Mixed Reality in Virtual World Teleconferencing Tuomas Kantonen (1), Charles Woodward (1), Neil Katz (2) (1)
VTT Technical Research Centre of Finland,
ABSTRACT In this paper we present a Mixed Reality (MR) teleconferencing application based on Second Life (SL) and the OpenSim virtual world. Augmented Reality (AR) techniques are used for displaying virtual avatars of remote meeting participants in real physical spaces, while Augmented Virtuality (AV), in form of video based gesture detection, enables capturing of human expressions to control avatars and to manipulate virtual objects in virtual worlds. The use of Second Life for creating a shared augmented space to represent different physical locations allows us to incorporate the application into existing infrastructure. The application is implemented using open source Second Life viewer, ARToolKit and OpenCV libraries. KEYWORDS: mixed reality, virtual worlds, Second Life, teleconferencing, immersive virtual environments, collaborative augmented reality. INDEX TERMS: H.4.3 [Information System Applications]: Communications: Applications – computer conferencing, teleconferencing, and video conferencing; H.5.1 [Information Systems]: Multimedia Information Systems – artificial, augmented, and virtual realities. 1
The need for effective teleconferencing systems is increasing, mainly due to economical and environmental reasons as transporting people for face-to-face meetings consumes lot of time, money and energy. Massively multi-user virtual 3D worlds have lately gained popularity as teleconferencing environments. This interest is not only academic as one of the largest virtual conferences was held by IBM in late 2008 with over 200 participants. The conference, hosted in a private installment of Second Life virtual world, was a great success saving an estimated $320,000 compared to the expense of having the conference held in the physical world . In this paper, we present a system for mixed reality teleconferencing where a mirror world of a conference room is created in Second Life and the virtual world is displayed in the real-life conference room using augmented reality techniques. The real people’s gestures are reflected back to Second Life. The participants are also able to interact with shared virtual objects on the conference table. A synthetic illustration of such a setting is shown in figure 1. [email protected] [email protected] [email protected]
IEEE Virtual Reality 2010 20 - 24 March, Waltham, Massachusetts, USA 978-1-4244-6236-0/10/$26.00 ©2010 IEEE
Figure 1. Illustration of Mixed Reality teleconference: Second Life avatar among real people, wearing ultra light weight data glasses, sharing a virtual object on the table, inside virtual room, displayed in CAVE.
The structure of the paper is as follows. Section 2 describes the background and motivation for our work. Section 3 explains previous work related to the subject. Section 4 gives an overview of the system we are developing. Section 5 goes into some explanation of Second Life technical detail. Section 6 gives a description of our prototype implementation. Section 7 provides a discussion of results, as well as items for future work. Conclusions are in the section 8. 2
There are several existing teleconference systems, ranging from old but still often used audio teleconferencing and video teleconferencing to web-based conferencing applications. 2D groupware and even massively multi-user 3D virtual worlds have also been used for teleconferencing. Each of these existing systems has its pros and cons. Conference calls are quick and easy to set up without other hardware than a mobile phone, yet it is limited to audio only and requires a separate channel e.g. for document sharing. Videoconferencing adds a new modality as pictures of participants are transferred but it requires more hardware and bandwidth, being quite expensive in the high-end. Web-conferencing is lightweight and readily supports document and application sharing but it lacks natural interaction between users. We see several advantages of using a 3D virtual environment, such as Second Life or OpenSim among many other platforms, as alternative means for real-time teleconferencing and collaboration. First, the users are able to see all meeting participants and get a sense of presence not possible in a traditional conference call. Second, the integrated voice capability of 3D virtual worlds provides spatial and stereo audio. Third, the 3D environment itself provides a visually appealing shared meeting environment that is just not possible with other means of teleconferencing. However, the lack of natural gestures constitutes a major drawback for real interaction between the participants.
In our work, virtual reality and augmented reality is combined in similar manner as in the original work by Piekarski et al. . Their work was quite limited in the amount of augmented virtuality as only position and orientation of users were transferred into the virtual environment. Our work focuses on interaction between augmented reality and a virtual environment. Therefore our work is closely related to immersive telepresence environments such as [3, 4]. Several different immersive 3D video conferencing systems are described in . Local collaboration in augmented reality has been studied for example in [6, 7]. Collaboration is achieved by presenting colocated users the same virtual scene from their respective viewpoints and providing the users simple collaboration tools such as virtual pointers. Remote AR collaboration has mostly been limited to augmenting live video such as in  or later augmenting a 3D model reconstructed from multiple video cameras as in . Remote sharing of the augmented virtual objects and applications has been studied for example in . Our work uses Second Life and the open source implementation of Second Life server called OpenSim, which are multi-user virtual worlds, as the virtual environment for presenting shared virtual objects. Using Second Life in AR has been previously studied by Lang et al.  as well as Stadon  although their work does not include augmented virtuality. In the simplest case, augmented virtuality can be achieved by displaying real video inside a virtual environment as in . This approach has been also used for virtual videoconferencing in  and augmenting avatar heads in . Another form of augmented virtuality is avatar puppeteering where human body gestures are recognized and used to control the avatar, either only the avatars face as in  or the whole avatar body as in . However, only little previous work has been presented on augmenting Second Life avatars with real life gestures. The main exception is the VRWear system  for controlling avatar’s facial expressions. 4
SECOND LIFE VIRTUAL WORLD
Second Life is a free, massively multi-user on-line game-like 3D virtual world for social interaction. It is based on community created content and it even has a thriving economy. The virtual world users, called residents, are represented by customizable avatars and can take part in different activities provided by other residents. For interaction, Second Life features spatial voice chat, text chat and avatar animations. Only the left hand of the avatar can be freely animated on-the-fly, while all other animations rely on prerecorded skeletal animations that the user can create and upload to the SL server. For non-expert SL users, however, meetings in SL can be quite static with the ‘who is currently speaking’indicator being the only active element. From our experience, actively animating the avatar while talking takes considerable training and directs the user’s focus away from the discussion. Second Life has client-server architecture and each server is scalable to tens of thousands of concurrent users. The server is proprietary to Linden Labs but there exists also the community developed SL compatible server OpenSimulator . 5
In this project we developed a prototype and proof-of-concept of video conference meeting taking place between Second Life and the real world. Our system combines immersive virtual environment, collaborative augmented reality and human gesture recognition in a way to support collaboration between real and
virtual worlds. We call the system Augmented Collaboration in Mixed Environments (ACME). In the ACME system, some participants of the meeting occupy a space in Second Life while others are located around a table in real world. The physical meeting table is replicated in Second Life to support virtual object interactions as well as avatar occlusions. The people in real world see the avatars augmented around a real world table, displayed by video see through glasses, immersive stereoscopic walls or within a video teleconference screen. Participants in Second Life see the real world people as avatars around the meeting table, augmented with hand and body gestures. Both the avatars and real people can interact with virtual objects shared between them, on the virtual and physical conference tables respectively. The main components of the system are: co-located users wearing video-see-throught HMD, a laptop for each user running the modified SL client, a ceiling mounted camera above each user for hand tracking and remote users using the normal SL client. The system is designed for restricted conference room environments where meeting participants are seated around a well lit, uniformly colored table. As an alternative to HMDs, a CAVE style stereo display environment or a plain old video screens can be used. Figure 2 shows how the ACME system is experienced in a meeting between two participants, one attending the meeting in Second Life and the other one in real life. It should be noted that the system is designed for multiple simultaneous remote and colocated users. A video of the ACME system is available at . 6
6.1 General The ACME system is implemented by modifying the open source Second Life viewer . The viewer is kept backward compatible with original Second Life so that, even though more advanced features might require server side changes, all major ACME features are also available when the user is logged in to the original Second Life world. The SL client was run on Dell Precision M6400 laptops (Intel Mobile Core 2 Duo 2.66GHz, 4GB DDR3 533MHz). Logitech QuickCam Pro for Notebooks USB cameras (640x480 RGB, 30 FPS) were used for video-see-through functionality, while Unibrain Fire-I firewire camera (640x480 YUV, 7.5 FPS) was used for hand tracking. eMagin Z800 (800x600, 40° diagonal FOV) and MyVu Crystal 701 (640x480, 22.5° diagonal FOV) HMDs were used as video-see-through displays. Usability studies of the system are currently limited to projects internal testing of individual components. The author has evaluated the technical feasibility of each feature and comments have been collected during multiple public demonstrations, including a demo at ISMAR 2009. We have been able to identify key points where the application has possibilities to overcome limitations of current systems and also points where improvements need to be made to create a really usable system. A proper user study will be conducted during 2010 with HIT Lab NZ, comparing the ACME system with other means of telecommunication. Detailed plans of the study have not yet been made. 6.2 Augmenting reality To be able to use SL for video see-through AR, three steps are required: video capture, camera pose estimation and rendering of correctly registered virtual objects. Currently the ACME system supports two different video sources, either ARToolkit  video capture routines for USB
Figure 2. User views of ACME: Second Life view (screenshot, left), real life view (augmented video, right).
devices or CMU  firewire camera driver API. ARToolkit OpenGL subroutines are used for video rendering. HMD camera pose is estimated by ARToolkit marker tracking subroutines. Multiple markers are placed around the walls of the conference room and the table so that at least one marker is always seen by the user wearing a HMD. We experimented with 20cm by 20cm and 50cm by 50cm markers at the distance from 1 to 3 meters from the user. Distance between markers was about three times the width of the marker. Real world coordinate system is defined by a marker that lies on a conference table. Registration with SL coordinates is done by fixing one SL object to the real world origin and using object’s coordinate axis as unit vectors. This anchor object is selected in the ACME configuration file. If the marker is not on the table, the anchor object must be transformed accordingly. Occlusion is the ability of a physical object to cover those parts of virtual objects that are physically behind it. In the ACME system, occlusion is implemented by modeling the physical space in the virtual world and using the virtual model as a mask when rendering virtual objects. The virtual model itself is not visible in the augmented image as otherwise it would cover the very physical objects we want to see. Similar method was used in . The ACME system does not place any restrictions on what kind of virtual objects can be augmented. Any virtual object can also be used as occlusion model. However, properly augmenting transparent objects has not yet been implemented. 6.3 Hand tracking For hand tracking, a camera is set up over the conference room table. The camera is oriented downwards so that the whole table is visible in the camera image. The current implementation supports only one hand tracking camera. Hand tracking video capturing and processing is done in a separate thread from rendering so that a lower video frame rate can be used without affecting rendering of the augmented video. Hands are recognized from the video image by HSV (hue, saturation and value) segmentation. HSV color space has been shown to perform well for skin detection . Each HSV channel is thresholded and combined into a single binary mask. A calibration utility was created for calibrating threshold limits to take different lightning conditions into account. The current implementation uses only a single camera for hand tracking, therefore proper 3D hand tracking has not yet been implemented. User hand is always assumed to hover 15cm over the table so that the user can do simple interactions with virtual objects on the table. 6.4 Gesture interaction Interaction in the ACME system is divided into two categories: interacting with other avatars and interacting with virtual objects.
Avatar interaction is more relaxed as the intent of body language is conveyed even when avatar movements don’t precisely match to user motion. Object interaction requires finer control as objects can be small and in many cases the precise relative position of objects is of importance. The orientation of the user’s face is a strong cue about where the user is currently focusing on. When the user is wearing a video-see-through HMD we use the orientation of the camera, already computed for augmented reality visualization, to rotate the avatar’s head accordingly. User hands are tracked by the hand tracking camera as explained in section 6.3. This hand position information is used to move avatar’s hand towards the same position. Second Life viewer has a simple built in inverse kinematics (IK) logic to control shoulder and elbow joints so that the palm of the avatar is placed approximately to the correct position. As the current implementation limits the hand to a plane over the table, interaction is restricted to simple pointing gestures. Other animations, for example waving for good bye, can still be used by manually triggering animations from the SL client. 6.5 Object interaction For easy interaction with objects, a direct and correct visual feedback is needed. This is achieved by moving a feedback object with the user’s hand. Any SL object can be used as the feedback object by attaching the object to the avatar’s hand. This feedback object is moved only locally to avoid any network latency. Currently we provide three object interaction techniques: pointing, grabbing and dragging. Interaction is controlled by two different gestures: thumb visible and thumb hidden. Gestures are interpreted from the point of view of the hand tracking camera, therefore the hand must be kept in a proper pose. If the user moves her hand inside an object, the object is highlighted by rendering a white silhouette around the object. If there is a gesture transaction from thumb visible to thumb hidden while an object is highlighted, the object is grabbed. The grabbed object is highlighted with a yellow silhouette. By moving the hand while an object is grabbed the object can be dragged, that is, the object will move with the hand. Releasing the grabbed object is done with a gesture transition from thumb hidden to thumb visible. 7
The current implementation of the ACME system is still quite limited. Even when using multiple large markers there can be registration errors of tens of pixels, creating annoying visual effects particulary at occlusion boundaries. Augmented objects also jerk a lot when markers become visible or disappear from the view. Better vision based tracking techniques or fusion with inertial sensors are clearly required for the system to be usable. Visualizing virtual avatars with a head mounted video see through display is limited by the current HMD technology. Affordable HMDs do not provide enough wide field of view to be really usable in a multi user conferencing. On the other hand, when augmentation is done into a video teleconferencing image, the user is able to follow virtual participants as easily as other video conference participants. Hand tracking with non-adaptive HSV segmentation is extremely sensitive to lighting and skin color changes. Careful calibration is needed for each user and recalibration needs to be done when ever the room lightning changes. The current hand gesture recognition is prone to errors and lacks haptic feedback. This makes the interaction feel very unnatural and requires very fine control from the user. Also the
Figure 3. Interaction with virtual objects: Second Life view (left), and real life view (right). Feedback object as red ball.
current limitation of the hand motion to a 2D plane makes any sensible interaction rather difficult. It should be noted that most of these short comings can be fixed by applying existing, more advanced algorithms. The only major issue without a direct solution is the low quality of currently available affordable HMDs. 8
ACKNOWLEDGMENTS The system has been developed in project “MR-Conference” starting in October 2008 with VTT as the main developer, IBM and Nokia Research Center as partner companies, and main funding provided by Tekes (Finnish Funding Agency for Technology and Innovation). Various people in the project team helped us with their ideas and discussions, special thanks going to Suzy Deffeyes at IBM and Martin Schrader at Nokia Research Center. REFERENCES
In this paper, we have presented a system called ACME for teleconferencing between virtual and physical worlds, including two way interaction with shared virtual objects, using means of augmented reality and gesture detection in combination of Second Life viewer and ARToolkit and OpenCV libraries. Currently the ACME system contains augmenting of avatars and virtual objects based on marker tracking, visualization including occlusions, and for interaction head tracking, 2D hand tracking from a monocular camera and a grab-and-hold gesture based interaction with virtual objects. Items for future work include enhanced AR visualization with markerless tracking, more elaborated hand gesture interactions and body language recognition, controlling avatar facial expressions, as well as various user interface issues. Overall, we believe this early work with the ACME system has demonstrated the feasibility of using a mixed reality environment as a means to enhance a collaborative teleconference. Certainly, the ACME system is not a replacement for a face to face meeting, but it should simplify and even enhance the 3D meeting experience to the point where mixed world teleconference meetings could be a low cost yet effective alternative for many business meetings. Our aim is within the next few months to employ the ACME system in our internal project meetings between overseas partners, which we have so far held in the pure virtual Second Life environment.
How Meeting In Second Life Transformed IBM’s Technology Elite Into Virtual World Believers http://secondlifegrid.net/casestudies/IBM. W. Piekarski, B. Gunther, B. Thomas (1999), “Integrating virtual and augmented realities in an outdoor application”, Proc. IWAR 1999, pp. 45-49.
       
P. Kauff and O. Sheer (2002), “An immersive 3D videoconferencing system using shared virtual team user environments”, in Proc. CVE’02, pp. 338-354. M. Gross et al. (2003), “blue-c: a spatially immersive display and 3D video portal for telepresence”, ACM Transactions on Graphics 22(3) , Jul 2003, pp. 819 –827. P. Eisert (2003), “Immersive 3-D Video conferencing: challenges, concepts, and implementations”, Proc. VCIP 2003, pp. 69-79. D. Schmalstieg, A. Fuhrmann, G. Hesina, Z. Szalavári, L. Encarnaçäo, M. Gervautz, W. Purgathofer (2002), “The Studierstube Augmented Reality Project.”, Presence: Teleoperators and Virtual Environments, Feb 2002, pp. 33 –54. M. Billinghurst, I. Poupyrev, H. Kato, R. May (2000), “Mixing realities in shared space: an augmented reality interface for collaborative computing”, in Proc. ICME 2000. M. Billinghurst and H. Kato (1999), “Real world teleconferencing”, Proc. Chi’99, pp. 194-195. S. Prince et al., “Real-time 3D interaction for augmented and virtual reality”, ACM SIGGRAPH 2002 conference abstracts and applications, pp. 238 D. Schmalstieg, G. Reitmayr, G. Hesina (2003), “Distributed applications for collaborative three-dimensional workspaces.” Presence: Teleoperators and Virtual Environments 12(1), Feb 2003, pp. 52-67. T. Lang, B. MacIntyre, I. J. Zugaza (2008), “Massively Multiplayer Online Worlds as a Platform for Augmented Reality Experiences”, IEEE VR ’08, pp. 67-70. J. Stadon, “Project SLARiPS: An investigation of mediated mixed reality” In Arts, Media and Humanities Proc. of the 8th IEEE ISMAR 2009, pp. 43–47. K. Simsarian, K.-P. Åkesson (1997), “Windows on the World: An example of Augmented Virtuality.”, Proceedings of Interfaces 97: Man-Machine Interaction. H. Regenbrecht, C. Ott, M. Wagner, T. Lum, P. Kohler, W. Wilke, E. Mueller (2003), “An Augmented Virtuality Approach to 3D Videoconferencing.”, In Proc. Of the 2nd IEEE and ACM ISMAR, 2003. P. Quax, T. Jehaes, P. Jorissen, W. Lamotte (2003), “A Multi-User Framework Supporting Video-Based Avatars.”, In Proceedings of the 2nd workshop on Network and system support for games, 2003, pp. 137 –147. F. Pighin, R. Szeliski, D. Salesin (1999), “Resynthesizing Facial Animation through 3D Model-Based Tracking.”, Proceedings of the 7th ICCV, 1999, pp. 143 –150. J. Lee, J. Chai, P. Reitsma, J. Hodgins, N. Pollard (2002), “Interactive Control of Avatars Animated with Human Motion Data.”, ACM Transactions on Graphics 21, Jul 2002, pp. 491 –500. VR-WEAR SL head analysis viewer, http://sl.vr-wear.com/, unpublished OpenSimulator, http://opensimulator.org/. Video of the ACME system, http://www.youtube.com/watch?v=DNB0_c-5TSk Second Life Source Downloads, http://wiki.secondlife.com/wiki/Source_archive. ARToolkit homepage, http://www.hitl.washington.edu/artoolkit/. CMU 1394 Digital Camera Driver http://www.cs.cmu.edu/~iwan/1394/. A. Fuhrmann, et al. “Occlusion in Collaborative Augmented Environments”, Computers and Graphics, 23(6):809-819, 1999. Benjamin D. Zarit, Boaz J. Super, Francis K. H. Quek (1999), “Comparison of Five Color Models in Skin Pixel Classification”, In ICCV’99 Int’l Workshop on, pp 58-63.