Annals of Software Engineering 12, 193–212, 2001 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.
Immersive Whiteboard Collaborative System ZON-YIN SHAE and BELLE TSENG {zshae;belle}@us.ibm.com IBM Watson Research Center, 30 Sawmill River Road, Hawthorne, NY 10532, USA WING HO LEUNG
[email protected] Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
Abstract. An immersive whiteboard system is presented where users at multiple locations can communicate with each other. The system features a virtual environment with vivid avatars, stroke compression and streaming technology to effectively deliver stroke data across meeting participants, friendly human interaction and navigation, virtual and physical whiteboard. The whiteboard is both a physical platform for our input/output interfaces and a virtual screen for sharing common multimedia. It is this whiteboard correspondence that allows the user to physically write on the virtual whiteboard. In addition to drawing on the shared virtual board, the immersive whiteboard in our setup permits users to control the application menus, insert multimedia objects into the world, and navigate around the virtual environment. By integrating multimedia objects and avatar representations into an immersive environment, we provide the users with a more transparent medium so that they feel as if they are communicating and interacting face-to-face. The whiteboard efficiently pulls all the collaboration technologies together. The goal of this collaborative system is to provide a convenient environment for participants to interact with each other and support collaborative applications such as instant messaging, distance learning and conferencing.
1.
Introduction
The use of whiteboards has been found as a beneficial medium to communicate spontaneous handwritten ideas and concepts. Several workplace field studies have shown the importance of physical whiteboards that are a locale for discussion and collaboration. However, physical whiteboards are only visible locally in one place. Their information is not easily available for remote sites. Especially, strokes written on the whiteboard with ink cannot be erased or edited by remote users. Hence, real-time multi-user physical whiteboard collaboration is limited locally to people in the same room. With the development of ink stroke data capturing devices [Electronics Imaging 1999; Virtual Ink 1999] and the explosive usage of the Internet, a networked whiteboard collaboration system is called for. The content of a whiteboard is made up of strokes written by users. A stroke is defined as a continuous curve created when the pen (or PC mouse) is down on the whiteboard and ends when the pen (or PC mouse) is lifted up from the whiteboard. These strokes are captured and rendered as a sequence of points. For a typical whiteboard stroke-capturing device, the entities that make up a stroke are tempo-
194
SHAE, TSENG AND LEUNG
rally sampled point coordinates. The stroke is then rendered on the display as a curve that passes through those sampled points. The simplest rendering of the stroke is thus connecting subsequent sampled points with straight lines. In our discussion in this paper, we will focus on the content of a whiteboard as a collection of strokes, each of which is composed of a sequence of sampled points (referred as stroke data points) connected by straight lines. Traditionally stroke data was not treated as stream data type [Adobe 1998; Douglas and Peucker 1973; Koeman and Vander Weiden 1970; Impedovo 1993; Microsoft 1995]. However, a streaming stroke data mechanism is especially essential in the low bandwidth environment (e.g., regular telephone modem) to reduce the download waiting time and enhance the media quality. Even in high bandwidth environments (e.g., LAN, cable modem or ADSL), a streaming technology has been shown essential for better quality of service (QoS) and also increasing the network bandwidth efficiency (such that it can support more users). Every media has its own distinct properties. Consequently, an effective stream mechanism can only be devised based on the media’s unique properties. A whiteboard collaboration system requires to stream efficiently stroke data across all participants. This paper will present a multi-resolution representation of the stroke media that is ready for efficient streaming. When a group of people gets together in one room to work on an agenda, we refer to this scenario as an ideal collaboration. A collaborative system can thus be viewed as a collection of technologies enabling users at multiple locations to seamlessly communicate with each other and to work jointly on the object of interest. To determine the desired technologies for our collaborative system, we examine existing collaborative and conferencing applications. Current video conferencing systems, such as Microsoft NetMeeting [Microsoft 1996] and CU-SeeMe [White 1996], lack the sense of immersion because each user sees the other users in separate windows, thus it is often difficult to tell who is talking to whom. Furthermore, Argyle and Dean [1965] suggested that during communication people vary physical proximity, eye contact and other behaviors to optimize an overall level of intimacy. This suggests that an immersive environment where users can navigate and interact with other users offers the best setting. There is a widely held belief in the multimedia arena that 3D will become popular and commonplace because it offers a superior way of representing the world around us. In a virtual environment, all forms of multimedia objects can be rendered and positioned at some 3D spatial locations. To identify the users in this virtual environment, avatars are required to represent each user. In Interspace [NTT 1998], the avatar’s face is merely a 2D image resulting in perspective distortion when viewed in a virtual environment, except at one viewpoint. In the Distributed Interactive Virtual Environment (DIVE) [Swedish Institute 1998], avatars are animated for actions such as walking, jumping, etc. but lips are not animated to synchronize with speech. In OnLive! Traveler [DigitalSpace 1997], the avatar’s lips are synchronized with speech however the avatar consists of only a generic head model floating in the air. Tang and Isaacs [1992] suggested that the use of body gestures facilitated smoother interactions between users.
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
195
As a result of these studies, for a realistic representation of each user, the avatar must be synthesized from the real user’s head. Furthermore, the facial expressions, lip synchronization, hand gestures, body movements and user locations should be implemented features of the avatars. In this paper, we will present an immersive whiteboards collaborative system. The most crucial element of our system is the shared whiteboards. The whiteboard is both a physical platform for our input/output interfaces and a virtual screen for sharing common multimedia. It is the whiteboards that allow users to seamlessly navigate and interact in the immersive environment. The screen also permits users to communicate through natural handwritten sketches. The virtual whiteboard in the immersive environment can actually correspond to the physical whiteboard in our setup. It is this whiteboard correspondence that allows the user to physically write on the immersive whiteboard. In addition to drawing on the shared virtual board, the immersive whiteboard in our setup permits users to control the application menus, insert multimedia objects into the world, and navigate around the virtual environment. By integrating multimedia objects and avatar representation into an immersive environment using a combination of multi-modal media, we provide the users with a more transparent medium so that they feel as if they are communicating and interacting faceto-face. We use a prototype implementation called the Networked Intelligent Collaborative Environment (NetICE) originally developed at CMU [Leung et al. 2000a, b]. NetICE allows multiple users to share a common world and demonstrate their avatar representation, audio, expression, and interaction. In our prototype, the avatar representation in NetICE is enhanced and NetICE is integrated with the shared whiteboards to achieve our goal for seamless and spontaneous collaboration. The whiteboard efficiently pulls all the collaboration technologies together. Figure 1 shows a user (represented with avatar) using the whiteboard to communicate with others in the immersive world.
Figure 1. Immersive whiteboard client view.
196
SHAE, TSENG AND LEUNG
Note that the whiteboard collaborative system described here is recommended for a truly immersive collaboration system. However, not every client is required to have all the components in order to participate in the conference. For instance, the whiteboard can be used by itself without the avatar environment and vice versa. Also, a PDA device can participate in the collaboration and function as a whiteboard without projector. The rest of the paper is organized as follows. Section 2 provides a system overview of our collaborative whiteboard system. Section 3 describes physical whiteboards system and the multi-resolution stroke media representation. Section 4 focuses on the meeting management server network while section 5 concentrates on the virtual environment augmented with physical whiteboard and describes how the shared whiteboards bring together the interface for the whole collaborative system. Section 6 describes avatar representation. Section 7 proposes some applications including our implementation of a virtual auction house. Finally, section 8 provides a summary. 2.
Immersive whiteboard collaboration system
The shared whiteboards in the immersive environment provide the best medium for users to freely discuss handwritten and spontaneous ideas. On a conventional whiteboard, participants are expected to draw and see other people’s drawings on the same surface. It offers a natural interface for both input and display. In our immersive whiteboard system, the whiteboard acts as a 2D stroke input, menu input, 3D object input, 3D navigation, and display. Several workplace field studies have shown the importance of physical whiteboards that are a local for discussion and collaboration. However, physical whiteboards are only visible locally in one place. Their information is not easily available for remote sites. Especially, strokes written on the whiteboard with ink cannot be erased or edited by remote users. Hence, real-time multi-user physical whiteboard collaboration is limited locally to people in the same room. With the development of ink stroke data capturing devices, system architecture of a networked collaboration system is shown in figure 2. It consists of two virtually physical whiteboards connected via the Internet. The virtual ink strokes are captured by means of low cost devices. Low cost projectors can then be used to display the strokes on the whiteboards. Consequently, we can virtualize the physical whiteboard, and enable the remote collaboration for discussion, design, and editing in a network environment. The meeting management server network will coordinate the meeting and communicate with client to establish the environment for some software components. The first component is the immersive environment in which the users share their presence and the objects on their agenda are exchanged. A 3D environment can be modeled as a set of graph nodes [Ames et al. 1997]. Each node in the graph can represent either a multimedia object or a transformation operation (e.g., affine transformation) on the multimedia objects resided in its children nodes. Multimedia objects can only be leaf nodes in the graph, and a transformation node should always have its children nodes. Each individual multimedia object can be modeled as a set of data points in the 3D
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
197
Figure 2. Immersive whiteboard collaboration system.
coordinate space. Each data point can have its individual attributes (e.g., color, lighting) as properties. One multimedia object can have various models (or representations) in a 3D environment. For example, a 2D image can be modeled as a set of points with individual color properties for each point. A 2D image can also be modeled as a flat 2D rectangular box (4 points) with a 2D image as its surface texture. The later model is called texture mapping. Consequently, the 3D environment provides a common world for each user to be represented and for all multimedia objects to be displayed. The virtual world also allows the users to navigate around the world and discover who or what is around them. Furthermore, the immersive world sets up a common platform for users to view and interact with objects residing there. The second component is the avatar representation of each user in the immersive environment. The avatar serves as a form of identification, expression, and navigation. Every user is identified by his/her own unique head model, which can be generated beforehand or online. A user’s expressions can be effectively displayed by facial expressions or body gestures. Similarly, a user’s position, orientation, and navigation suggest that he/she may currently only be interested in one part of the immersive world. Through these means, the avatar representation is thus crucial to give the rest of the participants a better understanding of his/her state. Finally, the third and the most crucial component of our system is the shared whiteboards. The whiteboard is both a physical platform for our input/output interfaces and a virtual screen for sharing common multimedia. It is the whiteboards that allow users to seamlessly navigate and interact in the immersive environment. The screen also permits users to communicate through natural handwritten sketches.
198
3.
SHAE, TSENG AND LEUNG
Physical whiteboard system
The physical setup of an immersive whiteboard suggests that the clients have the following components: • • • • • •
A typical whiteboard as found in many classrooms, offices and conference rooms. A computer with networking capabilities. A pair of audio speakers for spatial sound effects. A wireless microphone attached to the user to capture his/her speech. A projector to display the immersive environment. A stroke input device for capturing handwritings on the whiteboard, like the Mimio [Virtual Ink 1999] or e-Beam [Electronics Imaging 1999]. • A pen corresponding to the stroke input device for writing on the whiteboard surface. • Finally, a video camera for capturing the user. Each component serves as an input or output device to the computer and the user uses them as interfaces within the immersive environment. Also, a meeting management server can be employed to archive and retrieve the meeting content. It also provides the utilities to support a meeting, such as awareness and security. The pen corresponding to the stroke input device does not generate ink mark directly. Instead of leaving a mark when the pen touches the whiteboard, the position of the pen is sent back to the computer. The whiteboard system then displays virtual ink at that position on the computer, which in turn is projected back onto the whiteboard. This concept thus allows multiple users to contribute to the same whiteboard. On each client, figure 2 depicts the placement and configuration of the setup components listed above. First, align the projector so that it projects its images onto the physical whiteboard. Second, connect the projector to the computer display port. Third, attach the stroke input device on the whiteboard and connect the output line to the computer. This can be a Mimio device or the light pen which will be described in later section. Next, station the pair of audio speakers for spatial audio and setup the wireless microphone for recording the user’s speech. Finally, hook up the video camera to capture the user. The setup configuration of the physical whiteboard allows the user to (1) view the virtual environment, (2) draw sketches, (3) control the mouse on the computer screen, and (4) navigate around the 3D world all in front of the whiteboard. In the next section, we will examine the full functionality of the immersive whiteboard. The other audio/visual inputs provided by the microphone, speakers, and camera are more common and thus will not be discussed in much detail. 3.1. Light pen as stroke input Figure 3 shows a light pen input system [Elrod et al. 1992; Mizui 1999] with a front projection and vision digitizer. Positioning of the camera and projector are in principle
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
199
Figure 3. Light pen input with a front projection and vision digitizer.
arbitrary. However it is preferable to have a ceiling mounted projector and camera to avoid occlusion of the light pen by the user. We define a stroke as a set of connected dots from the above vision capture system. Since the light pen only has two states, on and off, the starting point and ending point of a stroke cannot be directly obtained from the light pen. It therefore requires additional image processing to determine start and end of the light pen device. Starting of the light pen always corresponds to a transition from off to on, while ending of the light pen always to a transition from on to off. Based on this observation, the start point and end point of a stroke can be found by detecting lighting transition of the light pen. In our experimental system [Hendriks et al. 2001], an off-the-shelf color videoconferencing camera (Videolabs Flexcam) and an IBM Intellistation M Pro with dual 400 Mhz Pentium processors are used to capture the location of the light pen. We block the visible environmental lights with an IR pass/visible block filter (Schott Glass RG830) and inspect only the red component of a video frame. The capture card is an Osprey 100 which is capable of producing a 640 × 480 video frame resolution at a rate of 30 frames per second. With our image-processing algorithm [Hendriks et al. 2001], we can successfully capture approaching the maximum capture rate. In our experiment, we found that the threshold can be set to less than 50% of saturation. And a 2nd order low pass filter produces a satisfactory result. It should be pointed out that a threshoulding based spot finding algorithm, in general, is sensitive to the enviromental lighting. However, with IR optical filtering we successfully eliminated the effects of environmental lighting and we observed that our vision algorithm is insensitive to the threshold. There is one advantage in the light pen systems. That is strictly speaking, the whiteboard need not be flat; we currently use a curved, high-gain projection screen. 3.2. Stroke data compression Electronic white board and pen strokes are treated as error tolerable stream media [Shae et al. 2000] and not as a pure data type which cannot endure any data corruption. The media properties of the stroke media are therefore explored. The characteristic large amount of data volume and file size, dynamical changes of content in time, and content information redundancy properties make stroke data similar to a continuous media like
200
SHAE, TSENG AND LEUNG
video and audio. As such, streaming mechanisms for this new media type are called for. Strokes are first segmented into substrokes under two criteria. The first criterion is to segment a stroke such that we eliminate any intersection within one stroke segment. The second criterion is to guarantee that when the error is calculated that the resulting areas are additive within one stroke segment. A set of points connected by lines from the original substroke is selected to approximate the original substroke. The bounding area between the approximation lines and original substroke is used to measure the error quantity (named error performance E later in the equations of this session). The optimization process is to find a set of points in the original stroke that minimize E. In the sense of local optimum, equations have been derived to select the set of points in the original stroke for the hierarchical LOD (level of detail) representations. Consequently, a multi-resolution representation [Shae et al. 2000] of stroke media content and its associated streaming mechanism is devised which will enable the stroke media streaming in the Internet. In order to compress the original strokes without degrading the final stroke rendering, the error performance between the original strokes and the compressed strokes should be minimized. The error performance of the compressed stroke is the area bounded by the curves of the original stroke and the compressed stroke. To minimize the error performance is equivalent to getting the two curves to overlap as close as possible. In our error minimization solution, we will use an iterative procedure to find the data points of the prediction stroke. Refer to figure 4, given an original stroke {Pi = (xi , yi ) | 1 i N}, where P1 and PN are endpoints of the stroke; our first task is to find one prediction point P such that the predicted stroke consisting of three points {P1 , P, PN } will minimize the stroke such that P error performance. One constraint is that we will choose prediction point P is a one of the points on the original stroke, P ∈ {Pi = (xi , yi ) | 1 i N}. This constraint allows for preservation of the original point data and results in the original , stroke sequence when bandwidth is sufficient. To find the optimal prediction point P we revisit the definition of error performance. The error performance between the original stroke {Pi = (xi , yi ) | 1 i N} and the predicted stroke {P1 , P, PN } can be calculated as the sum of triangular areas formed by these data points, which is the additive property of our error measurement. Let Area (Pi , Pj , Pk ) denote the area of the triangle formed by vertices Pi , Pj and Pk . With the constraint that the prediction point P is one of the original points, the error performance E between the original stroke and the predicted stroke can be expressed as: E(J ) =
J i=3
Area (P1 , Pi−1 , Pi ) +
N−1
Area (Pi−1 , Pi , PN ),
i=J +1
where the first part of equation is the area bound by the line segment P1 PJ and the stroke curve, and the second part is the area bound by the line segment PJ PN and the stroke curve. The error performance E is a function of the selected prediction point P = PJ ∈ {Pi | 1 < i < N}.
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
201
Figure 4. Original stroke and predicted stroke.
To find the prediction point PJ which minimizes the error performance, we split the error equation: E(J ) = F (J ) + G(J ), where F (J ) =
J
Area (P1 , Pi−1 , Pi ),
i=3
G(J ) =
N−1
Area (Pi−1 , Pi , PN ).
i=J +1
It then follows that: F (i) = F (i − 1) + Area (P1 , Pi−1 , Pi ), G(i) = G(i − 1) − Area (Pi−1 , Pi , PN ).
202
SHAE, TSENG AND LEUNG
There is a local minimum at point PJ if and only if: E(J − 1) E(J ) E(J + 1). For the left inequality, E(J − 1) E(J ), F (J − 1) + G(J − 1) F (J ) + G(J ), F (J ) − Area (P1 , PJ −1 , PJ ) + G(J ) + Area (PJ −1 , PJ , PN ) F (J ) + G(J ), Area (PJ −1 , PJ , PN ) Area (P1 , PJ −1 , PJ ). For the right inequality, E(J ) E(J + 1), F (J ) + G(J ) F (J + 1) + G(J + 1), F (J ) + G(J ) F (J ) + Area (P1 , PJ , PJ +1 ) + G(J ) − Area (PJ , PJ +1 , PN ), Area (PJ , PJ +1 , PN ) Area (P1 , PJ , PJ +1 ). Therefore we have found the prediction point PJ which locally minimizes the error performance if: Area (PJ −1 , PJ , PN ) Area (P1 , PJ −1 , PJ ), Area (PJ , PJ +1 , PN ) Area (P1 , PJ , PJ +1 ). The above mathematical findings can be better understood by its geometrical meaning as shown in figure 5. The stroke curve P1 to PN can be approximated by two line segments P1 PJ and PJ PN . The error performance of this approximation is the area bound by these two segments and the original stroke curve. Assume that PJ −1 and PJ +1 are two points on the stroke curve adjacent to PJ . The error performance resulted from these two points should be worse than PJ if the PJ is a local optimization point. It is clear that from the geometrical representation in Figure 5, the differences of the error performance of these three points are identical to the differences of triangle areas in the equations. The areas of the triangle can be calculated by taking the magnitude of the cross products of the vectors formed by the vertices of the triangle. Consequently, Area (P1 , Pi−1 , Pi ) = xi−1 − x1 , yi−1 − y1 × xi − x1 , yi − y1 , Area (P1 , Pi−1 , Pi ) = xi−1 yi − xi yi−1 − x1 (yi − yi−1 ) + y1 (xi − xi−1 ), and
Area (Pi−1 , Pi , PN ) = xi−1 − xN , yi−1 − yN × xi − xN , yi − yN , Area (P1 , Pi−1 , Pi ) = xi−1 yi − xi yi−1 − xN (yi − yi−1 ) + yN (xi − xi−1 ).
One pass for Pi where i ∈ (1, N) will determine one prediction point P = PJ for the original stroke {Pi | 1 i N}. This prediction point optimally bisects the stroke into two segments. The two smaller segments of the stroke are here forward
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
203
Figure 5. Geometrical meaning of the stroke data points minimization equations.
referred to as substrokes. Afterwards, we can repeat the error minimization pass for the two substrokes, and calculate the next set of prediction points. This procedure can be repeated recursively until every point is predicted or until a maximum error threshold is satisfied. In our paper, we assign a level of detail to correspond to the number of error minimizing iterations is performed. Therefore the first level of detail refers to finding the first prediction point on a stroke. The second level of detail refers to finding the next two prediction points on the two substrokes. The third level of detail refers to finding the next four prediction points on the four substrokes, and so on. In the next section, we will refer to these levels of details to determine how many error minimization bisections are performed. 3.3. Multi-resolution representation and streaming There are many possibilities in compressing, transmitting, and displaying the original whiteboard data. One may be interested in compressing and separating the stroke data into two extreme sets called coarse representation and fine details. Others may be more interested in a continuous stream of increasing detail until all the original data is received. Then there are those who would like to have separate hierarchical representations corresponding to different compressions of the original. Our objective is to provide for all these design requirements by introducing the concept of levels of details for stroke data [Shae et al. 2000]. The idea of having levels of details is to provide for different bandwidth capabilities. For broadcasts of whiteboard content, the compressed data is transmitted to a heterogeneous set of users. In order satisfy every user, the compressed data is divided into different levels of details. For our whiteboard content, there are three types of stroke data. The first is called segmentation points, which includes the endpoints and the segmentation points. The second is named feature points, which are calculated from the
204
SHAE, TSENG AND LEUNG
error minimization prediction points. Finally, the third is designated as the non-feature points, which include those points not identified by the error minimization procedure. For the base level of detail, denoted Level 0, the data is comprised of segmentation points. In the content of a whiteboard, the segmentation points include (1) the beginning and final endpoints of each stroke, (2) the horizontal and vertical extreme segmentation points and (3) possible inflection points on the stroke. Therefore, there is a minimum of two segmentation points on each stroke. And as the stroke becomes longer and more complicated, each minimum and maximum extreme points is detected and added to the set of segmentation points. As a result, these segmentation points only represent the bare backbone of each stroke and are consequently designated to Level 0. For all subsequent levels of detail, we need to select feature and non-feature points based on the error minimization procedure described in the above section. Before we can perform this task, we need to choose one of the three design criteria. The first criterion sets the number of error minimization iterations to perform. The second criterion sets the maximum error threshold to allow. The third criterion is a powerful combination of these two criteria that sets the maximum number of iterations and the maximum error threshold. The number of error minimization procedure is performed up to the specified iteration and the resulting prediction points are included in the data set only if the error is still above the maximum preset threshold. In our paper, we choose criterion three as it generates the best visual results. For the next level of detail, denoted Level 1, the data is comprised of the first set of feature points. These feature points are calculated only after the Level 0 segmentation points are derived. The segmentation points of each stroke divide the stroke into several smaller substrokes. On each substroke, we calculate the error and test if it is below the maximum error threshold. If the substroke has already achieved the test, we move onto the next substroke. Otherwise, we perform one iteration of the error minimization procedure to obtain one optimal prediction point. This set of first iteration prediction points gives us the Level 1 data. If the design criterion for Level 1 is that the number of iterations equals one, then we are finished for Level 1. On the other hand, if we desired that the number of iterations equals two, then we can perform another iteration of error minimization to obtain another set of prediction points, and include these second iteration prediction points in the Level 1 data. There can be subsequent levels of detail that follow in the same manner, where the data is comprised of future sets of feature points. After the desired number of levels is determined, the very last level of detail is comprised of the non-feature points. These are all the points on the whole whiteboard that were not selected to belong to any of the previous levels. Furthermore, these points generate the least contribution to the visual impression of the whiteboard as the error must also be below all preset thresholds. The error minimization procedure for a stroke finds the optimal stroke points and groups them into the appropriate levels of detail. In our implementation, Level 1 simply captures those points that were detected by the first iteration of the error minimization process and Level 2 captures those points that were detected by the second iteration of the error minimization process. Figure 6 shows an original whiteboard stroke (6A), the
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
205
Figure 6. Experimental results LOD.
Level 0 stroke (6B), the Level 0 + Level 1 stroke (6C), and the Level 0 + Level 1 + Level 2 stroke (6D). The total number of points required to represent this original stroke is 1181. As depicted in the figure, the number of data points in Figures 6B, 6C, and 6D is 48, 95, and 184, and corresponds to an error performance measure of 47472, 12806, and 4019, respectively. It can be seen that two levels of error minimization prediction are sufficient for a visually acceptable reconstruction. The levels of detail are ordered in terms of when they are created and in sequence of their detail priority, i.e., Level 0, Level 1, Level 2, Level 3, . . . , Level (L − 1), for a grand total of L levels. Level 0 represents the coarsest level of detail and gives a visually rough estimate of the whiteboard content. Level 1 then adds the most significant set of feature points, followed by Level 2, Level 3, . . . , Level (L − 2). The final Level (L − 1) is then composed of the finest details that may not even be visually significant to the whiteboard, and thus the lowest priority. Taking advantage of these properties, the
206
SHAE, TSENG AND LEUNG
sequential ordering of the levels of detail is then used to determine their transmission specifications. The levels of detail are ordered in terms of when they are created and in sequence of their detail priority, i.e., Level 0, Level 1, Level 2, Level 3, . . . , Level (L − 1), Level (L), for a grand total of L + 1 levels. Level 0 represents the coarsest level of detail and gives a visually rough estimate of the whiteboard content, and thus the highest priority. Level 1 then adds the most significant set of feature points, followed by Level 2, Level 3, . . . , Level (L − 2). The Level (L − 1) is then composed of the finest details that may not even be visually significant to the whiteboard. The final Level (L) contains all the remaining stroke data points (non-feature points) that have not been selected into any previous Level, and thus the lowest priority. These levels have an inherent priority ordering as well as, in general, increasing data size. The sequential ordering of the levels of detail is then used to determine their progressive streaming schedule. Higher priority levels of detail should be streamed out first with lower levels following. The server will decide in real time when and how many levels of detail it will send to the particular client based on the QoS specifications, such as the delay time, the media quality requirement, bandwidth, server resource utilization, etc. As a result, the clients see a coarse representation of the whiteboard as soon as possible and finer details arriving progressively. 4.
Meeting management server network
The meeting management server network (e.g., IBM SameTime server intelligent network [IBM 1999], NetICE Server [Leung et al. 2000a, b]) provides transparent communication medium services to the client terminals. Two or more users are in front of their client terminals that are connected by the network. The meeting management server network and the terminals together create a virtual link between them such that they think they are collaborating in the same room. The meeting management server network is the bridge that connects all the client terminals together and coordinates their communications. The client terminals are responsible for delivering an immersive environment to the end users by employing various speech and image processing techniques and 3D computer graphics animations. The meeting management server network is responsible for maintaining the state of the system, and distributing the received data streams back to the clients. The data streams can either be the avatars’ states, audio packets, or multimedia objects. For example, when a client first connects to the meeting server, the server opens a connection and assigns an ID to that client. Then the server determines a position in the virtual conference room for the avatar representation of this new client such that there is no collision with other avatars already in the conference. After that the server sends this position as well as other avatars’ states to this client. In addition, the new avatar’s state is also sent to other clients whose terminals can then update the virtual environment to indicate the presence of a new avatar. Similarly when a client is disconnected from the server, the client’s ID becomes available for the next incoming client. The server also notifies other clients when a client leaves the virtual room.
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
5.
207
Virtual environment augmented with physical whiteboard
In the client’s terminal, the user can navigate freely in the virtual environment and see other avatars’ navigation as well. A user can control his/her avatar to look at other avatars through navigation. Also the client terminal is able to detect collision with other clients in the virtual room so that the client is prohibited from getting too close to other clients while navigating. The clients’ positions and orientations are sent to the server and distributed to other clients when changes are detected. The virtual world provides a common environment for users from multiple locations to gather together, meet each other, and collaborate on their interests. A virtual environment can contain and display a diversity of multimedia objects. Initially a 3D world is selected by the server and is loaded onto every client that connects with the server. As a client connects with the server, a 3D avatar representation of the new user is positioned in the world and distributed to everyone else. Afterwards, if a user decides to share and discuss a 3D computer graphics model then this new model can also be introduced into the virtual world. The 3D object is first sent to the server, the server broadcasts the object to all the other clients. Similarly, the virtual environment supports audio sources at any 3D location, sometimes referred to as 3D sound. This directional sound technology provides a realistic and immersive sound environment [Gardner 1999]. Thus, a user can hear an increasingly louder voice while moving towards the speaker in the virtual room. Similarly, the voice fades away when the user moves away from the speaker. Moreover, different proportions of sound are steered among the left and right sound channels according to the listener’s and speaker’s relative positions so that the listener can feel which person is talking by listening to the direction of the sound. For example, if the speaker is located to the left of the user, then the sound gets steered more to the left channel and the user hears it accordingly. The CMU’s NetICE prototype achieves this by weighting the output from the left and right sound output channels and making use of the head-related transfer functions (HRTF) [Gardner and Martin 1995]. A virtual environment furnished with a fixed flat surface can be used to display 2D multimedia including images, videos, foils, and even handwritten strokes. Our world contains virtual whiteboard surfaces whose purpose is to provide a mean to display these media and call attention to points of interest. Users can place a photo image on the board, present some prepared foils, or write equations on the whiteboard. The virtual world will be projected on the physical whiteboard based on the user’s position and viewpoint. The virtual whiteboard and physical whiteboard will reside at the identical space when the user’s position and view port is at the virtual whiteboard. Consequently, the virtual world is augmented with the physical whiteboard. With this correspondence property between the writing instrument and the projected display, we can perform numerous desirable actions, such as navigation, drag and drop 2D/3D objects, and handwriting of the virtual world by mean of pen devices on the physical whiteboard surface. When the virtual whiteboard directly corresponds to the physical whiteboard, the user can use the pen to write on the board. Whatever that is drawn by
208
SHAE, TSENG AND LEUNG
the ink-less pen is displayed on the user’s projection, sent to the server and rendered on the virtual whiteboard in the environment, and shared with the other users. Thus when the virtual whiteboard of the immersive environment is completely displayed on the physical whiteboard, resulting in a one-to-one correspondence with the calibration, the whiteboard acts as a 2D stroke input device with the corresponding pen. Furthermore, there can be more than one shared whiteboards in the immersive environment. Consequently, the physical whiteboard behaves as a display surface for the computer connected to the projector. It also offers a natural pen interface for users uncomfortable with the rigidity of a computer keyboard and mouse. As such, the whiteboard efficiently pulls all the collaboration technologies together. 6.
Avatar representation
When a user enters an immersive environment, his/her presence is revealed as an avatar representation. This representation is crucial to other users in the 3D world because it serves as a unique form of identity and existence. Furthermore, advanced avatar representations may also convey expressions and moods with facial expressions and body gestures. In our NetICE prototype, the avatar representation of a user demonstrates identification, expression, and navigation. For the users in a collaborative environment, each participant is expected to contribute to their agenda. However, it is often difficult to associate the contribution with the contributor without a face or a name attached to it. Consequently, a realistic head model of the user is adopted or a named generic avatar of a user is assigned for our collaborative applications. In the realistic rendering of a user’s head, the pre-generated avatar model is obtained beforehand. A customized 3D head model and corresponding texture map can be generated for a user utilizing a laser scanner [Escher et al. 1998], from multiple view photographs [Kurihara and Arai 1991; Pighin et al. 1998], or from a video camera [Fua 1998]. The customized realistic avatar is kept with the user until he/she connects with the server and loads the realistic avatar head model into the immersive environment as its identity. This allows the other users to easily identify the collaborators with their corresponding contributions. Those users without a personalized head model can label a generic avatar with the user’s name, which in turn shows up on the avatar’s front and back shirt. Using a generic model has two main benefits in our NetICE environment. First, the generic head permits six rudimentary facial expressions. These face animations make the avatars appear more life-like. Second, the speech of the user is synchronized with the respective generic avatar’s lips. The lip-sync animation is currently performed by detecting the energy of the incoming audio samples. However from the other users’ perspectives, the simple mouth animation is sufficient to assist in developing an intuitive feel for who in the room is speaking. On the other hand, our static realistic head model is pre-generated and thus does not offer facial expressions or lip synchronization. Instead, our work on realistic video avatars [Leung et al. 2000a, b] uses the images from the real-time video of the user to convey these facial animations. Figure 7 shows some input video frames and the cor-
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
209
Figure 7. Three input video frames and their corresponding realistic video avatar.
responding realistic video avatar. The output textured mapped 3D model is able to reproduce facial expressions and mouth movements. At the writing of this paper, however, this more advanced real-time processing has not been fully integrated with the current system. Regardless of lip-sync, the audio signal generated by a user still originates from its avatar’s head model and is played back correspondingly. The avatars model whether a realistic head or generic head is attached to a 3D body model. The body allows arm movements as well as animated walking legs. For instance, the avatar may raise a hand before speaking in order to get the attention of the other users. Also in NetICE, the position of the user is his/her view of the immersive environment. The user can navigate freely in the world and see other avatar’s navigation as well. The avatar representation of a user is essential for a networked collaborative environment. The identity of a user is revealed by means of either the realistic head model of the user or a simple nametag. The expression or mood of the collaborator is demonstrated by facial expressions and body gestures. The navigation of the participants and therefore their interests is seen by each other in the room. Thus the avatar representation simulates the presence of each user in the immersive environment. 7.
Applications
There are several conferencing and collaborative applications that use whiteboards as a more natural medium to conduct discussions. Users are more comfortable with pens than with other computer input devices and whiteboards offer these users that convenience. With immersive whiteboards in a networked collaborative environment, the benefits are enhanced. Not only does the users share a common networked environment, they can interact with each other, navigate around the virtual world, exchange ideas on the immersive whiteboards, and share multimedia objects. Some typical applications include virtual conferencing, design collaboration, brainstorm meetings, distance education and Internet learning applications. For some applications, the whole collaborative whiteboard session can be recorded and later played back in their dynamic form allowing users to see how a design was finally generated, and not just the final static image of the whiteboard. Our system allows playback of these sessions in MPEG-4 format [MPEG4 1998]. For distance education, the record and playback features are essential for students interested in studying the
210
SHAE, TSENG AND LEUNG
Figure 8. An auction application using immersive whiteboards.
materials again before an exam. Also, students who come late to the virtual classroom can play back the missed section at any time. Figure 8 illustrates an auction application using the immersive whiteboards collaborative system. In the scenario, three bidders are interested in a banker’s lamp, which is placed on the podium in the front of the room. The moderator at the podium is conducting the auction and placing the current highest bids on the whiteboard. Also the bids are coming in via the specific user’s audio channels and the moderator is repeating each bid. Any user in this world can make a bid for this auction by raising his hand and saying his price. Furthermore, there are other auctions concurrently taking place in this large auction house. As a user navigates around the house, the other auctions with their respective immersive whiteboards and auctioning 3D objects will come into view. The auction winner can conclude the purchase transaction on line in the virtual world. The required mechanism to complete the e-commerce transaction activity employing multimedia presentation and interaction has been studied and reported in [Shae et al. 1999, 2000]. 8.
Summary
In this paper, we presented an immersive whiteboard collaborative system. The whiteboard is both a physical platform for our input/output interfaces and a virtual screen for sharing common multimedia. The virtual world will be projected on the physical whiteboard based on the user’s position and viewpoint. The virtual whiteboard in the immersive environment can actually correspond to the physical whiteboard in the setup. Consequently, the virtual world is augmented with the physical whiteboard. It is this whiteboard correspondence that allows the user to physically write on the immersive whiteboard. The whiteboard surface allows users to seamlessly navigate and interact in the immersive environment. It offers a natural pen interface for users uncomfortable with the rigidity of a computer keyboard and mouse. As a result, the whiteboard efficiently pulls all the collaboration technologies together. A multi-resolution representation of the
IMMERSIVE WHITEBOARD COLLABORATIVE SYSTEM
211
stroke media for efficient streaming stroke data across participants is presented. System components of immersive whiteboard collaborative are discussed. By integrating multimedia objects and avatar representation into an immersive environment using a combination of multi-modal media, we provide the users with a more transparent medium so that they feel as if they are communicating and interacting face-to-face. References Adobe (1998), “The Compact Font Format Specification,” http://partners.adobe.com/asn/ developer/ technotes/postscript.html. Ames, A.L., D.R. Nadeau, and J. Moreland (1997), VRML 2.0 Sourcebook, Wiley, New York. Argyle, M. and J. Dean (1965), “Eye-Contact, Distance and Affiliation,” Sociometry 28, 289–304. DigitalSpace (1997), “OnLive! Traveler 3D World,” http://www.onlive.com/prod/trav/ about.html. Douglas, D. and T. Peucker (1973), “Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature,” The Canadian Cartographer 10, 2, 112–122. Electronics Imaging (1999), “E-Beam,” http://www.e-beam.com/. Elrod, S., R. Bruce, R. Gold, D. Goldberg, F. Halasz, W. Janssen, D. Lee, K. McCall, E. Pedersen, K. Pier, J. Tang, and B. Welch (1992), “Liveboard: A Large Interactive Display Supporting Group Meetings, Presentations, and Remote Collaboration,” In Proceedings on Human Factors in Computing Systems, pp. 599–607. Escher, M., I. Panzic, and N. Magnenat-Thalmann (1998), “Facial Deformation from MPEG-4,” In Proceedings on Computer Animation’98, pp. 56–62. Fua, P. (1998), “Face Models from Uncalibrated Video Sequences,” In Proceedings of CAPTECH’98, pp. 214–228. Gardner, W.G. (1999), “3D Audio and Acoustic Environment Modeling,” http://www.wavearts. com/. Gardner, W.G. and K.D. Martin (1995), “HRTF Measurements of a KEMAR,” Journal of Acoustical Society of America 97, 6, 3907–3908. Hendriks, F., X. Wang, B. Tseng, and Z. Shae (2001), “Portable Whiteboard System with Vision Input,” In Proceedings of Video Communication and Image Processing (VCIP 2001), CA, USA, SPIE 4310, 87. IBM (1999), “SameTime Server,” http://www.lotus.com/. Koeman, C. and F. Vander Weiden (1970), “The Application of Computation and Automatic Drawing Instruments to Structural Generalization,” Cartographic Journal 7, 1. Kurihara, T. and K. Arai (1991), “A Transformation Method for Modeling and Animation of the Human Face from Photographs,” In Proceeding of Computer Animation’91, Tokyo, Springer, pp. 45–58. Impedovo, S. (1993), Fundamentals in Handwriting Recognition, Springer, Berlin. Leung, W.H., L. Belle, Z. Shae, F. Hendriks, and T. Chen (2000a), “Realistic Video Avatar,” In Proceedings of International Conference on Multimedia and Expo (ICME) 2000, IEEE Signal Processing Society Press, New York, pp. 631–634. Leung, W.H., K. Goudeaux, S. Panichpapiboon, S. Wang, and T. Chen (2000b), “Networked Intelligent Collaborative Environment (NetICE),” In Proceedings of the International Conference on Multimedia and Expo(ICME) 2000, IEEE Signal Processing Society Press, New York, pp. 1645–1648. Microsoft (1995), “True Type 1.0 Font Files,” http://www.microsoft.com/typography/tt/ tt.htm. Microsoft (1996), “NetMeeting Videoconferencing Software,” http://www.microsoft.com/ windows/netmeeting/. Mizui, S. (1999), “Method and Apparatus for Measuring Position by Image Processing,” US Patent 5,901,236.
212
SHAE, TSENG AND LEUNG
MPEG4 (1998), “SNHC Verification Model 9.0 (3D Mesh Encoding),” ISO/IEC JTC1/SC29/WG11 MPEG98/M3809. NTT Software Corporation (1998), “Interspace 3D Virtual Environment,” http://www.ntts.com/ ispace.html. Pighin, F., J. Hecker, D. Lischinski, R. Szeliski, and D. Salesin (1998), “Synthesizing Realistic Facial Expressions from Photographs,” In Proceedings of SIGGRAPH’98, ACM, New York, pp. 75–84. Shae, Z., B. Tseng, and J. von Kaenel (2000), “Multi-Resolution and Streaming of Whiteboard Stroke Media,” In Proceedings of Multimedia Software Engineering (MSE2000), Taiwan, IEEE Computer Society Press, Silver Springer, MD. Shae, Z., X. Wang, and J. von Kaenel (1999), “Electronic Shopping through an Interactive Advertising and Browsing Banner,” In Proceedings of the 1st Workshop on Agent for e-Commerce of IAT99, Hong Kong, pp. 217–225. Shae, Z., X. Wang, and J. von Kaenel (2000), “Transactional Interactive Multimedia Banner,” In Proceedings of Video Communication and Image Processing (VCIP2000), Australia, SPIE 4067, 49. Swedish Institute of Computer Science (1998), “Distributed Interactive Virtual Environment (DIVE),” http://www.sics.se/dce/dive/dive.html/. Tang, J.C. and E.A. Isaacs (1992), “Why Do Users like Video? Studies of Multimedia-Supported Collaboration,” Computer Supported Cooperative Work (CSCW) 1, 163–193. Virtual Ink Inc. (1999), “Mimio,” http://www.mimio.com/. White, P. (1996), “CU-SeeMe Videoconferencing Software,” http://www.wpine.com/Products/ CU-SeeMe/.