Michael J Taylor and Simon M Rowe. Canon Research Centre Europe ...... Argyle, M., Ingham, R. and McCallin, M. The different functions of gaze. Semiotica ...
Gaze Communication using Semantically Consistent Spaces Appears in Proceedings of CHI 2000, The Hague, The Netherlands, pages 400 –407, April 2000.
Michael J Taylor and Simon M Rowe Canon Research Centre Europe Guildford GU2 5YJ, UK +44 1483 448882 {mjt,simonr}@cre.canon.co.uk
ABSTRACT
This paper presents a design for a user interface that supports improved gaze communication in multi-point video conferencing. We set out to use traditional computer displays to mediate the gaze of remote participants in a realistic manner. Previous approaches typically assume immersive displays, and use live video to animate avatars in a shared 3D virtual world. This shared world is then rendered from the viewpoint of the appropriate avatar to yield the required views of the virtual meeting. We show why such views of a shared space do not convey gaze information realistically when using traditional computer displays. We describe a new approach that uses a different arrangement of the avatars for each participant in order to preserve the semantic significance of gaze. We present a design process for arranging these avatars. Finally, we demonstrate the effectiveness of the new interface with experimental results.
originate. Misinterpretation is inevitable, and users quickly become confused about when to speak and who is speaking [18]. The result is unsatisfactory group communication. The second category of cues covers the non-spatial aspects of the interface. Examples include audio latency [11,7], audio-visual synchronization and fidelity, conference set-up convenience, and ease of integration with other collaborative tools [20]. These issues are important but beyond the scope of the paper.
Keywords
Gaze, avatar, animation, virtual meeting, videophones. INTRODUCTION
The telephone is a remarkably successful means of communicating with one other remote person. However, communicating with more than one person using audio alone is much less successful [7]. In this situation, the appropriate use of video can help by supporting some of the rich set of visual cues that aid effective communication in face-to-face meetings [3]. This paper pursues the vision of making the multi-point virtual meeting as convenient as a telephone call [16] and as effective as a face-to-face meeting. The design of such an interface needs to address many relevant cues, which can be broadly divided into two categories: spatial and non-spatial. Both gaze and gesture, together with stereo sound, are inherently 3D spatial cues. Unfortunately, simplistic videoconferencing interfaces (Figure 1a) destroy the 3D context within which such cues Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI ‘2000 The Hague, The Netherlands Copyright ACM 2000 1-58113-216-6/00/04…$5.00
a) b) Figure 1: Multi-point videophone interfaces. a) A typical simple interface. b) The Hydra interface (figure courtesy of Sellen [17]): each remote participant is represented by a surrogate unit consisting of a camera, display, microphone and speaker. This enables gaze communication. This paper describes a new interactive system that supports the communication of gaze more effectively than previously possible using traditional computer displays. Previous approaches typically assume immersive displays, and use live video to animate virtual participants (VPs) in a shared 3D virtual world. This shared world is then rendered from the viewpoint of the appropriate VP to yield the required view of the virtual meeting. The next section summarises previous work on interfaces that support the communication of gaze. We then describe the disadvantages of using shared spaces with traditional computer displays. Under these circumstances, we make the novel observation that such views of a shared space do not convey gaze realistically. We explain why it is necessary to render the scene from the viewpoint of the real participant (RP). Consequently, we then show that it is necessary to animate the VPs differently at each node (RP location) for gaze to be communicated effectively. Next, we describe a
design process for optimally arranging the VPs at each node, in order to preserve the semantic significance of gaze. Finally, we demonstrate the improved effectiveness of gaze communication using the new asymmetric interface with a user study. THE EFFECT OF GAZE ON GROUP COMMUNICATION
In the context of this paper, we define gaze as the person at whom a single participant is looking. The semantic gaze configuration of a meeting is a list of who each participant is looking at, i.e. a list of each participant’s gaze. Gaze has been known to be an important cue for effective communication by researchers for some time [1]. When combined with spatial audio, it is especially important for multi-party meetings, where it supports a participant’s ability catch the eye of other participants and to tell when other participants are looking at them [18]. This is typically observed via conversational phenomena such as length of turn and degree of formality when switching speakers [11]. For example, compared with face-to-face interaction, participants using the interface of Figure 1a are more likely to name the next speaker. This process is known as explicit handover. Sellen has compared these phenomena across various videoconferencing interfaces, including one called Hydra (Figure 1b) that supports gaze communication [18]. Her measurements showed that in fact there was no significant difference in the frequency of explicit handovers between the system of the type shown in Figure 1a and the gazepreserving Hyrda. Sellen attributes this to the fact that the users often felt psychologically disconnected from the situation, and therefore they might have compensated for this by behaving in a generally more explicit manner [5]. As expected, the frequency of parallel conversations was much greater with the Hydra interface. The questionnaire results of her study favored the Hydra system as expected. The most frequently stated reason was that “they could selectively attend to people, and could tell when people were attending to them.” Another common comment was “that participants liked the multiple sources of audio…[that] helped them keep track of one thread of the conversation when people talked simultaneously.” The conclusion that we draw from this work is that, despite the failure of the Hydra system to support implicit handover, we believe that the positive questionnaire results and the occurrence of parallel conversations to be significant evidence for the utility of gaze communication. We believe that the fact that the participants were strangers and also inexperienced users could have contributed to the feeling of disconnection, and thus increased the explicit handover frequency. Vertegaal has performed experiments to isolate the effect of gaze from spatial audio. In [20] he concludes that “conveying gaze direction – especially gaze at the facial region – eases turn-taking, allowing more speaker turns and
more effective use of deictic [depends on the circumstances of use] verbal references.” To summarise, we have cited research that indicates that gaze communication helps participants to: •
Establish when to talk more naturally and effectively [20,18],
•
Hold parallel conversations and make side comments to other participants [18].
It is on this basis that we claim that the technique we later describe for improved gaze communication will, in turn, lead to more effective video-mediated communication. EXISTING INTERFACES THAT SUPPORT GAZE
This section describes recent approaches to gaze communication. In order to present this work, it is convenient first to describe the important properties of three aspects of all video conferencing systems: display type, VP representation and the meeting space geometry. Display Type
In this paper, we make the distinction between traditional and immersive displays. We use the term traditional display to refer to normal desk-top or lap-top computer screens or TVs. Immersive displays come in two forms. The first is the head mounted display (HMD). The second is a very large screen, typically projected onto a wall and referred to as a spatially immersive display [16,4]. The scene containing the VPs can be displayed on either a traditional display or an immersive display. We believe that traditional displays have many advantages over immersive displays. HMDs have the following disadvantages: they disassociate the user from their familiar environment, make it hard to acquire images of the participant’s face, and are rather cumbersome and potentially uncomfortable for spectacle users. Spatially immersive displays are typically expensive room-based installations. In contrast, traditional displays are relatively cheap and ubiquitous at work, on the road and at home. There is therefore considerable potential benefit in designing video communication interfaces for traditional displays. VP Representation
The simplest representation of the VPs is a window in which raw video taken from the VP’s camera is rendered (Figure 1). A more sophisticated representation is a 3D avatar. The avatar can be animated by the real motions of the participant obtained from tracking algorithms running on the video from the participant's camera [13,19]. The avatar can then be rendered realistically from any view, enabling for example, a profile view to be generated from a captured frontal view. In this way, a single camera can be used to generate the different views of a head necessary for gaze communication. This approach also enables the provision of a richer set of 3D cues to the user. Motion parallax – the visual cue that
encodes scene depth as image motion with respect to viewpoint change – can be generated as the user moves their head. In addition, if auto-stereoscopic displays are used, static binocular stereo effects can also be provided for greater realism. Meeting Space Geometry
At each node the virtual meeting is necessarily conducted in a 3D space defined by the positions of the VPs in the display and the relative position of the RP in front of the display. Examples range from spaces like Figure 1a where raw video window VPs are tiled arbitrarily in the display, to interfaces where 3D avatar VPs are embedded in a 3D world which is rendered on the display surface, simulating a view of a real meeting [4,13,19]. The Hydra Interface
Figure 1b shows the first type of interface to address the multi-point communication of gaze [2,17]. It uses surrogate participants to represent the VPs. Each surrogate unit contains a small display, camera, microphone and loudspeaker. This interface is studied in considerable detail by Sellen in [18]. A schematic showing how the surrogates are arranged symmetrically at each node is shown in Figure 2. In this manner, all the views of each participant necessary for gaze communication are obtained from the multiple cameras at each node. A
B
N×(N-1) surrogate units, where N is the number of participants in the virtual meeting. The MAJIC Interfaces
The MAJIC system [14] follows the same principles as Hydra. It improves upon the Hydra interface by allowing life-size VPs. It positions the cameras behind large halftransparent curved screens, thus enabling the camera axes to be approximately aligned with the gaze of the RPs. MAJIC projects the VPs onto the screen. Like Hydra, it requires N×(N-1) cameras and projectors, and is consequently very expensive. Desktop MAJIC [15] uses traditional displays. It represents a VP with a set of static images of the particular participant looking in a range of directions within the plane of the image (e.g. points of the clock). Consequently, if more than two VPs are placed in a line, the gaze communication is ambiguous. This means that for meetings with more than two VPs, they cannot be distributed along a line. Instead, they have to be distributed over the width and height of the display surface, which looks very unnatural, as VPs have to be rendered looking up and down, as well as from side-toside as usual. Interfaces using 3D Models
Ohya et al. [13] propose the use of large immersive, wallsized displays. They take the first steps towards a system that does not display raw video. Instead they track the motion of the RPs. The extracted motion parameters are used to animate 3D avatars – the VPs. These VPs can be rendered from any viewpoint, and so the cameras no longer need to be positioned close to the display as before. The VPs are placed in a shared 3D meeting space. They can be rendered from any view as required by the VP’s relative position in the virtual meeting. As with MAJIC, the use of an immersive display means that the rendered VPs can be positioned where they would have been if the meeting were real. However, Ohya et al. track markers attached to the user’s face, which has obvious practical disadvantages.
D
C
Figure 2: Arrangement for a 4-way Hyrda conference. Using such an arrangement, it is possible for each participant to see at whom all the other participants are looking. In order for gaze to be successfully communicated in this type of system, the angular separation between the optical axis of the camera and the line of sight of the RP must be kept small. This is achieved by using small displays in the surrogate units. If larger displays were used, Hydra VPs would not appear to be looking in quite the right directions. Thus, the Hydra system supports effective gaze communication when small displays are used. The principal disadvantage of the system is cost and scalability, requiring
The TRAIVI project [19] again projects views from a shared space onto a wall-sized immersive display. They improve upon Ohya’s approach by using a tracking technique based upon features already in the face, such as eyes and the mouth. They also consider approaches to animating the 3D model of the head with the measured movements of the mouth and expression changes. PANORAMA [12] displays VPs in a shared space using auto-stereoscopic displays (usually HMDs), providing both parallax and binocular stereo cues. To achieve this they use real-time acquisition of the shape of the participants from a triplet of cameras in the form of a depth map (an image where the intensity at each pixel site represents the distance from the camera to the scene). TELEPORT [4] uses wallsized displays but does not communicate gaze.
Another style of interaction that is common for internet meeting places [10] is based on the interface typically used for multi-user, networked computer games. Each participant controls their position and orientation (and hence view) within a shared 3D world with controls issued from a keyboard or joystick. This type of interface is perfect for controlling racing cars or spaceships, but less appropriate, we feel, for mediating group conversation in virtual meetings in a natural manner. The GAZE Groupware System
Vertegaal [20] adopts a more straightforward approach. He uses traditional displays. Live video is not used. Instead, he uses a static ‘snapshot’ for the VPs, obtained when looking directly into the camera. This, he argues, reduces the bandwidth necessary for communication and also avoids the apparent directional gaze error caused by the angle between camera and display. These snapshots are rendered on planes, once more in a shared 3D space (Figure 3). The orientation of these planes is driven by the gaze of the particular participant, which is in turn measured by tracking the orientation of the eye, using an infrared camera. More precisely, when the RP looks at a VP for more than a certain time, the system alters the state of a central 3D model such that the corresponding image plane is rotated to face the appropriate direction (Figure 3). C
camera, it is not capable, for example, of capturing a natural profile view. Further, because it has no 3D model to reanimate from a different viewpoint, it is not able to generate an artificial profile view. Hence, with the symmetric (square) configuration shown in Figure 3, the closest view to the profile view available of AR from the display-top camera is one at 45° to the plane of the face of AR, obtainable when AR is looking at BV or DV. This limits the natural gaze communication capabilities of this type of interface. This is related to what we call the “Mona Lisa” effect. The angle at which the image is positioned in the virtual space has little affect upon the perceived gaze direction. In other words, a frontal image of a person rendered at 45° in a virtual world still looks like a frontal image (albeit slightly smaller), and so the person appears to be looking out of the screen. What dictates the perceived gaze direction is the actual view of the subject that was painted, in our case by a graphics engine, not the position of the viewer relative to the painting. The GAZE system mounts the snapshot images on 3D solid blocks. The orientation of these blocks can be perceived, and hence, the intended gaze direction of the VP can be inferred implicitly. Finally, we note that in Figure 3, the view that AR has of BV or CV is foreshortened, reducing the resolution of the face, so that expression and mood become harder to assess, and hence communication impaired. This problem is most acute with AR’s view of DV, which in this case is completely degenerate, the image plane of DV being viewed end-on, and thus disappearing. Again, these problems can be solved with the use of 3D avatars as VPs, instead of planes. THE PROBLEM WITH SHARED SPACES
B
D Display
A
Figure 3: Using rotating image planes for VPs. We believe that it is important to communicate the expressions of VPs. It seems clear therefore, that it is desirable to send real-time video instead of using static snapshots. Of course, this should be done without compromising the quality of the gaze communication. The use of animated planes instead of 3D head models is problematic for two reasons. First, consider the 4-way conference shown in Figure 3. We define subscripts R and V to indicate real and virtual participants respectively. The RP A (henceforth referred to as AR) should see a profile view of DV, who should appear to be looking into the screen at CV. This profile is correctly provided by Hydra (Figure 1b,2) because it is a multiple camera system. However, because the GAZE system only uses a single
This section shows why the use of shared spaces yields poor gaze communication under practical viewing conditions using traditional displays. All virtual meeting systems known to us that use animated models, render the VPs in a common shared 3D space. Each node then renders the model from the viewpoint of their VP. Thus, AR’s view of the meeting is invariably obtained by rendering the shared virtual world from the viewpoint of AV (Figure 4). Since the shared virtual space is naturally symmetric, the VPs are positioned around a regular polygon. For this scene to be rendered from the viewpoint of AV, it is natural to place the display such that the nearest VPs are at either side of the display, as shown in Figure 4. This presents no particular problem if immersive displays are used, which handle such wide-angle scenes. However, in the case of a desk-top computer, we note that people usually sit about 20” away from their displays, which are typically 14” wide. Figure 4 shows how a practical asymmetric viewing situation (solid A) differs from the symmetric regular polygonal position used in shared space systems (dashed A). The angle θR is the angular gaze range of the practical RP, and θN is the gaze range for the symmetric position. Figure 4 shows how θN gets larger as N
increases, as θN = 180(1-2/N)°. Now in our example, θR ≈ 2arctan(7/20) = 40°, which is significantly less than θ3=
Camera
D
θ4 A
C
θR
A
B
With reference to Figure 4a, we consider how BV and DV looking at each other appear to AR in the asymmetric (practical) viewing position. Figure 5a shows the view that is obtained by rendering the scene from the symmetric (dotted) viewpoint. It shows views of the faces of BV and DV at ±45° to the frontal view. The result is a view that shows them looking out of the screen - they do not appear to be looking at each other. These views are not correct for the asymmetric viewer, who expects to see views of the head nearer to the profile view shown in Figure 5b.
a) Symmetric
a) E D θ5 A
θR
A
C b)
B
Figure 4: Symmetric and realistic viewpoints. The dotted polygons represent the symmetric arrangement of participants in a shared space. The dashed (solid) circles in front of the displays show the symmetric (realistic) position of the RP respectively. a) A 4-way meeting where θ4=90°. b) A 5-way meeting, where the symmetric RP position is closer to the display, and θ5=108°. 60°, θ4= 90° (Figure 4a) and θ5= 108° (Figure 4b). Intuitively, if views are rendered using a shared symmetric model from the symmetric (dashed) viewpoint, we would expect the effectiveness of the gaze communication to be reduced because the RP is in a different place. More precisely, we expect the gaze communication performance to degrade with increasing angular gaze error θN - θR, as N increases. The next section examines this effect in a little more detail. VP-VP Gaze Accuracy Requires Asymmetric Spaces
We first consider the accuracy of the perceived gaze of VPs looking at other VPs when views of a shared symmetric scene are rendered from the symmetric viewpoint. We assume that there is an accurate gaze tracking process in operation at each node [6]. Each RP’s gaze is sent to the other nodes over a multicast channel. Thus, each node has a dynamically updated representation of the semantic gaze configuration.
b) Asymmetric Figure 5: VP-VP gaze accuracy for a 4-way meeting. Left and right VPs are looking at each other. a) A view from the symmetric viewpoint (Figure 4a dashed) shows the VPs appearing to look out of the screen. b) A view from the asymmetric viewpoint gives the correct appearance. Note the favourable relative scale in the asymmetric case – the middle head is larger.1 We also note that the symmetric case generates a greater range in scale of the projections of the VPs, which is again unnatural and makes it difficult to discern both the gaze and the expressions of the smaller VPs. We conclude that to convey VP-VP gaze effectively we should not render from the symmetric viewpoint (dashed positions in Figure 4). Correct VP-VP gaze communication is obtained by: •
Removing the RP’s VP from the shared 3D scene.
•
Rendering the remaining avatars from the asymmetric viewpoint that corresponds to the position of the RP.
Asymmetric Spaces Require Asymmetric Animation
We have established that we need to render from the asymmetric viewpoint of the RP for VP-VP gaze to appear realistic. We now consider how this affects the rendering of VP’s when they look out at the asymmetric RP. Figure 6a shows the asymmetric situation where DV is looking out of the screen at AR. The scene is rendered from AR’s viewpoint. Therefore, in order for AR to perceive that DV is looking directly at AR, DV needs to be facing ~70° out of the display from BV. Figure 6b shows the same situation from CR’s viewpoint. Here, DV needs to be facing 45° into the display (or equivalently, away from BV) for CR to perceive that DV is 1
Head models courtesy of NFTS CREATEC, Ealing Studios, UK: http://www.nfts-createc.org.uk
looking at AV. This means that DV needs to be in a different orientation relative to the other VPs, depending upon whether it is viewed from AR or CR. Figure 7 illustrates this with head models. It shows both top and RP views for the symmetric (a,c) and the asymmetric (b,d) cases respectively. We note that the views in Figure 7c,d are not too different. The point is that the VPs have to be orientated differently to achieve the same apparent gaze.
Hence the use of a shared 3D model is no longer justified. Rather, the semantic gaze configuration should be used to build different individual 3D models as appropriate for each node. These individual models are therefore animated by the semantic gaze signals in an asymmetric manner. We use a shared semantic space, rather than a shared Euclidean space.
D
τ A ~70o
C
A a) Symmetric: top view
B
a)
b) Asymmetric: top view
A is RP c) Symmetric RP view
D
C
τ C ~45o A
C is RP b)
B
Figure 6: Asymmetric animation. Given that we are compelled to render the scene from the viewpoint of the asymmetric RP for VP-VP gaze to appear realistic, the orientations of the VPs in the scene need to be different to make VP-RP gaze work effectively. The figure shows how the system has to render D looking at A. a) VP-RP: For AR, DV is rendered at ~70° to the display. b) VP-VP: For CR, DV is rendered at ~45° to the display. In general then, VPs need to be in different relative orientations depending on the RP for which they are being rendered, in order for VP-RP gaze to be communicated effectively. They need to be animated asymmetrically. Shared Spaces do not Support Gaze Communication
We have shown that rendering views of a shared 3D model cannot yield effective gaze communication. More precisely, we have shown two corollaries of asymmetric viewing conditions caused by using traditional computer displays: 1.
The VPs must be rendered from the RP viewpoint for VP-VP gaze to operate effectively.
2.
The orientations of the VPs in the scene depend upon the RP for which they are being rendered.
d) Asymmetric RP view Figure 7: VP-RP gaze and asymmetric animation. a,c) VPs oriented symmetrically. b,d) VPs oriented asymmetrically as shown in Figure 6a. Note that the views c) and d) are very similar. The point is that the VPs have to be orientated differently to achieve the same apparent gaze. The relative scale is again better for the asymmetric case. THE iCON SYSTEM
In this section, we describe our implementation of a videophone system that communicates gaze using different, yet semantically consistent, 3D models at each node. We call the system iCon (eye-contact). Optimal VP Layout
We first consider the nature of these individual 3D models. Given that we have shown that they are necessarily different for each node, we now consider how they can be constructed in an optimal way, tailored for each RP. Until now, we have assumed that the VPs lie on N-1 points of the regular N-sided polygon (Figure 4-7). However, this was a direct result of the symmetry constraint on the shared virtual world, which is no longer useful. Instead, we note two design criteria. 1.
In order to make switches of attention as clear as possible, we would like to maximize the minimum apparent turn that any VP has to undergo to switch attention from one participant (VP or RP) to another.
2.
We would like the projections of the VPs to be evenly spaced out across the display to avoid occlusions and large amounts of unused display space.
Maximizing (1) subject to the constraint (2) leads to optimal arrangements of VPs, some of which are shown in Figure 8. 3 VPs
4 VPs
5 VPs
8 VPs 1
Face Animation
We use a simple texture mapping technique. Using colour, background models and the positions of the markers, we segment the raw video of the face region. The face video is rendered as a video-texture onto a generic 3D model of a head (Figure 9b) to form the VP. EVALUATION OF SYSTEM PERFORMANCE
0
-1 -3
-2
-1
0
1
Figure 8: Optimal VP layout: VPs are arranged so that the minimum apparent turn that any VP has to undergo to switch attention from one participant to another is maximized, subject to the constraint that their projections are evenly spaced out across the display. Such arrangements are shown for 3,4,5 and 8 VPs. Gaze Tracking and Head Animation
The handset of a normal telephone is often used in preference to a “hands-free” modality. This is usually because the remote party does not want their conversation to be heard by spurious unknown local parties. This is especially true in open–plan offices. The same applies to videophone calls, and we therefore argue that the use of a headset is often desirable, as it supports privacy and handsfree operation.
We set out to compare the gaze perception accuracy obtained using the asymmetric models generated by the iCon system against that obtained using the symmetric models. To do this, we generated a set of five gaze configurations for 3, 4 and 5 VPs – 15 different gaze configurations in all. The asymmetric VPs were arranged as described in the previous section (Figure 8). For each gaze configuration, we generated a view for both the symmetric and asymmetric model (shown in Figure 10 for 4 VPs). Users were asked to identify at whom each of the VPs were looking, in each configuration. We collected data for 16 users and measured the gaze miss-classification rate. The results are presented in Table 1.
a) Symmetric
b) Asymmetric
a) b) Figure 9: Gaze tracking and face animation. a) The iCon system tracks three markers on a normal audio headset, which can be used to estimate the gaze of the RP. b) The face region in the video is segmented out and rendered on a generic 3D model of a face. Hence, Figure 9a shows a user wearing a headset from the videophone camera. We track RP gaze using markers. In this case, there are three: one on each earpiece, and one on the microphone. We use a standard reliable marker-based tracking algorithm [6]. It would be straightforward to use a marker-less face tracking technique [8,9] based on natural features such as eyes, but they tend to be less robust. The estimated gaze position is displayed as feedback to the RP. The resulting gaze information is used to animate the VPs’ heads as described above.
Figure 10: Gaze perception test. An example of a 4 VP configuration. Users were asked to estimate at whom each VP is looking for 15 such configurations. a) The view rendered from a symmetric viewpoint of a shared space. b) The view rendered from a realistic viewpoint of an asymmetric space. The VPs are evenly spaced across the display. Discussion
We see that the mean gaze error rate of the asymmetric model is consistently and significantly smaller than that of the symmetric model.2 These results therefore confirm our hypothesis that the asymmetric model supports more effective gaze communication. As we would expect from Figure 7c,d, there were an insignificant number of cases for either model where users thought they were not being looked at when, in fact, they were. Users easily spotted a frontal view of a face, shown on the far right of both Figures 10a and Figure 10b. 2
Using a Student's t-test for significantly different means, we obtain a tiny probability (