Collaboration in Multimodal Virtual Worlds: Comparing ... - CiteSeerX

162 downloads 127402 Views 166KB Size Report
communication mode conditions, text-chat, audio conference and video conference. Also a web environment was implemented, in order to be a control condition ...
Collaboration in Multimodal Virtual Worlds: Comparing Touch, Text, Voice and Video Eva-Lotta Sallnäs

Introduction Social aspects of virtual reality is an area of research that has expanded as the technology has matured. Collaborative virtual environments (CVE’s) show great promise for investigating how human-human interaction works. The reason for this is that the mode of communication as well as task contexts, spatial affordances, information presentation and manipulation of common objects can be varied in order to understand effects of - and interrelations between - these factors. The communication mode is often text-chat in virtual environments, and audio or video channels are less often used. It has only recently become possible to support other human senses like touch in three-dimensional virtual environments. In my research the main focus is comparing different communication modes such as text-chat, voice communication, video-conferencing, and also to investigate the effect of supporting the touch modality. Evaluation of collaboration through different communication modes is not as common in the area of CVE’s as in the area of telecommunications and computer-mediated communication (CMC). One reason for this is that social psychology has not had a large impact on research in the field of CVE’s. In social psychological studies of mediated interaction, the focus of interest is for example on how people can build and sustain relations, prevent and solve conflicts, and collaborate to attain joint goals [1]. A general argument that a number of theories make is that the social richness of the communication medium has to be matched with the task in order for collaborators to accomplish these different goals [2, 3, 4, 5]. The capacity that a medium has to transmit social information like tone of voice or facial expressions affects people’s notion of social presence. Social presence is defined as the feeling of being present with another person at a remote location [2]. In virtual reality contexts, the concepts of togetherness, co-presence and also social presence are used in order to address issues of social interaction [27, 28]. Short et al. [2] regard social presence as a single dimension that represents a cognitive synthesis of several factors that are naturally occurring in face-to-face communication. Among these are the capacity to transmit information about tone of voice, gestures, facial expression, direction of gaze, posture, touch and non-verbal cues as they are perceived by the individual to enhance presence in the medium. The threedimensionality of virtual environments adds a number of specific features to traditional communication environments. The virtual environment is often perceived as a place in which people can navigate with an avatar, interact with objects and obtain information. In order to evaluate the affordances of a virtual environment the concept virtual presence has been

developed. Virtual presence is defined as the notion of feeling as if being present in a computergenerated environment that feels like reality [6]. Witmer and Singer [6] define this kind of presence as the subjective experience of being in a place or environment, even when one is physically situated in another. It is relevant to compare social presence to virtual presence in hybrid collaborative environments where several modes of interaction are provided. The reason is that both perceived social and virtual presence have been said to predict improved performance. It is therefore important to investigate interrelations between these two dimensions in relation to performance, and to measure collaboration objectively. Studies on mediated collaboration have focused on people's subjective perception of different communication modes and also people's actual behaviour when interacting in different modes [7]. Communicating through audio is important when collaborating at a distance and improves both task performance and perceived affordances in comparison to text-chat [8, 2, 9, 10]. In one study [11] various modes of interaction were examined ranging over audio, handwriting, typewriting, video and face-to-face. Results showed that people spoke more than they wrote, and people performed best in terms of time to complete tasks when audio was provided compared to text. Research has not found such uniform results regarding benefits from using video conferencing. It is usually shown that video does not add significant advantages to task performance compared to audio [12, 13, 14]. In Chapanis' study [11] it was found that, in terms of time to complete tasks and amount of words spoken, there was no advantage associated with visual access. The advantages of video that have been found are subjective ratings of qualities of communication modes. One study showed that preference ratings for people who met in an audio mode were lower than for those who met in a video mode or met face-to-face [15]. Daly-Jones et al. [16] found in their study that users were more aware of the presence of their partner, could monitor their partners' attentional status better, and felt that the mode of communication aided collaboration more when video was provided. Video has also been shown to support informal communication and relation building [17, 18, 19]. It has been shown in a number of studies of CVE’s, that supporting the touch modality through haptic force feedback improves task performance [20, 21, 22]. Haptic sensing is defined as identifying objects through both motor behaviours and stimulation of skin receptors [23, 24]. One study shows that if two people have the opportunity to "feel" the interface they are collaborating in, they manipulate the interface faster and more precisely [25]. Finally results from one study shows that haptic communication could enhance perceived "togetherness" and improve task performance in pairs working together [26, 27]. In this report two experimental studies are presented that were aimed at gaining a better understanding of the social dynamics and experiences of users collaborating in virtual environments using different communicative modes. The research questions investigated include how people perform in accomplishing different tasks when communicating via different media,

and to what extent people perceive that they are socially and physically present in a mediated environment when communicating via different media. Communication Modes in Virtual Worlds In this experimental study a CVE was implemented (figure 1.) using three different communication mode conditions, text-chat, audio conference and video conference. Also a web environment was implemented, in order to be a control condition for the effect of threedimensionality, using two communication mode conditions, audio conference and video conference (figure 2.). Dependent variables in the experiment were perceived social presence, perceived virtual presence, perceived task performance and these were measured by questionnaires. Furthermore objective measures of the interaction were time to finish the task, frequency of words used and finally words per second. Eighty subjects participated in the experiment. Subjects collaborated in pairs and performed a decision making task together.

Figure 1. One subjects perspective on the information in the CVE in Active Worlds showing the other subject, posters and links to QuickTime movie clips with audio. Two PowerBook PC:s, networked through Ethernet, were used in the experiment. The CVE was constructed in Active Worlds with the appearance of an exhibition with information points (figure 1.). The information consisted of posters with pictures and QuickTime movie clips with video images and audio information. Humanlike avatars represented the subjects. The webenvironment was designed as a web-site with the same information points (posters and QuickTime movie clips with audio) as in the CVE. The information points were placed beside each other at one web page in a sequential fashion in the web-conditions (figure 2.). The subjects had no avatars in the web-conditions.

Figure 2. The web-environment with posters and QuickTime movie clips with audio. In the condition with audio connection two telephones with headsets were used. In the condition with a video connection two 21-inch television monitors were used besides two telephones with headsets providing the audio communication (figure 3.).

Figure 3. Subjects collaborating in the web environment with video conference. The text-chat function that comes with the Active Worlds system was used in the text-chat condition but it was hidden in the audio and video condition. The web conditions included one

audio conference condition and one video conference condition. The task was for the subjects to navigate an information space either in a CVE or at a two dimensional web environment. The subjects could communicate via text-chat, audio conference or video conference in the CVE and by audio conference or video conference in the web environment. The decision making task was presented to the pairs of subjects as a written scenario. Results from this experiment show that text-chat is problematic as a synchronous communications medium. When communicating through CVE text-chat, people do not perceive themselves to be virtually present to a large extent compared to a video or voice conference regardless if it is in a CVE or web environment. The dimensions in the questionnaire imply that people in CVE text-chat do not perceive the environment to be as interactive, that they do not have as much control in the CVE, and that the environment is not as engaging or meaningful. There are no differences regarding perceived virtual presence between video or voice conditions in either CVE or web environments. This result must be investigated further as most studies on virtual presence implicitly state that three dimensionality increases perceived virtual presence. Communication through CVE text-chat makes the interaction in the environment less social compared with CVE video and CVE voice conferences. The dimensions in the social presence questionnaire imply that people do not feel that they can interpret the other person’s emotional state equally well. They have a hard time conveying their feelings and emphasis and do not build up interpersonal relations as well. Results regarding perceived social presence are not as straightforward when CVE text-chat is compared with web audio and web video. There is a significant difference between CVE text-chat and web video but not between CVE text-chat and web audio. This implies that web audio would be more similar to CVE text-chat than the other conditions. But at the same time there are no differences regarding perceived social presence between video or voice conditions in either CVE or web environments. There are only one significant difference regarding perceived performance found in this study which was between the CVE text-chat condition and the CVE video condition. That could mean that people generally perceived that they could perform tasks efficiently in most conditions, although slightly less so in CVE text-chat. Making joint decisions in the CVE text-chat condition is very time consuming and crude, about 29 minutes. Dialogues are much scarcer in text-chat and discussions preceding decisions are not as extensive. Results show a clear pattern that CVE text-chat is significantly different from collaboration through audio or video conference both regarding time to perform tasks and words used per second. However even if this pattern also appears for frequency of words used there is one exception: there is no significant difference regarding frequency of words used between CVE text-chat and web audio. The results regarding audio and video conditions are a bit more complicated to interpret. Although differences are not significant mean, values show that collaborating in web audio is very fast, about 6 minutes and in CVE audio about 10 minutes

whereas people spend more time in web video about 12 minutes and in CVE video about 16 minutes. The difference is bigger between the web audio and web video condition than between the CVE audio and CVE video condition. This might mean that three dimensionality smoothens out the difference between audio and video. Mean values also indicate, although not significant, that people have less extensive dialogues in the web audio (M=819) and CVE audio (M=1084) condition than in the web video (M=1463) and CVE video (M=1243) condition. Also for this measure there is a smaller difference between the CVE audio and CVE video condition than between the web audio and web video condition. There are significant differences between CVE video and CVE audio in terms of the frequency of words spoken per second. Collaborating in the CVE audio condition takes less amount of time and more words per second is used - but people spend more time in the CVE video condition and have lengthier dialogues. This would suggest that the video condition is in fact different from the voice condition. The video condition might be a more enjoyable medium - whereas the audio condition is the most efficient for this task. If compared to the web conditions there are no significant differences between CVE audio and web audio or web video, these conditions are similar in this respect. However CVE video differs significantly from both the web audio and web video condition as well as the CVE audio condition. This suggests that CVE video stands out from the other conditions regarding words used per second. The CVE video condition might not be the most efficient medium but it is the medium in which people spend a large amount of time and in which they at the same time have lengthy dialogues. Results from the two way ANOVA shows that there is a significant main effect regarding words used per second between the audio and video conditions and also between CVE and web conditions. These results further strengthen the results from the one way ANOVA that showed that there in fact seams to be a difference regarding wordiness both because of communication medium used and the three dimensions added in the Active Worlds environment. Earlier studies of collaborative environments that were not related to CVE’s have found similar results: that providing audio communication makes an important difference compared to textchat, whereas the effect of adding video is not as evident [8, 9, 12, 13, 14]. However results from the study reported here show that there indeed are differences between audio and video in a collaborative three-dimensional desktop virtual environment (CVE). It is a challenge to understand all aspects of how support for different human modes of interaction affects collaboration in a CVE. However, studies which systematically compare tasks in different CVE conditions can make a start in this direction. Supporting Touch in Collaborative Environments Haptic feedback is a natural ingredient in face-to-face interaction between people, and it also serves important functions for communication. One example of haptic communication is when a person hands over a precious artifact; people then rely to a large extent on haptic perception for recognising that the object has been successfully received. Handshakes and a pat on the back are

powerful communicative events that are symbolic and convey information about relations, status and emotional states. It is also intuitive for people to combine gestures, deictic references and joint manipulation in collaborative environments. An experimental study was performed in order to test the hypotheses that a distributed CVE supporting the touch modality will increase perceived virtual presence and social presence, improve task performance and increase perceived task performance [29, 30]. The independent variable in this experiment was the interface condition with two treatments, CVE-voice-haptic and CVE-voice-only. The subjects, in different locations, performed five collaborative tasks in both conditions. The haptic devices used in the tests were two 1.0 PHANToM (fig. 4). In the first condition that included haptic force feedback, the subjects obtained haptic force feedback from the dynamic objects, the static walls and the other person in the CVE. The subjects could simultaneously manipulate the dynamic objects that were modelled to simulate real cubes with form, mass, damping and surface friction. The subjects could also hold on to each other by pushing a small button on the part of the haptic device that they also manipulated the virtual cubes with. In the second condition the subjects had no haptic force feedback and could not hold on to each other. The haptic device then functioned as a 3D-mouse. Voice communication in both conditions was provided through a telephone connection using headsets. Task performance was measured by the total time it took the pairs of subjects to perform the five tasks, and also by the frequency of failure to lift cubes together - which was used as a measure of precision.

Figure 4. Subjects are doing the tasks using two versions of the PHANToM, on the left a "T" model and on the right an "A" model. The results showed that haptic force feedback significantly increased task performance, which means that the tasks were completed in less time in the haptic force feedback condition. Subjects used an average of 24 minutes to perform five tasks in the haptic force feedback condition as

against 35 minutes in the condition with no haptic force feedback. An analysis of frequencies of failures to lift cubes together as a measure of precision in task performance shows that it is significantly more difficult to co-ordinate actions with the aim of lifting objects in a threedimensional desktop virtual environment without haptic force feedback. Results show that there is a significant difference between conditions regarding subjects' ability to lift cubes when building one cube out of eight cubes and when constructing two piles with eight cubes. In the haptic force feedback condition, subjects failed to lift cubes on average 4 times when building a cube and 7 times when constructing two piles. In the condition without haptic force feedback, subjects failed to lift cubes on average 12 times when building a cube and 30 times when constructing two piles. Thus a major part of the difference of the time it took to perform tasks can be explained by the fact that subjects’ precision when lifting cubes without haptic force feedback is lower. The questionnaire that measured perceived performance showed that the subjects in the haptic force feedback condition perceived themselves to be performing the tasks significantly better. The mean value for each question on a seven point Likert-type scale show that subjects perceived their task performance to be higher in the three-dimensional visual/voice/haptic condition (5.9) than in the three-dimensional visual/voice only condition (5.1). Supporting haptic force feedback in a distributed collaborative environment makes manipulation of common objects both faster and more precise. There are clear connections between the ease with which people manipulate objects together and how long time it takes to complete the tasks. The results also show that haptic force feedback in a collaborative environment makes task performance more efficient. Earlier results in studies investigating one person interacting with a system providing haptic force feedback have shown similar results [20, 21, 22]. A further study has shown that two people manipulate an interface faster and more precisely if they can feel the interface [25]. The analysis of data from the virtual presence questionnaire shows that conditions differ significantly. The subjects' mean rating of perceived virtual presence was higher on a seven point Likert-type scale in the three-dimensional visual/voice /haptic condition (5.4) than in the threedimensional visual/voice only condition (4.4). Haptic-force-feedback thus adds significantly to people’s perceived virtual presence even in an environment that supports voice communication. An example of this is the observation that the emotional expressions of failure were much fewer in the non-haptic environment when people did not manage to lift the cubes. People seemed to be more disappointed when failing to lift the cubes in the haptic environment. The dimension of social presence measured by a questionnaire did not differ significantly across the two conditions when the items were analysed together as a total dimension. The mean rating on a seven point Likert-type scale of the total dimension for social presence was highest for the three-dimensional visual/voice/haptic condition (5.3) and lowest for the three-dimensional visual/voice only condition (4.8). This means that people only perceived themselves to be marginally more socially present in the haptic environment. The results in the study by Slater and Wilbur [26, 31] indicate

that haptic force feedback increased perceived togetherness between people in a collaborative environment. However, voice as a communication medium was not provided in that study, which suggests that voice possibly overshadows the effect of haptic force feedback to a certain extent. Acknowledgements Anders Hedman is gratefully acknowledged for performing the ActiveWorlds study with me. Kirsten Rassmus-Gröhn and Calle Sjöström are also gratefully acknowledged for their contributions to the haptic force feedback study. I would also like to thank Kerstin SeverinsonEklundh for her valuable comments and suggestions.

References 1. McGrath, J. E. (1993). Time, Interaction and Performance (TIP): A Theory of Groups. Small Group Research. In Baecker, R. M. (Ed.) Readings in Groupware and Computer-Supported Cooperative Work: Assisting Human-Human Collaboration. San Mateo, CA: Kaufmann, pp. 116129. 2. Short, J., Williams, E. & Christie, B. (1976). The Social Psychology of Telecommunications. London: Wiley. 3. Katz, R. and Tushman, M. (1978). Communication Patterns, Project Performance, and Task Characteristics: An Empirical Evaluation in an R&D Setting. Organizational Behavior and Human Performance. 23: 139-162. 4. Daft, R. L. & Lengel, R. (1986). Organizational Information Requirements, Media Richness, and Structural Design. Management Science. 32: 554-571. 5. Rice, R. E. (1993). Media Appropriateness: Using Social Presence Theory to Compare Traditional and New Organizational Media. Human Communication Research. 19(4): 451-484. 6. Witmer, B. G. and Singer, M. J. (1998). Measuring Presence in Virtual Environments: A Presence Questionnaire. Presence: Teleoperators and Virtual Environments. 7(3): 225-240. 7. Sellen, A., J. (1992). Speech Patterns in Video-Mediated Conversations, in Proceedings of CHI '92. ACM, pp. 49-59. 8. Ochsman, R.B. and Chapanis, A. (1974). The Effects of 10 Communication Modes on the Behavior of Teams During Co-operative Problem-solving, International Journal of Man-Machine Studies. 6: 579-619. 9. Vaske, J.J. and Grantham, C.E. (1989). Socializing the Human-Computer Environment. Norwood: Ablex. 10. Matsuura, N., Fujino, G., Okada, K. and Matsushita, Y. (1993). An Approach to Encounters and Interaction in a Virtual Environment. Proceedings of ACM Computer Science Conference. Association for Computing Machinery, pp. 298-303. 11. Chapanis, A. (1975). Interactive Human Communication. Scientific American. 232: 36-42.

12. Whittaker, S. & O-Conaill, B. (1997). The Role of Vision in Face-to-Face and Mediated Communication. In Finn, E. E., Sellen, A.J. and Wilbur, S.B. (Eds.) Video-Mediated Communication. Mahwah, NJ: Lawrence Erlbaum, pp. 23-50. 13. Anderson, A., Newlands, A., Mullen, J., Flemming, A. M., Doherty-Sneddon, G. and Van der Velden, J. (1996). Impact of Video-Mediated Communication on Simulated Service Encounters. Interacting with Computers. 8: 193-206. 14. Olson, J. S., Olson, G. M. and Meader, D. K. (1995). What Mix of Video and Audio is Useful for Small Groups Doing Remote Real-Time Design Work? Proceedings of the ACM CHI´95 Conference on Human Factors in Computing Systems. New York: ACM Press, pp. 362-368. 15. Williams, E. (1977). Experimental Comparisons of Face-to-Face and Mediated Communication: a Review. Psychological Bulletin. 84: 963-976. 16. Daly-Jones, O., Monk, A., Watts, L. (1998). Some Advantages of Video Conferencing Over High-Quality Audio Conferencing: Fluency and Awareness of Attentional Focus. International Journal of Human-Computer Studies. 49: 21-58. 17. Dourish, P. & Bly, S. (1992). Portholes: Supporting Awareness in a Distributed Work Group. Proceedings of CHI´92. pp. 541-547. 18. Bly, S. A., Harrison, S. R. and Irwin, S. (1993). Media Spaces: Video, Audio and Computing. Communications of the ACM. 36: 28-47. 19. Kraut, R. E., Fish, R. S., Root, R.W., and Chalfonte, B.L. (1990). Informal Communication in Organizations: Form, Function, and Technology. In Oskamp, S. and Spacapan, S. (Eds). Human Reactions to Technology: The Claremont Symposium on Applied Social Psychology. Beverly Hills, CA: Sage Publications, pp. 145-199. 20. Gupta, R., Sheridan, T. and Whitney, D. (1997). Experiments Using Multimodal Virtual Environments in Design for Assembly Analysis. Presence: Teleoperators and Virtual Environments. 6(3): 318-338. 21. Hasser, C. J. Goldenberg, A. S., Martin, K. M. and Rosenberg, L. B. (1998). User Performing a GUI Pointing Task with a Low-Cost Force-Feedback Computer Mouse. Proceedings of the ASME Dynamic Systems and Control Division. 64: pp.151-156. 22. Massimino, M.J. and Sheridan, T.B. (1994). Teleoperator Performance With Varying Force and Visual Feedback. Human Factors. 36(1): 145-157. 23. Appelle, S. (1991). Haptic Perception of Form: Activity and Stimulus Attributes. In Heller, M. and Schiff, W. (Eds.) The Psychology of Touch. New Jersey: Lawrence Erlbaum, pp. 169188. 24. Loomis, J. M. and Lederman, S. J. (1986). Tactual Perception, In Boff, K. R., Kaufman, L., Thomas, J. P. (Eds.) Handbook of Perception and Human Performance: Volume II, Cognitive Processes and Performance. New York, John Wiley & Sons, Inc. pp. 31.1-31.41. 25. Ishii, M., Nakata, M. and Sato, M. (1994). Networked SPIDAR: A Networked Virtual Environment with Visual, Auditory, and Haptic Interactions. Presence: Teleoperators and Virtual Environments. 3(4): 351-359.

26. Basdogan, C., Ho, C., Srinivasan, M. A., and Slater, M. (2000). An Experimental Study on the Role of Touch in Shared Virtual Environments. ACM Transactions on Computer-Human Interaction. 7(4): 443-460. 27. Durlach, N. and Slater, M. (2000). Presence in Shared Virtual Environments and Virtual Togetherness. Presence: Teleoperators and Virtual Environments. 9(2): 214-217. 28. Heeter, C. (1992). Being There: The Subjective Experience of Presence. Presence: Teleoperators and Virtual Environments. 1(2): 262-271. 29. Sallnäs, E-L., Rassmus-Gröhn, K., Sjöström, C. (2000). Supporting Presence in Collaborative Environments by Haptic Force Feedback. ACM Transactions on Computer-Human Interaction. 7(4): 461-476. 30. Sallnäs, E-L. (2000). Improved Precision in Mediated Collaborative Manipulation of Objects by Haptic Force Feedback. In Brewster, S., and Murray-Smith, R. (Eds.) Haptic HumanComputer Interaction. Proceedings of First International Workshop, Glasgow, UK, August/September. , pp. 69-75. 31. Slater, M. and Wilbur, S. (1997). A Framework for Immersive Virtual Environments (FIVE): Speculations on the Role of Presence in Virtual Environments. Presence: Teleoperators and Virtual Environments. 6(6): 603-616.