Spatialized Audioconferencing: What are the Benefits? Ryan Kilgore and Mark Chignell. Department of Mechanical and Industrial Engineering, University of ...
Spatialized Audioconferencing: What are the Benefits? Ryan Kilgore and Mark Chignell Department of Mechanical and Industrial Engineering, University of Toronto And Paul Smith IBM Centre for Advanced Studies1
Abstract Audioconference participants often have difficulty identifying the voices of other conferees, especially in ad hoc groups of unfamiliar members. Simultaneous presentation of multiple voices through a single, monaural channel can be discordant and difficult to comprehend. To address these shortcomings, we have developed the Vocal Village, a communications tool that allows for real-time spatialized audioconferencing across the Internet. The Vocal Village system uses binaural audio signals to present the voices of individual conference participants from different apparent positions in space by adding location cues to audio information. This paper describes our experimental research to determine whether the real-time, “within the head,” spatialization cues implemented by Vocal Village are sufficient to provide performance benefits compared to traditional, monaural audioconferencing methods. Performance benefits included memory, speaker identification, and participant preference. We also investigated whether providing users with the ability to control the location of conference participants within a virtual auditory space further enhanced any such benefits. The “within the head” spatialization used in this experiment did not lead to a statistically significant increase in the ability to remember who said 1
what in an audioconference. However, there was a borderline significant increase in remembering who said what when participants were given the opportunity to move the voices of two similar sounding conferees into different apparent locations. Participants also significantly preferred spatialized audio formats over the mono audio format. Spatialization had a significant effect on improving participants’ perceived confidence in their memory of conferee viewpoints. Additionally, spatialization significantly reduced both the perceived difficulty of identifying speakers during conferences, as well as the amount of attention perceived to be dedicated to performing such voice identification. Providing subjects with the ability to control the apparent location of conference participants resulted in the greatest benefit to both of these measures. Keywords Spatial audio, Spatialized audioconference, 3D Voice collaboration 1. Introduction Reliance on group communication over a distance has increased with the accelerated use of personal computing and a greater frequency of collaboration in dynamic and cross-organizational project teams [10]. However, although an increasing
The views expressed in this paper are those of the authors and not those of IBM Canada Limited Copyright: © 2003 IBM Canada Ltd. Permission to copy is hereby granted provided the original copyright notice is reproduced in copies made.
1
number of long-distance meetings are being conducted, collaboration tools to support such communication are difficult to use and are often unpopular [9]. Audioconferencing is a form of distance collaboration that is both relatively inexpensive and widely used, particularly in business settings. As a synchronous communication method, audioconferencing facilitates a quality of collaboration between distributed groups that cannot be easily achieved through use of asynchronous methods such as electronic mail. Audioconferencing occurs in real time and thus facilitates rapid backand-forth dialogue and exchange of ideas between individual group members by eliminating time lost between interactions. This efficiency can greatly improve the progress of a group’s collaboration, which depends in part on how quickly issues can be explored and misunderstandings can be resolved [12]. The limited amount of research carried out on audioconferencing suggests that the medium’s effectiveness depends, to some extent, on task context. Research reviewed by [13] suggests that removing visual cues reduces trust (and thereby increases conflict) by reducing the accuracy of person perception. However, removing visual cues might also be beneficial in some circumstances by limiting the distraction caused by irrelevant personal factors [5]. Perhaps due to reduced levels of distracting cues, some studies [4, 8, 11] have even found increases in decisionmaking efficiency for audioconferencing. In their review, [5] suggest that although participants claim to be less satisfied with teleconferenced meetings, especially in audio-only settings, individuals actually engage more evenly in this format than in others, experiencing a more uniform distribution of conversational contributions among the participants. They also note that "an interesting aspect of the research on decision quality is that, in general, objective measures fail to support participant perceptions of reduced quality in teleconferenced meetings, even for complex tasks." Thus, while audioconferencing might not always be popular with users, it can be an effective form of communication among workgroups and teams. Such within-group communication tends to be a
key determinant of group effectiveness. For instance, [14] reports that "teams in which members periodically gathered information about others and revealed information about themselves performed better than teams in which members did not do this." Informal audioconferencing can be a very effective method for such information exchange. Further enhancing the quality and functionality–or even perceived utility and user experience–of this pervasive form of communication would greatly benefit a large population of users. 2. The Vocal Village We aim to further enhance the quality of collaboration between dispersed individuals by creating new forms of audioconferencing. These new forms are now technologically and economically feasible through the use of soft phones and voice over Internet Protocol (VoIP) systems. To meet our goal of increasing the communication effectiveness and improving the satisfaction and experience of audioconference participants, we have developed the Vocal Village system. The Vocal Village is a real-time audioconferencing application that uses VoIP technology to connect collaborative groups over the Internet. Unlike other VoIP applications, the Vocal Village supplements the traditional audioconference environment by binaurally presenting location cues, causing the voices of individual conference participants to appear as if they are coming from different positions in space. Within the Vocal Village environment, users are able to independently arrange conferees’ incoming voice streams across the horizontal plain and view a graphical depiction of the conference arrangement that includes conferee names, positions, and relative volume settings (Figure 1).
Figure 1. The Vocal Village user interface
2
The Vocal Village uses a client/server architecture that allows audioconference participants to access the Vocal Village through individual client personal computers. Their voices travel as mono sound streams to a central server, which then combines all of the conferee voices into spatialized signals that are in turn sent back to the individual client personal computers. 2.1 Spatialized audio and the Vocal Village We have chosen to utilize spatialized audio within the Vocal Village because this format allows for a more natural reproduction of collaborative environments and has been shown to increase human auditory performance in numerous laboratory experiments. See [7] for a review. In complex listening tasks, spatialization of sounds serves as one of several aids during auditory scene analysis [15]. When multiple conferees speak simultaneously in a traditional, monaural audioconference, comprehension of the individual voices is difficult at best. Voices mask each other as they emanate from the same position in the audio space. However, spatially separating the apparent source of different voices in space allows the human brain to separate the voices into different auditory streams more easily [3]. This spatialization allows each listener to choose which of the voices to listen to at any one time. Previous work by [1] examined the role of highfidelity spatialized sound in audioconferences. This study demonstrates that key audioconference performance indicators–including memory of conference events, speaker identification ability and confidence, preference and perceived comprehension–all increased when individual conferee voices were projected from separate speakers distributed in space rather than from a single speaker source, as occurs in traditional audioconferences. While such speaker arrays provide excellent environments for examining the effects of spatialization, they are not well suited to the desktop environments common to most audioconferences. Multispeaker arrays require hardware and a signal-transport infrastructure well beyond that which is traditionally available to collaborative groups.
Rather than using multispeaker arrays, the Vocal Village system simulates sound spatialization of multiple voices in the horizontal plane using only standard stereo sound outputs. The system requires hardware no more exotic than a standard personal computer sound card and inexpensive stereo headphones. This spatialization is achieved in the Vocal Village by manipulating mono voice streams to incorporate Interaural Time Differences (ITDs) and Interaural Intensity Differences (IIDs), the two major binaural cues for localizing sounds in space [2]. Basic ITD and IID cues can be easily simulated on the fly and require only simple delay and gain manipulation of the mono audio source. Unlike more complex head-related transfer functions (HRTFs), the ITD and IID functions performed by the Vocal Village do not require modifying the frequency spectrum of sounds to include cues for head-related effects and for transformations caused by the shape of the pinna (outer ear). Although this simplified form of spatialization does not allow for highly accurate modeling of sound locations–for example, it is impossible to model elevation with only these two cues–it is a sufficient method for distinguishing locations in the horizontal plane. Use of only ITD and IID cues saves on computational processing requirements and eliminates a need for the individual calibration that is required by more sophisticated spatialization techniques due to differences in the shapes of people’s heads and outer ears. This emphasis on limiting process requirements supports a key goal of the Vocal Village system, which is to accommodate nearly instantaneous spatialized communication across a variety of platforms. Ultimately, these platforms might include wireless and handheld devices, which have limited processing power. Additionally, research by [6] shows that conversational turn-taking is disrupted if the auditory stream is delayed more than half a second. Achieving such rapid communication over the Internet requires limitations on individual bandwidth needs. The Vocal Village accommodates this limitation by using low-fidelity recording of the individual conferees (8-bit mono sound, sampled at 11,025hz) as well as low-fidelity rebroadcast of the final spatialized audio streams (8-bit stereo sound, at 11,025hz). Using low-fidelity
3
sound sources might augment the importance of directional cues; spatialization benefits in discrimination tasks have been shown to increase as signal-to-noise ratios decrease (that is, as voice quality decreases) [15]. 3. Psychoacoustics study This section of the paper describes a study that we performed to determine whether the low-fidelity, “within the head,” spatialization implemented by the Vocal Village is sufficient to provide performance benefits compared to traditional, monaural audioconferencing. While studies have demonstrated benefits in conference environments of high-fidelity spatialization using studio-quality recordings and multispeaker reproduction [1], similar effects for audio formats suitable to realtime processing and transfer over the Internet have not been investigated. 3.1 Hypotheses Our expectation prior to experimentation was that the low-fidelity spatialization implemented by the Vocal Village would provide benefits similar to those demonstrated by [1] for high-fidelity spatialization in conference environments. However, we anticipated that Vocal Village’s use of only the IID and ITD binaural cues might limit the magnitude of participants’ perception of spatialization and thus possibly reduce the overall benefit. The general hypothesis is the summation of three specific hypotheses: Performance Hypothesis: Cognitive performance, in terms of memory, speaker identification, focal assurance, and confidence would be greater for low-fidelity spatialized formats than for traditional monaural formats. Preference Hypothesis: Subjects would prefer the spatialized audioconferences to monaural audioconference and perceive an increase in their conference comprehension and in the ease with which they were able to identify speakers during the spatialized format. Personalization Hypothesis: Allowing subjects to self-determine the spatial location of individual conference participants would provide additional
benefit in terms of performance and preference measures beyond that achieved by random spatialization of conference participants. 3.2 Design We implemented a within-subjects design, with audio spatialization format as the independent variable. Twenty-two subjects listened to a series of four pre-recorded audioconferences held between the same four women. These conferences were presented in one of four audio formats: • Format1: Nonspatialized audio (Mono format) • Format2: Audio spatialized using IID and ITD cues with conference speakers assigned to random locations (Random spatialized format) • Format3: Audio spatialized with the same cues, but conference speakers arranged in an order determined by the subject (Vocal Village format) • Format4: Audio spatialized using CoolEdit, a commercial audio editing environment, with conference speakers assigned to fixed positions (CoolEdit format) Each participant was equally likely to receive each conference in each of the four audio format conditions, and presentation order was randomized. After listening to each of the four audioconferences, participants were given a memory test, a focal assurance questionnaire, and a postconference questionnaire, modeled on those used in the previous spatialized audioconference study by [1]. Following the experimental session, participants were given a final questionnaire. Memory Test: Participants completed the two-part memory test immediately after they finished listening to the prerecorded conferences. In the first part of the memory test, participants were asked to label a conferee position chart (Figure 2) with the names of the speakers as they were arranged during the conference. (This part of the memory test was omitted for trials presented in the Mono format.) In addition, participants were asked to indicate their confidence in these positions (from 0-100%). For the second part of the memory test, participants were presented with 26 statements,
4
arranged in the order with which they occurred during the conference. Participants were instructed to read the statements and indicate the name of the speaker who made the statement (Speaker Identification) as well as their own confidence that they had correctly identified the speaker (Speaker Identification Confidence).
B
C
A
D
Figure 2. Conferee positions Focal Assurance Questionnaire: In the focal assurance questionnaire, participants were asked to briefly outline the viewpoints of the four conferees and to indicate their confidence that they had correctly surmised these viewpoints (Focal Assurance Confidence). The viewpoint statements were later graded on how accurately they reflected conferee viewpoints (Focal Assurance). The grader was blind to participant identification or experimental condition.
3.3 Apparatus Stimuli: Nine voice conferences, each approximately six minutes in length, were held between four women and recorded for use in the experiment. The women were all English speaking, with standard central Canadian accents. The women were given controversial topics that typically involved legal action being taken against an individual–for example, an arrest involving the medicinal use of marijuana. The purpose of these topics was two-fold: controversial material was used in an attempt to maintain the attention of subjects over the duration of the experiment and allowed the women to select differing viewpoints during the conversation. These conferences were conducted and recorded using the Vocal Village system, with each of the women seated in a remote location but speaking together in real time over the Internet. Of these nine conferences, four were selected as conference stimuli based on similarity in their number of back-and-forth dialogue interactions, equivalent representation of the four individual speakers, and duration.
Postconference Questionnaire: In the postconference questionnaire, participants rated their own perception of the following four conference attributes using a 1-7 scale: • The difficulty of determining who was speaking during the conference (Perceived Difficulty) • Whether more or less of their attention was focused on determining who was speaking (Perceived Attention Allocation) • The helpfulness of speaker locations during the conference (Perceived Helpfulness) • Their overall comprehension of the conference (Perceived Comprehension)
The conferences were recorded using standard Altec-Lansing headset-microphone combos in mono 8-bit sound at the low sampling rate of 11,025 hz and saved as a WAV file. Each woman’s voice was recorded as a separate audio track, allowing for each voice to be individually manipulated using Matlab software to simulate IID and ITD spatialization cues identical to those utilized in the Vocal Village system. For Format 1 (Monaural) the recordings of each woman were left in the monaural format with which they were recorded. For Format 2 and Format 3, individual sound tracks were created for each of the four women in each of four possible locations (-90 degrees, -30 degrees, 30 degrees and 90 degrees from center) allowing for 24 possible spatialized location arrangements for every conference. For Format 4, the individual tracks were spatialized using generalized head-related transfer function (HRTF) modeling in CoolEdit.
Final Questionnaire: In the final questionnaire, participants were asked to reflect on all four conferences they had listened to and rate the audio format of each conference based on their preference (Format Rating). In addition, they were asked to explain any strategies they used to track who was speaking during the conference.
Presentation: People participated in the experiment one at a time. Each participant listened to the four audioconferences, one in each of the four audio format conditions using the same off-theshelf Altec-Lansing stereo headsets that were used to record the conferences. MULTIQUENCE audio mixing software was used to combine the
5
recorded tracks of each of the four conferees into a single stereo audio signal that was output from a personal computer audio card to the headsets. Subjects were left alone in a quiet room for the duration of each conference. 3.4 Participants A total of 22 individuals participated in the experiment. There were 13 male participants and 9 female participants, with the majority of subjects recruited from mailing lists of the Interactive Media Lab at the University of Toronto. The remaining three participants were full-time employees from external companies. Prior to experimentation, all participants were screened for any known hearing impairments that might interfere with their ability to perceive spatialized sound. The participants were paid $15 for their time (approximately 70-90 minutes). 3.5 Procedure Prior to entering the testing room, participants answered a brief survey on their communication habits. Following this survey, the experimental protocol was discussed and the experiment was begun. The experiment itself consisted of two sections: training and the main task. Training: In the training section of the experiment, again modeled on the work of [1], participants were asked to listen to four 30-second sound clips, one for each of the four conferees. In these sound clips, conferees stated their name twice, read a 20-second excerpt from a human factors text book, and again stated their name. The purpose of this training exercise was to familiarize subjects with the four conferees, to enable subsequent identification of the four conferees by voice alone. Participants were allowed to listen to these clips as many times as they wished until they felt comfortable in their ability to identify conferee voices. Main Task: The main experimental task was performed four times–once for each of the four audio format conditions–and consisted of a self-check phase, a listening phase, and a testing phase. In the self-check phase, subjects listened to two series of four short statements, one read by each of the conferees in a random order. For the first series of four statements, subjects were provided
with the name of the conferee that was reading the statement, but told not to look at the name until after they had tried to guess the identity of the conferee. For the final four statements, subjects were not provided with the identity of the speaking conferees and were instead instructed to tell the examiner which conferee was speaking after hearing each of the statements. If participants were unable to successfully identify all four of the speakers, the process was repeated with a new set of stimuli until they could do so. The purpose of this self-check phase was to ensure that subjects were still able to accurately identify conferees prior to each listening phase. During the listening phase, subjects were left alone in the room to listen to one of the sixminute audioconferences. For conditions where participants were able to self-determine the location of conferees, this was accomplished prior to the listening phase by completing the Conferee Position Chart. Participants were told to assign one conferee to each of the four possible positions and to write a brief explanation of the reasoning behind the configuration they chose. Subjects were instructed to pay as much attention to the conference as they would if they were actually an active participant in the discussion and were also reminded to pay particular attention to the viewpoints of the individual conferees as well as the location from which their voices appeared to be originating. Finally, subjects were told to signal to the examiner when the conference was finished. After the listening phase, the examiner reentered the room and the subject began the testing phase, during which they completed a memory test corresponding to the conference they had just listened to, as well as a focal assurance questionnaire and postconference questionnaire. During the memory test, participants were instructed to answer questions as quickly as possible. After the fourth and final iteration of the main task, subjects were asked to complete the final questionnaire.
4. Results Separate repeated measures analyses of variance were run for each of the dependent variables, with audio format as the independent variable. Analy-
6
ses of all four conditions, if significant, were then followed up with analyses of just the three spatialized conditions. This was done to see whether the differences were due to a mono versus spatialized effect or to differences between the spatialized formats.
While there was no significant difference in ability to remember the viewpoints of individual conferees (Focal Assurance) (F