Using Spatial Intercoms to Improve Speech ...

8 downloads 2601 Views 157KB Size Report
However, operators are often required to monitor three or more simultaneous ... Each phrase is of the form 'Ready [call sign], go to [color]. [number] now,” with the ..... the maskers located at 0° and 45° (significant three-way interaction of target ...
PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 50th ANNUAL MEETING—2006

584

USING SPATIAL INTERCOMS TO IMPROVE SPEECH INTELLIGIBILITY FOR INTERNATIONAL TEAMS Michela Terenzi°, Nandini Iyer^, Brian D. Simpson^, Robert S. Bolia^, Francesco Di Nocera° °Cognitive Ergonomics Laboratory, Department of Psychology, University of Rome “La Sapienza”, Italy ^Air Force Research Laboratory, Wright-Patterson AFB, OH, USA One of the most critical aspects of international teamwork activity is communications, most of which are in English, despite the fact that most of the team members in the world are non-native English speakers. Previous research has demonstrated that apparent spatial separation of communications can lead to substantial improvements in speech intelligibility and reductions in communications workload (Bolia & Nelson, 2003). This study investigated the benefits of sound spatialization on non-native English speakers, showing a significant improvement in recognition of the correct responses when speech was provided spatially. This result confirms and extends previous research on the usefulness of spatial audio technology in order to enhance speech intelligibility in multitalker communications environments.

INTRODUCTION Members of teams involved in critical operations (e.g. Air Traffic Controllers) often perform multiple tasks simultaneously, placing high demands on their cognitive capabilities. Most of these tasks are performed in open environments characterized by high levels of noise and other sources of distraction. In international teams, one of the most critical aspects of the operators’ activity is communications, many of which are in English despite the fact that most of the operators in the world are non-native English speakers. Although technical vocabularies are typically limited, the requirement to speak and understand a foreign language represents an additional demand on the operators. Several studies on the perception of spoken foreign languages have shown that signal-to-noise ratio negatively affects the performance of non-native speakers more than native speakers (Takata and Nábělek, 1990; Bond et al. 1996). Consequently, it should be possible to improve speech intelligibility by increasing the effective signal-to-noise ratio of any particular communication channel. One means of doing this may be via spatial audio displays, which take advantage of the human listener’s natural ability to more efficiently segregate speech streams that are separated in space. Spatial audio displays are typically generated by digitally filtering the desired number of channels through transformations that recreate the physical cues used for sound localization (Ericson et al, 2004). By doing so, speech on each channel appears to originate from a distinct location in space. Research has demonstrated that such apparent spatial separation can lead to substantial improvements in speech intelligibility and reductions in communications workload (Bolia, 2003b; Bolia & Nelson, 2003). The advantage in speech intelligibility is a critical aspect, and several studies (Bolia 2003a; Bolia et al., 2005) have addressed this issue in order to improve the work-place operations of Airborne Warning And Control System (AWACS) operators engaged in air battle management tasks. The tasks carried out by AWACS operators and air traffic controllers are very similar. Both groups operate in open environments characterized by large instrument panels, simultaneous communications, and multiple task performance.

Almost all the studies conducted so far on operators have involved well-trained, native English speakers, and large benefits from spatial displays have been reported. In order to assess the utility of sound spatialization for international teams, a study employing a communication performance task was conducted on Italian Air Traffic Controllers (ATC) listening to English communication (Fabrizi et al., 2006). Results showed a significant improvement of speech intelligibility for both expert and novice controllers, suggesting that this effect is very robust. However, it is unclear whether the benefit obtained with the Italian ATC group was simply due to the nature of their jobs, (presumably, ATC listeners are trained to listen for a target signal in difficult listening situations such as high noise and in the midst of other speech interferers) and hence involve an extensive learning period for naïve users of the display. The present paper addresses this issue by replicating a portion of the Fabrizi et al. (2006) study on a sample of Italian undergraduate students (Experiment 1). In that case the increase in intelligibility was assessed by spatially separating two talkers compared to non-spatialized listening conditions. With existing communications systems, spatializing two sources is relatively easy, since each channel can be directed to a unique ear. However, operators are often required to monitor three or more simultaneous communication channels. With three sources, spatial audio displays provide a significantly larger benefit in tasks requiring speech intelligibility than they do for only two sources. Target intelligibility in operational environments is influenced not only by the presence of non-target speech interferers, but also by ambient noise. The resulting loss of target intelligibility in the presence of one of these two maskers occurs due to different processes in the auditory system. When the masker is noise, target intelligibility is reduced due to the inability of the listeners to extract target information from those spectro-temporal regions in which the masker energy dominates; a type of masking often referred to as energetic masking. On the other hand, when the masker is speech, target intelligibility is influenced not only by energetic factors (i.e., spectro-temporal overlap of target and masker phrases), but also due to the distracting effects of the speech masker. In other words, listeners are unable to parse segments of the target

Downloaded from pro.sagepub.com at UNIV OF CO HEALTH SCIENCE CTR on March 31, 2015

PROCEEDINGS of the HUMAN FACTORS AND ERGONOMICS SOCIETY 50th ANNUAL MEETING—2006

phrase from a similar-sounding masker phrase, and the resulting loss in intelligibility has been frequently referred to as informational masking (Brungart, 2001; Brungart et al., 2001). In native English listeners, spatially separating an energetic masker from a target does not afford as much benefit to listeners as compared to spatially separating two or more speech phrases (the overall benefit depends on the location of the target relative to masker locations); in the latter case, target identification improves by as much as 40%. This amount of improvement is equivalent to reducing the number of talkers by 1-1.5 or increasing the level of the target by 3-9 dB. While the advantage of spatial separation is widely known with native listeners, the role played by energetic and informational maskers in non-native target identification tasks has received little or no attention. If non-native listeners have a disadvantage compared to native listeners in target identification tasks with noise maskers, one possibility is that they would extract a larger benefit than native listeners from spatially separating the target and noise sources. Furthermore, it is hypothesized that nonnative listeners would also derive a larger benefit from the spatial separation of speech maskers. This is because spatial separation is known to allow the listener to focus on one location to the exclusion of others, thereby providing an additional advantage to non-native listeners. In order to assess if the hypothesized results would hold among native and nonnative listeners, a second experiment was conducted in order to evaluate the amount of benefit obtained by spatializing a target phrase in the presence of two speech or two noise maskers. EXPERIMENT 1 Method Listeners. Twenty students (10 females; mean age = 23.75 years; S.D.= 1.29) participated in the experiment. All listeners were undergraduate students recruited at University of Rome “La Sapienza” and reported normal or corrected to normal vision and normal hearing. Stimuli. This study used the Coordinate Response Measure (CRM), a communication performance task used as a measure of speech intelligibility in multitalker listening environments (Bolia, Nelson, Ericson, & Simpson, 2000). The phrases in the CRM corpus are spoken by four male and four female talkers. Each phrase is of the form ‘Ready [call sign], go to [color] [number] now,” with the possibility of eight call signs (Arrow, Baron, Charlie, Eagle, Hopper, Laker, Ringo, and Tiger), four colors (red, white, blue, and green) and eight numbers (1 to 8). Thus, 256 different phrases are available for each talker, with a total of 2048 phrases in the entire corpus. On any given trial, two phrases from the CRM corpus were presented to listeners. In this study, two call signs – “Baron” and “Laker” – were used, with “Baron” being the target call sign. In all trials, the two talkers were unique but always either both male or both female (i.e., the same-sex talkers). In the spatialized listening conditions, these two phrases were further processed using virtual audio techniques.

585

Procedure. On each trial, listeners heard two simultaneous CRM phrases, either spatialized or non-spatialized. When the two phrases were spatialized, the two talkers were presented in virtual auditory space at ±60° in azimuth, where 0° was directly in front of the listener. An additional condition, in which listeners only heard a single talker, was also run to serve as a baseline of the listeners’ intelligibility using the CRM corpus. Listeners were required to click on a 4 × 8 matrix (four colored rows of eight numbers) indicating the color-number combination spoken by the target talker (designated by the call sign “Baron”). Each listener was presented with 9 blocks of 50 trials. The first block of trials was always the single talker baseline condition. The remaining 8 blocks of trials comprised of either spatialized or nonspatialized listening condition and were randomized across listeners. Before commencing the experimental blocks all listeners were familiarized with the CRM task by listening to two blocks of trials in which the talkers were randomly either spatialized or non-spatialized. During the training phase, feedback about the color-number combination was provided to the listeners, although no such feedback was provided during the experimental blocks. Results Angular transformation was employed to linearize the distributions and equalize variances of proportions of correct responses. A repeated measures ANOVA design (single vs. non-spatialized vs. spatialized) was run on the transformed proportions of correct responses. The analysis revealed a significant difference between the three conditions (F(2,38)=139.97; p

Suggest Documents