influence of context dependent quality parameters on the perception of ...

7 downloads 0 Views 924KB Size Report
Aug 27, 2014 - Next to an adequate technical realization of an audio reproduction system, the context of usage plays a major role if a perfect auditory illusion ...
INFLUENCE OF CONTEXT DEPENDENT QUALITY PARAMETERS ON THE PERCEPTION OF EXTERNALIZATION AND DIRECTION OF AN AUDITORY EVENT STEPHAN WERNER1 AND FLORIAN KLEIN2 1

Technische Universität Ilmenau, Electronic Media Technology Lab, Ilmenau, Germany [email protected]

2

Technische Universität Ilmenau, Electronic Media Technology Lab, Ilmenau, Germany [email protected]

Next to an adequate technical realization of an audio reproduction system, the context of usage plays a major role if a perfect auditory illusion with immersion and plausibility is aspired. This contribution presents results from perceptual experiments on context dependent quality parameters. A binaural synthesis of an acoustic scene via a personalized headphone system is used. The investigated quality parameters are influenced by divergence between synthesized scene and listening room, visibility of the scene, and personalization of the system. The plausibility of the perceived auditory scene is described by the test persons with the help of the quality features perceived externalization and direction of the auditory event. The analysis shows that there are significant differences in perceived externalization depending on the occurrence of localization errors but also on divergence or congruence between the listening and synthesized room.

INTRODUCTION Results from perceptual experiments on context dependent quality parameters are presented. The question is to which extend these parameters do have influence on the perception of an auditory illusion. A binaural synthesis of an acoustic scene via a personalized headphone system is used. The investigated quality parameters are influenced by divergence between synthesized scene and listening room, visibility of the scene, and personalization of the system. Two rooms with different acoustic parameters are used as recording and listening room. The test persons listen either to the same room as the listening room or to the other room. The plausibility of the perceived auditory scene is described by the test persons with the help of the quality features: perceived externalization and perceived direction of the auditory event. Because it is unknown if the relevant quality parameters are acoustically or visually dependent, two groups of test persons are used. The first group has no visual cues (dark room), while the second group sees the synthesized source positions and listening room. We found significant differences in perceived externalization depending on the combination of synthesized and listening room. Differences are also found between the two groups and on personalization of the binaural system.

1 MOTIVATION The development of audio systems [1] is motivated by the purpose to create perfect auditory illusions with a high degree of immersion [2] and plausibility [3]. A lot of work is done to increase the technical quality of such systems. Systems which use the principles of binaural synthesis are able to achieve auditory illusions. Binaural synthesis takes the underlying perceptual processes conditioned by the direct synthesis of the corresponding sound pressure at the ear drums of a listener into account. The technical elements are perceptional well understood and controllable (see, e.g., [4], [5]). Sound sources in rooms can be described by binaural room impulse responses (BRIRs). The BRIRs can be derived from acoustic room simulations or from measurements of real sound sources in real rooms. A personalization of the binaural system is achievable by using individual BRIRs and individual headphone equalization. In addition to the technical realization of the correct binaural synthesis and signals, many psychoacoustic effects in perception of auditory scenes and their interconnections are not completely understood until now. Such effects cover for example multimodal interactions between acoustical and visual stimuli like the McGurk-effect [6] or the ventriloquism-effect (e.g. [7], [8], and [9]). Other perceptual effects depending on the congruence or divergence between the synthesized scene (including room) and the listening situation also seem to have a not neglectable influence on perception

AES 55th International Conference, Helsinki, Finland, 2014 August 27–29

1

[10]. The quality of experience of an audio reproduction system depends on technical quality elements of the system but also on context-dependent quality parameters. To contribute to the improvement of binaural synthesis this paper focuses on investigations on acoustic divergence between listening room and synthesized room, visibility of the listening room and simulated source positions and on personalization of the binaural synthesis system. The quality of experience is measured with listening experiments. The ratings of perceived externalization and direction of the auditory event are shown. However, these quality features are only two possible features that have an influence on a plausible perception of an auditory illusion. Investigations on quality features can be found in [11], [12] or [13] for example.

Figure 1: Integration of context dependent quality parameters and effects of auditory adaption as extension for the general quality taxonomy for an interactive auditory virtual environment proposed by Silzle [4]. An extension of the general quality taxonomy for an interactive auditory virtual environment proposed by Silzle [4] is shown in figure 1. Based on the quality formation process described by Jekosch [14] and Raake [15], an additional layer is proposed to show that the context influences the quality features. The comparison and judgment between the desired and perceived quality features is therefore influenced by context-dependent quality parameters. This multi-dimensional view of perceived quality and the importance of context are also shown in the ARCU model [16]. We expect that further investigations on context-dependent quality parameters head to an expedient adaptation of the application or service on the context of use.

2 BINAURAL SYNTHESIS SYSTEM For the test stimuli, binaural recordings of individual and artificial BRIRs (using KEMAR head and torso simulator) for the selected rooms, sound source and positions are measured. Auralization via headphones is prepared. The binaural system is customizable to each participant to avoid within-cone and out-of-cone of confusion errors [17] and to increase the simulation’s similarity compared with real loudspeakers [18]. Different rooms with defined room acoustics and adequate source-receiver distances are chosen to include reverberation. Reverberation encourages the perception of externalization of an auditory illusion and the impression of distance [19]. The headphones are equalized using individual headphone transfer functions (HPTFs) if individual BRIRs are used. HPTFs from the head-and-torso simulator are used if artificial BRIRs are used. In-ear microphones are used to measure individual BRIRs and HPTFs at the entrance of the blocked ear canal of each test person. The microphones are not removed between the BRIR and HPTF measurements. The inverse of a HPTF is calculated by a least-square method with minimum phase inversion [20]. Stax Lambda Pro headphones are used for playback. 3 OBJECTS OF INVESTIGATION The listening experiments are focused on the evaluation of context dependent quality parameters and their influence on the perception of externalization and direction of the auditory event. Externalization is supposed as a crucial quality feature to reach a plausible spatial auditory illusion with binaural headphone systems. Additionally, a correct localization of auditory events is assumed to be a mandatory feature as well. This conveys especially for binaural simulations of multi-channel loudspeaker setups as it is proposed in MPEG-H [21]. Similar localization accuracies compared with listening to real loudspeaker setups are necessary [22]. This includes externalization and correct front-back localization. Listening tests with a pre-test and a main test are conducted. The tests investigate the combinations of listening room and synthesized room. Additional context dependent quality parameters like visibility of the listening room and personalization of the binaural synthesis are investigated. 3.1 Acoustic divergence between the used rooms A listening lab (Rec. ITU-R BS.1116, V=179 m³, RT60distance(2.2m)=0.16 s) and a depleted seminar room (V=182 m³, RT60distance(2.2m)=1.4 s) are used for the listening test and the measurement of the BRIRs at a distance of 2.2 m to the listening point. The tests are conducted at the same recording positions in each room to evaluate the influence of the listening situation. The

AES 55th International Conference, Helsinki, Finland, 2014 August 27–29

2

left part of figure 2 shows the combinations of listening room and synthesized room used in the tests.

Figure 2: Left: Combinations of listening and synthesized room used in the listening test; SR=seminar room, HL=listening lab; Right: Positions of the synthesized sound sources; distance of the sources to the listener approx. 2.2 m. 3.2 Visibility of the listening room The test persons are randomly divided into two groups depending on the presence of visual cues within the tests. For the first group (twelve test persons) the illumination of the listening rooms is minimized (nearly complete darkness) and a visually opaque but soundtransparent black curtain with a distance of 2.2 m is arranged around the test persons. The test persons should have no visual impression or additional visual cues of the listening room. The test persons in the second group (eleven test persons) are placed in the illuminated listening rooms and dummy loudspeakers are placed around the listener in steps of 30° to provide additional visual cues. 3.3 Sound source positions Five sound source directions are considered within the test. Genelec 1030A loudspeakers are used to measure the BRIRs for each position. The right part of figure 2 shows the different positions. The distance from the loudspeaker to the listening point is approx. 2.2 m for the pre- and main test. The height of the source position is approx. 1.3 m (ear position of a sitting person). The BRIRs for each position and for each test person were recorded in the two rooms. The recording position was the same as the listening position in the test. 3.4 Personalization of the binaural synthesis system The individual BRIRs of the test persons from the two rooms and source directions and the individual headphone transfer function are recorded in a preceding session as part of the pre-test. Furthermore, the BRIRs and HPTFs of a KEMAR head-and-torso simulator (45BA) are recorded. Both the individual and artificial BRIRs are used to create the binaural test stimuli for the test. The test persons have to come at two different days to record their individual BRIRs and HPTFs in each room.

4 LISTENING TEST The pre- and main tests investigate the perceived direction and externalization of an auditory event. After the recording of the individual BRIRs and HPTFs on the first day the pre-test is conducted in both rooms. The pre-test serves as introduction and training for the test procedure and the stimuli. The listeners should develop individual percepts for localization and externalization. The main test is conducted in the listening lab and the seminar room separately in two sessions at different days. In every session the test person listens to individually synthesized and dummy-head synthesized source positions of both recording rooms. The stimuli are presented two times in a random order. The perceived angle of incidence is rated by choosing the respective direction on a top-down view. Externalization is rated by choosing the midpoint, inner circle, or outer circle of a rating sheet. Figure 3 shows the rating interfaces used in the experiment. The multistimulus design is used in the pre-test and a singlestimulus design in the main test. The sliders of the interface are not active during the test. The grading and definition of the quality feature externalization is given by Hartmann and Wittenberg [23]. The following definitions are used in the test: a) midpoint: “The auditory event is entirely in my head or it is very diffuse.”; b) inner circle: “The auditory event is external but it is next to my ears or head.”; c) outer circle: “The auditory event is external and good locatable.”. Note that the definitions are given in the test person´s native language German. The usage of the different rating designs is motivated by the question if the ratings with a multi-stimulus design are comparable with a singlestimulus design and a lower rating time.

Figure 3: Graphical user interface used for the ratings in the listening test with exemplary ratings; left=multistimulus design for pretest; right=single-stimulus design for main test. Twenty-three test persons (six female, 17 male, mean age 27 years) participate in the listening test. The test persons are well experienced with listening tests and are trained before each test. The introduction of the quality features localization and externalization consists of an oral and written definition next to listening to the stimuli.

AES 55th International Conference, Helsinki, Finland, 2014 August 27–29

3

4.1 Pre-test design The pre-test also serves as introduction to the test design and procedure of the main test. Furthermore, the test persons can familiarize with the concepts of localization and externalization by comparison of the test items within the multi-stimulus test design. The test stimuli are non-binaural but intensity stereophony panned audio signals. The audio signals are also convolved with a measured omnidirectional room impulse response from the frontal loudspeaker position to include the room acoustic of the listening room but no binaural cues. An unfamiliar and unknown male speech signal and a saxophone part are used as test signals. The signals are chosen to include broadband and transient signals. The duration of the test signals is approx. 3 seconds. Furthermore, a binaural synthesis of the loudspeaker position using artificial BRIRs of the listening room is used as test stimuli. These two groups of test stimuli are chosen to get stimuli with a low and high spatial quality. They are used to get upper and lower references for the percept externalization. The listening room and the synthesized room are congruent in the pre-test. The listening room and the loudspeaker positions are visible for the test persons as dummy loudspeakers. The pretest follows close upon the recordings of the individual BRIRs and HPIRs.

5 RATINGS The ratings of the quality feature externalization are counted as frequencies. An externalization index is calculated as ratio between the number of ratings for “extern” (outer circle at the rating sheet) and all number of ratings. An index of “0” indicates an entire in-headlocalization, while an index of “1” indicates externalization of the auditory event. The ratings for the feature localization are counted as distinct direction as an angle. For further analysis a within-cone confusion localization error is calculated for the ratings of the presented directions 0°, 180°, and 330°. The distributions of the ratings for the two audio signals show no significant divergences for the two quality features. The ratings for both signals are summarized for further analysis. 5.1 Externalization The ratings from the pre-test for perceived externalization are shown in figure 4. The context of the pre-test includes congruence between listening room and synthesized room, usage of artificial BRIRs and intensity stereo panned audio signals and visibility of the listening room as well as possible source positions (dummy loudspeakers around the listener in steps of 30°).

4.2 Main test design The test stimuli include divergence and congruence between the listening room and synthesized room. The listening test is performed in the rooms SR and HL (Figure 2 left). The stimuli panel consists of different personalization, room combinations, source directions, and visibility of the listening room and loudspeakers. Equivalent to the pre-test speech and saxophone audio samples are used. The test design of the main test is a single stimulus test with randomized order of test stimuli for each test person. The rating interface is shown in the right part of figure 3. A replay of the actual stimulus is possible. The number of stimuli is 120 (5 directions x 2 signals x 2 rooms x 3 BRIRs x 2 repetitions). Next to the individual and artificial BRIRs a set of free-field KEMAR head related impulse responses are used to synthesize the test stimuli. No non-binaural signals are used in the main test. The free-field stimuli are used as spatial anchors with a low spatial quality [17], [18]. A presentation of all stimuli is conducted as training in the beginning of the main test. The test persons have to rate the training stimuli on a multi-stimulus design like in the pre-test. The test persons are randomly divided into two groups regarding the presence or absence of additional visual cues.

Figure 4: Perceived externalization from the pre-test, dependent on the listening room, direction, congruence between listening and synthesized room; usage of artificial BRIRs; visibility of the listening room and source positions; 95% binominal conf. int.; SR=seminar room, HL=listening lab. Externalization indices of 0.8 appear for 90° and 120° for both listening rooms. Approx. 0.7 appear for 330°. Next to the non-binaural stimuli the lowest indices are reached for the less reverberant listening lab and for the 0° and 180° directions. The main test asks for perceived externalization at different synthesis conditions. Different personalization methods, convergence respectively divergence between the listening room and synthesized room and the visibility of the listening room are investigated. Figure 5 shows the ratings of the test persons without additional visual cues. The perceived externalization is dependent on the presented source direction. The lowest externalization indices are reached for the synthesis of audio signals with free-field artificial head-related impulse responses. Less externalization is also measured for virtual audio objects placed at directions with expected localization inaccuracies. These directions are 0° with front-back-confusions and 180° with back-front-confusion especially using artificial BRIRs. A higher externalization is reached for congruence between the listening room and synthesized room compared with divergent listening conditions. This effect is for example visible for synthesis of the seminar room using individual BRIRs (SR_IN) or for the listening lab using individual BRIRs (HL_IN) for the 0° and 180° direction. A similar tendency is also visible for other directions and conditions.

Figure 5: Externalization index for combination of listening and synthesized room, personalization, and without additional visual cues; 95% binominal conf. int.; SR=seminar room, HL=listening lab, KK=artificial BRIRs, FF=free field, IN=individual BRIRs. Figure 6 shows the ratings of the test persons with additional visual cues (visibility of the listening room and dummy loudspeakers). The distribution of

perceived externalization is similar to the group without additional visual cues. Externalization is rated lower for the 0° and 180° directions. Compared to the other group higher indices are reached for test conditions with congruence between the listening and synthesized room compared with divergent listening conditions. This effect is also less distinct for the non-personalized binaural synthesis using artificial BRIRs.

Figure 6: Externalization index for combinations of listening and synthesized room, personalization, and additional visual cues with 95% binominal conf. int.; SR=seminar room, HL=listening lab, KK=artificial BRIRs, FF=free field, IN=individual BRIRs. The presence of additional visual cues has a supporting effect on the perceived externalization. A mean increase of 16% is reached for the group with additional visual cues compared to the other group. No clear tendencies in respect to the personalization method and combination of the listening room and synthesized room are detectable. A higher increase of perceived externalization of 24% is observed for the 180° direction independent of the other test conditions. This is an unexpected effect because the test persons are instructed to keep still during the test and are not able to see the back positions. Additionally back-front confusions at a source direction of 180° are decreased for the group with additional visual cues. An analysis of relationship between localization errors and perceived externalization is presented in chapter 5.2. A strong congruence of the indices is visible between the ratings of the pre- and main test for using artificial BRIRs and room combinations “HL in HL” and “SR in SR”. The multi-stimulus and the single-stimulus designs seem to be feasible and produce comparable results. The multi-stimulus method decreases the rating time for each test person substantially. Furthermore, the intention to use non-binaural and free-field stimuli as anchors with a low spatial quality seems to be suitable to define the lower quality point of the rating scale.

AES 55th International Conference, Helsinki, Finland, 2014 August 27–29

5

5.2 Localization and externalization The figures 7, 8, and 9 show the relationship between perceived externalization and localization for different directions, test conditions, and the two test person groups. The localization performance is given as withincone-of-confusion error which includes front-back and back-front confusions. For a target source direction of 0° the error is named as front-back-confusion and for a source at 180° the error is named as back-frontconfusion given in percent. For a source at 330° the error is named as within-cone confusion.

rank correlation coefficient after Spearman1 is -0.76 and -0.75 for the two groups.

Figure 8: Localization and externalization at 180° for different listening conditions; 95% binominal conf. int.; spearman´s rho; * with p

Suggest Documents