Virtual 5.1 Surround Sound Localization using Head

ISSC 2014 / CIICT 2014, Limerick, June 26–27

Virtual 5.1 Surround Sound Localization using Head-Tracking Devices Brian C. O’Toole† , Marcin Gorzel† , Ian J. Kelly† , Liam O’Sullivan† , Gavin Kearney∗ and Frank Boland† Dept. of Electronic & Electrical Eng. Trinity College Dublin, Ireland

Dept. of Theatre, Film & Television University of York, UK

E-mail: † {bcotoole,gorzelm}@tcd.ie

∗

[email protected]

Abstract — We investigated the impact of exploratory head movements on sound localization accuracy using real and virtual 5.1 loudspeaker arrays. Head orientation data in the horizontal plane was provided either by the Microsoft Kinect face-tracking or Oculus Rift’s built-in Inertial Measurement Unit (IMU) which resulted in significantly different precision and accuracy of measurements. In both cases, results suggest improvements in virtual source localization accuracy in the front and rear quadrants. Keywords — surround sound, binaural, head tracking, spatial audio.

I

Introduction

The human auditory system makes use of several methods to determine the source location of incident sound waves. This localization is sensitive to all three dimensions, making use of a number of auditory cues. These cues provide the listener with a highly accurate picture of the surrounding sonic space. A sound source can be perceived with positional accuracy as high as 1◦ when located in front of the listener [1]. The capture, reproduction and synthesis of these localization cues is termed spatial audio and can be delivered to the listeners with the use of loudspeakers or headphones. Applications include gaming, education and music. When producing spatial audio with headphones, mismatches can occur between the actual and perceived localization of sound sources. That is why real-time spatial audio systems should incorporate tracking to counteract these perceptual mismatches. However, there are many forms of head tracking available of differing performance capabilities. This paper sets out to determine the impact of these differences in performance and to establish whether head tracking remains a net positive in all cases. The remainder of this paper is organized as follows: first, a brief description of the factors involved in three-dimensional sound localization is

presented. Later sections detail objective assessment of two head tracking systems, the Oculus Rift and Microsoft Kinect. Following from this, a subjective experiment is reported that compared the impact these head trackers had on localization accuracy. Lastly, the results of the experiment, analysis of such, and conclusions drawn therefrom are presented. a)

Localization of Sound Sources

Perception of spatial location of a sound source is possible mainly due to the interaural cues that are differences between the sound waves detected by the ears. Interaural Level Difference (ILD) is the difference in volume of a waveform that the two ears hear, due to the shadowing effect of the head. ILDs are most noticeable at high frequencies [2]. Interaural Time Difference (ITD) is the difference in arrival times of a sound wave to the two ears. At a typical head diameter these arrival times can differ by several hundred microseconds. However, the ILD and ITD cues are only useful for localizing on the horizontal plane, due to the position of the ears. Further cues are incorporated by the auditory system to locate sources in the vertical plane, primarily spectral filtering by the listener’s anatomy, in particular the outer ear or pinnae. The folds of the pinnae affect cer-

tain frequencies depending on the waves’ angle of incidence, and the auditory system locates sound sources based on this response. This allows for identification of vertical angle in elevated sources [3], as well as determining front/back positioning [4][5]. Combined, the interaural cues and spectral filtering form what is termed the Head Related Transfer Function (HRTF). This function encapsulates the altering of a sound signal as it travels from a certain location and reaches the listener’s ear canal. Thus, filtering of a sound signal with a set of HRTFs produces what is known as a binaural recording which, when presented over headphones or loudspeakers with sufficient channel separation, produces an illusion of sound coming from a particular direction. b)

Reproduction of Spatial Audio using Headphones

Real-time and interactive reproduction of spatial audio using that straightforward binaural technique is rather impractical. This is because rendering of multiple or dynamically relocated sources would necessitate a large amount of filter sets that are tedious to measure and prone to error. Also, switching of the filters can lead to auditory artefacts caused by wave discontinuity in the binaural signals [6]. This problem can be addressed by use of “virtual loudspeakers” [7][8][9]. In this approach, sound sources are spatialized as for loudspeaker reproduction, however, the loudspeaker signals are filtered with the HRTFs according to their spatial locations and summed together in order to produce left and right headphone signals. For example, to obtain the left ear headphone feed we have to perform:

l=

N X

hLi ? qi

(1)

i=1

where hLi is the time domain representation of the left ear HRTF corresponding to the ith loudspeaker location, qi is the ith loudspeaker signal feed and the ? symbol denotes convolution. The process is analogical for the right ear signal feed. In order to position a source image at an angle between loudspeakers, a panning algorithm must be used. In this paper we use Pairwise Constant Power Panning (PCPP) which applies a sinusoidal relationship to the gain values for source angles between speakers. Gain values gA , gB for two speakers A, B are calculated as follows:

gA (θi ) = cos

gB (θi ) = sin

θi −θA θB −θA θi −θA θB −θA

. .

π 2

π 2

(2) (3)

Where θi is the intended source image angle, and θA , θB refer to the angles of the speaker pair enclosing θi . The Sine-Cosine relationship ensures constant unity power throughout the arc of the relevant speaker pair. PCPP can be applied to any loudspeaker configuration (including standard 5.11 ). A “5.0” array has its Front Left and Right speakers close at either side of the Centre channel, at -30◦ and 30◦ respectively. The corresponding rear speakers, Left and Right Surround, are positioned at -110◦ and 110◦ . As can be seen in Figure 1, only two virtual loudspeakers are ever active at one time which assures best energy concentration in the intended direction of the sound source. This makes the method optimal for mid-high frequency localization through ILDs. Other approaches exist for simultaneous activation of more than two speakers, as described e.g. in work by West [10]. 90

1

120

60 0.8 0.6

150

30 0.4 0.2

180

0

210

330

240

300 270

Fig. 1: 5.0 loudspeaker gains for sound sources at different azimuthal angles calculated according to PCPP

Previous work in the area of localization on physical 5.1 speaker arrays suggest asymmetric source imaging accuracy around the speaker circle [11]. This is due to the unequal spacing of the speakers. Consequently, accuracy favours the quadrant directly in front of the listener, due to greater speaker density in this area. II

Binaural Audio with head-tracking

Binaural audio created using HRTFs of a person that is not the listener can still sound convincing despite the anatomical mismatch, due to the contribution of the ITDs and ILDs. However, localization may become more difficult and front-back reversals are often reported. When localizing with interaural cues, a region known as the cone of confusion exists on either side of the head, the axes of which are collinear with a line between the two ears [12]. At any point on the circumference of this cone’s base, the ITD and ILD cues will return ambiguous results; consequently localization can 1 since the “.1” channel does not contribute to spatialization, it is omitted in this work. Effectively, the setup should be referred to as “5.0”

Head Rotation Angle: MS Kinect vs Oculus Rift 50

Rotation (degrees)

not be performed with the interaural cues alone. This ambiguity is resolved by small head movements, with the resulting shift of the ambiguous sound source revealing its position in the horizontal plane [13]. Due to the unique combination formed by the listener’s ears, head, and torso, the HRTF is highly individualized and is non-trivial to synthesize or customize. Fortunately, head movement produces a differing frequency response and is used in a comparative process. It has been shown that head movement is particularly beneficial in the case where a listener is exposed to a synthesized source combined with a generic or non-personalized HRTF, in a head-tracked system [14]. In addition to this, subjects in testing carried out by Begault & Wenzel [14] achieved a ca. 50% relative decrease in reversal errors when using head-tracking. A study carried out by Brimijoin et al. [15] concluded that compensation for head movement in headphone presentation produced significant reduction of “in-head” localization rates, where sources are perceived as within the head of the listener. In summary, compensation with rotation data from the head trackers can be used to produce a stabilizing effect on the virtual sound field. By subtracting the rotation of the head about the vertical axis from the intended source angle the virtual sound field was effectively turned the opposite direction in the listener’s headphones, giving the impression that the source was stationary. It is still unclear however, how the head-tracking information is used by the auditory process in order to resolve the location of a sound source and to what extent precision and accuracy of head-tracking impact the localization performance. Objective Evaluation of Head Tracking Methods

Prior to subjective testing, some objective analysis was performed on the two head trackers intended for the trial. Firstly both trackers were left idling for an hour, with a mannequin head acting as subject, in order to determine the drift of the trackers. Both performed satisfactorily in this regard, drifting less than 5◦ over the course of an hour. Having satisfied this initial fidelity requirement, an accuracy assessment was carried out. Three subjects’ head movements were recorded for several minutes simultaneously with the MS Kinect and Oculus Rift. Without any immediately available means of determining a base truth value for the rotation value, it was only possible to determine the deviation in values between the two trackers. An excerpt of one testing period can be seen in Figure 2. For the three subjects, the mean absolute error

0

−50 0

10

20

30 40 Time (seconds)

50

60

70

Fig. 2: Simultaneous head tracking data from the MS Kinect and Oculus Rift.

between the values recorded by the MS Kinect and Oculus Rift was 6.5◦ . While this may initially appear to be quite a low error given the significantly different tracking methods used by the two devices, some undesirable inconsistencies were found in the data upon further inspection: • Faster rotation of the head resulted in greater difference in values between the two trackers: Avg. Rot. (deg/s) 0.1 3.2 20

Mean Abs. Err. (deg) 3.5 7.1 15.2

Std. Dev. (deg) 12.2 9.2 19.2

• Cross-correlation analysis of sections of slow and fast head movement suggests varying latency between the trackers. These disparities in head tracking performance were tested for effect in the subsequent subjective trials. IV

III

Kinect Rift

Subjective Evaluation: Localization Accuracy with Head Tracking

In order to establish the effectiveness of the head tracking applied in the sound system, a subjective experimental trial was designed. a)

Variables

Several variables were tested for their effects through the trial. Of particular interest was the use of head-tracking, and comparison between the impact on localization between a low latency, accurate tracking device - the Oculus Rift - and less accurate, image-processing based MS Kinect facetracking algorithm. Testing was also done without any head tracking whatsoever. Two stimuli were used in the test: pink noise bursts, and a female speech sample in order to represent both unfamiliar and familiar sound sources. Four possible virtual source locations were used: 15◦ , -40◦ , 88◦ , -150◦ (with 0◦ being straight ahead of the listener) in order to utilize both median and

lateral locations as well as different loudspeaker pairs creating phantom images. b)

Participants

Ten subjects participated in the subjective testing procedure for which they received financial compensation. Subjects were mostly Music Technology students who had experience with previous audio trials but not one of this kind. Subjects were encouraged to turn their heads as normal, within the effective field of view of the MS Kinect (ca. -40◦ - 40◦ azimuth head rotation). c)

Environment

The trial was carried out in a noise-isolated studio. Participants were seated at the centre of a 5-speaker array with a radius of 135 cm. Test progress was monitored remotely from the studio’s control room. The test setup is illustrated in Figure 3.

Firstly participants were familiarized with a the interface through a short routine of identifying which speakers were playing sounds to them. Subsequently the first investigative stage was entered. This stage alternated between subjects, changing between the speaker testing stage or headphone testing stage first for subsequent subjects [16]. Speaker testing closely resembled the training stage, with the aforementioned variables applied to playback over the physical speaker array. Test items were played back in a randomized order. As the angles used in the test were deliberately chosen to be between pairs of speakers rather than at the speakers themselves, a panning algorithm was used to vary the sound intensity between speakers according to the intended source direction. No head tracking was applied at this stage. In the other testing stage, subjects donned headphones and proceeded through the test in a similar fashion. Head tracking was added as a variable. To produce the desired externalization effect, sources were filtered using HRTFs for those speaker positions. To limit the variables in the comparative study, these HRTFs were not personalized for the subject, but were recorded by a third party using in-ear microphones in the test environment. These transfer functions therefore also included some minimal reverberation, characteristic of the room used. V

Fig. 3: Test environment with the 5-loudspeaker array

Measurement of the studio environment produced a spatially-averaged reverberation time of 0.18 s at 1 kHz. The background noise level was recorded at 26 dBA. Both heading tracking systems were in place simultaneously. In order to not interfere with the facial recognition-based system utilized by the MS Kinect, participants wore the Rift backwards to leave their face visible.

Results

Mean absolute localization error for four virtual source positions and four modes of playback (loudspeaker playback, headphone playback with no head-tracking, headphone playback with Oculus Rift tracking and headphone playback with MS Kinect tracking) has been computed. The analysis has been done separately for pink noise bursts and female speech sources. The results are presented within 95% Confidence Intervals in Figures 4 & 5. Absolute localisation error, 95% CI, pink noise; N = 10

d)

180

Interface

Audio playback was controlled by the participants, who then indicated the direction from which they perceived the sound. This was done through a touchscreen interface on a tablet device, by positioning a marker on a circle representing an overhead view of the seated listener. Participants could also indicate that the sound appeared to be “inhead”, or could repeat a sample. The interface displayed current head orientation or indicated if head tracking was not in operation. e)

Testing Routine

The test was carried out in three stages.

absolute localisation error [degrees]

160 140

Loudspeakers Headphones − no head−tracking Headphones − Oculus Rift Headphones − Kinect

120 100 80 60 40 20 0

15

−40 88 virtual source position [degrees]

−150

Fig. 4: Absolute localization error for pink noise virtual sources and four different modes of playback (95% Confidence Intervals)

Absolute localisation error, 95% CI, female speech; N = 10 180

absolute localisation error [degrees]

160 140

Loudspeakers Headphones − no head−tracking Headphones − Oculus Rift Headphones − Kinect

120 100 80 60

perception for both source types (female speech & pink noise bursts) and 3 modes of head-tracking (no head-tracking, Oculus Rift & MS Kinect). As we can see, externalization is poorer for sources appearing at the back of the listener than in the case of front or lateral source directions. Head-tracking seems not to have a significant impact on the rate of externalized sound sources.

40 Percentage of sound sources perceived inside the head (avg); N = 10 100

20

no head tracking Oculus Rift MS Kinect

90 −40 88 virtual source position [degrees]

−150

Fig. 5: Absolute localization error for female speech virtual sources and four different modes of playback (95% Confidence Intervals)

Since the study followed the within-subject factorial design, in order to investigate the statistical significance of the results obtained, 2-way factorial ANOVA has been performed where the two factors were 4 (playback mode) * 2 (stimulus). The analysis has been performed for each of the 4 virtual source positions separately. No statistically significant differences have been found for virtual source locations −40◦ and 88◦ (F−40◦ (3, 72) = 1.25, p−val = 0.2996; F88◦ (3, 72) = 0.29, p−val = 0.8303). Only for virtual sources at locations 15◦ and −150◦ loudspeaker playback resulted in significantly smaller localization errors than in other modes of playback (F15◦ (3, 72) = 4.35, p − val = 0.0071; F−150◦ (3, 72) = 6.99, p−val = 0.0003). No other effects of stimuli nor any synergistic effect of combined stimulus/playback modes have been found. These results suggest that although head tracking in general reduces the localization error, especially for virtual sources localized to the front and to the back, its effect is rather subtle. We have to remember however, that in this experiment nonindividualized HRTFs were used which may have resulted in increased localization blur and frontback reversals for some of the subjects. For example, for pink noise virtual source localized at 15◦ subjects no. 4 & 6 experienced very high absolute localization errors of 145◦ and 145.3◦ respectively, suggesting strong front-back confusion. However, these errors were mitigated with Oculus Rift headtracking which brought the errors down to 2◦ and 15.4◦ respectively. Thus, this study confirms that main contribution of head trackers is in reduction of front back reversals and, what follows, reduction of significant localization errors. It must be noted though that it does not apply to all the subjects and further studies should investigate this effect using individualized HRTFs. Another issue related to the use of generic HRTFs is virtual sound source externalization. Figure 6 shows the percentage of inside-the-head

80 [%] of inside−the−head sources

15

70 60 50 40 30 20 10 0

15

−40 88 virtual source position [degrees]

−150

Fig. 6: Percentages of virtual sound sources perceived inside the head for both source types (female speech & pink noise bursts) and 3 modes of head-tracking (no head-tracking, Oculus Rift & MS Kinect)

Also, some subjects (e.g. subject 2 & 3) experienced significantly higher rates of in-the-head localization than others (Figure 7). This could be attributed to a greater individual HRTF mismatch with respect to the filters used in the test, although further studies would be required to confirm that. Individual percentage of sound sources perceived inside the head 100 90 80 [%] of inside−the−head sources

0

70 60 50 40 30 20 10 0

1

2

3

4

5 6 Subject

7

8

9

10

Fig. 7: Percentages of virtual sound sources perceived inside the head for all 10 subjects

Lastly, the way in which subjects utilized available head-tracking information was also highly individual. Figure 8 contains 13 s excerpts of the head-tracking data acquired for subjects 2 & 3 performing the headphone part of the listening test. Note that subjects 2 & 3 are those who experienced the highest rates of inside-the-head localization. As can be seen though, the patterns, and therefore also the way the head-tracking information was utilized by the subjects differs quite signif-

icantly. High rates of inside-the-head localization can therefore be again attributed to the HRTF feature mismatch.

non-individualised head-related transfer functions for forward and backward directional sound: cluster analysis and an experimental study”. Ergonomics, 53(6): 767 781, 2010.

Individual head−tracking information (Oculus Rift) 8 Subject 2 Subject 3

6

head orientation [degrees]

4 2 0 −2 −4 −6 −8

0

2

4

6 time [s]

8

10

12

Fig. 8: Oculus Rift head-tracking data recorded for subjects 2 & 3 during the listening test

VI

Conclusions

In this study we investigated the influence of headtracking devices (MS Kinect and Oculus Rift) on virtual source localization in real and virtual loudspeaker arrays. The study showed that loudspeaker localization was better than headphone localization in the front and rear quadrants. In the lateral quadrants, the performance was similar for all modes of playback. With respect to the front and rear quadrants, headphone localization accuracy with no head-tracking gave the worst results. Improvements can be observed by employing head-tracking devices which tend to equally reduce front-back reversal rates and thus causing the error to diminish. Although the improvement can be as high as 50%, for the current sample size (10 trained subjects) this is however not statistically significant. Further work should investigate the effect of head-tracking using individualized HRTFs and with higher number of participants. References [1] J. Blauert. “Spatial Hearing-The Psychophysics of Human Sound Localization”. The MIT Press, 1983. [2] T. Francart. “Perception of across-frequency interaural level differences”. J Acoust Soc Am., 122(5):2826-31, 2007. [3] J. Blauert. “Sound localization in the median plane”. Acustica, 22:205-213, 1969. [4] J. Hebrank, D. Wright. “Spectral cues used in the localization of sound sources on the median plane”. J. Acoust Soc Am., 56:1829-1834, 1974. [5] R. H. Y. So, B. Ngan, A. Horner, J. Braasch, J. Blauert, K. L. Leung. “Toward orthogonal

[6] M. Otani, T. Hirahara, “Auditory Artifacts due to Switching Head-Related Transfer Functions of a Dynamic Virtual Auditory Display”, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E91-A(6):1320–1328, June 2008. [7] A. McKeag, D. McGrath. “Sound Field Format to Binaural Decoder with HeadTracking” in Proc. of the AES 6th Australian Reg. Conv., Melbourne, Australia, 1996 [8] M. Noisternig. “A 3D Ambisonic Based Binaural Sound Reproduction System”. Proc. Int. Conf. Audio Eng. Soc., 24, 2003 [9] B.I. Dalenb¨ack, M. Str¨omberg. “Real time walkthrough auralization - the first year”. in Proc. of the Inst. of Acoustics, 28(2), 2006 [10] J. R. West. “Five-channel panning laws: an analytic and experimental comparison”. Master’s Thesis, Chapter 3: IID-based Panning Methods, University of Miami, 1998. [11] B. Wiggins. “The Generation of Panning Laws for Irregular Speaker Arrays Using Heuristic Methods”. Audio Engineering Society Conference: 31st International Conference, 2007. [12] A. W. Mills. “Auditory localization”. J Tobias (ed.), Foundations of modern auditory theory, vol.2, 301-345, 1972. [13] S. Perrett, W. Noble. “The contribution of head motion cues to localization of low-pass noise”. Percept Psychophys. 59(7):1018-26, 1997. [14] D. R. Begault, E. M. Wenzel. “Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source”. J Audio Eng Soc., 49(10):90416, 2001. [15] W. O. Brimijoin. “The Contribution of Head Movement to the Externalization and Internalization of Sounds”. PLoS ONE 8(12): e83068, 2013. [16] C. J. Ziemer, J. M. Plumert, J. F. Cremer, and J. K. Kearney. “Estimating distance in Real and virtual environments: Does order make a difference?”. Atten Percept Psychophys, 71(5):10951106, July 2009.