GENERATION AND VALIDATION OF VIRTUAL AUDITORY SPACE

2 downloads 0 Views 524KB Size Report
Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. ... cordings generate a compelling illusion of auditory space when played.
CHAPTER 4

GENERATION AND VALIDATION OF VIRTUAL AUDITORY SPACE Danièle Pralong and Simon Carlile

1. INTRODUCTION

T

he aim in generating virtual auditory space (VAS) is to create the illusion of a natural free-field sound using a closed-field sound system (see Fig. 1.1, chapter 1) or in another free-field environment, using a loudspeaker system. This technique relies on the general assumption that identical stimuli at a listener’s eardrum will be perceived identically independent of their physical mode of delivery. However, it is important to note that the context of an auditory stimulus also plays a role in the percept generated in the listeners (chapter 1, section 2.1.2) and that auditory and nonauditory factors contribute to this. Stereophony is generated by recording sounds on two separate channels and reproducing these sounds using two loudspeakers placed some distance apart in front of a listener. If there are differences in the program on each channel, the listening experience is enhanced by the generation of a spatial quality to the sound. This could be considered as a relatively simple method for simulating auditory space. The easiest way to simulate a particular auditory space is to extend this technique to the use of multiple loudspeakers placed in a room at the locations from which the simulated sound sources are to be located. The usefulness of this technique is limited by the room’s size and acoustics, by the number of speakers to be used, and by the restricted area of space where the simulation is valid for the listener (for review see Gierlich1). It is now generally accepted that the simulation of acoustical space is best achieved using closed-field systems, since headphones allow

Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 R.G. Landes Company.

110

Virtual Auditory Space: Generation and Applications

complete control over the signal delivered to the listener’s ears independently of a given room size or acoustical properties. The disadvantage of this technique is that it requires compensation of the transfer function of the sound delivery system itself, that is, the headphones. As we will see in sections 4 and 6 of this chapter, this problem is not trivial, for both acoustical and practical reasons, and is a potential source of variation in the fidelity of VAS.

1.1. GENERATION

OF

VAS USING

A

CLOSED-FIELD SYSTEM

One technique for the simulation of auditory space using closedfield systems involves binaural recording of free-field sound using microphones placed in the ears of a mannequin2 or the ear canals of a human subject.3 These recordings when played back over headphones’ result in a compelling recreation of the three-dimensional aspects of the original sound field, when there is proper compensation for the headphones transfer characteristics.2-4 Such a stimulus should contain all the binaural and monaural localization cues available to the original listener, with the exception of a dynamic link to head motion. However, this technique is rather inflexible in that it does not permit the simulation of arbitrary signals at any spatial location. Still, the binaural technique remains extremely popular in the architectural acoustics and music recording environments, using artificial heads such as KEMAR by Knowles Electronics, Inc.,5 or the one developed by Gierlich and Genuit6 (for recent reviews see Blauert 4 and Møller7).a The principles underlying the generation of VAS are closely related to the binaural recording technique. The fact that binaural recordings generate a compelling illusion of auditory space when played back over headphones validates the basic approach of simulating auditory space using a closed-field sound system. The more general approach to generating VAS is based on simple linear filtering principles (see chapter 3, section 5.1; also Wightman and Kistler9). Rather than record program material filtered by the outer ear, the head-related transfer functions (HRTFs) for a particular set of ears are measured for numerous positions in space and used as filters through which any stimulus to be spatialized is passed. The following set of equations have been adapted from Wightman and Kistler9 and describe, in the frequency domain, how the appropriate filter can be generated. The series of equations described apply only to one ear, and a pair of filters for each ear must be constructed for each position to be simulated.

a

Hammershøi and Sandvad8 recently generalized the term “binaural technique” by introducing the expression “binaural auralization” which is equivalent to the VAS technique described below.

Generation and Validation of Virtual Auditory Space

111

Let X1 be a signal delivered by a free-field system, the spatial properties of which have to be reproduced under closed-field conditions. The signal picked up at the subject’s eardrum in the free-field, or Y1, can be described by the following combination of filters: Y1 = X1 L F M,

(1)

where L represents the loudspeaker transfer function, F represents the free-field to eardrum transfer function, i.e., the HRTF for a given spatial location, and M the recording system transfer function. Similarly, if a signal X2 is driving a closed-field system, the signal picked up at the eardrum, or Y2, can be described as: Y2 = X2 H M

(2)

where H represents the headphone-to-eardrum transfer function, which includes both the headphone’s transducing properties and the transfer function from the headphone’s output to the eardrum. As the aim is to replicate Y1 at the eardrum with the closed-field system, one has: Y2 = Y1

(3)

X2 H M = X1 L F M.

(4)

X2 = X1 L F/H.

(5)

Then: Solving for X2: Therefore, if the signal X1 is passed through the filter LF/H and played by the headphones, the transfer function of the headphones is canceled and the same signal will be produced at the eardrum as in free-field conditions. It can be seen from Equation 5 that with this approach the loudspeaker transfer function L is not eliminated from the filter used in the closed-field delivery. L can be obtained by first calibrating the loudspeaker against a microphone with flat transducing properties. Alternatively, L can be more easily eliminated if the closed-field delivery system employed gives a flat signal at the eardrum, like the in-ear tube phones made by Etymotic Research. In this case the filter used in closed-field is F only, which can be obtained by correcting for the loudspeaker and the microphone transfer functions (L M); this is illustrated in Figure 4.1. VAS localization results obtained using this latter method are reported in section 5.2 of this chapter.

1.2. ACOUSTICAL

AND

PSYCHOPHYSICAL VALIDATIONS

OF

VAS

As the perceptual significance of the fine structure of the HRTF is still poorly understood, the transfer functions recorded for VAS simulations should specify, as far as possible, the original, nondistorted signal

112

Virtual Auditory Space: Generation and Applications

Fig. 4.1. Dotted line: transducing system transfer function (loudspeaker, probe tube and microphone, L M; see equation (1), section 1.1). Broken line: transfer function recorded in a subject’s ear canal with the above system. Solid line: subject’s free-field-to-eardrum transfer function after the transducing system’s transfer function is removed from the overall transfer function (F). The measurements were made at an azimuth and an elevation of 0°.

input to the eardrum (however, see chapter 6, section 2.2). This is particularly important if VAS is to be applied in fundamental physiological and psychophysical investigations. This section is concerned with introducing and clearly differentiating the experimental levels at which the procedures involved in the generation of VAS can be verified. The identification of the correct transfer functions represents the first challenge in the generation of high fidelity VAS. Firstly, the measured HRTFs will depend on the point in the outer ear where the recordings are taken (see section 2.1). Secondly, recordings are complicated by the fact that the measuring system itself might interfere with the measured transfer function. Section 2.1.2 in the present chapter deals with the best choice of the point for measurement of the HRTF. The question of the perturbation produced by the recording system itself is reminiscent of the Heisenberg’s uncertainty principle in particle physics. We have examined this previously using a mannequin head fitted with a model pinna and a simplified ear canal. 11 As schematized in Figure 4.2A, the termination of the model canal was

Generation and Validation of Virtual Auditory Space

113

fitted with a probe microphone facing out into the ear canal (internal recording system). Transfer functions could also be recorded close to the termination of the canal by an external probe microphone held in place by a small metal insert seated at the entrance of the ear (external recording system); this system is described in more detail in sections 2.1.1 and 2.1.3. Figure 4.2B shows that the spatial location transfer functions computed from the two different microphone outputs were virtually identical. Thus recordings of transfer functions by the internal microphone in the absence and then in the presence of the external recording system could provide a measure of the effects introduced by the latter on the recorded transfer function. Figure 4.3A shows that the main changes in the HRTFs produced by the recording system were direction-independent small attenuations at 3.5␣ kHz (-1.5␣ dB) and at 12.5␣ kHz (-2 dB). The low standard error of the mean of the amplitude data pooled from three experiments (Fig. 4.3B) indicates Fig. 4.2. (A) Schematic drawing of the external recording system used to measure the HRTFs placed in a model ear showing the internal microphone probe 2␣ mm from the eardrum. (B) Transfer functions recorded in the model ear for a stimulus located at azimuth -90° and elevation 0°. Recordings from the internal microphone (solid line) and the external recording system (broken line) (dB re free-field stimulus level in the absence of the dummy head). Adapted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.

114

Virtual Auditory Space: Generation and Applications

Fig. 4.3.␣ Mean of the differences in pressure level between measurements obtained in the presence of the external recording system and measurements obtained in the absence of the external recording system across 343 speaker positions as a function of frequency. Shown are the average of the mean for three experiments (A), and the standard error of the mean (B), incorporating a total of 1029 recordings. Reprinted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.

Generation and Validation of Virtual Auditory Space

115

that these effects were highly reproducible. Effects on phase were between -0.02 and 0.05 radians for frequencies below 2.5␣ kHz. The likely perceptual significance of the perturbations introduced by our recording system was also estimated by comparing the HRTFs recorded with and without the external recording system after they had been passed through a cochlear filter model12 (see chapter 2, section 2.5.1). In this case the mean attenuation produced by the recording system reached a maximum of 1.4 dB over the frequency range of interest.11 It was concluded that the perturbation produced by the external recording system was unlikely to be perceptually significant. Hellstrom and Axelsson13 also described the influence of the presence of a probe tube in a simple model of the ear canal, consisting of a 30␣ mm long tube fitted at the end with a 1/4-inch B&K microphone. The effect was greatest (-1 dB attenuation) at 3.15␣ kHz, the resonance maximum. However, this measure takes no account of the remainder of the external ear or of the possible direction-dependent effects of their recording system, and the system described was not compatible with the use of headphones. The second challenge in the generation of VAS is to ensure that the digital procedures involved in the computation of the filters do not cause artifacts. Wightman and Kistler9 have presented an acoustical verification procedure for the reconstruction of HRTFs using headphones. This involves a comparison of the transfer functions recorded for free-field sources and for those reconstructed and delivered by headphones. The closed-field duplicated stimuli were within a few dB of amplitude and a few degrees of phase of the originals, and the errors were not dependent on the location of the sound source. This approach validates the computational procedures involved in the simulation, as well as the overall stability of the system, but provides no information about the possible perturbing effects of the recording device. Therefore, acoustical verification of both the recording procedure (as described above) and the computational procedures are necessary and complementary. Finally, a psychophysical verification, showing that the synthesized stimuli are perceptually equivalent to the percept generated by freefield stimuli constitutes the ultimate and necessary demonstration of the fidelity of VAS. In this type of experiment, the ability of a group of listeners to localize sound sources simulated in VAS is rigorously compared to their ability to localize the same sources in the free-field. To be able to demonstrate any subtle differences in the percepts generated the localization test must be one which truly challenges sound localization abilities (see chapter 1, section 1.4.2). Wightman and Kistler14 have presented a careful psychoacoustical validation of their HRTFs recording and simulation technique. These results are reviewed in section 5.2 together with results obtained recently in our laboratory.

116

Virtual Auditory Space: Generation and Applications

2. RECORDING THE HRTFS 2.1. ACOUSTICAL ISSUES The description of an acoustical stimulus at the eardrum, and hence the recording of HRTFs, is complicated by the physics of sound transmission through the ear canal. In order to capture the transfer function of the outer ear, the recording point should be as close as possible to eardrum. However, this is complicated by the fact that in the vicinity of the eardrum the sound field is complex. Over the central portion of the ear canal wave motion is basically planer over the frequency range of interest (< 18␣ kHz; 15 see also Rabbitt and Friedrich16). Because of the finite impedance of the middle ear, the eardrum also acts as a reflective surface, particularly at high frequencies. The sound reflected at the eardrum interacts with the incoming sound and results in standing waves within the canal.17,18 Closer to its termination at the eardrum, the morphology of the canal can vary in a way which makes the prediction of pressure distribution in the canal for frequencies above 8␣ kHz more complex.19 Furthermore, close to the eardrum the distribution of pressure is further complicated by the fact that eardrum motion is also coupled to the sound field.20 The locations of the pressure peaks and nulls resulting from the longitudinal standing waves along the canal are dependent on frequency. These are manifest principally as a notch in the transfer function from the free-field to a point along the canal that moves up in frequency as the recording location approaches the eardrum.17,21 This is illustrated in Figure 4.4. To a first approximation, the frequency of the notch can be related to the distance of the recording point from the eardrum using a 1/4 wave length approximation.21 From these considerations it is clear that placement of the microphone within the auditory canal records not only the signal of interest, namely the longitudinal mode of the incoming wave, but also a number of other components which for the purposes of recording the HRTF are epiphenomenal. A final point worth mentioning is that the input impedance of the measuring system should also be sufficiently high to avoid perturbing the acoustical environment sampled.16,22 The measurement of HRTFs for VAS is further constrained by the requirement that the recording system employed must also be capable of measuring the headphone-to-eardrum transfer function (see section 4). HRTFs can be obtained either from anthropometric model heads fitted with internal microphones or from human subjects. An alternative approach is to use pinnae cast from real human subjects and placed on an acoustical mannequin. Acoustical and psychophysical work is currently being undertaken in our laboratory to test whether this type of recording is equivalent to the HRTFs directly measured at the subject’s

Generation and Validation of Virtual Auditory Space

117

Fig. 4.4. The frequency response spectrum of the replica of a human ear canal with the recording probe at depths of 4,11,16 and 17␣ mm into the canal. Reprinted with permission from Chan JCK et al, J Acoust Soc Am 1990; 87:1237-1247.

eardrum. There are a number of implementation issues in the recording of HRTFs from real human ears and these are discussed in the following sections. 2.1.1. Microphone type A general problem with most microphones is that due to their size they will perturb the sound field over the wavelengths of interest. The choice of microphones suited for HRTF measurement is therefore limited by having to satisfy criteria both of reduced dimensions and of sufficient sensitivity and frequency response to provide a good signal to noise ratio over the frequency range of interest. Chapter 2 (section 1.4.1) gives a list of studies where directiondependent changes in spectrum at the eardrum have been measured. Recordings were obtained using either probe microphones or miniature microphones small enough to be placed in the ear canal. Both types have a relatively low sensitivity (10-50␣ mV/Pa) and probe microphones also suffer the further disadvantage of a nonflat frequency response due to the resonance of the probe tube. Furthermore, because of their relative size, a miniature microphone introduced in the

118

Virtual Auditory Space: Generation and Applications

ear canal will also perturb the sound field.23 Therefore HRTFs recorded in this way are not readily suitable for the generation of VAS.b Although the microphone’s own transfer function can be compensated for at various stages of the HRTF recording or VAS generation procedures (cf. sections 1.1 and 2.3.2), a frequency response as flat as possible should be preferred. The HRTFs are characterized by a relatively wide dynamic range, and if the frequency response varies considerably, then the spectral notches in the recorded HRTFs could disappear into the noise floor if they coincide with a frequency range where the system’s sensitivity is poor. The microphone should also be sufficiently sensitive to record the HRTFs using test stimuli at sound pressure levels which will not trigger the stapedial reflex. The stapedial reflex results in a stiffening of the ossicular chain which will cause a frequency-dependent variation in the input impedance to the middle ear (see section 2.3.3). Ideally, the microphone should also have a bandpass covering the human hearing frequency range, or at the least the range within which the cues to a sound’s location are found (see chapter 2). Two main types of microphones have been employed for recording HRTFs in humans: condenser probe microphones (e.g., Shaw and Teranishi;24 Shaw;25 Butler and Belendiuk;3 Blauert;4␣ Møller et al26), and small electret microphones (Knowles EA-1934;23 Etymotic;9 Knowles EA-1954; 27 Sennheiser KE 4-211-228). Metal probe tubes are inadequate due to the curved geometry of the ear canal and are therefore fitted with a plastic extension (e.g., Møller et al28). Alternatively, a plastic probe can be directly fitted to an electret microphone.9,11 Probes have been chosen with a plastic soft enough so that it bends along the ear canal, but hard enough not to buckle when introduced into the canal.21 Furthermore, a compromise has to be found between the overall diameter of the probe and its potential interaction with the sound field in the ear canal. In our previous study 11 the probe was thick enough so that transmission across the probe tube wall was negligible compared to the level of the signal which is picked up by the probe’s tip (50 dB down on the signal transmitted through the tip from 0.2 to 16␣ kHz) and with an internal diameter still large enough not to significantly impair the sensitivity of the recording system.

b

Møller7 has described how the disturbance should affect the final transfer function; it is argued that the error introduced in the HRTF will cancel if headphones of an “open” type are used for the generation of VAS, i.e., headphones which do not disturb the sound waves coming out of the ear canal, so that the ear canal acoustical conditions will be identical to those in the freefield.

Generation and Validation of Virtual Auditory Space

119

2.1.2. Point of measurement: occluded canal recordings versus deep canal recordings The location within the outer ear to which the HRTFs are to be measured is an important acoustical factor. Deciding on a specific recording location is also a prerequisite for a reproducible placement of probes at similar acoustical positions for different subjects and for the left and right ears of a given subject. Several studies have demonstrated that in the frequency range where only one mode of wave propagation is present (the longitudinal mode), sound transmission along the ear canal is independent of the direction of the incoming sound source.23,24,29,30 Consequently, the HRTFs measured in the ear canal can be accounted for by a directionally-dependent component and a directionally independent component. The directional components are due to the distal structures of the auditory periphery such as the pinna, concha and head. The directionally independent components are due principally to the proximal component of the outer ear, the ear canal. From this point of view, measurements made at any point along the ear canal will contain all the directional information provided by the outer ear and should represent an accurate description of the directional components of the HRTFs. Thus, in theory, HRTFs measured at any point along the ear canal could be used for the generation of VAS but only if the appropriate correction is made for the direction independent component so that the absolute sound pressure level at the eardrum is replicated.7 HRTFs described in the literature have been measured at three main positions along the ear canal: deep in the canal (i.e., close to the eardrum), in the middle of the canal, and at the entrance of the canal. Transfer functions have also been measured at the entrance of an occluded ear canal. For the latter technique, the microphone is embedded in an earplug introduced into the canal (see for example refs. 7, 28, 31, 32). The plug’s soft material expands and completely fills the outer end of the ear canal, leaving the microphone sitting firmly and flush with the entrance of the ear canal. This method offers the advantage of eliminating the contribution of the outgoing sound reflected at the eardrum. Occluded ear canal measurements represent an obviously attractive option as the method is relatively noninvasive and is potentially less uncomfortable for the subjects than deep canal measurements (see below). However, there are a number of potentially complicating issues that need to be considered. Firstly, the main question with this technique is the definition of the point in the ear canal at which sound transmission ceases to be direction-dependent. Hammershøi et al (1991; quoted in ref. 7) have shown that one-dimensional sound transmission starts at a point located a few mm outside the ear canal. This is in contradiction with previous results by Shaw and Teranishi24 and Mehrgardt and Mellert30 demonstrating direction-dependent transmission

120

Virtual Auditory Space: Generation and Applications

for locations external to a point 2␣ mm inside the ear canal (see however Middlebrooks and Green33). Secondly, the description of the canal contribution to the HRTFs as being nondirectional does not necessarily mean that blocking the canal will only affect the nondirectional component of the response. Shaw32 has argued that canal block increases the excitation of various conchal modes. This results simply from the increase in acoustical energy reflected back into the concha as a result of the high terminating impedance of the ear canal entrance. Finally, the HRTFs measured using the occluded canal technique will only be a faithful representation of the HRTFs to the eardrum if the transmission characteristics of the canal itself are accurately accounted for. Occluded canal recordings have been carefully modeled by Møller and coauthors.7,28 However what is still lacking is an acoustical demonstration that these types of measurements result in a replication of the HRTF at the eardrum of human subjects. Preliminary reports of the psychophysical validation of VAS generated using this method do not seem as robust as those obtained using deep canal recordings8 (see also section 5.2). It may be, however, that refinement of this technique, particularly with respect to precise recording location, may result in a convenient and a relatively noninvasive method for obtaining high fidelity HRTFs. An alternative position for recording the HRTFs is deep within the canal in the vicinity of the eardrum. Although deep ear canal measurements represent a more delicate approach to the recording of HRTFs, they have to date provided the most robust and accurate VAS14 (also see section 5.2). In this case, the measuring probe should be placed deep enough in the ear canal so that the recorded HRTFs are not affected by the longitudinal standing waves over the frequencies of interest. This will ensure that measurements in the frequency range of interest do not suffer a poor signal to noise ratio as a result of the pressure nulls within the canal. On the other hand, the probe should be distant enough from the termination of the canal so as to avoid the immediate vicinity of the eardrum where the pressure distribution becomes complex at high frequencies. Wightman and Kistler 9 have reported that their recording probes were located 1-2␣ mm from the eardrum. We have chosen to place our probe microphones at a distance of 6␣ mm from the effective reflecting surface of the eardrum so that the 1/4 wavelength notch due to the canal standing waves occurs above 14␣ kHz27 (see below). Obviously, this position represents a compromise between the upper limit of the frequency range of interest and the subjects’ comfort and safety. Such transfer functions will provide an excellent estimate of the eardrum pressure for frequencies up to 6␣ kHz.21 However, as the frequency is increased from 6␣ kHz to 14␣ kHz, the pressures measured by the probe will represent a progressively greater underestimate of the effective eardrum pressure. As the auditory canal

Generation and Validation of Virtual Auditory Space

121

is not directionally selective, this spectral tilt is unaffected by the location of the sound in space. Additionally, placing the probe at the same acoustical position when measuring the headphones transfer function will ensure that this effect will be canceled when the HRTFs are appropriately convolved with the inverse of the headphones transfer function (this will apply if headphones of an “open” type are employed; see section 2.1.1). 2.1.3. Microphone probe holders Microphone probes have to be held in place in the ear canal by a system satisfying the following three criteria: to allow a safe and precise control of the probe’s position in the ear canal, to be of minimum dimensions in order to avoid occluding the ear canal or causing any disturbance of the sound field in the frequency range of interest, and to be suitable for measuring the headphone transfer function. Wightman and Kistler9 introduced the use of custom-made, thin boredout shells seated at the entrance of the ear canal and fitted with a guide tube. We have used an approach which is largely based on this method.27 While keeping the idea of a customized holder the original, positive shape of the distal portion of the ear canal and proximal part of the concha is recovered by first molding the shape of the external ear and then plating the surface of the mold. The thickness of the holder can be reduced to a minimum (less than 0.25␣ mm) using metal electroplating. The metal shell is slightly oversized compared to the original ear canal and concha and forms an interference fit when pressed into the canal. Figure 4.5 shows a photograph of the metal insert, probe tube and microphone assembly in place in a subject’s ear. The acoustical effects of this recording system (including probe and microphone) on the measured HRTFs for a large number of spatial locations of a sound source were determined using a model head fitted with a probe microphone11 (see section 1.2). The results demonstrate that this system does not significantly perturb the sound field over the frequency range of interest (0.2 to 14␣ kHz). 2.1.4. Methods for locating the probe within the ear canal For deep canal recordings considerable precautions have to be taken to avoid the possibility of damaging the subject’s eardrum. Also, the tissue surrounding the ear canal becomes increasingly sensitive with canal depth and should be avoided as well. It is thus crucial that the probe should be introduced to its final location under careful monitoring and by well-trained personnel. The subject’s ears should also be free of accumulated cerumen which could occlude the probe or be pushed further in and touch the eardrum. The use of probe tubes is also complicated by the fact that, due to the slightly curved geometry of the ear canal, in some subjects the probe ceases to be visible a few mm after it has entered the meatus. Two different methods have been

122

Virtual Auditory Space: Generation and Applications

Fig. 4.5. Photograph of a metal insert, probe tube and microphone assembly used for head-related transfer functions recordings in place in a subject’s ear. Reprinted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.

employed for controlling the placement and the final position of inear probes: “touch” and acoustical methods. An example of the former is best described in Lawton and Stinton,34 where a recording probe was placed at a distance of 2␣ mm from the eardrum in order to measure acoustical energy reflectance. The probe tube was fitted with a tuft of soft nylon fibers extending 2␣ mm beyond the probe tip. The fibers touching the eardrum would result in a bumping or scraping noise detected by the subject. The fibers deflection was also monitored under an operating microscope (see also Khanna and Stinton17). Wightman and Kistler9 employed a slightly modified version of this method, where a human hair was introduced into the ear canal until the subject reported that it had touched the eardrum. The hair was marked and then measured to determine the precise length for the probe. The probe’s final placement was also monitored under an operating microscope (Wightman and Kistler, personal communication).

Generation and Validation of Virtual Auditory Space

123

Acoustical methods represent an alternative to the approaches presented above. Chan and Geisler21 have described a technique for measuring the length of human ear canals which we have applied to the placement of a probe for the measurement of HRTFs.27 Chan and Geisler showed that the position of a probe tube relative to the eardrum can be determined using the frequency of the standing wave notches and a 1/4 wavelength approximation. In our recording procedure the probe tube is first introduced so that it protrudes into the auditory canal about 10␣ mm from the distal end. A transfer function is then measured for a sound source located along the ipsilateral interaural axis, where the HRTF is fairly flat and the position of the standing wave notch best detected. The frequency of the spectral notch is then used to calculate the distance from the probe to the reflecting surface. The frequency of the standing notch is monitored until the probe reaches the position of 6␣ mm from the eardrum, which leaves the standing wave notch just above 14␣ kHz. Both placement methods allow control of the position of the probe with a precision of about 1␣ mm. 2.1.5. Preservation of phase information Precise control over the phase of the recorded signals is crucial for the generation of high fidelity VAS. In the case where a recording of absolute HRTFs is desired (i.e., a true free-field to eardrum transfer function devoid of the loudspeaker characteristics), the recording system (probe and microphone) has to be calibrated against the loudspeaker in the free-field (see above). It is essential for the two microphones or microphone probes to be placed at precisely the same location with respect to the calibration speaker for this measurement, which also matches the center of the subject’s head during recordings. Indeed, any mismatch will superimpose an interaural phase difference on the signals convolved for VAS. This will create a conflict between time, level and spectral cues and therefore reduce the fidelity of VAS. For instance, a difference of as little as 10␣ mm in position would result in a time difference of 29 µs, corresponding to a shift of approximately 2° towards the leading ear.

2.2. DIGITIZATION ISSUES There are a number of important issues to be considered when selecting the digitization parameters for measuring the HRTFs, recording the program material and generating virtual auditory space. The background to these issues has been considered in some detail in chapter 3 and is only briefly considered here. The frequency bandwidth of the signal and its temporal resolution is determined by the digitization rate. However, there is an interesting anomaly in auditory system processing that makes the selection of digitization a less than straight forward matter. Although the precise upper limit of human hearing is

124

Virtual Auditory Space: Generation and Applications

dependent on age, for most practical purposes it can be considered as around 16␣ kHz. In this case it could be argued that, assuming very steep anti-aliasing filters, all the information necessary for the reconstruction of the signal should be captured with a digitization rate of 32␣ kHz. However, we also know that the auditory system is capable of detecting interaural time differences of as little as 6 µs (chapter 2, section 2.2.1). To preserve this level of temporal resolution, digitization rates would need to be as high as 167␣ kHz, some 5-fold higher than that required to preserve the frequency bandwidth. The digitization rate chosen has a serious impact on both the computational overheads and mass storage requirements. The time domain description of the HRTF (its impulse response) is of the order of a few milliseconds. The length of the filter used to filter the program material will be greater with higher digitization rates. The longer the filter, the greater the computational overheads. At the higher digitization rates most of the current convolution engines and AD/DA converters commonly used are capable of this kind of throughput and so can be used for generating VAS in real time. However, in practice this is only really possible where the length of the filter is less than 150 taps or so (see chapter 3, sections 5.3 and 6). In many implementations of VAS, the program material is continuously filtered in real time. In the real world, when we move our head, the relative directions between the ears and the various sound sources also vary. To achieve the same effect in VAS the location of the head is monitored (chapter 6) and the characteristics of the filters are varied rapidly. With current technology, if VAS is to be generated in real time and be responsive to changes in head orientation this is likely to require a processor for each channel and, assuming no preprocessing of the digital filter and the higher digitization rates, this would be limited to generating only one target in virtual space. The obvious alternative to this rather limited and clumsy implementation of VAS is to reduce digitization rate, thereby reducing the computational overhead. However, there are as yet no systematic studies examining digitization rate and the fidelity of VAS. This is somewhat surprising, particularly given the apparent disparity between frequency and time domain sensitivities in the auditory system. Selecting a digitization rate which simply satisfies the frequency bandwidth criteria will limit the temporal resolution to steps of 31 µs. Whether this reduction in temporal resolution leads to any degradation in the fidelity of VAS is unknown. On the other hand, selecting the digitization rate to satisfy the time domain requirements would result in a frequency bandwidth of around 80␣ kHz. Naturally, the corner frequencies of the anti-aliasing and reconstruction filters would also need to be set accordingly so that the time domain fidelity of the output signal is not compromised (chapter 3, section 3.2).

Generation and Validation of Virtual Auditory Space

125

The choice of the dynamic range of the DA/AD converters to use in such a system is determined by both the dynamic range of the HRTFs and the dynamic range of the program material to be presented in VAS. While HRTFs vary from individual to individual, the maximum dynamic range across the frequency range of interest is unlikely to exceed 50 dB. This is determined by the highest gain component in the low to mid frequency range for locations in the ipsilateral anterior space and the deepest notches in the transfer functions in the mid to high frequency range (chapter 1, section 1.5). The gain of the program material can be controlled by either varying the gain after the DA conversion using a variable gain amplifier or variable attenuator, or alternatively by varying the gain of the digital record prior to conversion. Every electronic device (including DA and AD converters) has a noise floor. To obtain a high fidelity signal, where the signal is relatively uncontaminated by the inherent noise in the device (chapter 3, section 2.2), it is necessary to have as large a difference as possible between the gain of the signal to be output and the noise floor of the device. Where the gain of signal is controlled post conversion and assuming that the output signal fills the full range of DA converter (chapter 3, section 3.3), the signal to noise ratio of the final signal will be optimal. This will not be the case if the gain of the signal is varied digitally prior to conversion as the signal will be somewhat closer to the noise floor of the device. Often it is not practical to control the rapidly changing level of a signal program using post conversion controllers so a small sacrifice in fidelity needs to be tolerated. However, in practice, the overall average level of a signal can still be controlled by a post conversion controller. In this case a 16 bit device providing around 93 dB dynamic range (chapter 3, section 3.3) is quite adequate to the needs of everything but the most demanding of applications.

2.3. TEST SIGNAL

AND

RECORDING ENVIRONMENT

2.3.1. Recording environment Head-related transfer functions are best recorded in a nonreverberant, anechoic room. Although the HRTFs could be recorded in any environment, HRTFs measured in a reverberant space will include both the free-field to eardrum transfer function and potentially directiondependent effects of the room. The impulse-response function of the auditory periphery is relatively short (on the order of a few milliseconds). Therefore, if the room is of sufficiently large dimensions, reverberant effects due to the walls and ceiling could be removed by windowing the impulse response function, although floor reflections may still present a problem. Even though reverberation will improve the externalization of synthesized sound sources (Durlach et al73), the maximum flexibility for simulation will be achieved by separating room

126

Virtual Auditory Space: Generation and Applications

acoustical effects from the effects of the subject’s head and body. This is particularly important in the case where HRTFs are to be used in the study of fundamental auditory processes. Sound sources can then be simulated for different acoustical environments using a combination of HRTFs and acoustical room models (e.g., Begault35). 2.3.2. Test signals for measuring the HRTFs The first measurements of the frequency response of the outer ear were obtained in a painstaking and time-consuming way using pure tones and a phase meter.24,29,36␣ More recently, the impulse-response technique (see chapter 3, section 5.1) has been routinely applied to measure HRTFs. The total measurement time has to be kept as short as possible to minimize subject discomfort and ensure that the recording system remains stable throughout the measurement session. The main problem with the impulse response technique is the high peak factor in the test signal. That is, there is a high ratio of peak-toRMS pressure resulting from the ordered phase relations of each spectral component in the signal. Where the dynamic range of the recording system is limited this can result in a poor signal to noise ratio in the recordings. Several digital techniques involving the precomputation of the phase spectrum have been developed to try and minimize the peak factor (see for instance Schroeder37). The algorithm developed by Schroeder has been applied by Wightman and Kistler9 to the generation of a broadband signal used for the recording of HRTFs. Signal to noise ratio can also be increased by using so-called maximum length sequence signals, a technique involving binary two-level pseudo random sequences.26 Recently, a similar technique according to Golay has been applied successfully to the measuring of HRTFs11,27,38 (see chapter 3, section 5.2). Several other steps can be used to increase the signal to noise ratio in the recording of HRTFs. Energy should be concentrated in the frequency band of interest. Also, averaging across responses will provide a significant improvement, even when Golay codes are employed. Finally, the signal delivered can be shaped using the inverse filter functions of elements of the transducing system so that energy is equalized across the spectrum (see for instance Middlebrooks et al23). We have generated Golay codes shaped using the inverse of the frequency response of the whole transducing system to compensate for the loudspeaker and microphone high-frequency roll-off and the probe tube resonance (Fig. 4.6, broken line).11 The resulting response of the microphone was flat (±1 dB above 1␣ kHz) and at least 50 dB above noise over its entire bandwidth (0.2-16␣ kHz) (Fig. 4.6, solid line). Signal to noise ratio can also be improved at the level of sampling by adjusting the level to optimize analog to digital conversion (see section 2.2).

Generation and Validation of Virtual Auditory Space

127

Fig. 4.6. Broken line: transducing system (loudspeaker, probe tube and microphone) transfer function determined using Golay codes (dB re A/D 1 bit). Solid line: amplitude spectrum of the Golay codes stimulus used for recording head-related transfer functions as redigitized from the microphone’s output, as a result of shaping with the inverse of the system’s frequency response. Adapted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.

2.3.3. Sound pressure level Absolute sound pressure level is an important parameter in recording both the HRTFs and the headphone transfer functions. On the one hand, the level should be high so that signal to noise ratio is optimized. On the other hand, sound pressure level should be sufficiently moderate so that the middle ear stapedial reflex is not triggered. Wightman and Kistler9 presented a control experiment where HRTFs recorded at a free-field SPL of 70 dB were used to filter headphones stimuli delivered at 90 dB SPL. The HRTFs re-recorded in this condition differed significantly from the original HRTFs, in particular in the 1-10␣ kHz frequency range, reflecting the change in acoustical impedance at the eardrum. This clearly indicates that SPL of the measuring stimulus should be matched to the level that VAS is to be generated, or at least over a range where the measurement is linear. We have also measured HRTFs and headphone transfer functions at 70␣ dB SPL.11

2.4. STABILIZING

THE

HEAD

FOR

RECORDINGS

The subject’s head can be maintained stabilized during the recordings of the HRTFs by various devices such as a clamp, a bite bar or chin rest. A head clamp is not advisable because of the possibility of perturbing the sound field. Also, the contraction of the masseter muscles

128

Virtual Auditory Space: Generation and Applications

produced by a bite bar results in a noticeable change in the morphology of the distal portion of the auditory canal which may affect the positioning of the probe tube. Therefore a chin rest of a small size seems to represent a good compromise between stability and interfering with the recordings. We estimated the variation introduced into the data by any head movement during the recording session using multiple (9 to 13) measurements of the HRTF made directly in front of 8 subjects (azimuth 0° and elevation 0°).27 The variation in the HRTFs was pooled across all subjects, and the pooled standard deviation was obtained by pooling the squared deviations of each individual’s HRTF from their respective means. To gain insight into the perceptual significance of these acoustical variations, the HRTFs were filtered using an auditory filter model to account for the spectral smoothing effects of the auditory filters12 (see section 1.2, and section 2.5.1 in chapter 2).This resulted in pooled standard deviations around 1 dB except for frequencies around 8␣ kHz or higher, where the deviation reached a maximum of about 2␣ dB. Middlebrooks et al23 introduced the use of a head-tracking device (see section 5.1) worn by the subject during the HRTFs measurement, providing a measurement of the head’s position throughout the session. We have recently refined this technique by providing subjects with feed-back from the head-tracker, allowing them to keep their head within less than 1.5 degrees of azimuth, elevation and roll compared to the original position.

3. FILTERING SIGNALS FOR PRESENTATION There is a huge literature examining the efficient filtering of signals using digital methods and it would not be possible to give adequate coverage here. Instead, we will focus on a number of key issues and look at ways in which they are currently being implemented in VAS display. Digital filtering is a very computationally intensive process and there are a number of efficient, purpose designed devices available. As discussed above (section 2.2), these devices are, on the whole, being pushed close to their operating limits with the implementation of even fairly simple VAS displays involving a small number of auditory objects and very rudimentary acoustical rendering. Where there are a large number of virtual sources and/or a complex acoustical environment involving reverberance and therefore long impulse response functions, the computational overheads become sufficiently high that real time generation of VAS is not possible. There are a number of research groups currently examining ways in which the efficiency of the implementation of VAS can be improved (see chapter 6). The two main approaches to filtering a signal are either in the time domain or the frequency domain. In the first case, digital filters,

Generation and Validation of Virtual Auditory Space

129

generally in the form of the impulse response of the HRTF and the inverse of the headphone transfer function, are convolved with the program material (chapter 3, section 5.3). The number of multiply and accumulate operations required for each digitized point in the program material is directly related to the length of the FIR filter. Each filter can represent a single HRTF or the combination of a number of filter functions such as the HRTF plus the inverse of the headphone transfer function plus some filter describing the acoustics of the environment in which the auditory object is being presented (e.g., the reverberant characteristics of the environment). In the latter case the filter is likely to be many taps long and require considerable processing. In addition to the computational time taken in executing a filter there may also be a delay associated with each filter which adds to the total delay added to the program material. In the situation where the user of a VAS display is interacting with the virtual environment (see chapter 6), the selection of the components of the filters will be dependent on the relative location and directions of the head and the different acoustical objects in the VAS being generated (this will include sources as well as reflecting surfaces). There may also be a further small delay added as the appropriate libraries of HRTFs are accessed and preprocessed. In a dynamic system, the total delays are often perceptually relevant and appear as a lag between some action by the operator (e.g., turning the head) and the appropriate response of the VAS display. In such cases the VAS is not perceived to be stable in a way that the real world is stable when we move around. The perceptual anomalies produced by such a poor rendering of VAS can lead to simulator sickness, a condition not unlike sea sickness. Thus, computational efficiency is an issue of considerable importance not just in terms of the fidelity of a real time display but in the general utility of the display. In a time domain approach, care must be taken when switching filters with a continuous program to avoid discontinuities due to ringing in the filters and large jumps in the characteristics of the filters. If the head is moved slowly in the free-field, a continuous free-field sound source will undergo a continuous variation in filtering. However, by definition sampling of the HRTF must be done at discrete intervals. If VAS is based on a relatively coarse spatial sampling of the HRTFs then, if there is no interpolation between the measured HRTFs, there may be significant jumps in the characteristics of the filters. A number of interpolation methods have been employed to better capture the continuous nature of space and, therefore, the variations in the HRTFs (see chapter 5 and 6). As yet there is no systematic study of the effects on the fidelity of the generated VAS using relatively large steps in the HRTF samples. In any case, the seamless switching of filters requires convolution methods that account for the initial conditions of the system. Similarly, if the filters start with a step change

130

Virtual Auditory Space: Generation and Applications

from zero then they are also likely to ring and produce some spectral splattering in the final signal. Fortunately, this problem is simply dealt with by time windowing to ensure that the onset and offset of the filters make smooth transitions from zero. The second approach to filtering is in the frequency domain. This approach becomes increasingly more efficient when the impulse response of the combined filters is relatively long. In this approach the program material is broken up into overlapping chunks and shifted into the frequency domain by a FFT. This is then multiplied by the frequency domain description of the filter and the resultant is shifted back into the time domain by an inverse FFT. Each chunk of the now filtered program material is time windowed and reconstructed by overlapping and adding together the sequential chunks. The efficiency of the current implementations of the FFT make this the most practical approach when the filters are relatively long, as occurs when complex or reverberant acoustical environments are being rendered. There are also new very efficient methods and hardware for computing very fast FFTs.

4. DELIVERY SYSTEMS: RECORDING HEADPHONES TRANSFER FUNCTIONS (HPTFS) Many points discussed in the previous sections of this chapter for the recording of HRTFs also apply to the recording of HpTFs. The calibration of headphones presents some specific problems, some of which could contribute significantly to the spatial quality of signals resynthesized under VAS. Choosing headphones with good transducing qualities is an obvious factor. As shown further in this section, the headphones capsule type has to be taken into consideration as well. In particular, the type of headphones employed should leave the recording system undisturbed so that the probe’s position in the ear canal is not altered when the HpTFs are being measured. There has been very little published on headphone transfer functions measured in real ears. Calibration is generally performed on artificial ears systems because of the difficulties associated with using real ears. While such systems may be useful for controlling the quality of headphones from the manufacturer point of view, they have limited relevance to the headphones’ performance on real human ears. Moreover, results are generally expressed as differences between the sound pressure level produced by the headphones at a given point in the ear canal and the sound pressure level produced at the same point by a free-field or diffuse field sound source.39,40 These measurements give information about the ability of headphones to replicate free-field or diffuse field sound reproduction. In the case of the generation of VAS the acoustical objective is to reconstitute, using headphones, the sound field at the eardrum that would normally occur for a free-field stimulus. The HRTFs for free-field sources are measured to a specific reference

Generation and Validation of Virtual Auditory Space

131

in the ear canal and the HpTFs must be measured to the same reference point if they are to be properly accounted for (see section␣ 1.1). Figure 4.7 shows a number of measurements of HpTFs recorded for two different types of headphones from the internal probe microphone at the eardrum of our model head (see section 3).c These transfer functions were obtained as follows. The transfer function of the probe-microphone system was first obtained by comparing its free-field response to a loudspeaker with that of a flat transducer (B&K 4133␣ microphone) to the same speaker. The HpTF was then obtained by deconvolving the probe-microphone response from the response recorded in the ear canal with the stimulus delivered through headphones. This transfer function describes both the headphones’ transducing properties and the headphone-to-eardrum transfer function. Figure 4.7A shows a number of measurements of HpTFs recorded for circum-aural headphones (Sennheiser Linear 250). The HpTF is characterized by gains of up to 18 dB for frequencies between 2␣ kHz and 6␣ kHz and between 8␣ kHz and 12␣ kHz, separated by a spectral notch around 7.5␣ kHz. These spectral features are similar to the main features of the freefield-to-eardrum transfer functions obtained for positions close to the interaural axis (see chapter 2, section 1.5). With circum-aural headphones the ear is entirely surrounded by the external part of the phone’s cushion. Therefore, it is not surprising that this type of headphone captures many of the filtering effects of the outer ear. The standard deviation for six different placements of the headphones was worst at 8␣ kHz (around 2 dB) and below 1 dB for the other regions of the spectrum below 12␣ kHz (Fig. 4.7B). This demonstrates that the placement of this type of headphones on the model head is highly reproducible. In contrast, the reproducibility of recordings obtained for repeated placements of headphones of the supra-aural type (Realistic Nova␣ 17) was less satisfying (Fig. 4.7C). Although these transfer functions were characterized by a generally flatter profile, there was considerable variation between replicate fittings and the standard deviation for six different placements reached 8 to 9 dB around 8 and 12␣ kHz (Fig.␣ 4.7D). Indeed, this type of headphone flattens the outer ear convolutions in a way which is difficult to control from placement to placement. Although favored for their flat response when tested on artificial ear systems, this type of headphone is clearly not suitable for VAS experiments which rely on a controlled broadband signal input to the auditory system. Transfer functions for the Sennheiser 250 linear circum-aural headphones recorded from a human subject are shown in Figure 4.7E. In this case measurements were made with the in-ear recording system as c

This measures only the reproducibility of positioning, not the disturbance of the external recording system used on real human ears.

132

Virtual Auditory Space: Generation and Applications

Generation and Validation of Virtual Auditory Space

133

described above (section 1.2; see Fig. 4.2). These transfer functions show moderate gains for frequencies between 2 and 4␣ kHz and between 4 and 6␣ kHz (up to 10 and 5 dB respectively) and a second 10␣ dB gain at 12␣ kHz, preceded by a -25 dB notch at 9␣ kHz. Although these features are qualitatively similar to those obtained from the model head with the same headphones (Fig. 4.7A), these are significant differences in the amplitude and frequency of spectral transformations. They can probably be attributed to differences between the human and model pinna shape and size, ear canal length, acoustical reflectance at the eardrum as well as the probe’s position. These transfer functions compare well with data reported for similar types of headphones by Wightman and Kistler9 and Møller et al.28 The standard deviation for six different placements of the headphones was worst around 14␣ kHz (about 4 dB), about 2 dB for frequencies between 8 and 12␣ kHz and below 1 dB for frequencies below 7␣ kHz (Fig. 4.7F). Part of the variability observed for high frequencies could be caused by slight movements of the probe due to the pressure applied by the headphones on to the external recording system. Thus, the standard deviation reported here probably represents an overestimate of the variability between different placements of circum-aural headphones on a free human ear. Circum-aural headphones therefore seem to be best suited for the generation of VAS, and have been used in all the VAS studies published so far (e.g., Sennheiser HD-340;9 Sennheiser HD-43041). Some models of circum-aural headphones also have capsules of an ovoid rather than a round shape, which fit better around the pinna and leave enough space for microphone cartridges to be placed below the ear without being disturbed. Møller et al28 have presented an extensive study describing transfer functions from 14 different types of headphones to the occluded ear canals of 40 different subjects, including many circumaural headphones. Control over the precise position of headphones around the outer ear remains a critical issue in the generation of high fidelity VAS. In the absence of any acoustical measurement device, the correspondence between subsequent placements of the headphones and that used in calibrating the HpTFs can not be determined. Mispositioning circumaural headphones slightly above or below the main axis of the pinna can result in significant changes in the measured HpTF, with notches in the transfer function shifting in frequency in ways similar to changes Fig. 4.7 (opposite). (A) Headphone-to-eardrum transfer functions for Sennheiser 250 Linear circum-aural headphones measured using the model head, (C) Realistic Nova 17 supra-aural headphones measured using the model head, and (E) Sennheiser 250 Linear circum-aural headphones measured from the left ear of a human subject. Six recordings were made in each condition with the headphones taken off and repositioned between each measurement. (B), (D), (F): standard deviations for (A), (C), (E), respectively.

134

Virtual Auditory Space: Generation and Applications

observed in HRTFs when a free-field sound source moves around the ear (Pralong and Carlile, unpublished observations). This problem can be eliminated, at least in the research environment, by the use of inear tube phones such as the Etymotic Research ER-2. These tube phones fit tightly in the ear canal through an expanding foam plug and are calibrated to give a flat response at the eardrum. Psychophysical validation of VAS using Etymotic ER-2 tube phones are presented in section 5.2. Another possible advantage of in-ear tube phones over headphones is that headphones might cause a cognitive impairment in the externalization of a simulated sound source due to their noticeable presence on the head. On the other hand, the auditory isolation provided by in-ear tubes phones could be a source of problems in itself. Finally, HpTFs vary from individual to individual, thus requiring individualized calibration for the generation of high fidelity VAS. This issue is addressed in section 6 of this chapter together with the general problem of the individualization of HRTFs.

5. PERFORMANCE MEASURES OF FIDELITY Section 4.2 in chapter 1 gives a detailed account of the different methods available for the measurement of sound localization performance in humans, and it is concluded that a most demanding test for the fidelity of VAS is to compare the absolute localization of transient stimuli in VAS with that in the free-field. The following sections briefly review the methods by which localization of transient stimuli can be assessed as well as studies where localization in VAS has been systematically compared to localization in the free-field. We also illustrate this discussion with some recent results from our own laboratory.

5.1. METHODOLOGICAL ISSUES There are three main methodological issues that need to be considered in the assessment of sound localization by humans; (i) how the position of a sound source is varied; (ii) how the subject indicates where the sound source is perceived to be; and (iii) the statistical methods by which localization accuracy is assessed. In most experimental paradigms the first two problems are linked and the solutions that have been employed offer different advantages and disadvantages. One way in which the sound source location can be varied is by using a fixed number of matched speakers arranged at a number of locations in space around the subject. Following the presentation of a stimulus the subject is required to identify which speaker in the array generated the sound. This procedure has been referred to as “categorization,”42 as subjects allocate their judgment of an apparent sound location to one response alternative in a set of available positions (see for instance Butler;43 Blauert;44 Hebrank and Wright;45,46 Belendiuk and Butler;47 Hartman;48␣ Musicant and Butler49). The set of possible responses, or categories, is generally small (below 20). The obvious

Generation and Validation of Virtual Auditory Space

135

technical advantages is that changing the location of the source is simply a matter of routing the signal to a different speaker. One technical difficulty with such a system is that the speakers in the array have to be well matched in terms of transduction performance. This has generally been achieved by testing a large number of speakers and simply selecting those which are best matched. With the advent of digital signal processing (see chapter 3) it has been possible to calibrate each speaker and then modify the spectrum of the signal to compensate for any variations between the sources (see Makous and Middlebrooks50). In this kind of experimental setup there are a discrete number of locations and the subject’s foreknowledge of the potential locations constrains their response to a relatively small number of discrete locations. This is particularly a problem when a sound location is ambiguous or appears to come from a location not associated with any of the potential sources. Indeed, differences in the arrangement of the loudspeakers also seems to guide the subject’s responses.51-53 For instance, the accuracy of the localization of a lowpass sound above the head was found to be dependent on whether other sources used in the localization experiments were restricted to either the frontal or lateral vertical plane.53 Thus, as these authors point out, the context in which the stimulus was presented had influenced the localization judgments. A similar pattern of results was reported by Perrett and Noble51 where sound localization under conditions where the subject choices were constrained were compared to conditions where they were unconstrained. This clearly demonstrated the powerful influence of stimulus context and foreknowledge of the potential source locations. Butler et al52 also showed that prior knowledge of the azimuth location of the vertical plane in which a stimulus was presented had a significant effect on the accuracy of localization in that vertical plane. However, a similar interaction was not shown for foreknowledge of the vertical position of the horizontal plane from which a stimulus could be presented (see Oldfield and Parker54 and Wightman and Kistler42 for further discussion of these important points). A variant of this procedure is to use a limited array of sources to generate the stimuli but to conceal the number and location of these sources from the subject. An acoustically transparent cloth can be placed between the subject and the speaker array and a large number of potential speaker locations can be indicated on the cloth (e.g., Perrett and Noble51). Although there may be only a few of these potential locations actually associated with a sound source, the subject’s response is not constrained to the small number of sources. Additionally, the cognitive components associated with the foreknowledge of the disposition of the potential source is also eliminated. The second approach to varying the target location is to vary the source location using a movable loudspeaker so that the choice of locations is a continuous function of space. If this procedure is carried

136

Virtual Auditory Space: Generation and Applications

out in the dark, or with the subject blindfold, the subject has no foreknowledge as to potential target locations. The method by which subjects report their judgments in these studies has to be carefully considered, as some of the variance in the results could be accounted for by the subject’s ability to perform the task of indicating the perceived location and not be related to his/her perceptual abilities. In the series of studies by Oldfield and Parker54-56 subjects reported the position of sound by pointing to the target using a hand-held “gun,” the orientation of which could be photographed. A similar procedure was used by Makous and Middlebrooks50 where subjects pointed their nose towards the perceived sound location and the orientation of their head was registered with an electromagnetic tracking device (see also Fig.␣ 1.2 and section 2.1.2 of chapter 1).d Head-pointing is currently the most popular method for indicating the location of the stimulus. Precautions have to be taken so that the variability associated with the subjects ability to orient their head is less than the error in their localization of the sound source (see below). Also, the resolution of the tracker should be better than human sound localization accuracy. One of the most commonly used headtracking systems has an angular accuracy of about 0.5 degrees (e.g., Polhemus IsoTrack), which is well below the human best possible performance. The use of head-tracking devices is further discussed in chapter 6. Wightman and Kistler14,59 (see also Kistler and Wightman60) used a third type of judgment method where subjects indicate the perceived position of a sound source by calling out numerical estimates of apparent azimuth and elevation, using spherical coordinates. In this type of experiment the results would also depend on the ability of the subject to map perceived locations to the coordinate system rather than motor ability. In all cases, subjects would first have to learn the task till they produce steady and reliable judgments in the free-field before any comparison can be made with localization performance in VAS. Also, multiple measurements should be made for each position in order to describe all the biological variance and to obtain a good statistical estimate of localization accuracy. All three procedures described above produced similar results with respect to localization errors and their regional dependency, with the difference that the coordinate estimation procedure seemed to produce more variance in the data.42 We have recently measured a number of the error components associated with pointing the head towards an unconstrained target in the dark.61 Turning to face an auditory target is clearly a highly ecological d

Another method of estimation has been described recently, where subjects report the presumed location of a target by pointing onto a spherical model of the coordinate system.57,58 Also, Hammershøi and Sandvad8 have used a similar approach where the subjects indicated a perceived position from a limited number of locations using a digital notepad.

Generation and Validation of Virtual Auditory Space

137

behavior and tracking the position of the head using a head mounted device provides an objective measure of the perceived location of the sound source. One problem associated with “head pointing” is that orienting the head is only one component of the behavior that results in visual capture of the source. The other major component is the movement of the eyes. For sound locations where head movement is mechanically unrestricted (around the frontal field and relatively close to the audio-visual horizon) it is very likely that the eyes will be centered in the head and the position of the head will provide a good indicator of the perceived position of the sound. However, for sound locations which would require more mechanically extreme movements of the head it is more likely that the eyes will deviate from their centered position so that there will be an error between the perceived location of the source and the position indicated by the head. Recent results from our laboratory verify this prediction and we have developed training procedures based on head position feedback which act to minimize these errors.61

5.2. EXPERIMENTAL RESULTS The study by Wightman and Kistler14 provided the first rigorous assessment of the perceptual validity of VAS. The stimulus employed in their localization tests was a train of eighth 250␣ ms bursts of noise. In order to prevent listeners from becoming familiar with a specific stimulus or transducer characteristics, the spectrum of each stimulus was divided into critical bands and a random intensity was assigned to the noise within each critical band (spectral scrambling). In both freefield and VAS, the same eight subjects were asked to indicate the apparent spatial position of 72 different target locations by calling out numerical estimates of azimuth and elevation (see above for details). VAS was synthesized using individualized HRTFs, i.e., recorded from their own ear canal.9 Subjects were blindfolded in both conditions and no feedback regarding the accuracy of their judgment was given. The data analysis for this type of experiment is complicated by the fact that target and responses are distributed on the surface of a sphere. Therefore, spherical statistical techniques have been used in analyzing the data.14,62,63 Two useful parameters provided by these statistics are the centroid judgment, which is reported as azimuth and elevation and describes the average direction of a set of responses vectors for a given location, and the two-dimensional correlation coefficient, which describes the overall goodness of fit between centroids and targets. Another problem in this type of data comes from the cone of confusion type of errors, i.e., front-to-back or back-to-front reversals (see␣ chapter 1, section 2.1.3). Cone of confusion errors need to be treated as a particular case, as they represent large angular deviations from their targets and their inclusion in the overall spherical statistics

138

Virtual Auditory Space: Generation and Applications

would lead to a large overestimate of the average angular error. Two types of analytical strategies have been proposed to deal with this type of error. Makous and Middlebrooks50 remove confusions from the pool of data and treat them separately. Wightman and Kistler14 have chosen to count and resolve cone of confusion errors; that is, the response is reversed towards the hemisphere where the target was located, and the rate of reversal is computed as a separate statistic. Therefore, responses where a confusion occurred still contribute to the computation of the overall localization accuracy while the true perception of the subject is reflected in the statistics. Table 4.1 provides a general summary of the results obtained by Wightman and Kistler.14 It can be seen that the individual overall goodness of fit ranges from 0.93 and 0.99 in the free-field, and 0.89 and 0.97 in VAS. Although no significance was provided for these coefficients, at least it can be said that the free-field and VAS performances are in good agreement. A separate analysis of the azimuth and elevation component of the centroids of the pooled localizations for each location shows that, while the azimuthal component of the stimulus location is rendered almost perfectly, the simulation of elevation is less satisfying, as reflected by a slightly lower correlation between the responses and target elevations. Also, the cone of confusion error rate, which is between 3 and 12% in the free-field, increases to 6 and 20% in VAS. The increase in the front-back confusion rate and the decrease in elevation accuracy suggests that in these experiments there may have been some mismatch between the sound field generated at the eardrum in VAS and the sound field at the eardrum in free-field listening (see chapter 2; also see section 1.2 of this chapter).e This issue was recently examined in our laboratory using highfidelity individualized HRTFs acoustically validated as described in section 1.2.64 Free-field localization accuracy was determined in a dark anechoic chamber using a movable sound source (see chapter 1). Broadband noise bursts (0.2-14␣ kHz, 150␣ ms) were presented from 76 different spatial locations and 9 subjects indicated the location by turning to face the source. Head position was tracked using an electromagnetic device (Polhemus IsoTrack). Cone of confusions errors were removed e

Hammershøi and Sandvad8 recently presented a preliminary report of a psychophysical validation of VAS generated using occluded ear canal HRTF recordings. The localization test consisted in the identification of 17 possible sound sources reported using a digital notepad, and two different types of stimuli were tested: a 3 seconds speech signal and 3 seconds of white noise. They show a doubling of the total number of errors in VAS compared to free-field and a particularly large increase in the number of cone of confusions errors. It is not clear yet whether these poor results are related to any procedural problem in the generation of VAS or to the use of occluded ear recordings.

Generation and Validation of Virtual Auditory Space

139

Table 4.1 Spherical statistics and cone of confusions errors for measures of free-field sound localization (in bold face type) and measures of VAS sound localization (between brackets) shown individually for 8 subjects ID

Goodness of fit

Azimuth correlation

Elevation correlation

% reversal

SDE

0.93 (0.89)

0.983 (0.973)

0.68 (0.43)

12 (20)

SDH

0.95 (0.95)

0.965 (0.950)

0.92 (0.83)

5 (13)

SDL

0.97 (0.95)

0.982 (0.976)

0.89 (0.85)

7 (14)

SDM

0.98 (0.98)

0.985 (0.985)

0.94 (0.93)

5 (9)

SDO

0.96 (0.96)

0.987 (0.986)

0.94 (0.92)

4 (11)

SDP

0.99 (0.98)

0.994 (0.990)

0.96 (0.88)

3 (6)

SED

0.96 (0.95)

0.972 (0.986)

0.93 (0.82)

4 (6)

SER

0.96 (0.97)

0.986 (0.990)

0.96 (0.94)

5 (8)

Adapted with permission from Wightman FL et al, J Acoust Soc Am 1989; 85: 868-878.

from the data set. Figure 4.8A shows that the mean localization accuracy across all subjects varies with source location, being most accurate for frontal locations. The centroids of the localization judgments for each location were highly correlated with the actual location of the source (spherical correlation: 0.987) and the cone of confusion error rate was 3.1%. Our VAS stimuli were generated by filtering noise with the appropriate HRTFs and delivered using in-ear tube phones (Etymotic Research ER-2; see section 4). Sound localization was measured as for free-field so that each subject was tested with at least 5 consecutive blocks of each of 76 locations. Subjects acclimatized to the VAS task after a short training period without feedback, and localization accuracy was compared with accuracy in the free-field using the mean angular error for each test location. Although the spatial variation of the mean angular errors in VAS is very similar to that seen in the freefield, they were on average larger by about 1.5º (paired t: t␣ =␣ 3.44 df 75 p␣ =␣ 0.0009). The cone of confusions errors rate is 6.4%, a fraction of which can be accounted for by the slight increase in the average angular error in VAS around the interaural axis compared to the free-field. The correlation between localization in VAS and localization in the free-field is very high (spherical correlation: 0.983), indicating that the performance between the two conditions is very well matched (Fig.␣ 4.8B). The difficulty of discriminating front and back in simulated auditory space has been known since the binaural technique was first developed.7 Frontal localization in particular seems to be a problem. Sound sources in the frontal hemisphere are often perceived as appearing

Fig. 4.8. (A) Free-field localization accuracy shown for 9 subjects (see text). The centroids of the localization judgments for each location (+) are very highly correlated with the actual location of the source (spherical correlation: 0.987). The horizontal and vertical axis of the ellipses represent ±1 s.d. about the mean response for each location. (B) VAS localization accuracy shown for 5 subjects (see text). The centroid of the localization judgments (+) are very highly correlated with the HRTF recording location (spherical correlation: 0.974). *: indicates position with coordinates 0°, 0°.

140 Virtual Auditory Space: Generation and Applications

Generation and Validation of Virtual Auditory Space

141

from behind and are poorly externalized, although the problem in binaural recordings could be related to the use of material recorded from nonindividualized, artificial ear canals (see section 6). The resolution of frontal and posterior hemispheres seems to be the most fragile aspect of sound localization when tested using an auditory environment where cues are reduced to a minimum. This can be exemplified by the fact that front-back resolution is the first aspect of performance to be degraded when sound localization is investigated in a noisy background. 57 These results have been discussed by Good and Gilkey57 in relation to the saliency of cues to a sound’s location. It appears quite likely that this aspect of localization could also deteriorate as a result of higher level factors, auditory or cognitive. Indeed, identical stimuli might be perceived identically only when they are presented in identical environments. Some experimental factors, like prior training with feed-back in the same environment, still differed between the freefield and VAS localization experiments described above. Another of these factors is the absence of dynamic link to head motion in VAS. In the light of the discussion presented in chapter 1 (section 2.3.1), it appears rather unlikely that head movements could contribute to the localization of transients in the experiments described in Fig. 4.8. Wightman and colleagues recently reported that dynamically coupling a stimulus presented in VAS to small head movements decreased the number of confusions for those subjects which performed particularly badly in the virtual environment.65 If the stimulus consisted in a train of 8 bursts of 250␣ ms as in their previous work9 it is then possible that the subjects used a scanning strategy to resolve ambiguous spectral cues in the signal, a process different to that involving the localization of transients. In conclusion, experimental data obtained so far by Wightman and Kistler and collaborators as well as in our laboratory indicate that the simulation of free-field listening in VAS is largely satisfactory, as indicated by the very high correlations between localization accuracy in both conditions. Efforts remain to be made towards improving elevation accuracy and decreasing the number of cone of confusion errors. It appears that the evaluation of the fidelity of VAS using the localization of transient, static stimuli presented in anechoic conditions and uncoupled to head movements is delicate, and the unusual context and sparsity of cues of such stimuli could render the listener’s performance more susceptible to the influence of higher level factors. It should also be borne in mind here that the psychophysical validation of VAS relies on the primary assumption that the correct HRTFs have been measured, which remains difficult to establish empirically (see section 1.2 of this chapter). Progress in the understanding of the outer ear acoustics and in the control over environmental factors should enable sound localization in VAS with an accuracy matching that in the free-field. It is also expected that in more multi-modal and dy-

142

Virtual Auditory Space: Generation and Applications

namic virtual environments, the contribution of higher level factors that can impair localization performance could become less important with more continuous stimuli when disambiguating information is obtained from correlated head movements and visual cues.

6. INDIVIDUALIZED VERSUS NONINDIVIDUALIZED HRTFS AND HPTFS The question of whether individualized HRTFs have to be used to generate high-fidelity VAS is of considerable practical and theoretical interest. So far, the discussion in this chapter has assumed that individual HRTFs are recorded for each subject for whom VAS is generated. Indeed, due to differences in the sizes and shapes of the outer ears there are large differences in the measured HRTFs, particularly at high frequencies, which would seem to justify the use of personalized recordings.10,23,27 Furthermore, when the measured HRTFs are transformed using an auditory filter model, which accounts for the frequency dependence of auditory sensitivity and the frequency and level dependent characteristics of cochlear filters, the individual differences in the HRTFs are preserved, suggesting that the perceptually salient features in the HRTFs are likely to differ from subject to subject27 (see chapter 2, section 2.5). Preliminary psychoacoustical studies confirmed the importance of individualized HRTFs and showed that the best simulation of auditory space is achieved when the listener’s own HRTFs are used.66,67 It is clear from the previous sections of this chapter that measuring high-fidelity HRTFs is a delicate and time consuming process. Measurements have to be carried out in a sophisticated laboratory environment, which is probably not achievable for all potential users of VAS displays. Wenzel et al63,66 have suggested that any listener might be able to make use of nonindividualized HRTFs if they have been recorded from a subject whose perceptual abilities in both free-field and close-field simulated sound localization are accurate. In a recent study, Wenzel et al41 asked inexperienced listeners to report the spatial location of headphone stimuli synthesized using HRTFs and HpTFs obtained from a subject characterized by Wightman and Kistler14 who was found to be an accurate localizer. These results show that using nonindividualized HRTFs, listeners were substantially impaired in elevation judgement and demonstrated a high number of cone of confusion errors. Begault and Wenzel68 reported comparable results in similar experiments where speech sounds rather than broadband noise were used as a stimuli. Unfortunately, the acoustical basis of the variability observed in the results by Wenzel et al69 is not known, as the waveform at the subject’s eardrum had not been recorded in this study. That is, as stated by the authors themselves, “to the extent that each subject’s headphone-to-eardrum transfer function differs from SDO’s [the accurate localizer], a less faithful reproduction would result.” We have illustrated in section 4 that the HpTFs for circum-aural

Generation and Validation of Virtual Auditory Space

143

headphones do capture some of the outer ear filtering effects (see Fig.␣ 4.7). Data from our laboratory show that consequently, like the free-fieldto-eardrum transfer functions, the headphone-to-eardrum transfer functions for circum-aural headphones can indeed differ significantly from one subject to another.70 It can be seen in Figure 4.9 that the variability in the transfer functions is considerable for frequencies above 6␣ kHz, with an inter-subject standard deviation peaking up to 17 dB for frequencies around 9␣ kHz for right ears. The frequency and depth of the first spectral notch varied between 7.5 and 11␣ kHz and -15 and -40␣ dB respectively. There are also considerable individual differences in the amplitude and the center frequency of the high frequency gain features. The intersubject differences in the HpTFs shown here are similar to those shown previously for circum-aural headphones.28 As previously described for HRTFs in the frontal midline9,26,27,71 interaural asymmetries in the HpTFs are also evident, especially for frequencies above 8␣ kHz. These data demonstrate that when generating VAS using circumaural headphones, the headphone transfer function will differ from subject to subject. It is therefore likely that the subjects in the experiments described above by Wenzel et al69 listened to different signals. This is illustrated by the data presented in Figure 4.10. We chose to examine the effects of using nonindividualized HpTFs by considering in detail the transfer functions of two subjects A and B which are amongst the ones described in Figure 4.9. Figure 4.10A and B shows that there is a 2␣ kHz mismatch in the mid-frequency notch of the HpTFs for the two subjects for both left and right ears. Differences can also be observed at higher frequencies (above 10␣ kHz), as well as in the 2␣ kHz to 7␣ kHz region where the second subject is characterized by lower amplitude levels and a shift of the main gain towards low frequencies. The spectral profile which would have been obtained if subject A’s HpTFs had been used to deconvolve one of subject A’s HRTFs when reconstituted in subject B’s ear canal were calculated (Fig. 4.10C and D). In this particular example the simulated location was at the level of the interaural axis and facing the left ear. In this condition subjects A’s inverse headphone transfer functions were removed from the resynthesized stimulus whereas subject B’s headphone transfer functions were actually imposed on the delivered stimulus. It can be seen that due to the higher amplitude level of subject A’s HpTFs in mid-frequency region, the resynthesized HRTFs lacks up to 10 dB in gain between 3␣ kHz and 7␣ kHz. Furthermore, due to the mismatch of midfrequency notches, the notch in the resynthesized HRTF is shifted down to 8␣ kHz and a sharp peak is created at 9␣ kHz where the notch should have appeared. Minor differences also appear above 10␣ kHz. In order to examine how much of the changes introduced by the nonindividualized headphone deconvolution might be encoded by the auditory system, the resynthesized HRTFs were passed through the auditory filter model previously described (see chapter␣ 2, section 1.6.1,

144

Virtual Auditory Space: Generation and Applications

Fig. 4.9. Top: headphone-to eardrum transfer functions measured for a pair of Sennheiser 250 Linear headphones, for the left and right ears of 10 different human subjects. The transfer functions are displaced by 40 dB on the Y axis to facilitate comparison. Bottom: the mean (solid line) and standard deviation (thick dotted line) of the headphone-toeardrum transfer function for each ear.

Generation and Validation of Virtual Auditory Space

145

Fig. 4.10. Effects of using nonindividualized headphones transfer functions on the resynthesis of HRTFs, for left and right ears. (A) and (B): transfer functions for Sennheiser 250 Linear circum-aural headphones, for subject A (solid line) and B (dotted line). (C)␣ and (D): head-related transfer functions for a spatial location on the left interaural axis (azimuth -90°, elevation 0°), as originally recorded for subject A (solid line), and as deconvolved with subject A’s HpTFs but imposed on subjects B’s HpTFs (dotted line). Convolution and deconvolution were modeled in the frequency domain using Matlab (The MatWorks Inc.). (E) and (F): data displayed in (C) and (D) after being passed through an auditory filter model (see text). To simulate the effect of overall stimulus level, an offset of 30 dB was added to the input of each auditory filter.

146

Virtual Auditory Space: Generation and Applications

and also section 1.2 of this chapter). The output of the cochlear filter model indicates that all of the differences described above are likely to be encoded in the auditory nerve and may well be perceptually relevant (Fig. 4.10E and F). The distortion of spectral cues provided by nonindividualized HpTFs may well be the basis for the disruption of sound localization accuracy previously observed with inexperienced listeners.69 Firstly, the effects on frequencies between 3␣ kHz and 7␣ kHz would disrupt a potential cue for the resolution of binaural front-back ambiguities.27 Secondly, the peak in transmission around 9␣ kHz illustrated in Figure 4.10 results in a filter function which is uncharacteristic of any set of HRTFs previously reported.9,27 Distortion effects similar to those described above have been reported by Morimoto and Ando.10 These authors used a free-field, symmetrical two-channel loudspeaker system to simulate HRTFs in an anechoic chamber and compare the localization accuracy of three subjects with different sizes of pinna. They described, in the frequency domain, the sound pressure level at the subject’s ear when a different subject’s HRTFs were reconstituted using their system. They observed that when listening to nonindividualized HRTFs the subject’s own free-field to eardrum transfer functions were being imposed on the signal. The different HRTFs associated with each ear will also lead to interaural spectral differences which vary continuously as a function of sound location. Computing the interaural spectral profile for subject B using subject A’s HRTFs with both nonindividualized and individualized HpTFs shows that, in contrast to the effects seen in the monaural spectral profile, there are only relatively small differences between binaural spectral differences generated using individualized and nonindividualized HpTFs (data not shown). This might be expected from a close inspection of the HpTFs, which shows that for any particular subject the intersubject differences are generally similar for both ears. Thus the use of nonindividualized HpTFs is likely to result in a greater disruption of monaural spectral cues compared to binaural spectral differences. Interaural level differences are known to contribute to the judgment of the horizontal angle of a sound source (for review, see Middlebrooks and Green72). It is not surprising then that when frontback confusions were resolved, most subjects from the Wenzel et al69 study retained the ability to correctly identify the azimuthal angle of sound sources. In view of the large variations in the individual HpTFs reported here, it is clear that these transfer functions must be accounted for in any controlled “cross-listening” experiment. Such experiments are important because they provide insights into how new listeners might perform with nonindividualized HRTFs and adapt to these over time. This is an important acoustical step in examining the link between the HRTFs and sound localization performance and the adaptation of

Generation and Validation of Virtual Auditory Space

147

listeners to new sets of cues present in nonindividualized HRTFs. It remains at this point unclear whether performance using nonindividualized HRTFs will be improved if appropriate training is used and subjects are given the opportunity to adapt to new HRTFs. There are a number of alternative approaches to matching the user of a VAS display to the HRTFs used in generating this display. One could be to provide a large library of HRTFs and simply allow subjects to select from this library the HRTFs which result in the best sound localization performance. The success of such a system will depend on the number of dimensions over which HRTFs vary in the population and the development of practical means by which the selection of the appropriate library can be achieved. Another method might involve the identification of the dimension of variation of HRTFs within the population and the development of mechanisms for modifying a standard HRTF library to match individual listeners. Both of these approaches are currently under investigation in our laboratory.

ACKNOWLEDGMENTS The new psychoacoustical work from the Auditory Neuroscience Laboratory presented in this chapter was supported by the National Health and Medical Research Council (Australia), the Australian Research Council and the University of Sydney. The assistance of Stephanie Hyams in collecting some of the data is gratefully acknowledged. We also wish to thank the subjects who took part in these experiments. The Auditory Neuroscience Laboratory maintains a Web page outlining the laboratory facilities and current research work at http:// www.physiol.usyd.edu.au/simonc.

REFERENCES 1. Gierlich HW. The application of binaural technology. Appl Acoust 1992; 36:219-143. 2. Plenge G. On the differences between localization and lateralization. J Acoust Soc Am 1974; 56:944-951. 3. Butler RA, Belendiuk K. Spectral cues utilized in the localization of sound in the median sagittal plane. J Acoust Soc Am 1977; 61:1264-1269. 4. Blauert J. Spatial Hearing: The psychophysics of human sound localization. Cambridge, Mass.: MIT Press, 1983. 5. Burkhardt MD, Sachs RM. Anthropometric manikin for acoustic research. J Acoust Soc Am 1975; 58:214-222. 6. Gierlich HW, Genuit K. Processing artificial-head recordings. J Audio Eng Soc 1989; 37:34-39. 7. Møller H. Fundamentals of binaural technology. Appl Acoust 1992; 36:172-218. 8. Hammershøi D, Sandvad J. Binaural auralization. Simulating free field conditions by headphones. Presented at Audio Engineering Society. Amsterdam: 1994. 1-19.

148

Virtual Auditory Space: Generation and Applications

9. Wightman FL, Kistler DJ. Headphone simulation of free field listening. I: Stimulus synthesis. J Acoust Soc Am 1989; 85:858-867. 10. Morimoto M, Ando Y. On the simulation of sound localization. J Acoust Soc Jpn 1980; 1:167-174. 11. Pralong D, Carlile S. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 12. Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data. Hearing Res 1990; 47:103-138. 13. Hellstrom P, Axelsson A. Miniature microphone probe tube measurements in the external auditory canal. J Acoust Soc Am 1993; 93:907-919. 14. Wightman FL, Kistler DJ. Headphone simulation of free field listening. II: Psychophysical validation. J Acoust Soc Am 1989; 85:868-878. 15. Shaw EAG. The acoustics of the external ear. In: Studebaker GA, Hochberg I, ed. Acoustical factors affecting hearing aid performance. Balitmore: University Park Press, 1980:109-125. 16. Rabbitt RD, Friedrich MT. Ear canal cross-sectional pressure distributions: mathematical analysis and computation. J Acoust Soc Am 1991; 89:2379-2390. 17. Khanna SM, Stinson MR. Specification of the acoustical input to the ear at high frequencies. J Acoust Soc Am 1985; 77:577-589. 18. Stinson MR, Khanna SM. Spatial distribution of sound pressure and energy flow in the ear canals of cats. J Acoust Soc Am 1994; 96:170-181. 19. Stinson MR, Lawton BW. Specification of the geometry of the human ear canal for the prediction of sound-pressure level distribution. J Acoust Soc Am 1989; 85:2492-2503. 20. Rabbitt RD, Holmes MH. Three dimensional acoustic waves in the ear canal and their interaction with the tympanic membrane. J Acoust Soc Am 1988; 83:1064-1080. 21. Chan JCK, Geisler CD. Estimation of eardrum acoustic pressure and of ear canal length from remote points in the canal. J Acoust Soc Am 1990; 87:1237-1247. 22. Gilman S, Dirks DD. Acoustics of ear canal measurements of eardrum SPL in simulators. J Acoust Soc Am 1986; 80:783-793. 23. Middlebrooks JC, Makous JC, Green DM. Directional sensitivity of soundpressure levels in the human ear canal. J Acoust Soc Am 1989; 86:89-108. 24. Shaw EAG, Teranishi R. Sound pressure generated in an external-ear replica and real human ears by a nearby point source. J Acoust Soc Am 1968; 44:240-249. 25. Shaw EAG. Transformation of sound pressure level from the free field to the eardrum in the horizontal plane. J Acoust Soc Am 1974; 56: 1848-1861. 26. Møller H, Sørensen MF, Hammershøi D et al. Head-related transfer functions of human subjects. J Audio Eng Soc 1995; 43:300-321. 27. Carlile S, Pralong D. The location-dependent nature of perceptually salient features of the human head-related transfer function. J Acoust Soc

Generation and Validation of Virtual Auditory Space

149

Am 1994; 95:3445-3459. 28. Møller H, Hammershøi D, Jensen CB et al. Transfer characteristics of headphones measured on human ears. J Audio Eng Soc 1995; 43:203-217. 29. Wiener FM, Ross DA. The pressure distribution in the auditory canal in a progressive sound field. J Acoust Soc Am 1946; 18:401-408. 30. Mehrgardt S, Mellert V. Transformation characteristics of the human external ear. J Acoust Soc Am 1977; 61:1567-1576. 31. Shaw EAG. The external ear: new knowledge. In: Dalsgaad SC, ed. Earmolds and associated problems. Proceedings of the Seventh Danavox Symposium. 1975:24-50. 32. Shaw EAG. 1979 Rayleigh Medal lecture: the elusive connection. In: Gatehouse RW, ed. Localisation of sound: theory and application. Connecticut: Amphora Press, 1982:13-27. 33. Middlebrooks JC, Green DM. Directional dependence of interaural envelope delays. J Acoust Soc Am 1990; 87:2149-2162. 34. Lawton BW, Stinson MR. Standing wave patterns in the human ear canal used for estimation of acoustic energy reflectance at the eardrum. J Acoust Soc Am 1986; 79:1003-1009. 35. Begault DR. Perceptual effects of synthetic reverberation on three-dimensional audio systems. J Audio Eng Soc 1992; 40:895-904. 36. Shaw EAG. The external ear. In: Keidel WD, Neff WD, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490. 37. Schroeder MR. Synthesis of low-peak-factors signals and binary sequences with low autocorrelation. IEE Trans Inform Theory 1970; IT-16:85-89. 38. Zhou B, Green DM, Middlebrooks JC. Characterization of external ear impulse responses using Golay codes. J Acoust Soc Am 1992; 92: 1169-1171. 39. Theile G. On the standardization of the frequency response of high-quality studio headphones. J Audio Eng Soc 1986; 34:956-969. 40. Villchur E. Free-field calibration of earphones. J Acoust Soc Am 1969; 46:1526-1534. 41. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 42. Wightman FL, Kistler DJ. Sound localization. In: Yost WA, Popper AN, Fay RR, ed. Human psychophysics. New York: Springer-Verlag, 1993:155-192. 43. Butler RA. Monaural and binaural localization of noise burst vertically in the median sagittal plane. J Audit Res 1969; 3:230-235. 44. Blauert J. Sound localization in the median plane. Acustica 1969-70; 22:205-213. 45. Hebrank J, Wright D. Are the two ears necessary for localization of sound sources on the median plane? J Acoust Soc Am 1974; 56:935-938. 46. Hebrank J, Wright D. Spectral cues used in the localization of sound sources on the median plane. J Acoust Soc Am 1974; 56:1829-1834.

150

Virtual Auditory Space: Generation and Applications

47. Belendiuk K, Butler RA. Monaural location of low-pass noise bands in the horizontal plane. J Acoust Soc Am 1975; 58:701-705. 48. Hartman WM. Localization of sound in rooms. J Acoust Soc Am 1983; 74:1380-1391. 49. Musicant AD, Butler RA. Influence of monaural spectral cues on binaural localization. J Acoust Soc Am 1985; 77:202-208. 50. Makous JC, Middlebrooks JC. Two-dimensional sound localization by human listeners. J Acoust Soc Am 1990; 87:2188-2200. 51. Perrett S, Noble W. Available response choices affect localization of sound. Perception and Psychophysics 1995; 57:150-158. 52. Butler RA, Humanski RA, Musicant AD. Binaural and monaural localization of sound in two-dimensional space. Percep 1990; 19:241-256. 53. Butler RA, Humanski RA. Localization of sound in the vertical plane with and without high-frequency spectral cues. Perception and Psychophysics 1992; 51:182-186. 54. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space I. Normal hearing conditions. Perception 1984; 13:581-600. 55. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space II: Pinna cues absent. Percep 1984; 13:601-617. 56. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space. III Monaural hearing conditions. Percep 1986; 15:67-81. 57. Good MD, Gilkey RH. Auditory localization in noise. I. The effects of signal to noise ratio. J Acoust Soc Am 1994; 95:2896. 58. Good MD, Gilkey RH. Auditory localization in noise. II. The effects of masker location. J Acoust Soc Am 1994; 95:2896. 59. Wightman FL, Kistler DJ. The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 1992; 91: 1648-1661. 60. Kistler DJ, Wightman FL. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J Acoust Soc Am 1992; 91:1637-1647. 61. Carlile S, Leong P, Hyams S et al. Distribution of errors in auditory localisation. Proceedings of the Australian Neuroscience Society, 1996; (in press). 62. Fisher NI, Lewis T, Embleton BJJ. Statistical analysis of spherical data. Cambridge: Cambridge University Press, 1987. 63. Wenzel EM. Localization in virtual acoustic displays. Presence 1992; 1:80-105. 64. Carlile S, Pralong D. Validation of high-fidelity virtual auditory space. Br J Audiol, (in press). 65. Wightman F, Kistler D, Andersen K. Reassessment of the role of head movements in human sound localization. J Acoust Soc Am 1994; 95:3003-3004. 66. Wenzel E, Wightman F, Kistler D. Acoustic origins of individual differences in sound localization behavior. J Acoust Soc Am Suppl 1 1988; 84: S79.

Generation and Validation of Virtual Auditory Space

151

67. Wenzel EM. Issues in the development of virtual acoustic environments. J Acoust Soc Am 1991; 92:2332. 68. Begault DR, Wenzel EM. Headphone localization of speech. Human Factors 1993; 35:361-376. 69. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 70. Pralong D, Carlile S. The role of individualized headphone calibration for the generation of high fidelity virtual auditory space. Proc Australian Neurosci Soc 1995; 6:209. 71. Searle CL, Braida LD, Cuddy DR et al. Binaural pinna disparity: another auditory localization cue. J Acoust Soc Am 1975; 57:448-455. 72. Middlebrooks JC, Green DM. Sound localization by human listeners. Annu Rev Psychol 1991; 42:135-159. 73. Durlach N, Rigopulos A, Pang XD, Woods WS, Kulkarni A, Colburn HS, Wenzel EM. On the externalization of auditory images. Presence 1992; 1:251-257.

Suggest Documents