Objective and Subjective Evaluation of Head-Related Transfer Function Filter Design
Jyri Huopaniemi ; , Nick Zacharov , and Matti Karjalainen 12
1
3
2
Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, CA 94305-8180, USA 2
Laboratory of Acoustics and Audio Signal Processing Helsinki University of Technology P.O. Box 3000, FIN-02015 HUT, Finland Nokia Research Center Speech and Audio Systems Laboratory P.O.Box 100, FIN-33720 Tampere, Finland 3
[email protected],
[email protected],
[email protected] http://www-ccrma.stanford.edu/, http://www.acoustics.hut.fi/
(Invited paper to be presented at the 105th Audio Engineering Society Convention, San Francisco, Sept. 26-29, 1998)
Abstract In this paper, the problem of modeling head-related transfer functions (HRTFs) is addressed. Traditionally, HRTFs are approximated in real-time applications using minimumphase reconstruction and various digital lter design techniques, yielding FIR or IIR structures. In this work, binaural auditory modeling has been applied to HRTF analysis and lter design methods have been compared from the auditory perception point of view. This paper presents applicable perceptually valid lter design techniques and discusses listening test results for localization and timbre degradation.
1 Introduction The rapidly growing eld of 3-D audio is dependent on accurate and correctly reproduced binaural hearing cues. These cues are embodied in head-related transfer functions (HRTFs) that describe the free- eld transmission to a point in the human ear canal [1, 2]. This work focuses on the desired accuracy of HRTF representations in binaural synthesis. 1
Progress in research of lter design for 3-D audio is reported. Earlier results of this study were presented at the 102nd AES Convention in Munich in Spring 1997 by Huopaniemi and Karjalainen [3]. The methods presented in this study are in general applicable to many audio applications, not only ones concerning binaural technology. We have attempted to motivate the use of auditory resolution lter design and modeling techniques to a wide variety of applications (see, e.g., [4, 5] for details). In spatial hearing, it is well known that low-frequency interaural time delay (ITD) cues have a dominant role in localization, and mid- to high-frequency spectral cues enable elevation detection and out-of-head localization, especially in the median plane and on the cone of confusion. It is, however, not clear what the salience of these cues in localization is, and how accurately they should be modeled in binaural synthesis. In the literature, numerous articles can be found on HRTF approximations for real-time and non-real-time simulation. Most of the currently applied methods do not, however, take into account the non-uniform resolution of the auditory system. We have studied dierent HRTF approximation techniques and compared their perceptual relevance using a binaural auditory model. Based on these results, listening tests on the perception of virtual sources were conducted in headphone listening. The results show that high-frequency complexity of the HRTFs can be reduced according to psychoacoustical theory without aecting the perception. Discussion and considerations for required lter orders for binaural synthesis based on auditory modeling and listening test results are provided. This paper is organized as follows. In Section 2, an overview of digital lter modeling of HRTFs is presented with basic methodology and techniques for auditory lter design. A complete reference list to previous publications on the subject is included. The hypothesis and problem formulation of the current study are given in Section 3. The HRTF properties and lter design techniques used in objective and subjective analysis are outlined. Topics in objective analysis of HRTFs are covered in Section 4. Binaural auditory modeling is used to derive error estimates for dierent lter designs. The use of a simpli ed spectral error criterion is also discussed. Results of the objective evaluation of lter designs under test are given. In Section 5, the subjective listening experiment and its results are presented. Finally, in Section 6, the results of objective and subjective experiments are summarized. Conclusions are drawn from the experiments, and suggestions are made for optimal reproduction of 3-D audio. Discussion and directions for future research are given.
2 Head-Related Transfer Function Modeling In this section, analysis and lter design methods applied to binaural HRTF modeling are overviewed. The task of approximating an ideal HRTF response H (z) by a digital lter H^ (z) is studied. For any lter design, the goal is to minimize a complex-valued error function that is of the form: E (ej! ) = H (ej! ) ? H^ (ej! ) (1) A list representing research carried out in the eld of HRTF approximation by various authors is illustrated in Table 1. The lter orders have been collected from the publications and represent FIR lter orders (number of taps - 1) or IIR lter orders (number of poles or number of zeros). One can see from the table that the results from dierent studies vary 2
considerably from one to another. There are many causes to this. Some of the studies are purely theoretical, meaning that the results are formulated in the form of a spectral error measure or by visual inspection. In some of the references, the authors claim that a certain lter order appeared to be satisfactory in informal listening tests. These cases are marked in the table with a question mark. There have been very few formal listening tests in this eld that give statistically reliable results [6, 3, 7, 8]. Another question is the validity of the HRTF data used in the studies. Whether equalized for free- eld conditions or a certain headphone type, whether dummy-head or individual/non-individual real-head data were used, whether minimum-phase reconstruction was applied, all these aspects may cause the large deviation seen in the results of the studies (Table 1). Let us look at the problem of nding an optimal lter approximation to a given HRTF. A common way to quantify errors in digital lter design is to use the Lp-norms [9, 10]. An Lp error norm of a frequency-domain representation of an arbitrary lter H (ej! ) and its approximation H^ (ej! ) is de ned by the following equation:
Z p d! 1 j! j! j! j! ^ ^ kE kp = kH (e ) ? H (e )kp , 2 H (e ) ? H (e ) 2 ?
1
p
;p 1
(2)
2.1 Introduction to the Filter Design Problem
2.1.1 Least-squares Norms The L norm is generally known as the least squares (integrated squared magnitude) norm. It is particularly appealing due to the fact that using Parseval's relation this norm can directly be used both in the frequency and the time domains [10, 11]: 2
Z H (ej! )
j! ^ , ? H (e ) d! 2 ? 1 X = h(n) ? ^h(n) , kh(n) ? h^ (n)k
E2 = kH (ej! ) ? H^ (ej! )k2
2
2
(3)
2
n=0
If we further employ a positive weighting function W (ej! ), we can write a general weighted least squares (WLS) complex approximation problem in the frequency domain as: Z
j! ^ (4) Ew , ? H (e ) d! W 2 ? This property is desirable since we are able to incorporate weighting functions into the lter design and optimization in order to model, e.g., the auditory frequency resolution. Another advantage of the least squares formulation is that since the error term is quadratic, there is a global minimum. Popular methods using least-squares error norm minimization are the equation-error method, the Prony's method, and the Yule-Walker methods [10]. 2
? j! e H (ej! )
2
2.1.2 Chebyshev Norm The L1 norm is often referred to as the Chebyshev norm. This norm, as p approaches in nity, aims to minimize the maximum component of the error term E (hence the term 3
minmax):
kE k1 = kH (ej! ) ? H^ (ej! )k1 , ?
(5)
The Chebyshev norm minimization may be an applicable choice to modeling of HRTFs, since the magnitude responses typically exhibit prominent peaks and valleys that are important for localization. Since the minmax approach seeks to minimize the maximum error, which is often found at the transitions in the response, we would expect good performance in HRTF modeling. Another property of human hearing, the logarithmic amplitude resolution, can also be incorporated in a Chebyshev norm minimization algorithm (a weighted log-magnitude spectral error [10]). A drawback of the Chebyshev norm optimization is, however, that the error surface may not always be convex and the result can be unstable and/or only locally optimal. Well-known methods for Chebyshev norm optimization are, e.g., the Remez multiple exchange algorithm and the Parks-McLellan algorithm.
2.1.3 Hankel Norm Hankel norm -based methods are attractive to HRTF modeling since they provide a general and stable solution to complex frequency response modeling. The Hankel norm lies between L and L1 norms and is quanti ed by the spectral norm of the Hankel matrix of a given response. The most straightforward technique of obtaining an optimal Hankel norm approximation is to nd the Hankel singular values of a Hankel matrix (by singular value decomposition) and select a desired number of the singular values to construct the nal lter (see, e.g., [10, 12] for discussion). 2
2.1.4 Methods of Auditory Weighting From the psychoacoustical point of view, the linear frequency scale used in linear transforms such as the Fourier transform is not optimal. In psychoacoustics it has been shown experimentally that scales such as the Bark scale (or the critical-band rate scale) [13] or the ERB (Equivalent Rectangular Bandwidth) rate scale [14] match closely the properties of human hearing. Moreover, presently the ERB scale is believed to be theoretically better motivated than the Bark scale [15]. To a rst approximation, a logarithmic scale has also been found reasonable in modeling of the auditory frequency resolution. There are dierent methods of incorporating the non-uniform resolution into a lter design. We have divided the approaches to three dierent categories: auditory smoothing of the responses prior to lter design use of a weighting function in lter design warping of the frequency axis prior or during the lter design 2.1.5 Auditory Smoothing In many cases, the data (in this case the HRTF frequency- or time-domain respones) may be pre-processed using a smoothing technique prior to the lter design. The most 4
straightforward way to accomplish a frequency-dependent magnitude response smoothing is to use moving averaging of variable window size [10, 16]:
jHS (f )j =
s
Z f1
1
f 1 ? f 0 f0
jH (f 0 )j
2
(6)
df 0 ;
p
where f ? fp is the window width at frequency f 0. Furthermore, if we de ne f = f= K and f = f K , it follows that for third-octave width, K = 5=4 and for octave width, K = 2 [16]. The window size as a function of frequency can also be derived using auditory criteria, presented below. 1
0
0
1
2.1.6 Auditory Weighting An alternative to smoothing techniques is to use weighting functions in the lter design that approximate the human auditory resolution. Approximation formulas for the psychoacoustic scales are the following. The ERB scale bandwidth as a function of center frequency is given by the following equation [14]: fCE = 24:7(4:37fc + 1): (7) Similarly, the Bark scale resolution (critical band) can be approximated by the equation [13]:
fc
fCB = 25 + 75 1 + 1:4 1000
2 !0:69
:
(8)
Approximation formulas presented in Eqs. (7) and (8) have been used to calculate the weighting functions shown in Fig. 2. The weight of each frequency point is the inverse of the bandwidth calculated with Eqs. (7) and (8). The weight functions were normalized with the corresponding RMS values and scaled to a minimum weight of one (Bark weighting at fc = 24 kHz). As we can see, the ERB scale weighting function focuses more at low and high frequencies (below 300 and above 9000 Hz) than the corresponding Bark function, but for center frequencies (100 - 5000 Hz) the ERB scale assigns less weight than the Bark scale.
2.1.7 Frequency Warping A third technique to incorporate non-uniform frequency resolution into lter design is to modify the actual frequency scale of the design. Frequency scale warping is in principle applicable to any design or estimation technique. The most popular warping method is to use the bilinear conformal mapping (see, e.g., [17, 10, 18, 5] for more details). The bilinear warping is realized by substituting unit delays with rst-order allpass sections: z ?1 ? ? 1 z = D1 (z ) = 1 ? z?1 ;
(9)
where is the warping coecient. This implies that the frequency-warped version of a lter can be implemented by such a simple replacement technique. It is easy to show 5
that the inverse warping can be achieved with a similar substitution but using ? instead of (this was used in [10, 19]). This approach is, however, limited to low lter orders, where computational accuracy does not introduce problems. Another procedure is to incorporate the unwarping procedure in the lter implementation, using warped FIR and IIR lter (WFIR, WIIR) structures [5, 3]. In the proposed WFIR and WIIR structures, the unit delays of, e.g., a direct form II lter implementation are replaced by rst-order allpass lters. In the case of WIIR lters, remapping of the coecients is needed to yield realizable implementations [5]. The usefulness of warping by conformal mapping comes from the fact that, given a target transfer function H (z), we may nd a lower order warped lter Hw (D (z)) that is a good approximation of H (z) in an auditory sense. Hw (D (z)) is then designed in a warped frequency domain so that using allpass delays instead of unit delays maps the design to a desired transfer function in the ordinary frequency domain. For an appropriate value of , the bilinear warping can t the psychoacoustic Bark scale, based on the critical band concept [13], surprisingly accurately [18]. An approximate formula for the optimum value of as a function of sampling rate is given in [18]. For a sampling rate of fs = 48 kHz this yields = 0:7313 and for fs = 32 kHz = 0:6865. 1
1
2.2 Head-Related Transfer Function Models
In this section, we concentrate on modeling aspects of the HRTFs. There are many approaches to HRTF modeling: looking at the physical properties of the outer ear and creating time-domain models [20], creating functional databases from measurements or models [21], or deriving synthetic HRTFs from measured data using dierent lter design techniques. In this work, we concentrate on this third approach. An attractive property of HRTFs is that they are nearly of minimum phase [22]. The excess phase that is the result of subtracting the original phase response from its minimumphase counterpart has been found to be approximately linear. This suggests that the excess phase can be separately implemented as an allpass lter or more conveniently, as a pure delay. In the case of binaural synthesis, the ITD part of the two HRTFs may then be modeled as a separate delay line, and minimum-phase HRTFs may be used for synthesis. This traditional method is depicted in Fig. 1. Further attractions of minimum-phase systems in binaural simulation are: 1) the energy of the impulse response is optimally concentrated in the beginning allowing for shortest lter lengths for a speci c amplitude response, and 2) due to the previous property minimum-phase lters are better-behaved in dynamic interpolation. Several independent publications have stated that minimum-phase reconstruction does not have any perceptual consequences [21, 23, 24]. This information is crucial in the design and implementation of digital lters for 3-D sound. The delay error due to rounding of the ITD to the nearest unit-delay multiple can be avoided using fractional delay ltering (FD) (see [25] for a comprehensive review on the design of FD lters).
2.2.1 Finite Impulse Response Methods The most straightforward ways to approximate HRTF measurements are to use the windowing FIR lter design method or a direct frequency-sampling design technique [11]. If H (ej! ) is the desired frequency response, the direct frequency sampling method is carried 6
out by sampling N points in the frequency domain and computing the inverse discrete Fourier transform (normally using FFT): NX ? 1 H (k )ej h(n) = N 1
k=0
=N )nk :
(2
(10)
A time-domain equivalent to uniform frequency-sampling is windowing (truncating) an impulse response h(n) with a rectangular window function w(n). Generally, the windowing method can be presented as a multiplication of the desired impulse response with the window function w(n): ^h(n) = h(n)w(n): (11) The use of rectangular windowing method is known to minimize a truncated time-domain L norm [26]. The eect of dierent window functions in HRTF lter design has been discussed by Sandvad and Hammershi [6]. They concluded that although rectangular windowing provokes the Gibbs' phenomenon seen as ripple around amplitude response discontinuities, it is still favorable when compared to, e.g., the Hamming window. Another and perhaps more intuitive conclusion is based on the fact that HRTFs do not contain discontinuities but broad uctuations in the magnitude response, which can be modeled eectively using least-squares tting. A severe limitation of the windowing method is, however, the lack of a weighting function that could model the non-uniform frequency resolution of the ear. For an extended frequency-sampling method, yielding an LS solution, we can, however, introduce non-uniform frequency sampling and weighting [11]. Kulkarni and Colburn [27, 7] have proposed the use of a weighted least squares technique based on log-magnitude error minimization for nite-impulse response HRTF lter design. They claim that an FIR lter of order 64 is capable of retaining most of spatial information. In a method proposed by Hartung and Raab [28], binaural lters were optimized using auditory criteria, approximating the sensitivity of the human ear with a logarithmic magnitude weighting function. Non-uniform sampling of the frequency grid was applied in order to achieve auditory resolution. Optimization was then carried out to yield the following results: an FIR lter of order 48 \revealed no signi cant dierences in localization performance... Minor divergence was noted with the 32nd-order FIR lter" [28]. Issues in FIR lter design using auditory criteria have been discussed by Wu and Putnam [29]. They derived a perceptual spectral distance measure using a simpli ed auditory model. The technique was applied to HRTF-like magnitude responses and as a result FIR approximations of order 20 were successfully calculated. Huopaniemi and Karjalainen [3] studied the performance of a windowing FIR lter design method in localization tests using non-individualized HRTFs. The conclusion was that 75% of the population heard no dierence between the reference and an FIR approximation of order 40. 2
2.2.2 In nite Impulse Response Methods The earliest HRTF lter design experiments using pole-zero models were carried out by Kendall [30, 31]. A comparison of FIR and IIR lter design methods was presented in 7
[6, 32]. The non-minimum-phase FIR lters based on individual HRTF measurements were designed using rectangular windowing. The IIR lters were generated using a modi ed Yule-Walker algorithm that performs least-squares magnitude response error minimization. The low-order t was enhanced a posteriori by applying a weighting function and discarding selected pole-zero pairs at high frequencies. Listening tests showed that an FIR of order 72 equivalent to a 1.5 ms impulse response was capable of retaining all of the desired localization information, whereas an IIR lter of order 48 was needed for the same localization accuracy. In the research carried out by [33], the error criteria in the auto-regressive moving average (ARMA) lter design were based on log-magnitude spectrum dierences rather than magnitude or magnitude-squared spectrum dierences. Furthermore, a new approximation for the log-magnitude error minimization was de ned. The theoretical study concluded that it was possible to design low-order HRTF approximations (the given example used 14 poles and zeros) using the proposed method. In a recent study [34], the results have been generalized and they concluded that pole-zero models of order 40 were needed for accurate modeling of HRTFs. They also stated that a least squares (LS) algorithm was inferior in comparison to using a logarithmic error measure. Asano et al. [35] have investigated sound localization in the median plane. They derived IIR models of dierent orders (equal number of poles and zeros) from individual HRTF data. When compared to a reference, a 40th-order pole-zero approximation yielded good results in the localization tests with the exception of increased front-back confusions in frontal incident angles. Huopaniemi and Karjalainen [3] studied the performance of IIR and WIIR lter design methods in localization tests using non-individualized HRTFs. The conclusions were that an auditorily motivated WIIR design outperforms a traditional IIR design method. Of the population, 75% heard no dierence between the reference and an WIIR approximation of order 20. Other IIR approximation models for HRTFs have been presented by [36, 37, 28, 38, 8].
2.2.3 Balanced Model Truncation An attractive technique for HRTF modeling using Hankel norm minimization has been proposed in [12]. By using balanced model truncation (BMT) it is possible to approximate HRTF magnitude and phase response with low order IIR lters. A complex HRTF system transfer function is written as a state-space dierence function, which is then represented in balanced matrix form. A truncated state-space realization Fm(z) of order n can be found with a similarity to the original system F (z) which is approximately quanti ed by the Hankel norm: kF (z) ? Fm(z)kH = 2trace( ); (12) where 0 = 0 ; (13) where = diag( ; : : : ; k ) are the Hankel singular values (HSV) of the rejected system after truncation and = diag(k ; : : : ; n ) are the HSV's of the truncated system. In 2
1
2
2
1
1
+1
8
the experiments, minimum-phase diuse- eld equalized auditory smoothed HRTFs (based on Kemar measurements by [39]) were modeled by 10th order IIR lters created using the BMT technique. The signal-to-error power ratios (SER) were compared to IIR models designed using Prony's method [11] and the Yule-Walker method [40]. The average SER was found to be approximately 10 dB better in BMT models.
2.2.4 Implementation Issues Implementation issues of binaural lters require knowledge of what the system will feature. When comparing dierent lter design and implementation strategies, one should pay attention particularly to the following viewpoints: Is the system dynamic, i.e., do we need HRTF interpolation and commutation? Are we using minimum-phase HRTF approximations? Are we using specialized hardware (signal processors) for implementation? Are there memory constraints for storing HRTF data? Is the HRTF processing a part of a room acoustics modeling scheme? In this paper, the goal is to present and compare methods for HRTF lter design, thus we only outline solutions to the above questions. In many dynamic virtual acoustic environments, minimum-phase FIR approximations have been chosen for HRTF implementation due to straightforward interpolation, relatively good spectral performance, and simplicity of implementation. Furthermore, this approach may easily be integrated into real-time geometric room acoustics modeling schemes, such as the image-source method [41]. One of the drawbacks is, however, the large memory requirements that this approach causes. A PCA-based method may be attractive when conducting HRTF spatialization, since as low as 5 principal component basis functions have been found to model human HRTFs (or DTFs, directional transfer functions) with good accuracy [21]. The following benchmarks have been calculated for a Texas Instruments TMS320C3x
oating point signal processor, but are practically similar in other processors as well. FIR implementation is ecient (N+3 instructions for N taps), and dynamic coecient interpolation is possible. Designs are usually straightforward (e.g., frequency sampling), but give limited performance especially at low orders. IIR implementations are generally slower (2N+3 instructions for order N in direct form II implementation) if dynamic synthesis is required. Interpolation and commutation methods such as cross-fading and dierent transient elimination techniques increase computation. Pole-zero models are suited for arbitrary-shaped magnitude-response designs, thus low-order designs are possible. The eciency of warped vs. non-warped lters depends on the processor that is used. For Motorola DSP56000 series signal processors a WFIR takes three instructions per tap instead of one for an FIR. For WIIR lters four instructions are needed instead of two for an IIR lter. In custom design chips the warped structures may be optimized so that the overhead due to complexity can be minimized. The warped structures may also be expanded (unwarped) into direct form lters, which will lead to the same computational demands as with normal IIR lters. 9
3 Problem Formulation for HRTF Analysis In this section, we describe the HRTF data and the methods of lter design used in the objective and subjective analysis. The task was to investigate three dierent lter design approaches, and the goal was to nd methods and criteria for subjective and objective evaluation. We concentrated on nding answers to two problems often found in HRTF lter design: What is the needed lter length for perceptually relevant HRTF synthesis? Can we introduce an \auditory resolution" in the lter design, whereby the spectral and interaural phase details are modeled more accurately at low frequencies and considerably smoothed at high frequencies? A set of 10 human test subjects were chosen for the experiment. A blocked ear canal HRTF measurement technique [2] was used to obtain the needed transfer functions for the experiments (the used measurement setup is discussed in greater detail in [42]. An Audax AT100M0 4-inch transducer element in a plastic 190 mm enclosure was used as a sound source for the measurements. The test subjects were seated in an anechoic chamber on a rotating measurement chair. Sennheiser KE211-4 miniature microphones were used. HRTF and headphone responses (for the used headphone type: Sennheiser HD580) were measured for each test person to enable full individual HRTF simulation. Pseudorandom noise was used as the measurement sequence. For the objective and subjective tests, four azimuth angles were chosen: 0o , 40o , 90o , and 210o . These incident angles represent the median plane, frontal plane, and horizontal plane responses. The minimum-phase reconstruction was carried out using windowing in the cepstral domain (as implemented in the Matlab Signal Processing Toolbox rceps.m function [43]). The minimum+excess phase approximation method [19] was used to nd the ITD for each incident angle. The ITD was inserted in HRTF synthesis as a delay line. We used three dierent minimum-phase HRTF approximations: Windowed FIR design (rectangular window), time-domain IIR design (Prony's method [11], as implemented in the Matlab Signal Processing Toolbox [43]) and a warped IIR design (warped Prony's method, warping coecient = 0:65). The reference lter order for each incident angle was chosen to be 256 (257 FIR taps). The tested lter orders were: FIR: 96, 64, 48, 32, 16 IIR: 48, 32, 24, 16, 8 WIIR: 48, 32, 24, 16, 8 As an example, in Fig. 3 magnitude responses of original and lter approximations from one test person's HRTFs at 40o azimuth (0o elevation) are depicted. It can be seen that a WIIR design (using Prony's method) has an improved low-frequency t (up to approximately 9 kHz) when compared to a uniform frequency resolution IIR design of equivalent order. The better t at low frequencies when comparing WIIR approximation to windowed FIR can also be observed both in the amplitude response plots and the ITD plot. The warping value of = 0:65 was used, which is slightly lower than for approximate Bark-scale warping. This value was chosen by visual inspection of the magnitude responses to enhance low-frequency t, but still retaining the overall high-frequency magnitude envelope. 10
4 Objective Analysis of Head-Related Transfer Functions During the course of this project, it became evident that apart from subjective listening test results we would require an objective measure of HRTF lter design quality. Furthermore, this error measure should be based on auditory criteria. Two such techniques were studied and will be presented in the following.
4.1 Binaural Auditory Model for HRTF Filter Performance Evaluation
A novel idea of using a binaural auditory model for virtual source quality estimation has been proposed by Pulkki et al. [44]. Similar techniques have been presented by, e.g., [45]. The basis of the auditory model applied in this study lies in the coincidence and cross-correlation principles introduced by Jeress [46]. The early model has been further extended by several authors (e.g. Lindemann [47] and Gaik [48]). A schematic of the binaural auditory model used in this study is depicted in Fig. 4 (slightly modi ed from [44]). The model consists of the following steps. A pink noise sample convolved with the HRTFs under study is used as the excitation. Pink noise yields a spectrally balanced excitation to the auditory system lacking major temporal attacks, thus the in uence of the precedence eect is minimized. The HRTF ltered pink noise samples are passed through a gammatone lterbank (GTFB) [49] of 32 bandpass ERB (equivalent rectangular bandwidth) channels. Half-wave recti cation and low-pass ltering (cuto frequency at 1 kHz) are used to simulate the hair cells and the auditory nerve behavior in each bandpass channel. For the ITD model, the interaural cross-correlation function (IACC) is calculated for each bandpass channel pair (see Fig. 4). The ITD as a function of ERB channel is estimated by calculating the time delay corresponding to the position of the maximum in each bandpass IACC curve. Loudness estimates (L in sones) p4 for the left and right ERB bandpass channels are calculated using the equation L = < H > (an approximation of Zwicker's formula, where he used an exponent of 0.23 [13]), where < H > is the average power of the ERB channel output. The loudness levels LL (in phons) for each ERB channel are computed using the formula LL = 40 + 10 log L, resulting in a loudness level spectrum for left and right ear signals. 2
2
2
4.1.1 Error criteria for binaural auditory model In order to be able to compare the binaural auditory model outputs for dierent lter designs, a suitable error measure had to be considered. We computed the root-meansquare (RMS) error to compare the binaural auditory model outputs of dierent lter designs (and lter order) with a reference response. In the calculation of the distance measures, we observed the basic phenomena found in human sound localization [1]: 1) the spectral and interaural level dierence (ILD) cues are dominating localization at frequencies above approximately 1.5 kHz, but may also contribute to localization at lower frequencies, and 2) the interaural time dierences (ITD) are the dominant localization cues at frequencies below approximately 1.5 kHz. Based on these assumptions, two quality measures were derived: 11
Perceived loudness level spectrum error Perceived loudness level spectrum + ITD modeling error
The modeling errors were calculated as an RMS dierence between the outputs from the auditory models (loudness level spectra or ITDs) of the reference and the lter approximation over a passband fl ? fh. In the results presented below the following limits were chosen: fl = 1:5 kHz and fh = 16 kHz. The loudness level spectrum error was calculated by summing the left and right ear loudness errors . The loudness level spectrum+ITD modeling error was calculated by scaling and summing the low frequency (f < 1:5 kHz) ITD modeling error with the high frequency loudness level spectrum modeling error. As the role of low-frequency spectral cues in sound source localization is ambiguous, a choice was made to exclude the loudness level error at f < 1:5 kHz from the RMS error measures. The results of using a binaural auditory model in HRTF lter design quality estimation are depicted in Figs. 5-7. The loudness level spectrum estimates of HRTFs for person 3 at 40o azimuth are shown in Figs. 5-6 and the perceived ITD of the corresponding lter approximations is plotted in Fig. 7 (note that the loudness level spectrum estimates shown in Fig. 6 correspond to the magnitude responses shown in Fig. 3). The dashed line in each subplot is the reference response, and the solid line is the current approximation result. The number of lter coecients for the row of plots is shown to the right of the gures. Furthermore, an RMS error estimate of the corresponding design is shown in Figs. 5-6 in each of the subplots. From the plots we can see that the spectral detail clearly visible in the original HRTF (see Fig. 3) is smoothed as a consequence of auditory modeling. The RMS error estimate shows that over the passband of loudness level error calculation (1.5-16 kHz) the warped IIR design method provides the best t to the desired response. This is also true in the case of the ITD estimate, shown in Fig. 7. 1
4.2 Spectral Distance Measure
Another point of view to objective HRTF design analysis is to generate a simple numerical measure of lter design quality that is meaningful also from the perceptual point of view. A similar approach has been considered in, e.g., [29]. In [3], a spectral (magnitude) distance measure was experimentally derived in the following way. compute power spectrum resample (by interpolation) uniformly on a logarithmic frequency scale smooth with about 0.2 octave resolution (this resolution value was speci ed somewhat arbitrarily to be not too far from the ERB resolution) convert to dB scale In the next steps, the dierence of the lter approximation and a reference spectrum is computed for the passband region. The reference spectrum is in our case the corresponding reference response (257-tap FIR lter) in our listening experiments. An RMS value of the dierence spectrum is then computed and this is used as an objective spectral distance measure to characterize the perceptual dierence between the magnitude responses or a deviation from a reference response. This spectral distance measure was used in the previous study [3]. 1
Summation of left and right ear loudnesses to obtain a binaural loudness estimate was used in, e.g., [50].
12
4.3 Results and Discussion
The results of estimating HRTF lter design quality using a binaural auditory model suggest that high-frequency smoothing of the HRTFs is motivated. The results depicted in Figs. 8-9 con rm that the RMS loudness level spectrum error is lowest for the auditory WIIR lter design, which essentially provides a better low frequency t with a tradeo in high-frequency accuracy. In the gures, both left and right ear HRTFs approximations were modeled for 9 test subjects at 4 azimuth angles. It can be seen that an auditory scale lter (WIIR) outperforms both FIR and IIR lter design methods. The results are even more dramatic when the ITD modeling error is added to the error measure. In Fig. 9, these results for 9 test subjects at 4 azimuth angles are depicted. In both plots, the lter design errors start to increase as the number of coecients is below 48. The WIIR design error is tolerable to order 16, whereas both the FIR and IIR design errors start to increase earlier.
5 Subjective Analysis of Head-Related Transfer Functions In order to verify the theoretical lter design results we carried out headphone listening experiments. The goal was to study the performance of individualized HRTF lter approximations using dierent design techniques and dierent lter orders. Listening experiments were carried out for the three HRTF lter design methods as described in the previous section: FIR, IIR, and WIIR. A total of 10 male test subjects participated in the listening experiment, with ages ranging between 25 and 51. The hearing of all test subjects was tested using standard audiometry. None of the subjects had reportable hearing loss that could eect the test results. The HRTF approximations were individually equalized for a speci c headphone type (Sennheiser HD580).
5.0.1 Test Method An A/B paired comparison hidden reference paradigm was employed for the listening tests with two independent grading scales. The subjects were asked to grade localization and timbre impairment against the hidden reference on a continuous 1.0 to 5.0 scale (as proposed in ITU-R BS 1116 [51]). The hidden reference in each case was the 257-tap lter. In each trial, two test sequences were presented with 0.5s between sample (i.e. A/B//A/B). A full permutation set was employed and two dierent random orders of presentation were used to minimise bias. To obtain data regarding listener reliability the reference (the 257-tap FIR case was also tested against itself). Each test type was repeated two times. Listeners were given written and oral instructions. 5.0.2 Test Stimuli A pink noise sample with a length of one second (50 ms onset and oset ramps) was used in the nal experiment. The level of the stimuli was adjusted so that the peak Aweighted SPL did not exceed 70 dB at any point. This has been done in order to avoid level adaptation and the acoustical re ex (Stapedius re ex). No gain adjusting of the test sequences calculated for one person was carried out, since the only variability in level was (possibly) introduced by the used HRTF lters. 13
5.0.3 Test Procedure The test person was seated in a semi-anechoic chamber (anechoic chamber with hardboard
oor). The test stimuli were presented over headphones. A computer keyboard was placed in front of the test person. Each test person was individually familiarized and instructed to grade the localization and timbre scales for each test signal pair. An example plot of the listening test software is shown in Fig. 10. As a total, ve dierent lter approximations for each of the three lter types at four apparent source positions were used. Each alternative was played three times. The results of the listening tests were gathered automatically by a program written for the QuickSig environment [52]. The result data were transferred into a statistical software package (SPSS), where analysis was performed.
5.1 Results and Discussion
5.1.1 Data and Model Veri cation The data were initially checked to ensure equal variance across listeners. At this point it was found that one listener had very dierent variance than the others. Upon closer inspection this listener was found to be grading very similarly for all systems, had nonnormal distribution of data and also had very low error variance and poor F-statistics. This is an indication that this listener could not discriminate between systems and was thus eliminated from subsequent analysis. The data were then tested for meeting the assumption of the analysis of covariance model (ANCOVA). The data were found to be normally distributed, thought slightly skewed, which is typical of subjective data. The ANCOVA model is fairly robust to slight skewness of data. Residuals were found to be normally distributed. It should be noted that the model did not meet the requirements for homogeneity of variance and the Levene statistic's were found to be signi cant. This is not considered signi cant as in all other respects the ANCOVA assumptions have been met and the raw and modelled data are strongly correlated. 5.1.2 Analysis An ANCOVA was employed for the full analysis of the data considering all factors: lter order (FILTSIZ), lter type (TYPE), listener (PERSON), reproduction angle (ANGLE). A covariate (ORDER) was included to represent the order in which the test was performed. A full factorial analysis was made for all factors and covariates, employing a type III sum of squares. This analysis was repeated for both dependent variables: localization and timbre, the results of which are shown in the ANCOVA tables 4-5. Initial inspection of these tables shows that the ANCOVA models are valid (Localization: F = 8:733; p < 0:00, Timbre: F = 8:639; p < 0:00). All plotted means are based upon the ANCOVA modelled data. When considering the means illustrated in Figs. 11-16 it would appear that there is a correlation between the dependent variables. To test this a bivariate correlation was performed between the localization and timbre variables, employing Spearman's rho test. The output of this test is shown in Table 6 and indicates a correlation of 89% (p < 0:01). This level of correlation is quite high but may be explained in the following manner: 14
Timbre and localization are implicitly related by the amplitude response character-
istics of HRTF's. Untrained listeners were employed for the tests. As this task is more complex, involving two measures, listeners may not have been able to discriminate between the two scales. The net result in such cases is that listeners grade similarly for both scales, which are eectively representative of the mean opinion score (MOS) scale for overall perceived quality. It is not possible to argue which of these two factors is dominating. The output from the ANCOVA tables are not identical, as illustrated in the Table 3, which shows that the rank order of the factors are dierent for both dependent variables. For this reason the dependent variables are considered separately. The ORDER covariate is found to be a signi cant factor. This indicates that the order of presentation did have an in uence on the grading, which is a further sign that the listening panel was untrained. However, this is not a critical factor, as long as it is included in the ANCOVA. Furthermore, upon inspection of means as a function of PERSON, it can be seen that listeners are grading consistently with a common trend (see Figs. 11-12). There is a signi cant dierence for this factor for both scales, but this is considered normal, as listeners do tend to perceive and grade dierently. The dominant factor for both variables was that of FILTSIZ (see tables 2-5). FILTSIZ is plotted with respect to the FIR grading as a dierence grading, to illustrate the relative performance of the IIR and WIIR lters. A positive digrade implies superiority. It is clear from Figs. 15-16 that there is only marginal performance dierence between FIR and IIR lters for all lter sizes. Clearly, the 17-coecient IIR lter should be avoided. The WIIR lter appears to reach an early peak in its performance compared to the others around 33 coecients, at which point it is nearly one digrade better than a same length IIR lter and 0.75 digrades better than an FIR lter. Above this length the WIIR performance does not improve, and the FIR lter provides a competitive solution for longer lters. The comparison between FIR and WIIR at 97 coecients is negligible. IIR lter quality improves slowly as a function of length but only approaches the same quality as FIR lters above 65 coecients. Considering the second most signi cant factor, TYPE, in all cases the WIIR is found superior, with the FIR design is second place. For 37-66 coecients, the WIIR lter more than 0.4 digrades superior than the equivalent length FIR lter. When considering the degradation as a function of ANGLE, we can see from Figs. 13-14 that timbre degradation is more strongly aected than localization. The highest quality for both scales occurs at 90o . This implies that it is possible to achieve the same quality level for 90o with inferior lters than at other angles. Blauert has presented the human lack to localization sensitivity (localization blur) in the horizontal plane at the sides which may provide an explanation for this phenonenon [1]. It can also be considered that listeners are more critical to timbre, and to a lesser extent localization, degradation in the frontal direction.
15
6 Conclusions In this paper, the HRTF lter design problem was addressed from the objective and subjective evaluation point of view. Filter design methods taking into account the nonuniform frequency resolution of the human ear were studied and summarized. A new technique for auditory spectral error estimation was incorporated, based on a computationally ecient binaural auditory model. Subjective listening tests were performed to compare the theoretical model results with empirical localization performance. The results suggest the following preliminary conclusions: The high-frequency spectral content present in HRTFs can be smoothed using auditory criteria without degrading localization performance. A binaural auditory model can be used to give a quantitative prediction of perceptual HRTF lter design performance. A warped IIR lter of order 16 appears to be sucient for retaining most of the perceptual features of HRTFs. In conclusion, it can be stated that lter design methods for 3-D sound can gain considerable eciency when an auditory resolution is used. The non-uniform frequency resolution can be approximated using pre-smoothing, weighting functions, or frequency warping. The binaural auditory model outputs and listening test results gave similar results in terms of detectable (perceptually audible) dierences in original and approximated HRTFs. The required lter length for high-quality 3-D sound synthesis is, however, also dependent on the incident angle of the incoming sound.
6.1 Discussion
Subjective estimation of HRTF lter design quality is not an easy task. One of the most important issues in carrying out subjective tests is the problem formulation: which features should be tested? In our experiments, the subjects were not asked to determine the angle of arrival of the test tones. This rules out the possibility of checking for localization errors such as in-head localization or front-back confusions. Therefore we made the assumption that the reference HRTF is \perfect" and the quality degradation is always related to that reference. We found out that timbre and localization are not orthogonal measures in virtual source quality estimation (as was expected). However, the correlation of objective and subjective results in lter design comparison was found good, which suggests that the binaural auditory model can be used as a tool for perceptual quality estimation. It is also clear that limited information can be gained from the simple localisation task, which provides little information regarding the perceived spatial reproduction quality of a lter design. Multidimensional spatial attributes may provide researchers with a greater knowledge of how individuals assess spatial sound for dierent tasks and provide greater insight into how to design such systems. However, the search for multidimensional spatial attributes is not a simple one, but should be considered as an important task for further research. 16
7 Acknowledgments The authors would like to thank Mr. Ville Pulkki (Helsinki University of Technology (HUT), Acoustics Lab), whose work and software implementations were the source of motivation for the auditory modeling part of the present work. Mr. Klaus Riederer (HUT Acoustics Lab) is acknowledged for measuring all the HRTF data used in this study. The authors are also greatful to Prof. Julius Smith (CCRMA, Stanford University), Mr. Matti Hamalainen (Nokia Research Center) and Dr. Vesa Valimaki (HUT Acoustics Lab) for reading and commenting versions of this paper. Last but not least, the 10 test subjects who suered through HRTF measurements and listening tests are sincerely thanked for their patience and enthusiasm. This work has been nancially supported by the Emil Aaltonen Foundation, the Helsinki University of Technology Foundation, and the Nokia Foundation.
17
8 References [1] J. Blauert. Spatial hearing. The psychophysics of human sound localization. MIT Press, Cambridge, MA, USA, 1997. [2] H. Mller, M. Srensen, D. Hammershi, and C. Jensen. Head-related transfer functions of human subjects. Journal of the Audio Engineering Society, 43(5):300{321, May 1995. [3] J. Huopaniemi and M. Karjalainen. Review of digital lter design and implementation methods for 3-D sound. In Proceedings of the 102 Convention of the Audio Engineering Society, Preprint 4461, Munich, Germany, 1997. [4] M. Karjalainen, E. Piirila, A. Jarvinen, and J. Huopaniemi. Comparison of loudspeaker equalization methods based on DSP techniques. In Proceedings of the 102 Convention of the Audio Engineering Society, Preprint 4437, Munich, Germany, 1997. [5] M. Karjalainen, A. Harma, U. Laine, and J. Huopaniemi. Warped lters and their audio applications. In Proceedings of the IEEE Workshop of Applications of Signal Processing to Audio and Acoustics, Mohonk Mountain House, New Paltz, New York, 1997. [6] J. Sandvad and D. Hammershi. Binaural Auralization. Comparison of FIR and IIR Filter Representation of HIRs. In Proceedings of the 96th Convention of the Audio Engineering Society, Amsterdam, The Netherlands, 1994. preprint 3862. [7] A. Kulkarni and H. S. Colburn. Finite-impulse-response models of the head-related transfer-function. 1997. Submitted to J. Acoust. Soc. Am. [8] A. Kulkarni and H. S. Colburn. In nite-impulse-response models of the head-related transfer-function. 1997. Submitted to J. Acoust. Soc. Am. [9] G. H. Golub and C. F. V. Loan. Matrix computations. Princeton University Press, The Johns Hopkins University Press, 3 edition, 1996. [10] J. O. Smith. Techniques for Digital Filter Design and System Identi cation with Application to the Violin. PhD thesis, Stanford University, Stanford, California, USA, June 1983. [11] T. Parks and C. Burrus. Digital Filter Design. John Wiley&Sons, New York, 1987. [12] J. Mackenzie, J. Huopaniemi, V. Valimaki, and I. Kale. Low-order modelling of headrelated transfer functions using balanced model truncation. IEEE Signal Processing Letters, 4(2):39{41, 1997. [13] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer-Verlag, Heidelberg, Germany, 1990. [14] B. C. J. Moore, R. W. Peters, and B. R. Glasberg. Auditory lter shapes at low center frequencies. Journal of the Acoustical Society of America, 88:132{140, 1990. [15] B. C. J. Moore, R. W. Peters, and B. R. Glasberg. A revision of Zwicker's loudness model. Acta Acustica, 82:335{345, 1996. nd
nd
18
[16] J. Koring and A. Schmitz. Simplifying cancellation of cross-talk for playback of headrelated recordings in a two-speaker system. Acustica, 179:221{232, 1993. [17] H. W. Strube. Linear prediction on a warped frequency scale. Journal of the Acoustical Society of America, pp. 1071{1076, 1980. [18] J. O. Smith and J. Abel. The Bark bilinear transform. In Proceedings of the IEEE Workshop of Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 1995. [19] J.-M. Jot, V. Larcher, and O. Warusfel. Digital signal processing issues in the context of binaural and transaural stereophony. In Proceedings of the 98th Convention of the Audio Engineering Society, Paris, France, 1995. preprint 3980. [20] C. P. Brown and R. O. Duda. A structural model for binaural sound synthesis. IEEE Transactions on Speech and Audio Processing, 1998. in press. [21] D. J. Kistler and F. L. Wightman. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. Journal of the Acoustical Society of America, 91(3):1637{1647, 1992. [22] S. Mehrgardt and V. Mellert. Transformation Characteristics of the External Human Ear. Journal of the Acoustical Society of America, 61(6):1567{1576, 1977. [23] A. Kulkarni, S. K. Isabelle, and H. S. Colburn. On the minimum-phase approximation of head-related transfer functions. In Proceedings of the IEEE Workshop of Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 1995. [24] A. Kulkarni, S. K. Isabelle, and H. S. Colburn. Sensitivity of human subjects to head-related-transfer-function phase spectra. 1997. Submitted to J. Acoust. Soc. Am. [25] T. I. Laakso, V. Valimaki, M. Karjalainen, and U. K. Laine. Splitting the unit delay { tools for fractional delay lter design. IEEE Signal Processing Magazine, 13(1):30{60, January 1996. [26] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing. Prentice Hall, New Jersey, 1989. [27] A. Kulkarni and H. S. Colburn. Ecient nite-impulse-response lter models of the head-related transfer function. Journal of the Acoustical Society of America, 97(5):3278, 1995. [28] K. Hartung and A. Raab. Ecient modeling of head-related transfer functions. Acta Acustica, 82(suppl. 1):S88, 1996. [29] S. Wu and W. Putnam. Minimum perceptual spectral distance FIR lter design. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pp. 447{450, 1997. [30] G. S. Kendall and C. A. P. Rodgers. The simulation of three-dimensional localization cues for headphone listening. In Proceedings of the International Computer Music Conference, 1982. 19
[31] G. S. Kendall and W. L. Martens. Simulating the Cues of Spatial Hearing in Natural Environments. In Proceedings of the International Computer Music Conference, pp. 111{125, Paris, France, 1984. [32] J. Sandvad and D. Hammershi. What is the most ecient way of representing HTF lters? In Proc. NORSIG'94, pp. 174{178, 1994. [33] M. A. Blommer and G. H. Wake eld. On the design of pole-zero approximations using a logarithmic error measure. IEEE Transactions on Signal Processing, 42(11):3245{ 3248, November 1994. [34] M. A. Blommer and G. H. Wake eld. Pole-zero approximations for head-related transfer functions using a logarithmic error criterion. IEEE Transactions on Speech and Audio Processing, 5(3):278{287, May 1997. [35] F. Asano, Y. Suzuki, and T. Sone. Role of spectral cues in median plane localization. Journal of the Acoustical Society of America, 88(1):159{168, 1990. [36] C. Ryan and D. Furlong. Eects of headphone placement on headphone equalization for binaural reproduction. In Proceedings of the 98th Audio Engineering Society (AES) Convention, Paris, France, Feb. 25-28 1995. preprint no. 4009. [37] R. L. Jenison. A spherical basis function neural network for pole-zero modeling of head-related transfer functions. In Proceedings of the IEEE Workshop of Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 1995. [38] A. Kulkarni and H. S. Colburn. In nite-impulse-response lter models of the headrelated transfer function. Journal of the Acoustical Society of America, 97(5):3278, 1995. [39] W. G. Gardner and K. D. Martin. HRTF measurements of a KEMAR. Journal of the Acoustical Society of America, 97(6):3907{3908, 1995. [40] B. Friedlander and B. Porat. The modi ed Yule-Walker method of ARMA spectral estimation. IEEE Trans. Aerospace Electronic Syst., AES-20(2):158{173, 1984. [41] L. Savioja, J. Huopaniemi, T. Lokki, and R. Vaananen. Creating Interactive Virtual Acoustic Environments. 1998. Submitted to J. Audio Eng. Soc. [42] K. Riederer. Repeatability analysis of HRTF measurements. San Francisco, USA, September 26-29 1998. To be presented at the 105 Convention of the Audio Engineering Society. [43] Mathworks. MATLAB Signal Processing Toolbox. User's Manual, 1994. [44] V. Pulkki, M. Karjalainen, and J. Huopaniemi. Analyzing virtual sound sources using a binaural auditory model. In Proceedings of the 104 Convention of the Audio Engineering Society, Preprint 4697, Amsterdam, The Netherlands, May 1998. A revised version submitted to J. Audio Eng. Soc. [45] E. A. Macpherson. A computer model of binaural localization for stereo imaging measurement. Journal of the Audio Engineering Society, 39(9):604{622, 1991. th
th
20
[46] L. A. Jeress. A place theory of sound localization. J. Comp. Physiol. Psych., 61:468{ 486, 1948. [47] W. Lindemann. Extension of a binaural cross-correlation model by contralateral inhibition. I. Simulation of lateralization for stationary signals. Journal of the Acoustical Society of America, 80(6):1608{1622, December 1986. [48] W. Gaik. Combined evaluation of interaural time and intensity dierences: Psychoacoustic results and computer modeling. Journal of the Acoustical Society of America, 94(1):98{110, July 1993. [49] M. Slaney. An ecient implementation of the Patterson-Holdsworth lter bank. Apple Technical Report 35, 1993. Apple Computer, Inc. [50] B. C. J. Moore, B. R. Glasberg, and T. Baer. A model for the prediction of thresholds, loudness, and partial loudness. Journal of the Audio Engineering Society, 45(4):224{ 240, April 1997. [51] ITU-R. Recommendation BS.1116. Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. 1994. Geneva. [52] M. Karjalainen. DSP software integration by object-oriented programming - a case study of QuickSig. IEEE ASSP Magazine, pp. 21{31, 1990. [53] D. R. Begault. Challenges to the successful implementation of 3-D sound. Journal of the Audio Engineering Society, 39(11):864{870, 1991.
21
DATA
Audio Input
USER INPUT HRTF Database θ = [0:10:350] φ = [−90:10:90]
ITD Table
Azimuth θ Elevation φ
OUTPUT Coefficient Interpolation
DLl
hl,i (n,θ,φ)
DLr
hr,i (n,θ,φ)
Crosstalk Canceling
Transaural Listening
Binaural Listening
Figure 1: General implementation framework of dynamic 3-D sound using non-recursive lters. HRTF modeling is carried out using pure delays to represent the ITD (DLl and DLr ), and minimum-phase reconstructed interpolated HRIRs (hl;i (n) and hr;i (n)).
2
Weight value
10
1
10
ERB weighting Bark weighting 0
10
2
10
3
10 Frequency (Hz)
10
4
Figure 2: Auditory weighting functions as a function of frequency, fs = 48 kHz. 22
Modeling of minimum−phase HRTFs: azi=40deg, person: 3 20 15
Relative Magnitude / dB
10 5 0 −5 −10 −15 −20
Minimum−phase Original FIR, order: 48 IIR, design: Prony, order: 24 WIIR, lambda= 0.65, design: Prony, order: 24
−25 −30
3
4
10
10
Frequency / Hz
Figure 3: HRTF magnitude response and dierent lter approximations. Right ear, azimuth angle 40o , person 3. IACC IACC
ITD spectrum
GTFB HRTF L
IACC pink noise
GTFB
LL
Left LL Spectrum
LL
HRTF R
LL half wave rectification
low pass filtering
LL LL
Right LL Spectrum
LL
Figure 4: Detailed view of the binaural auditory model used in the objective evaluation of HRTF lter designs. 23
FIR, azi=40, id=3 90 80 70
LL(L) / phons
WIIR, azi=40, id=3 97
3.1908
90 80 70
3.0824
0.96047
65 5.1938
90 80 70
5.8484
1.9814
49 5.4452
90 80 70 90 80 70 0.2
IIR, azi=40, id=3
7.1516
2.433
33 7.1978
8.6572
4.4735
17 9.0677
1
3
9.7706
10 21
0.2
1 3 10 21 Frequency / kHz
7.4997
0.2
1
3
10 21
Figure 5: Binaural auditory model evaluation of lter design quality. Left ear, azimuth angle 40o , person 3. Solid line: lter approximation, dashed line: reference. FIR, azi=40, id=3 90 80 70
LL(R) / phons
WIIR, azi=40, id=3 97
3.1241
90 80 70
2.4967
0.7613
65 5.0635
90 80 70
5.293
1.7185
49 5.1856
90 80 70 90 80 70 0.2
IIR, azi=40, id=3
6.2493
2.1406
33 6.3121
7.4215
5.0108
17 8.1712
1
3
8.2394
10 21
0.2
1 3 10 21 Frequency / kHz
7.3156
0.2
1
3
10 21
Figure 6: Binaural auditory model evaluation of lter design quality. Right ear, azimuth angle 40o , person 3. Solid line: lter approximation, dashed line: reference. 24
FIR, azi=40, id=3
IIR, azi=40, id=3
WIIR, azi=40, id=3
0 −0.5
97
−1 0
ITD / ms
−0.5
65
−1 0 −0.5
49
−1 0 −0.5
33
−1 0 −0.5 −1 0.2
17
0.5
1
0.2
0.5 1 Frequency / kHz
0.2
0.5
1
Figure 7: Binaural auditory model evaluation of lter design quality. ITD, azimuth angle 40o , person 3. Solid line: lter approximation, dashed line: reference. Subjects: 9, azims: 4 8
FIR IIR WIIR
Composite modeling error
7
6
5
4
3
2 10
20
30
40 50 60 70 Number of filter coefficients
80
90
100
Figure 8: Binaural auditory model results for 9 subjects, 4 azimuth angles using three dierent lter design methods. The composite (L+R) loudness level spectrum error estimate was used. 25
Subjects: 9, azims: 4 16
FIR IIR WIIR
15
ITD+L+R modeling error
14 13 12 11 10 9 8 7 10
20
30
40 50 60 70 Number of filter coefficients
80
90
100
Figure 9: Binaural auditory model results for 9 subjects, 4 azimuth angles using three dierent lter design methods. The combined ITD and composite (L+R) loudness level spectrum error estimate was used.
Figure 10: Listening test questionnaire window for HRTF lter design quality (localization and timbre grading).
26
5.0
4.5
95% CI Predicted Value for LOCALIZATION
4.0
3.5
3.0
TYPE 2.5 FIR 2.0 IIR 1.5
1.0
WIIR 1
2
3
4
5
6
7
8
9
PERSON
Figure 11: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of test persons. Tested variable: localization. 5.0
4.5
4.0
95% CI Predicted Value for TIMBRE
3.5
3.0
TYPE 2.5 FIR 2.0 IIR 1.5
1.0
WIIR 1
2
3
4
5
6
7
8
9
PERSON
Figure 12: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of test persons. Tested variable: timbre. 27
5.0
4.5
95% CI Predicted Value for LOCALIZATION
4.0
3.5
3.0
TYPE 2.5 FIR 2.0 IIR 1.5
1.0
WIIR 0
40
90
210
ANGLE
Figure 13: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of presentation angle. Tested variable: localization. 5.0
4.5
4.0
95% CI Predicted Value for TIMBRE
3.5
3.0
TYPE 2.5 FIR 2.0 IIR 1.5
1.0
WIIR 0
40
90
210
ANGLE
Figure 14: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of presentation angle. Tested variable: timbre. 28
1.0
95% CI Predicted value for LOCALIZATION
.5
0.0
TYPE
FIR -.5 IIR
-1.0
WIIR 17
33
49
65
97
257
FILTSIZ
Figure 15: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of lter type and order. Tested variable: localization. 1.5
1.0
95% CI Predicted value for TIMBRE
.5
TYPE 0.0
FIR -.5 IIR
-1.0
WIIR 17
33
49
65
97
257
FILTSIZ
Figure 16: Listening test results for three lter types (FIR, IIR, WIIR). Predicted values as a function of lter type and order. Tested variable: timbre. 29
Research Group
Begault, 1991 [53] Sandvad and Hammershi, 1994ab [6, 32] Kulkarni and Colburn, 1995&1997 [27, 7] Hartung and Raab, 1996 [28] Huopaniemi and Karjalainen, 1997 [3] Asano et al., 1990 [35] Sandvad and Hammershi, 1994ab [6, 32] Blommer and Wake eld, 1994 [33] Jot et al., 1995 [19] Ryan and Furlong, 1995 [36] Kulkarni and Colburn, 1995&1997 [38, 8] Kulkarni and Colburn, 1995&1997 [38, 8] Hartung and Raab, 1996 [28] Mackenzie et al., 1997 [12] Blommer and Wake eld, 1997 [34] Huopaniemi and Karjalainen, 1997 [3]
Design Type
binaural / FIR binaural / FIR binaural / FIR binaural / FIR binaural / FIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR binaural / IIR
Filter Order Study
80-512 72 64 48 40 >40 48 14 10-20 24 6 25 (all-pole) 34/10 10 40 20
empirical empirical empirical empirical empirical empirical empirical theoretical empirical? empirical? empirical empirical empirical theoretical theoretical empirical
Table 1: Binaural HRTF lter design data from the literature.
Factor
Levels
Filter type (TYPE) Filter length (FILTSIZ) Angle of reproduction (ANGLE) Listeners (PERSON) Order of sample presentation (covariate ORDER)
FIR, IIR, WIIR 17, 33, 49, 65, 97, 257 0o , 40o, 90o , 210o 10 listeners 2 orders
Table 2: Factors and levels for the subjective experiment.
Rank Order 1 2 3 4 5 6 7 8
Localization
FILTSIZ TYPE PERSON ORDER (covariate) ANGLE FILTSIZ*PERSON ANGLE*FILTSIZ ANGLE*PERSON
Timbre
FILTSIZ TYPE ANGLE PERSON ORDER (covariate) ANGLE*FILTSIZ ANGLE*FILTSIZ*TYPE FILTSIZ*PERSON
Table 3: ANCOVA output in rank order of factors for localization and timbre. 30
Table 4: The ANCOVA table for LOCALIZATION.
Table 5: The ANCOVA table for TIMBRE. 31
Table 6: Correlation analysis between localization and timbre.
32