efficient hrtf synthesis using an interaural transfer function ... - eurasip

8 downloads 0 Views 2MB Size Report
ABSTRACT. We propose an alternative method for binaural synthesis using the ratio of the HRTFs at the two ears, the interaural transfer function (ITF), for ...
EFFICIENT HRTF SYNTHESIS USING AN INTERAURAL TRANSFER FUNCTION MODEL Gaëtan Lorho, Jyri Huopaniemi, Nick Zacharov, David Isherwood Nokia Research Center, Speech and Audio Systems Laboratory, P.O. Box 407, FIN-00045 Nokia Group, FINLAND e-mail:[gaetan.lorho,jyri.huopaniemi]@nokia.com, [nick.zacharov,david.isherwood]@nokia.com ABSTRACT We propose an alternative method for binaural synthesis using the ratio of the HRTFs at the two ears, the interaural transfer function (ITF), for deriving the response at the contralateral ear from the ipsilateral one. This technique allows an optimal binaural synthesis if an accurate ITF model is used but a computationally more efficient binaural synthesis can also be obtained by simplifying the ITF filter used for the contralateral channel processing, which is perceptually less critical than the ipsilateral response. Different filter designs and filter orders are investigated for both speech and wide-band signals. Subjective tests are carried out to investigate the perceptual differences introduced by this ITF model. The results indicate that an ITF filter can be of considerably lower complexity when compared to the ipsilateral filter, thus allowing for computational savings without significant perceptual degradation. 1 INTRODUCTION The reproduction of virtual audio environments relies on the exact replication of spatial cues used by the auditory system. Impulse responses measured at the two ears of a human subject or a dummy-head contain the information necessary for directional sound localisation. By processing a monophonic signal with these Head Related Transfer Functions (HRTF), the two resulting signals produce a virtual source of sound when replayed via headphones. The principle of binaural synthesis is illustrated by the block diagram in Fig. 1. HRTFs are normally modelled with digital filters (Hi for the ipsilateral ear, Hc for the contralateral ear) incorporating a minimum-phase approximation of the impulse responses and a simple delay line for the ITD (denoted z-D, where D is the integer sample delay) [1]. In previous work [2] it has been found that FIR filters as short as 40 taps can be used (at an audio sample rate of 48 kHz) for high-quality binaural presentation. Nevertheless, real-time implementation is computationally costly and becomes problematic when a large number of sound sources has to be rendered. The possibility of using a simple model for deriving the contralateral signal from the ipsilateral one would allow more efficient binaural synthesis. The idea of a contralateral HRTF model has been discussed by Avendano et al. [3] where a spherical head model was used to transform the ipsilateral HRTF to the contralateral one. In our novel approach, we use the same basic principle, but apply the interaural transfer function (ITF) to derive the contralateral response, as illustrated in the block diagram of Fig. 2. Furthermore, we look at a simplified model for this ITF. This simplification can be justified by the fact that the response at the ear opposite to the sound source is perceptually less critical due to the large ILD introduced by the presence of the head.

yi

Hi x z-D

yc

Hc

Figure 1. Principle of binaural synthesis, including two filters (HI for the ipsilateral HRTF and Hc for the contralateral HRTF) and a pure delay line for the ITD.

yi x

Hi z-D

ITF

yc

Figure 2. Alternative method for binaural synthesis using the ipsilateral HRTF, the ratio of the HRTFs at the two ears, the interaural transfer function (ITF= Hc / Hi ) and a pure delay line for the ITD. 2 INTERAURAL TRANSFER FUNCTION MODEL In this section, methods for low-order filter modelling of the ITF are presented. The original ITF filter used in this study can be expressed as the ratio of a minimum phase approximation of the contralateral and ipsilateral HRTFs (ITF= Hc / Hi) (see [4] and [5] for more details). The main advantage of using this method comes from the fact that a lower order filter can be fitted to the ITF without affecting the overall perception. Methods for HRTF filter approximation have been presented in [6]. An attractive low-order IIR filter design method, balanced model truncation (BMT) [7], has been used to compute loworder approximations of ITFs for this study. Prior to the filter design, some pre-processing steps are required. First, the HRTFs have to be minimum-phase in order to derive a causal ITF. Furthermore, pre-smoothing of the response is desired to enhance the low-order filter fit. Psychoacoustically motivated smoothing such as ERB or Bark smoothing can be used. In Fig. 3, the frequency responses of the ipsilateral and contralateral HRTFs are shown with the corresponding ITF for a sound source at 80° azimuth, 0° elevation. Due to the presence of the head, the response at the contralateral ear is attenuated for high frequencies. The ITF also contains this low-pass effect as a main feature. A more complex structure is seen at high frequency but at a level 20dB lower.

Ipsilateral / contralateral HRTFs and ITF, 30° azimuth, 0° elevation

10

Amplitude (dB)

0

−10

−20

−30

Ipsilateral HRTF Contralateral HRTF ITF = Contra / ipsi −40

2

3

10

4

10

10

Frequency (Hz)

Figure 3. Ipsilateral, contralateral and ITF frequency responses for individual HRTFs measured at 80° azimuth (0° elevation).

Figure 5. Magnitude frequency responses plot of an average ITF (256th order FIR filters) for the horizontal plane from 0 to 180° azimuth.

In this computational experiment, ITF filter orders of 8, 6, 4, 2 and 1 were designed for individualised HRTFs and a generalised (non-individual) HRTF response. The ITF calculated from the generalised HRTF set is compared in the horizontal plane (0 to 180° azimuth, 0° elevation). Fig 5 and 6 show, for all 19 incident angles, decibel magnitude frequency responses of 128th order FIR filters and corresponding 6th order IIR filter models. In Fig. 4, an example of ITF model for different IIR filter orders is also presented for an individual HRTF at 80° azimuth (0° elevation). The result can be seen with the contralateral HRTF obtained from this model. An accurate frequency match is possible with a 8th order filter, but even a first order filter is sufficient for reproducing the main effect of low-pass filtering around 5kHz. Original and modeled contra. HRTF, 80° 0

-10

-10

-20

-20

-30

-30

Relative magnitude (dB)

Relative magnitude (dB)

Original and modeled ITF, 80° azimuth 0

-40

-50

-60

-70

3.1 Test Specification

-50

-60

128-tap orig.

128-tap orig.

IIR order 8

IIR order 8 -70

IIR order 6 IIR order 2

3

10

Frequency (Hz)

IIR order 6 IIR order 4

-80

IIR order 2 IIR order 1

IIR order 1 -90

3 SUBJECTIVE EVALUATION OF THE ITF MODEL

-40

IIR order 4 -80

Figure 6. Corresponding 6thorder IIR filters modelled from to the average ITFs.

4

10

-90

3

10

4

10

Frequency (Hz)

Figure 4. 128-tap FIR filter model of an ITF calculated from an individual HRTF set at 80° azimuth, and the associated ITF model for different IIR filter orders (8, 6, 4, 2 and 1) at the left. The contralateral HRTF model obtained with the different ITFs is shown at the right. Also, when very low-order is required for the ITF model, a frequency warping with a large emphasis on low frequency is sometimes necessary in order to retain the main low frequency structure of the ITF rather than the more complex peak and dip structure at high frequency. The use of this frequency warping can actually be seen as a steering of the filter design for a general low-pass effect representing the main head shadow effect in the HRTFs.

A series of headphone listening tests were performed and analysed to study the benefits of the presented technique. These were designed to evaluate spatial and timbral differences between binaural signals that have contralateral signals generated from a) the normal contralateral HRTF (Fig. 1) and b) the convolution of the ipsilateral HRTF and the ITF (Fig. 2). All filter models use a generalised HRTF set. Three monophonic test stimuli (narrow band speech, pink noise and music) with sampling rates of 8kHz and 48kHz have the transfer functions applied to them with intended azimuth angles of 10°, 40°, 70°, 90°, 100°, 120°, and 150°, all at 0° elevation. The reference HRTF model uses 24th order warped IIR filter models (warping coefficient λ = 0.65), which have been shown to retain the perceptual features of HRTFs [2] (at a sampling rate of 48 kHz). The ITF model uses 2nd and 8th order IIR filters. 3.2 Test Method Two listening tests were employed in this experiment to improve the independence of ratings between the different test disciplines. The first required listeners to grade the absolute lateral angle of test items in the horizontal plane and grade

whether the virtual source was intra-cranial or extra-cranial. The second test involved a triple stimulus comparison of the differences between the algorithms under evaluation and the reference in terms of spatial and timbral degradation. All tests were performed in a listening space, and used reproduction equipment, conforming to the requirements stipulated in [11]. The GuineaPig2 subjective testing system [8] was used for test creation and presentation. 3.2.1 Coloration Test The coloration test was designed to study timbral and spatial degradation with respect to the defined reference. A triple stimulus, hidden reference paradigm is employed to study these aspects. Timbral and spatial coloration introduced by the algorithms are evaluated employing a 5-point impairment as illustrated in Fig 7. An efficient mixed design was created with these factors which are compared against the reference. Nine listeners from the permanent NRC listening panel were employed.

Figure 7. GuineaPig [8] UI for the coloration test. 3.2.2 Localisation Test A single stimulus paradigm required subjects to localise each test item in the horizontal plane and thus minimise bias effects caused by a reference sample. A 5° grid was employed in the range 0o to ±180° azimuth employing a user interface illustrated in Fig 8. Ratings were made with reference to a visual pointer [9]. A measure of distance was graded by quantizing the perceived distance from the centre of the head to be either "inside head", "outside head", or "ill-defined". Subjects were presented with examples of the extent to which this distance property would change in the training phase. This analysis will be presented at a later date.

indication that a) listeners have discrimination sensitivity and b) there are difference between the algorithms. 3.3.1 Coloration Test Results A multivariate analysis of variance (MANOVA) model was employed for the primary analysis of correct ratings. The dependent variables timbre and spatial degradation were studied with the five fixed factors (Sampling rate, filter order, listener, azimuth and program) including main and two-way effects. Assumptions for the MANOVA model were tested. Box’s M test for equality of covariance matrices is not significant and residuals are found to be normally distributed in accordance with the MANOVA requirements. Both the model and all factors were found to be significant for both dependent variables at a 95% level. Only the factor azimuth was insignificant for the timbre variable. From Fig. 9 and 10 it can be seen that maximum degradation for both timbre and spatial quality occurs with the 2nd order filter at 48kHz. Pink noise is always the most critical signal and suggests that it is not ideally representative of real signals for quality evaluations. Speech is the least critical signal due to its intrinsic bandwidth. Acceptable performance is found with 8th order filter at both sampling rates and 2nd order 8kHz implementations with degradations of less than one grade (i.e. perceptible, but similar). However all filters have degradations of less than two grades (i.e. slightly different) which is quite acceptable. In Fig 11. it can be seen that a significant degradation in the perceived spatial quality occurs with 2nd order filters compared to 8th order. This is also notable for timbre. The Pearson correlation coefficient between the timbre and spatial scales is found to be 0.56 (p < 0.00) indicating a possible correlation between the two attributes. It can be concluded that the use of listeners experienced in assessing spatial and timbral qualities is beneficial and that there is a clear association between these two perceptual domains as discussed in [2, 10]. 3.3.2 Localisation Test Results An ANOVA including main and two-way interactions for all 6 factors (including filter type) was performed for the localisation experiment and found to be significant. It was found that filter type is not significant, indicating that for the localisation task there are no significant differences between the normal contralateral HRTF and the ITF model. Significant factors include listener, azimuth, sampling rate and program. Scatterplots of azimuth judgments versus presented angles are shown in Fig. 12 for the reference HRTF, the 2nd and 8th order ITF model (48kHz sampling rate, all program items). The frontback confusions have been corrected (by reflecting the azimuth responses around 90° as discussed in [11] and [5]) in these plots for a qualitative comparison. The confusion rates were 34.9, 31.2 and 32.3% respectively for 2nd order, 8th order filters and the reference.

Figure 8. GuineaPig [8] UI for the localisation test.

3.3.3 Discussion

3.3 Subjective Test Results

The subjective test results appear to correlate well with the theoretical results discussed in Section 2. The benefit of the ITF model depends on the audio material and the sampling rate, but performance is satisfactory for many applications.

The initial analysis considers the reliability of the listeners in correctly identifying the hidden reference samples. All listeners correctly identified these samples 65.6 - 82.5%. This is an

4 CONCLUSIONS In this paper, a novel and efficient model for binaural synthesis using HRTFs was presented. The model is based on the use of the low-order model of the interaural transfer function for the contralateral ear. The paper discusses filter design aspects and presents listening test results for the ITF model when compared with a traditional HRTF rendering method. This algorithm can enable computational savings of over 30% without significant perceptual degradation when compared to traditional methods. References [1]

J.-M. Jot, V. Larcher and O. Warusfel, "Digital signal processing issues in the context of binaural and transaural stereophony", Audio Eng. Soc. 98th Convention, Paris, France, Feb. 1995, preprint no. 3980. [2] J. Huopaniemi, N. Zacharov and M. Karjalainen, "Objective and subjective evaluation of head-related transfer function filter design", J. Audio Eng. Soc., vol.47, no. 4, Apr. 1999, pp. 218-239. [3] C. Avendano, R. O. Duda and R. Algazi, "Modeling the contralateral HRTF", Proc. Audio Eng. Soc. 16th Int. Conf., Rovaniemi, Finland, April 1999, pp. 313-318. [4] J.-M. Jot, "Etude et realisation d’un spatialisateur de sons par modeles physiques et perceptifs", Ph.D. thesis, Telecom Paris, 1992. [5] W. G. Gardner, 3-D Audio Using Loudspeakers. Boston: Kluwer Academic Publishers, 1998. [6] J. Huopaniemi and J. O. Smith, "Spectral and time-domain preprocessing and the choice of modeling error criteria for binaural digital filters", Proc. Audio Eng. Soc. 16th Int. Conf., Rovaniemi, Finland, Apr. 1999, pp. 301-312. [7] J. Mackenzie, J. Huopaniemi, V. Välimäki and I. Kale, "Low-order modeling of head-related transfer functions using balanced model truncation", IEEE Signal Processing Letters, vol.4, no. 2, pp. 39-41, Feb. 1997. [8] J. Hynninen and N. Zacharov, "GuineaPig - A generic subjective test system for multichannel audio", Audio Eng. Soc. 106th Convention, Munich, Germany, May 1999, preprint no. 4871. [9] M. J. Evans, “Obtaining accurate responses in directional listening tests”, Audio Eng. Soc. 104th Convention, Amsterdam, Netherlands, May 1998, preprint no. 4730. [10] N. Zacharov and J. Huopaniemi, “Results of a round robin subjective evaluation of virtual home theatre sound systems”, Audio Eng. Soc. 107th Convention, New York, USA, Sept. 1999, preprint no. 5067. [11] D. J. Kistler and F. L. Wightman, “A model of headrelated transfer functions based on a principal components analysis and minimum phase reconstruction,” J. Acoust. Soc. Am., vol. 91, no. 3, pp.1637-1647, Mar. 1992.

Figure 9. Means and 95% confidence intervals for timbre degradation as a function of sampling rate, filter order and program item, average over azimuth.

Figure 10. Means and 95% confidence intervals for spatial degradation as a function of sampling rate, filter order and program item, average over azimuth.

Figure 11. Means and 95% confidence intervals for spatial degradation as a function of filter order and azimuth.

210

210

210

180

180

180

150

150

150

120

120

120

90

90

90

60

60

60

30

30

30

0

0 -20

0 -30

20

40

60

80

100

120

140

160

180

-20

0

0 -30

20

40

60

80

100

120

140

160

180

-20

0

20

40

60

80

100

120

140

160

180

-30

Figure 12. Scatterplots of absolute azimuth errors for all program items at 48kHz (left plot: 2nd order ITF filters, center plot: 8th order ITF filters and right plot: reference HRTF).

Suggest Documents