Perceptually Motivated Processing for Spatial Audio Microphone Arrays

0 downloads 0 Views 2MB Size Report
A preliminary study is presented that investigates processing of microphone array signals for multi-channel ...... microphone array, unified circular arrays were.
___________________________________ Audio Engineering Society

Convention Paper Presented at the 115th Convention 2003 October 10–13 New York, New York This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

___________________________________ Perceptually Motivated Processing for Spatial Audio Microphone Arrays Christoph P. A. Reller1, Malcolm O. J. Hawksford2 Centre for Audio Research and Engineering, University of Essex, UK CO4 3SQ, [email protected]

1

Centre for Audio Research and Engineering, University of Essex, UK CO4 3SQ, [email protected]

2

ABSTRACT A preliminary study is presented that investigates processing of microphone array signals for multi-channel recording applications. Generalised, perceptually and acoustically based approaches are taken, where a spaced array of M microphones is mapped to an array of L loudspeakers. The principal objective is to establish transformation matrices that integrate both microphone and loudspeaker array geometry in order to reproduce a subjectively accurate illusion of the original soundfield. Techniques of acoustical vector synthesis and plane wave reconstruction are incorporated at low frequencies migrating to an approach based upon head-related transfer functions (HRTFs) at higher frequencies. Error surfaces based on the HRTF reconstruction error are used to assess perceptually relevant solutions. Simulation results presented in a 5-channel format are calculated for processed audio material with and without acoustical boundary reflections.

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

1 INTRODUCTION In the past few years numerous approaches have been undertaken to design systems, which provide a more or less realistic virtual 2- or 3-dimensional image of a sound scene for a variety of listening conditions [1]. Constraints such as the limitation of the number of transducers, room interference, listening area requirements, inter-individual differences and others make this a nontrivial task. The preliminary study presented here strives for an approach to record an acoustical event using an arbitrarily shaped microphone array in the horizontal plane and reconstruct an image event on a loudspeaker array for a single listener. In order to accommodate realistic situations, localisation and overall perceived sound quality should degrade gracefully for out-of-origin listening positions. The simple case of linear filters between each microphone and every loudspeaker is taken. The design method for these is called re-encoding principle in some papers [1]. Here it is referred to by the word reconstruction. The idea is to start with an acoustical or perceptual relevant event, e.g. a planewave, which is recorded by the microphones. The filters are now adjusted such that during playback the event is restored (matches the original) as close as possible. Firstly reconstruction of plane waves is considered and simple approaches to partially match the acoustic vector at the origin are given to yield solutions for the low frequency band. Secondly reconstruction of ear signals by means of HRTFs (head-related transfer functions) is introduced. This approach differs from the general transaural cancellation techniques [2] but equally aims at reconstructing the signals perceived. In most situations the resulting filters have an integrating characterisation with an infinite gain at DC. A partial remedy is given using a combined approach incorporating both acoustical and HRTF reconstruction. The problem usually will be either over- or underdetermined such that often only approximate solutions for the filters can be achieved. The resulting error is assessed graphically and numerically in terms of the HRTF reconstruction error, which is believed to be especially relevant from a perceptual point of view and is plotted as an error surface. A conventional stereo system and a 3/2 surround system are also assessed this way. In order to validate the theoretical results and to test for robustness in real-life situations, audio material was processed to be played back on a 5-loudspeaker array for subjective assessment. In a first step, a monophonic sound source tracing a predetermined path was recorded using software simulation. This was done both with, and without acoustical

reflections. In a second step the microphone signals were filtered in order to produce the five loudspeaker signals. A selection of microphone arrays and corresponding filters where thus compared with the simulation of a traditional 3/2 array recording. Figure 1-1 schematically summarises the elements this project comprises. The assessment of the acoustical error is not done here because it cannot be made visible as simple as the HRTF error. Some possibilities for extensions and refinement are given towards the end. Anechoic Material

Acoustical Reconstruction

Acoustical Error

compute Subjective assess Graphical & Filter Matrix assess Listening Numerical A(f) Tests Evaluation compute Reverb. Material

HRTF Reconstruction

HRTF Error

Figure 1-1 – Overview of process 2

DEFINITIONS AND ACOUSTICAL CONCEPTS USED Although only 2-dimensional geometries are considered here, some parts of the system would allow a straightforward expansion to the third dimension. Figure 2-1 shows the coordinate conventions of the polar system used here. M microphones located at polar coordinates (rm, φm) sample the soundfield in space, where the subscript m indicates the microphone number (m = 1, 2, …, M). L loudspeakers are used during playback located at angles φl (l = 1, 2, …, L). Their distances from the coordinate origin are assumed to x φ r y

L

R

Figure 2-1 – Coordinate Conventions

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 2

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS 2.3 Microphone Signals The microphone signal Im(f) depends on the location (rm, φm), the directivity function Dm(φ) of the microphone and the parameters of the acoustical event. Microphones considered here are assumed to record a combination of the acoustic pressure p(rm, φm, ω) and its gradient. The acoustic velocity v(r, ω) is related to the pressure as [4]:

be equal and large enough such that their radiation can be approximated by a plane wave. 2.1 The Filtering Matrix For an acoustical event, e.g. a plane wave, the microphones will record input signals Im(f) in the frequency domain. These are filtered by M⋅L filters Aml(f) connecting microphone m to loudspeaker l in order to produce the output loudspeaker signals Ol(f). Writing the microphone and loudspeaker signals as row vectors: I(f ) = [I1 (f ) I2 (f ) L IM (f )] and O(f ) = [O1 (f ) O 2 (f ) L O L (f )] , and the filters as a matrix

A(f )

 A 11 (f ) A 12 (f )  A (f ) A 22 (f ) =  21  M M   A M1 (f ) A M2 (f )

∇p(r, ω) = − jωρ 0 v (r, ω) where ρ0 is the ambient air pressure. In cylindrical coordinates the pressure gradient recorded by a microphone at (rm, φm) pointing outwards thus is

L A 1L (f )   L A 2L (f ) O M   L A ML (f )

∂p(r, φ, ω) = jωρ 0 v n (r, φ, ω) ∂r A general first order microphone is assumed to record the signal:

the filtering operation can be described as a matrix multiplication in the frequency domain: O(f ) = I(f ) ⋅ A (f )

Im (φ, ω) = α ⋅ p(rm , φ m , ω) + (1 − α ) ⋅ jωρ 0 v n (r, φ, ω) , α = [0 L1]

(1)

The filtering Pm(φx , f) performed on the signal S(f) carried on a plane wave when recording thus results in the microphone signal:

This system of equations can be extended to include several acoustical events, e.g. several plane waves with different incident angles. This is done by adding corresponding rows to the vectors O(f) and I(f) thus converting them to matrices.

Im (φ x , f ) = S(f ) ⋅ Pm (φ x , f )

= S(f ) ⋅ [α + (1 − α ) ⋅ cos(φ m − φ x )] ⋅ e − jkrm cos (φm −φ x )

2.2 Plane Waves A basic but convenient acoustical event is a plane wave with incident angle φx carrying a signal S(f). Solving the Helmholz equation [3] can result in complex valued plane waves where surfaces of equal pressure are planes and travel with a velocity c = 343 m/s in a direction (φx − π). The acoustic pressure p(r, f) thus can be written in complex form:

Where the term in the complex exponent represents the time delay with respect to the origin and the directivity function Dm(φx) is the modulus of Pm(φx , f). D m (φ x ) = α + (1 − α ) ⋅ cos(φ m − φ x )

For α = 1 the microphone records the pressure only and has an omnidirectional response, whereas for α = 0 only the pressure gradient is recorded resulting in a cosinusoidal directivity called figure of eight. For values α = ]0L1[ the directivity blends over between the two extremes. The case α = 0.5 results in what is known as a cardioid and α = 0.25 as a hypercardioid response (see figure 2-2).

p(r, f ) = S(f ) ⋅ e − jk⋅r

In this equation k = ω c is the wave number and the vector k points into the direction of incidence φx. (ω = 2πf is the radian frequency.) The actual pressure being a real number is simply the real part. The analytic (complex) description is however better to work with. In the 2-dimensional cylindrical coordinate system a plane wave thus is described by: p(r, φ, f ) = S(f ) ⋅ e − jkr cos (φ−φ x )

(3)

(2)

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 3

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

+

where the vectors rl in R L = [r1 r2 L rL ] point in the direction of the loudspeakers.

+

T

2.5 HRTFs

α=1 omnidirectional

A head-related transfer function He(φ, f) similarly to the filtering functions Pm(φ, f) mentioned in the previous section describes what filtering is performed by the pinna and torso, thus the ratio of the ear signal and the signal S(f) carried in the plane wave [5]. In order to accommodate for as many ears as required, the index e = 1, 2, …, E denoting the number of the ear is introduced. Although in a traditional measuring setup they are measured using a finite distant sound source [6], HRTFs approximate the filtering performed on a plane wave (an infinite distant source). It should not be forgotten, that any system working with HRTFs suffers from inter-individual differences in these functions of several dB [1]. These affect especially the interaural intensity difference (IID) (increasingly above 1 kHz) and to a lesser extent the interaural time difference (ITD). The latter are believed to be perceptually predominant for localisation effects at low frequencies up to 700 Hz where no excess phase between the ears occurs. They give rise to what is called the cones of diffusion, the 3-dimensional regions in which the ITDs are the same. This is the main effect behind front-back confusion and wrongly perceived elevation when listening on a horizontal loudspeaker layout. The IID seem to be the main localisation factor above 700 Hz. Since the spread in inter-individual differences increases with frequency, synthesising generic ear-signals above a certain limit is difficult and can result in sound coloration. The HRTFs used here are derived using a Fourier transform of the KEMAR HRIRs (head-related impulse response) taken from [6]. Since the intention is to use them for direct sound and reverberation, they are diffuse field equalised as described in [7]:

α = 0.5 cardioid

+

+ −



α = 0.25 hypercardioid

α=0 figure of 8

Figure 2-2 – Microphone Directivities 2.4 The Acoustic Vector The term acoustic vector (called velocity vector in [1]) is used here only to describe frequency independent effects in contrast to its more general meaning used in other papers. It can best be visualised as the DC- or very low frequency component of an acoustical event. The acoustical vector of a plane wave (equation 2) therefore is just equal to v = − S ⋅ kˆ , where S is the frequency independent signal the wave is carrying and kˆ is the unit vector of k. With reference to figure 2-3 the acoustic vector of the loudspeaker generated field can be defined as follows: v =

−O ⋅ R L L

∑ Ol l=1

H(φ i , f ) =

x v

H(φ i , f ) 1 I 2 H(φ i , f ) ∑ I i=1

⋅ ∠H(φ i , f )

The denominator is the average power of all HRTFs covering all incident angles. Theoretically the incident angles should cover the whole 3-dimensional space and not just the horizontal plane, but this effect was assumed to be negligible compared to interindividual differences. This equalisation ensures that a diffuse sound field as it occurs in the late part of reverberation is

y

Figure 2-3 – Acoustic Vector v for Loudspeaker Field

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 4

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS Theoretically the number of ear signals achievable is equal to the number of loudspeakers.

synthesised correctly and also it extracts more of the interaural differences in the transfer functions.

E ≤ L

3 FILTER DESIGN BY RECONSTRUCTION It is the aim of a play back system to reconstruct some quantity, which was present in the original recording situation. This can usually not be done perfectly because the system is likely to be overdetermined. Therefore an error variable is added to the equation. It is the objective to minimise this error. The approach taken by wavefront synthesis and higher order Ambisonics is to strive for a reconstruction of the complete sound field within some area or volume [4]. Rayleigh’s integrals [3][4] provide a means of reconstructing the pressure within a volume from knowledge of either the pressure or the normal velocity at its surface. The 2-dimensional equivalent uses an area and its contour [4]. In a unified circular array this contour is sampled at equal distances and for a sufficiently large radius the Shannon sampling criterion applies: λ min = 2dmax Æ fmax =

c 2dmax

Apart from the deficiencies of HRTF-related approaches already mentioned, transaural assumes that the listener is at an exactly known location. For frequencies above e.g. 1 kHz the dislocation producing 90 ° phase-shift can be as small as 0.09 m. 3.1 Plane Wave Reconstruction The pressure p(r, φ, f) due to the original plane wave with incident angle φx carrying the signal S(f) is given by equation (2): p(r, φ, f ) = S(f ) ⋅ e − jkr cos (φ−φ x )

(2)

The recorded signals therefore are: I(f ) = S(f ) ⋅ PM (f )

(7)

where the spatial propagation vector including the directivity functions Dm for the microphones is:

(4)

[

PM (f ) = D1e − jkr1 cos (φ x −φ1 ) L D M e − jkrM cos (φ x −φM )

dmax is the maximal distance between two microphones in the array and λ is the wavelength. Although spatial aliasing occurs at frequencies above fmax, the perceived degradation seems to be graceful [8][9]. In higher order Ambisonics the sound field is decomposed into spherical or cylindrical harmonics. The latter is for the 2-dimensional case where the maximal order N recordable with first-order microphones is given by [9]: N < M/ 2

(6)

]

If the loudspeakers are assumed to generate plane waves, the pressure p’(r, φ, f) produced by them is a superposition of L plane waves and can be written as the vector product p' (r, φ, f ) = O(f ) ⋅ PL (r, φ, f )

(8)

where

(5)

[

PL (r, φ, f ) = e − jkr cos(φ −φ1 ) L e − jkr cos(φ−φL )

This restricts the angular resolution of the system. As in wavefront synthesis, spatial aliasing occurs on the estimated cylindrical harmonic spectrum. This effect seems to degrade the performance gradually for offcentre positions and high frequencies. Given a restriction on the numbers of transducers, faithful reproduction of the complete soundfield is not feasible in most situations. This is where transaural methods can provide a possible solution [2]. They aim at reconstructing the ear signals only, arguing that these are the ones perceived. Once the ear signals are known, crosstalk cancellation schemes can be used to re-establish them by means of loudspeakers.

]

T

is a propagation vector and O(f) contains the loudspeaker signals as defined in section 2.1. According to equation (1) the latter can be expressed in terms of I(f) and A(f) such that: p' (r, φ, f ) = I(f ) ⋅ A(f ) ⋅ PL (r, φ, f )

(9)

Using equation (7) the synthesised pressure can ultimately be expressed as: p' (r, φ, f ) = S(f ) ⋅ PM (f ) ⋅ A(f ) ⋅ PL (r, φ, f )

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 5

(10)

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS same and thus cancel leaving an L by L identity i on the left side:

Reconstruction means setting the original event in equation (2) equal to the synthesised one in equation (10):

i = PML (f ) ⋅ A(f ) + ε A (φ, f )

e − jkr cos (φ− φ x ) = PM (f ) ⋅ A(f ) ⋅ PL (r, φ, f ) + ε A (r, φ, f )

A(f ) = pinv (PML (f ))

where an error εA(r, φ, f) has been added to accommodate for imperfect reconstruction. This system of equations can be extended by adding rows to the left hand side, PM(f) and εA(r, φ, f) such that several plane waves with incident angles φx (x = 1, 2, …, X) are reconstructed. PX (r, φ, f ) = PM (f ) ⋅ A(f ) ⋅ PL (r, φ, f ) + ε A (r, φ, f )

As described in [9] and [12] the resulting filters A(f) exhibit an integrating character at low frequencies with infinite gain at DC (see graph 4-5 for an example). Consequently they cannot be used directly. According to [9] the reason for this lies in the farfield representation of the loudspeakers. Consequently a method using the nearfield radiation of the loudspeakers and of acoustical events to be reconstructed (e.g. sources located at (rx, φx)) leads to valid solutions [9], which apparently work well but are not considered further here. Another way of dealing with this problem could be to truncate the order N of the cylindrical harmonics in PL(φ) and PX(φ) in a frequency dependent way.

(11)

PX(r, φ, f) is a propagation vector equivalent to PL(r, φ, f). One way to proceed from here would be to choose some strategically placed points (r, φ) in PX(r, φ, f) and PL(r, φ, f) at which sound field reconstruction would be aimed. But equation (11) can be simplified further by using a phase mode representation of the plane wave and thus decomposing it into complex cylindrical harmonics (see appendix A). The Bessel functions in this expansion being dependent solely on kr cancel on both sides leaving: PX (φ) = PM (f ) ⋅ A(f ) ⋅ PL (φ) + ε A (φ, f )

3.2 Acoustic Vector Reconstruction For low frequencies the microphone array in a very rough approximation can be assumed to be coincident because the wavelengths are much greater than the extent of the array. Therefore a frequency independent reconstruction applying the acoustic vector can be used for this frequency band.

(12)

where PL (φ) =

∑ [e ∞

n = −∞

jn (φ − φ1 )

e jn (φ−φ2 ) L e jn (φ−φL )

]

3.2.1 Acoustic Vector All complex exponent terms in equations (12) and (14) equal to 1 and the reconstruction equation is:

T

and PX(φ) is defined equivalently. This equation can now be solved approximately for A(f) using pseudo inverses: A(f ) = pinv (PM (f )) ⋅ PX (φ) ⋅ pinv (PL (φ))

i = PM ⋅ A + ε A

where i is an L by L identity and PM has L rows. The phase terms in PM disappear leaving only the directivity functions Dm. For α = 1 only omnidirectional microphones are used and the system cannot be solved reflecting the fact, that based on the coincidence assumption all microphones record the same signal. Because of the cosine function in Dm the rank of PM is half its smallest extent such that the system is underdetermined for M ≤ 2. The resulting pseudo inverse A = pinv(PM) actually matches the 2dimensional acoustic vector at the origin for L ≥ 2 and M ≥ 2 [11]. For regular array layouts, the real gains Aml can be shown to be a kind of circular sincfunctions with the function argument (φm − φl) [12][9]. These can in general have positive and negative values (see figure 3-1 for an example) in

(13)

pinv(…) stands for the Moore Penrose pseudo

((

inverse defined as pinv (A ) = A T

)



⋅A

)

−1

( )

⋅ AT

(14)



,



where stands for the complex conjugate [10]. It has the property of minimising the least square error and at the same time the norm of the result. The former is important for the case of an overdetermined system where the error has to be minimised and the latter minimises the power in A(f) and with it also in the loudspeaker signals O(f) [2]. If all incident angles of the plane waves are set to the loudspeaker angles φx = φl, then the propagation matrices PL(φ) and PX(φ) in equation (12) are the

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 6

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS only positive gains A. Phase inverted loudspeaker signals can be a disadvantage when large listening areas are considered [1].

order to match the acoustical vector as defined in section 2.3 and therefore negative gains (phase inversion) can result.

3.3 HRTF Reconstruction.

csinc7(φ)

For several ears and incident angles φx (x = 1, 2, …, X) the HRTFs are combined in a matrix.

H X (f )

 H1 (φ1 , f ) H 2 (φ1 , f )  ( H φ , f ) H 2 (φ 2 , f ) =  1 2  M M  H1 (φ X , f ) H 2 (φ X , f )

L HE (φ1 , f )   L HE (φ 2 , f )  O M  L HE (φ X , f )

Setting the angles equal to the loudspeaker incident angles φx = φl yields the equivalent matrix HL(f). The ear signals e(f) due to the original plane waves with incident angles φx carrying signals S(f) are given by the HRTFs:

φ [°]

Figure 3-1 – Circular sinc Function of Order 7 3.2.2 One-To-One Mapping A more intuitive approach was derived from the traditional 3/2 surround microphone array where for five loudspeakers the connection matrix A is simply an identity scaled by a constant factor. This “one-toone” connection was extended to arbitrary numbers of microphones and loudspeakers using the following set of rules: • Each microphone is connected to the two adjacent loudspeakers, or to one if it is located at the same angle. • The two gains are obtained by using a panning function, e.g. calculating the ratio of the angular difference between the microphone and the loudspeaker and the angle between the two loudspeakers. • An overall factor related to the number of microphones and loudspeakers is applied to adjust the low frequency match error to 0. This mapping assures that no negative gains occur and for arrays with M > L many of the gains Aml are set to 0.

e(f ) = S(f ) ⋅ HX (f ) The ear signals e’(f) due to the plane waves generated by the loudspeakers can be written as e' (f ) = O(f ) ⋅ HL (f ) HX(f) and HL(f) are equivalent to the propagation matrices PX(φ) and PL(φ) in section 3.1 in as far as both relate plane waves to the acoustical event to be reconstructed. Substituting for O(f) and for I(f): e' (f ) = I(f ) ⋅ A(f ) ⋅ HL (f ) e' (f ) = S(f ) ⋅ PM (f ) ⋅ A(f ) ⋅ HL (f ) Now the original and the synthesised ear signals are equalled and the reconstruction error εH(f) is added. H X (f ) = PM (f ) ⋅ A(f ) ⋅ HL (f ) + ε H (f )

(15)

This equation is the equivalent to the plane wave reconstruction equation (12) and can be interpreted as synthesising (usually many) HRTFs HX(f) from the (usually fewer) HRTFs HL(f). The filters can be approximated again as least square solution.

3.2.3 Raised Cosine Window In a general case a microphone signal can be fed to any loudspeaker with a gain obtained by an angular window function. A raised cosine window was used in a similar manor as in the section above but covering the complete 360 ° space. Since equally spaced sinusoidal functions add to a constant (as the angular sinc functions also do) the overall attenuation factor for regular layouts is 1/(M + L). In contrast to reconstruction of the acoustic vector this approach makes only sense for unified arrays, but it results in

A(f ) = pinv (PM (f )) ⋅ H X (f ) ⋅ pinv (HL (f ))

(16)

If the matching angles are set to the loudspeaker angles φx = φl then HX(f) and HL(f) are equal and thus

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 7

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

cancel, yielding the same reconstruction equation as for the plane wave matching (equation (14)). Due to the farfield representation in PM(f) the resulting filters A(f) also have an integrating character at low frequencies with infinite DC gain. The solution to this would again be nearfield calculation, placing the sound sources at some finite distance points (rx, φx). Due to the way HRTFs are measured (using finite distance sources) this can only improve the reconstruction. Theoretically it would even be possible to use truly 2-dimensional (or even 3-dimensional) HRTFs H(r, φ, ω) (or even H(r, φ, θ, ω) with the elevation angle θ) to reconstruct proximity effects. These must however be expected to be minor and dependent strongly on the HRTFs of the listener. The integrating character of the filters does not pose problems for higher frequencies such that in this region they yield valid results.

in a first step through an acoustic vector method and must not be mixed up with A(f), the unknown filters. PML(f) is the microphone propagation matrix for φx = φl. This yields the reconstruction equation:

[PML (f ) ⋅ A

A (f ) = pinv (PM (f )) ⋅ [PML (f ) ⋅ A H X (f )] ⋅ pinv ( [i HL (f )] )

H X (f )] = PM (f ) ⋅ A(f ) ⋅ [PL (φ) HL (f )] + [ε A (φ, f ) ε H (f )]

A(f ) = pinv (PM (f )) ⋅ [PX (φ) H X (f )]

⋅ pinv ( [PL (φ ) HL (f )] )

(19)

(20)

Another alternative applied in this study is a 2-band processing approach where the combined reconstruction is used for high frequencies and an acoustical reconstruction at low frequencies. The calculation of A(f) is done twice and then the values are crossfaded in the transition band using some suitable function, e.g. a raised cosine or a straight line.

3.4 Combined Approach Because of their differing virtues, it would be desirable to combine both acoustical and HRTF reconstruction. Ideally this should be done in a frequency dependent way yielding optimised tradeoffs. Acoustical reconstruction being feasible at low frequencies is valid over an area of some extent whereas HRTF reconstruction is theoretically possible up to the highest frequency but only for a very constrained position of the listener. The combination could yield a system that degrades gracefully with out-of-origin location and exhibits sharp localisation in ideal conditions. Because of the similarity of equations (12) and (15) they can be combined into a single expression:

[PX (φ)

H X (f )] = PM (f ) ⋅ A(f ) ⋅ [i HL (f )] + [ε A (φ, f ) ε H (f )]

4 EVALUATION This section describes the methods used to assess the theory developed in section 3. First of all most parts of the system were implemented in MATLAB1 script language with a graphical user interface (see appendix C for a screen shot). In order to finally validate the outcomes it was decided to process some audio material with the aim of running some subjective listening tests. This was done to test the assumptions made and the approaches and ideas developed in a real-life situation. The development so far does not include any optimisation regarding other parameters than A(f). Therefore some sensible starting points for the other parameters had to be chosen. Due to physical constraints the loudspeaker array was limited to have 5 elements. Although the theory allows for any arbitrary shape in the horizontal plane of the microphone array, unified circular arrays were chosen exclusively. This was mainly done to reduce the parameter space by one dimension. The diameter of the array was chosen to be near the (acoustical) distance between the ears obtained by inspection from the HRIR database yielding a radius rm = 0.072 m. Although different values α for the microphone directivity function were investigated, the value presented here will be exclusively set to α = 0.5. Differing values exhibited only quantitative differences and by choosing 0.5, the pressure and it’s

(17)

(18)

If the matching angles φx are placed in between the loudspeaker angles φl and the acoustical reconstruction used is stable, then stable results can be achieved despite of the unstable characteristic exhibited by the pseudo inverse of PM(f). In the methods based on the acoustic vector, the reconstruction equations have a slightly different form or the gains A are found using a described procedure. In order to be able to use a combined approach with HRTFs nevertheless, the product PML(f)⋅A is calculated first. A are the gains obtained

1

MATLAB is a trademark of The MathWork Inc.

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 8

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

gradient are given equal weight. An interesting parameter to fine-tune were the angles φx and their number, because this enables to do the reconstruction in an incident-angle dependent way.

ε rms (φ ) =

4.1 Visual and Numerical Assessment For well controlled listening situations the HRTF reconstruction error εH(f) is believed to be the most relevant from a perceptual point of view. The least square approximation in equations (16), (18) and (20) minimises this quantity. It is instructive to assess it as an error surface even if the error εA(f) is being minimised (equations (13), (18) and (20)).

(

)

(

ε mag,dB (φ, f ) = 20 log Hres (φ, f ) − 20 log H tar (φ, f )

f

ε mean (φ) =

1 2 ε mag (φ, f )df f 2 − f1 f∫1

ε rms (f ) =

1 360° 2 ε mag (φ, f ) dφ 360° ∫0

ε mean (f ) =

4.1.1 Measures of Error Once the filter functions A(f) are computed using one of the mentioned methods they are used to find the reconstructed HRTFs for a left ear at φe = 90 °. The resulting Hres(φ, f) is compared with the original target function Htar(φ, f) by calculating the magnitude difference

f

1 2 2 ε mag (φ, f ) df ∫ f 2 − f1 f1

1 360° ε mag (φ, f )dφ 360° ∫0

These measures are inserted along the angle- and the frequency-axis of the surface plot. The rms and mean values are printed as dotted and solid lines respectively. 4.1.2 Acoustical Reconstruction Surface plots of εmag,dB(φ, f) for some configurations of arrays and filters A(f) obtained with one of the acoustically based methods presented will be discussed in this section. Graph 4-1 shows the error for the conventional stereo loudspeaker pair recorded with two spaced cardioid microphones. Both transducer arrays are placed at ± 30 ° and the microphones are 0.072 m apart with directivity factor α = 0.5 (cardioid). The filters in this case are obtained with the one-to-one algorithm thus A is a diagonal matrix with only real gains, realising a traditional stereo recording setup. This configuration clearly is not capable of synthesising sounds coming from the rear (see figure 2-1 for the coordinate conventions used). Furthermore above 3 kHz the error starts to exhibit an increasingly wider spread with fast changing excursions. This is due to the HRTF database, where the complexity of the transfer functions increases above 3 kHz. For frequencies beyond about 6 kHz, inter-individual differences and interference from the HRTF measurement setup are becoming substantial factors.

)

and the phase difference ε phase (φ, f ) = ∠Hres (φ, f ) − ∠H tar (φ, f ) From a perceptual point of view the magnitude difference is believed to be more significant than the phase difference, especially at high frequencies [1] [5]. It seems to be logical that it is the interaural difference (ID), which produces the strongest localisation effect [13]. By defining the ID as the difference of the HRTFs for the left and the right ear HID (φ, f ) = H(φ, f ) − H(− φ, f ) , the error in ID is obtained. ε ID = (Hres (φ, f ) − Hres (− φ, f )) − (H tar (φ, f ) −H tar (− φ, f )) = ε(φ, f ) − ε(− φ, f )

εH,mag [dB]

Although implemented in the MATLAB script, the plots shown here are solely of the type εmag,dB because the ID does not seem to be more revealing and is more difficult to relate to what is heard in the listening tests. (See screen shot of graphical user interface in appendix C for an example of an ID plot.) To make any dependence of the surface on the angle φ and on f visible the rms and mean values are calculated as well:

φ [°]

f [Hz]

Graph 4-1 – M=L=2, one-to-one, rm=0.072 m, α=0.5

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 9

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

Graph 4-2 shows a setup with L = M = 5 loudspeakers and microphones at equally distributed angles φl = φm = m⋅360°/M with rm = 0.072 m and α = 0.5 for all microphones. Still the one-to-one algorithm is used, such that this configuration represents a simple unified 5 element microphone loudspeaker array. In recommended setups neither the angles are equally distributed nor are the radii and directivity for all microphones the same [15] [16]. The Graph clearly shows, that at low frequencies sounds coming from the rear can well be produced. With a microphone spacing of about 0.09 m the upper limit before spatial aliasing occurs for the array is about 1.8 kHz whereas reconstruction within 5 dB rms error value is constricted to frequencies below 400 Hz. Furthermore it is noticeable that the localisation seems to be better for contralateral incidence. Graph 4-3 shows the error surface for the same array but calculated using plane wave reconstruction with cylindrical harmonics N up to the first order for PL(φ). The polar plot shows schematically the incident source angles φx that were used. The error has reduced somewhat especially on the lateral side and for frontal incidence. The 5 dB rms border is now at about 900 Hz. There are however some negative going spikes in the surface which probably indicate sound coloration. The error surface for the same settings but with N = 2 for PL(φ) is shown in graph 4-4. The depiction of the frequency response Am=1,l=2(f) in graph 4-5 illustrates the integrating nature of the filter as mentioned in section 3.1. This set of filters A(f) is thus not directly implementable. The error has improved somewhat insofar as it is flatter and the frequency where the rms value increases beyond 5 dB is about 1.2 kHz. It must be emphasised that these acoustical methods aim for a reconstruction within a significant listening area, which stands in contrast to HRTF based methods (see next section). The error εH(φ, f) assesses only the ideal sweet spot. Increasing the order N might therefore also increase the listening area (see equation (21) in appendix A). The raised cosine window approach yields similar surfaces but with a general attenuation for medium and high frequencies. Increasing the radius rm reduces the spatial aliasing frequency and thus results in the uneven high frequency excursions extending into lower bands. When rm is decreased small improvements can be observed but the pseudo inverse of PM(f) exhibits higher gains at low frequencies. Increasing the number of microphones M does give the capability of recording higher order cylindrical harmonics but the playback system is still restricted to the second order.

εH,mag [dB]

φ [°]

f [Hz]

Graph 4-2 – M=L=5, one-to-one, rm=0.072 m, α=0.5

εH,mag [dB]

φ [°]

f [Hz]

Graph 4-3 – M=L=5, plane waves N=1, rm=0.072 m, α=0.5

εH,mag [dB]

φ [°]

f [Hz]

Graph 4-4 – M=L=5, plane waves N=2, rm=0.072 m, α=0.5

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 10

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

20log|A(f)| [dB] 5/2π∠A(f) [°]

εH,mag [dB]

f [Hz]

φ [°]

f [Hz]

Graph 4-5 – Am=1,l=2(f) solid line: 20log|A(f)|, dashed line: 5 2π ⋅ ∠A (f )

20log|A(f)| [dB] 5/2π∠A(f) [°]

4.1.3 HRTF Reconstruction Since HRTF reconstruction aims at restoring the ear signals, rear localisation can be reproduced with only two loudspeakers. Graph 4-6 shows this for M = 5 equally distributed microphones (rm = 0.072 and α = 0.5) and L = 2 loudspeakers at ±30 °. The incident source angles φx that were matched to are shown schematically in the polar plot. This reconstruction works with a total of 6 ears spaced very closely around ±90 ° making the system largely overdetermined, such that the reconstruction although imperfect should be tolerant to small head rotations. The rms error of less then 5 dB extends until 1 kHz and the surface is very flat. This is because it is the HRTF error εH(φ, f) which is being minimised and not the acoustical error εA(φ, f) as in the previous section. HRTF reconstruction relies on more than two microphones in order to extract the ear signals. This is different from the acoustical approach where more microphones than loudspeakers do not principally decrease the reconstruction error. The filters A(f) are expected to be unstable at low frequency but some trials revealed that moving the matching angles φx away from the loudspeaker angles φl reduces this effect. This might be because for φx = φl, A(f) is just the pseudo inverse of PM(f) (see equation (14)), which is again an acoustical reconstruction. As shown in graph 4-6 Am=4,l=1(f) is unstable only at very low frequencies. The high frequency part of this filter is however uneven, which could lead to sound coloration. Notice that it is possible to compensate for lack of frontal localisation in the error surface by rearranging the angles φx. The error surface however reflects ideal listening conditions such that it is desirable to combine HRTF with acoustical reconstruction. This is especially important when more than two loudspeakers are used because then, at low frequencies, the acoustical vector can approximately be matched in two dimensions.

f [Hz]

Graph 4-6 – M=5 L=2, HRTF, rm=0.072 m, α=0.5, Am=4,l=1(f) εH,mag [dB]

φ [°]

f [Hz]

Graph 4-7 – M=L=5, HRTF & one-to-one, rm=0.072 m, α=0.5 The error surface depicted in graph 4-7 is obtained using the same microphone loudspeaker layout (M = L = 5) and the same matching angles φx and ears as in the case presented in graph 4-3. The reconstruction method is a combined one using HRTF reconstruction and the one-to-one algorithm as defined by equation (20). Although the surface is almost the same as for the two speaker layout in graph 4-6 it must be expected to be perceived differently, because the acoustical vector is reconstructed to some extent, making the system less susceptible to HRTF-related deficiencies. Due to the combined approach the functions A(f) are no longer

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 11

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS amplitude panning approaches [17]. As shown in the frequency response of Am=4,l=2(f) the filters are however still unstable at low frequencies.

unstable at low frequencies, which can clearly be seen in the example filter Am=3,l=2(f) (graph 4-8). Furthermore the impulse response am=3,l=2(t), which is the inverse Fourier transform with respect to time of Am=3,l=2(f), is depicted. The impulse response is well behaved, as in most cases were a combined HRTF and acoustical matching was used. It must be stated that for some cases, even if the magnitude frequency response looked visually stable the corresponding impulse responses exhibited large amounts of pre- or post-ring due to the phase response of Aml(f). This was the reason for including plots not only showing the complex frequency response but also the impulse response in the visualisation software. When increasing the number of microphones, faithful HRTF reconstruction extends more and more towards the high frequencies. The integrating character of the filters A(f) is however rapidly building up and starts to extend into higher bands. As this behaviour is believed to be produced by the plane wave representation in PM(f) [9], the easiest way to reduce it seems to be to decrease the number of incident angles φx and place them in between the loudspeaker angles φl. Graph 4-9 shows the results of a combined matching (HRTF and plane wave method as defined by equation (18)) used on arrays M = 12 and L = 5 with 7 incident angles φx and 6 ears as depicted in the polar plot. There is some improvement when comparing with the M = 5 microphones case, the 5 dB rms error border being now at about 1.7 kHz. The reduced number of matching angles is however clearly visible on the surface. Since they are located in between the loudspeaker angles, this reconstruction scheme can be expected to help making the awareness of the loudspeaker positions disappear, which stands in contrast to conventional

εH,mag [dB]

f [Hz]

20log|A(f)| [dB] 5/2π∠A(f) [°]

φ [°]

f [Hz]

Graph 4-9 – M=12 L=5, HRTF & Plane Wave, rm=0.111 m, α=0.5, Am=4,l=2(ω)

20log|A(f)| [dB] 5/2π∠A(f) [°]

εH,mag [dB]

φ [°]

f [Hz]

a(t)

20log|A(f)| [dB] 5/2π∠A(f) [°]

f [Hz]

f [Hz]

Graph 4-10 – M=12 L=5, HRTF & Plane Wave, rm=0.111 m, α=0.5; Am=4,l=2(f)

t [s]

Graph 4-8 – Am=3,l=2(f), am=3,l=2(t)

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 12

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

Software Simulation

Subjective Listening Tests

Point Sound Source

Image Source

i1(t)…iM(t) Microphone Array

Loudspeaker Array

a(t) o1(t)…o5(t)

Acoustic AC3 Encoder

AC3 Decoder

Figure 4-2 – Overview Subjective Listening Tests This problem was solved using two different reconstruction filters for low and high frequencies together with cross fading. For the lower band a plane wave reconstruction was applied as in equation (13) and the higher band uses the combined approach as before. The transition is calculated using a raised cosine function centred on 800 Hz with a transition width of 400 Hz. In the corresponding error surface (graph 4-10) only slight effects due to this crossfading can be seen. The function Am=4,l=2(f) now is stable at low frequencies but the transition region can be identified in the graph.

two kinds of recordings, anechoic and reverberant, using software simulation. Moreover localisation does seem to improve if sources move. The realisation of this poses some complication when simulating recordings because the impulse responses change with time, but convincing image sources can be expected. Once the recordings are obtained using one of the methods described below, the resulting microphone signals I(f) need to be filtered by A(f) as described by equation (1) so as to produce the loudspeaker signals O(f). The loudspeaker arrays used are a regularly spaced 5-element array (L = 5) with angles φl = l⋅360°/L as well as the recommended 3/2 surround setup with front speakers at 0 °, ±30 ° and rear speakers at ±110 °. Both arrays are on a circle with radius rl = 1.8 m. In order to be able to reproduce 5 signals at a time they are AC32 encoded and played back on a device integrating the AC3decoder, the DACs and the amplifiers. Figure 4-2 schematically summarises the procedure.

4.2 Subjective Listening Tests Although no decisive criteria for selecting array shapes and sizes were yet developed, some subjective listening tests were undertaken. Their main aim was to validate some of the selected arrays and reconstruction methods with Figure 4-1 – Microphone respect to real-life Angles Used for Processing listening situations. Figure 4-1 shows the subset of microphone locations chosen incorporating two regularly spaced circular arrays with M = 5 and M = 12 (a total of 16) microphones at a radius rm = 0.072 m. Since it is known, that acoustical reverberation contributes to a large extent to the perceived source location and envelopment it was decided to obtain

4.2.1 Simulated Impulse Responses in an Acoustic Environment The software package used (CATT-Acoustic3 [18]) allows the construction of an arbitrarily shaped room using geometrical descriptions of planes. The surfaces thus created generate reverberation depending on their properties. For this project, the provided tutorial theatre type hall was used (see figure 4-4). 2 3

AC3 is a trademark of Dolby Laboratories CATT-Acoustic is a trademark of CATT

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 13

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS obtained using the velocity of propagation c = 343 m/s. Thus non-integer arrivals time narr in sample time dimension can be found. In this way the signal is time-varyingly delayed with sub-sample accuracy. In order to obtain the sample value sarr(n) at integer values for n, the recorded signal sarr(narr) is interpolated to integer locations of n. This timevarying resampling thus produces faithfully all timerelated effects, even Doppler shift. The first part of the calculation of the attenuation is done by assuming a radiation attenuation factor k rad ∝ 1 r 2 as found for omnidirectional sources. Secondly, the attenuation kdir due to the microphone directivity function of the microphone is computed using equation (3). Since this recording facility is dependent on the microphone array geometry it was integrated into the graphical user interface.

Source and receiver locations are added in order to calculate a complete echogram for all combinations of these. The echogram contains the full 3dimensional information at the receiver location as response to an impulse launched from the source. From the echogram a microphone impulse response can be calculated using an idealised frequency independent directivity function as in equation (3). Alternatively two ear signals can be derived using built-in HRTFs for listening tests over headphones. This possibility, potentially yielding convincing and sharply localised images, proved to be useful as a means for comparison. In order to cope with moving objects (either sources or receivers) the software offers a “walk-through” convolver, a time-varying filter which takes as inputs a list of impulse responses with corresponding time durations indicating for how long each one will be used. Interpolating between the impulse responses an audio file thus is filtered in a time-variant manner. Figure 4-3 schematically summarises the signal paths for reverberant and anechoic recording.

4.2.3 Sound Scenes Recorded Recording moving sound sources in software emulation implied huge amounts of data in the form of echograms and impulse responses. To keep the processing within limits, two sound scenes were selected: • Scene A: A female voice tracing a circle around the listener. • Scene B: A small drum moving along a path covering much of the (r, φ)-space. For the circle in scene A a radius of 2 m was chosen to be traced within about 11 s. For the reverberated recordings the sources were placed at 10 ° intervals facing towards the centre and their directivity function was set to one similar to a small

4.2.2 Anechoic Recording An algorithm for anechoic recordings of moving sound sources was developed on MATLAB. The starting point is a sound file s(n) of a monophonic source and a description of the path in terms of its geometry at some points in time. First the path location for every sample n is calculated by interpolation using the given points. The delay between each sample positioned in space at (rn, φn) and the microphone at (rm, φm) is then Echograms

Source Locations Mic Locations

Mic Model

Path Description Source Locations Mic Locations

Impulse Responses

Path Description

Walkthrough Convolution

Anechoic Mono Material

Resampling / Interpolation

i(t)

i(t)

Convolution a(t)

Convolution a(t)

o(t) a) Reverberant Processing (CATT-Acoustic)

Anechoic Mono Material

o(t) b) Anechoic Processing (MATLAB)

Figure 4-3 – Schematic Description of the Signal Paths for the Simulated Recordings AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 14

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS was explored the most. First of all the traditional oneto-one mapping scheme where the microphone signals are not processed at all was assessed. This case exhibited very natural sounding scenes and especially envelopment and spaciousness were perceived strongly. The loudspeaker positions seemed however to be quite obviously revealed and localisation, especially at the front, was quite diffuse for the unified circular loudspeaker array and sharpened somewhat for the recommended 3/2 surround setup. The raised cosine window method resulted in some attenuation of the high frequencies as was observed on the error surface, leading to a slightly dull but otherwise natural sound. Localisation was however even more diffuse than before. The acoustical vector and plane wave matching schemes sharpened localisation considerably but added slight coloration in the high frequency band. Dislocations of the listening position greater than 0.15 m resulted in wrong localisation and speakers became hearable as separate sound sources depending on where exactly the image sound was moving. By their nature all these acoustically based methods are insensitive to head rotation. Combined HRTF and acoustical methods lead to different perceptions such that with some listening experience it was easy to tell a HRTF-based signal from an acoustically based one. Moreover the perception seemed to change with the number of listening repetitions, which could indicate a learning effect in the cognitive system of the listener. Ten matching angles φx situated between the loudspeaker angles were used. Localisation was significantly sharpened for all the combined methods. There was however a great difference in localisation

loudspeaker. This gives a total of 648 impulse responses to be calculated. Figure 4-4a shows the room used with the listener position including its facing direction (blue) and the sound sources (red). The path chosen for Scene B is depicted in figure 44b. It is constrained to the horizontal surface and starts at the front. The sound source was assumed to have an omnidirectional directivity and a duration of about 42 s was chosen. 40 locations along the path lead to a total of 756 impulse responses. 4.2.4 Listening Tests Since only a very restricted number of tests was carried out in a not fully controlled way, they can not be regarded as being representative. They nevertheless give some insight into where the problems and the strengths lie and what future developments might be envisaged. Informal listening tests were performed continuously by the author once a working system was present in order to validate theoretic results. This section reports some of the notable findings. Further tests based on a questionnaire were run by an independent supervisor. See appendix B for a short description of these tests and results. In general localisation seemed to be reinforced when HRTF processing was used. As expected spaciousness and envelopment were always enormously improved for material with acoustical reflections. For these coloration generally was perceived less severe than for the anechoic tests. 4.2.4.1 M = 5, L = 5 Being the conventional microphone – loudspeaker array setup for surround recordings this configuration

a) Scene A

b) Scene B

Figure 4-4 – Location of the Paths and the Microphone Array for the Recorded Scenes AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 15

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS

strength for frontal and lateral incidence. The latter was clearly improved when comparing with the acoustical methods. The HRTF combination with “one-to-one” gave the least sound coloration and the acoustical vector method the most. This coloration usually was audible as some sharpness in the high frequency band and was improved with an increase in the numbers of ears used in the reconstruction. Decreasing the angles φx further down to five, coloration was again diminished but frontal localisation weakened. When the matching angles were distributed irregularly with an emphasis at the front and at the back, localisation improved again. Especially at the back a convincing and sharp phantom source was perceived.

Surprisingly for a well controlled location of the listener, the sound scenes processed with combined HRTF methods were perceived very similar to the L = 5 case (five loudspeakers) with clear lateral localisation making the loudspeaker positions disappear. With some listening experience it was possible to tell an L = 5 from an L = 2 presentation due to different sound coloration and impaired localisation. The stereophonic case was however strongly restricted to the exact listener position to dislocations within about 8 centimetres and head rotations within about ±15 °. For comparison some other material containing ten stationary sound sources was produced in CATTAcoustic using a first order crosstalk canceller applied to binaural signals. Since this is equivalent to recording with an artificial head, the ear signals are known exactly. Consequently localisation was considerably sharper than for the scenes processed for HRTF reconstruction. The sweet spot is however restricted even more and slight dislocation or head rotation destroys the sound image completely.

4.2.4.2 M > 5, L = 5 For the acoustical methods increasing the number of microphones did not have a great impact. Even for the frequency dependent plane wave reconstruction only a slight difference in sound colour was observed. This is because the plane wave model of the replay system (PL(φ)) does not allow higher order cylindrical harmonics. With HRTF based approaches when increasing M, the ear signals can be reproduced more faithfully (compare graph 4-10). In order to cope with the low frequency instability of the filters, combined HRTF and acoustical methods were used with a 2-band processing, using only the latter for the lower band. Also the angles φx (no more than 7 of them for M = 12) had to be kept well away from the loudspeaker angles φl. Localisation sharpness seemed to increase only by a small amount and was judged to be best for the combination with plane wave reconstruction. Sound coloration however increased especially in the high frequencies above say 2 kHz. This is where large excursions in the HRTFs occur. Spatial aliasing is also most likely to degrade the effectiveness of the reconstruction, such that the question arises how to process this highest frequency part of the signal.

5 FURTHER DEVELOPMENTS The most impairing obstacle met was the instability of some of the filters A(f). A possible remedy for this based on nearfield descriptions would have to make some assumption about the actual distance of the listener from the loudspeaker. This can however be compensated for in a separate step [9]. Once stable filters are obtained it would be possible to increase the number M of microphones and the number X of source positions (rx, φx). By considering nearfield effects, a comparison with wavefield synthesis should be possible. A different route of approaching a solution would be to consider this concept as a starting point and cast it into a similar scheme as the one developed in section 3.1. Expansion into the third dimension would also be a feasible and meaningful extension. With 3dimensional HRTFs H(φ, θ) elevation effects should be possible to be reproduced also on a horizontal loudspeaker system by simply reconstructing the ear signals. Once the acoustical concepts would be extended, elevated loudspeakers could also be taken into consideration. It would be interesting to assess the error εA(r, φ, f) in an equivalent manor as was done for the probably more relevant εH(φ, f) surface. Especially with HRTF based reconstruction the acoustical error could yield some indication as to how sensitive to dislocation and head rotation the system is. Further investigation into the area of optimising other parameters such as array shape (maybe even including facing direction of the microphones),

4.2.4.3 M ≥ 5, L = 2 This case is of interest because most of today’s playback systems are stereophonic. The loudspeaker angles were chosen at ±30 ° as recommended. This decision is however not optimised and other setups might perform better because HRTF crosstalk cancellation techniques are known to work best with very closely [19] or very distantly spaced loudspeakers. Acoustical methods were only tried as reference because they cannot substantially surpass the performance of conventional stereo recording techniques.

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 16

RELLER AND HAWKSFORD

PERCEPTUAL MICROPHONE ARRAYS which is a significant improvement when compared to traditional stereo and surround systems. The listening tests clearly showed some of the drawbacks. As typical for HRTF based approaches sound coloration was a noticeable artefact. Some indication for the reasons is given, of which lack of smoothness in the HRTFs and post processing of the filters as well as spatial aliasing and imperfect listening conditions are judged to be the predominant ones. The stereophonic playback setup showed that the ear signals indeed are reconstructed to some extent making loudspeakers disappear from a perceptual point of view. This is a true alternative to recording with an artificial head and using crosstalk cancellation. Some indication of further developments and extensions are given of which nearfield calculation is estimated to have the greatest impact. They are considered to be as vital as interesting to be followed in order to make the system perform to its full potential.

microphone directivity, matching angles distribution, loudspeaker placement and others are certainly necessary. Considering the large space that these cover it must be expected to be a very difficult task to do combined optimisations and they might have to be done using a neural network or a similar approach [20]. From a processing point of view it might be interesting to consider more complex array processing systems. Circular arrays often work by first taking the inverse discrete Fourier transform with respect to the microphone angle [21]. They thus process directly the cylindrical harmonics (or phase mode). This area is considered to be vital when it comes to fast computation. Using measured HRTFs seemed to interfere at high frequencies. To minimise coloration, they should be smoothed in frequency f as well as in angle φ since HRTFs are functions of both f and φ [14]. Alternatively the filters A(f) could be smoothed. HRTFs based on an acoustical model, e.g. based on the diffraction around a rigid sphere, are expected to perform better with respect to inter-individual differences although they might impair localisation sharpness. The impulse responses obtained should probably be windowed in order to reduce pre- and post-ring. If delays between microphones and loudspeakers can be separated, minimum phase filters could be obtained using the Hilbert transform.

7

6 CONCLUSION A working system based on reconstruction of acoustical and perceptual relevant quantities of the soundfield was developed. Both types were cast in the same mathematical frame and were shown to exhibit equivalent forms. This made it possible to approximate solutions in a least square sense for either of them or for a combined approach. The obvious acoustical element, the plane wave, does not seem to be ideal with every respect. It is believed, that it is the farfield representation of the loudspeaker and matching source generated fields that renders the filters unstable under some conditions. Nevertheless combined approaches and 2-band processing did yield valid results that were investigated further. Graphical assessment of HRTF reconstruction errors was used to establish a variety of solutions, which however do not cover the space of all possible parameter values very well. Being able to place incident angles to be matched to freely around 360 ° proved to be a powerful tool for fine-tuning. Despite the danger of unstable filter responses, the error surface can be flattened for frequency below 1.5 kHz,

REFERENCES

[1]

J.-M. Jot, V. Larcher, J.-M. Pernaux, A Comparative Study of 3-D Audio Encoding and Rendering Techniques, AES 16th International Conference, 1999

[2]

J. Bauck, D. H. Cooper, Generalized Transaural Stereo, AES 93rd Convention, San Francisco, 1992, Reprint No. 3401

[3]

Earl G. Williams, Fourier Academic Press, 1999, Chapter 1

[4]

E. Hulsebos, D. de Vries, E. Bourdillat, Improved Microphone Array Configurations for Auralization of Sound Fields by WaveField Synthesis, J. Audio Eng. Soc. Vol. 50, Okt. 2002, p. 779 – 790

[5]

C. I. Cheng, G. H. Wakefield, Introduction to Head-Related Transfer Functions (HRTFs): Representations of HRTFs in Time, Frequency and Space, AES 107th Convention, New York, 1999, Reprint No. 5026

[6]

http://interface.cipic.ucdavis.edu/

[7]

V. Larcher, G. Vandernoot, J. - M. Jot, Equalization Methods in Binaural Technology, AES 105th Convention, San Francisco, 1998, Reprint No. 4858

[8]

M. M. Boone, E. N. G. Verheijen, P. E. van Tol, The Wave Field Synthesis Concept Applied to Sound Reproduction, AES 96th Convention, Amsterdam, 1994, Reprint No. 3814

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 17

Acoustics,

RELLER AND HAWKSFORD [9]

[10]

[11]

PERCEPTUAL MICROPHONE ARRAYS 8 APPENDICES A) Phase Mode Representation of a Plane Wave

J. Daniel, R. Nicol, S. Moreau, Further Investigations of High Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging, AES 114th Convention, Amsterdam, 2003, Nr. 5788

A plane wave given by p(r, φ, f ) = S(f ) ⋅ e − jkr cos (φ−φ x )

V. V. Mitin, D, A, Romanov, M. A. Polis, Modern Advanced Mathematics for Engineers, John Wiley & Sons, 2001, p. 228

can approximated by the sum

M. A. Poletti, The Design of Encoding Functions for Stereophonic and Polyphonic Sound Systems, J. Audio Eng. Soc., Vol. 44, No. 11, 1996 November, p. 948 – 963

e − jkr cos (φ−φ x ) =



∑ jn Jn (kr ) ⋅ e jn(φ−φ

x

)

(21)

n = −∞

M. A. Poletti, A Unified Theory of Horizontal Holographic Sound Systems, J. Audio Eng. Soc., Vol. 48, No. 12, 2000 December, p. 1155

where e jnφ are the complex cylindrical harmonics. See [12] for a derivation.

[13]

M. O. J. Hawksford, Multi-Channel Coding with HRTF Enhancement for DVD-A and Virtual Sound Systems, AES 108th Convention, Paris, 2000, Reprint No. 5158

[14]

J. Houpaniemi, J. O. Smith, Spectral and Time-Domain Preprocessing and the Choice of Modeling Error Criteria for Binaural Digital Filters, AES 16th International Conference, 1999

[15]

M. Williams, G. le Dû, Multichannel Microphone Array Design, AES 108th Convention, Paris, 2000, Reprint No. 5157

[16]

P. Segar, F. Rumsey, Optimisation and Subjective Assessment of Surround Sound Microphone Arrays, AES 110th Convention, Amsterdam, 2001, Reprint No. 5368

[17]

G. Dickins, M. Flax, A. McKeag, D. McGrath, Optimal 3D Speaker Panning, AES 16th International Conference, 1999

[18]

http://www.catt.se/

[19]

O. Kirkeby, P. A. Nelson, H. Hamada, Virtual Source Imaging Using the "Stereo Dipole", AES 103rd Convention, New York, 1997, Reprint No. 4574

[20]

A. Laborie, R. Bruno, S. Montoya, A New Comprehensive Approach of Surround Sound Recording, AES 114th Convention, Amsterdam, 2003, Nr. 5717

[21]

S. C. Chan, C. K. S. Pun, On the Design of Digital Broadband Beamformer for Uniform Circular Array with Frequency Invariant Characteristics, IEEE International Symposium on Circuits and Systems, 2002, Volume 1, p. 693 – 696

Both reverberated scenes A and B were presented in three different ways, each candidate taking only one version: • Version I) Combined HRTF and acoustic vector reconstruction with M = L = 5 microphones and loudspeakers, similar to the one in Graph 4-7. • Version II) Combined HRTF and acoustic vector reconstruction with M = 5 microphones and L = 2 loudspeakers, similar as in Graph 4-6. • Version III) Binaural signals obtained with the CATTAcoustic software presented on headphones without headphone equalisation. A short questionnaire asked the candidates to indicate the following: • Scene A: a) How well is the circle perceived? b) Where does it end? (Start and end points are both biased somewhat to the left as can be seen in figure 4-4a.) c) Does the timbre of the voice alter? d) Is elevation perceived? • Scene B: Make a drawing of the perceived path. • General: a) Describe the angular distribution of incidence of reverberation. b) Give the number of activated loudspeakers perceived. (The visual setup always included all 5 loudspeakers. Not applicable to version III.)

[12]

B) Listening Tests4

A short summery of the results is given below: Version I) (2 candidates): 4

Listening tests run by Rabiul Syed

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 18

RELLER AND HAWKSFORD •





PERCEPTUAL MICROPHONE ARRAYS

Scene A: a) Circle perceived well. b) End point perceived wrongly by one person. c) Slight alteration in timbre perceived. d) Elevation perceived by one candidate. Scene B: One person gets the roughly circular shape with correct start and end points but no change in distance. For the other person the front part of the path was wrongly perceived as being at the back. General: a) one person perceives more reverberation coming from the back than the front. b) The active loudspeakers perceived are 5 and 4.



to the right, move to the left). Only one candidate perceived sound at the back. For 4 candidates the path covered considerably more than the loudspeaker angles ±30 °. General: a) 2 candidates perceived less reverberation coming from the back than from the front. b) None of the candidates found out that only 2 speakers were active. 3 persons perceived 3 active speakers, 2 persons perceived 5 active speakers.

Version III) (2 candidates): • Scene A: a) Circle barely perceived. b) End point are perceived wrongly by more than 60 deg. c) Alterations in timbre of the voice perceived. d) Elevation perceived by both candidates. • Scene B: Only rough shape perceived (start at the left, move to the right, move to the left), most parts of the path with front-back confusion. • General: a) one person perceives more reverberation coming from the back than from the front.

Version II) (5 candidates): • Scene A: a) The circle is only perceived well by two candidates. b) End point detected correctly by all candidates. c) Alterations in timbre perceived by all. d) No elevation perceived. • Scene B: None of the candidates draws the correct path but all get the rough idea (start at the left, move C) MATLAB Evaluation Software Screen Shot:

AES 115TH CONVENTION, NEW YORK, NEW YORK, 2003 OCTOBER 10-13 19