[5] David Romblom, Richard King, and Catherine Guastavino. A Perceptual ... [18] V Ralph Algazi, Richard O Duda, and Dennis M Thompson. The use of ...
Diffuse Field Modeling using Physically-Inspired Decorrelation Filters: Improvements to the Filter Design Method David Romblom
Abstract Diffuse Field Modeling uses decorrelation filters based on the statistical description of reverberation to approximate the expected measurements from an array of outward-facing cardioid microphones in an acoustic diffuse field. The cardioid microphones are formed from linear combinations of a B-Format Room Impulse Response that has been altered to be diffuse. This third publication is an extension of two previous publications: Part I [1] presented the physical justification and filter design and Part II [2] presented a perceptual evaluation of the algorithm for both a large and small loudspeaker array. The current paper presents a method to generate decorrelation filters for three dimensional loudspeaker arrays with arbitrary angular spacing as well as for binaural configurations. It also presents a method for determining the frequencyspacing of the constituent bandpass filters based on estimates of the RT60 of the B-Format Room Impulse Response.
1
Introduction
The objective of the original Diffuse Field Modeling (DFM) paper [1] was to create an ensemble of decorrelation filters that, when combined with a specially treated B-Format Room Impulse Response (RIR), would approximate the expected measurements on a real array of outward-facing cardioid microphones in an acoustic diffuse field. A single-layer approximation of the Kirchhoff Helmholtz Integral (K/H) was developed and the theoretical bounding surface S was replaced with an array of outward-facing cardioid microphones in a recording space routed to a conceptually-coincident, inward-facing array of cardioid radiators in a listening space. The virtual microphone array in the recording space is comprised of spatially separated microphones while the B-Format RIR corresponds to a single point; the “missing information” of the acoustic diffuse field
1
Abbreviation RT60 SAC FAC RIR K/H DFM
Meaning Reverberation Time. Spatial Autocorrelation. Frequency Autocorrelation Room Impulse Response. Kirchhoff Helmholtz Integral. Diffuse Field Modeling.
Table 1: Abbreviations used throughout the paper.
Symbol I L Ai,l Ci,l Ω ω gc τ p pˆ Q R
Quantity Total number of channels indexed by i. Total number of “design” center frequencies indexed by l. Structured random gain per channel (index i) and center frequency (index l). Outward-facing cardioid microphone in simulated diffuse field. Continuous time angular frequency. Discrete time angular frequency. Linear gain defining the effective bandwidth of physical frequency autocorrelation. Modal decay constant (τ =RT60/13.82 for p2 ). Acoustic pressure. Acoustic pressure as complex phasor. Number of plane waves in diffuse field simulation, indexed by q. Radius of loudspeaker array.
Table 2: Notation used throughout the paper.
is simulated by applying the DFM decorrelation filters to cardioid microphones formed from the B-Format RIR. The original algorithm was validated as a component of a physically-plausible virtual acoustic model in a perceptual evaluation by Rummukainen et al in [2]. For a large array comprised of a circular 16-channel horizontal ring and 4-channel height ring, the initial implementation was found to be useable by 11 of 16 expert listeners from the McGill Sound Recording program. Magnitude rating experiments for “spatial quality,” “timbral quality,” and “realism” scales indicated that it was necessary to use a B-Format RIR (as opposed to an omni-directional RIR), that the audio structure of this B-Format RIR should be diffuse, and that the early reflections ought to be geometrically modeled. A second experiment using a 5.1 loudspeaker configuration found DFM (in conjunction with modeled reflections) was comparable to the Hamasaki Square [3]. This microphone technique was chosen because of its physical interpretation as a sparse sampling of a diffuse field, because of its general acceptance in the sound recording community [4], and because of the high magnitude ratings of “distance” and “envelopment” in a previous experiment by this author [5]. A closely related experiment [6] has demonstrated that the auditory system is sensitive to directional energetic differences in reverberation. The perceptual thresholds were estimated for a lateral condition (-2.5 dB), for a height condition (-6.8 dB), and for a frontal condition (-3.2 dB). Note that these levels correspond to segments of the loudspeaker array used in the experiment and that the spectral differences 2
measured in the center of the array using both opposing cardioid microphones and a HATS dummy head were considerably less. Additionally, opposing cardioid measurements in a shoebox style concert hall demonstrate that physical differences corresponding to the perceptual thresholds can be found in existing acoustic spaces. Such energetic directional differences can be recorded at one point using a B-Format microphone; the DFM decorrelation filters allow the simulation of virtual microphones at other points in this non-ideal diffuse field. The objective of this paper is to extend the original work to three dimensional loudspeaker arrays with arbitrary angular spacing, to make the filter design process more robust, and to develop a binaural adaption amenable to rotational head tracking. This paper is organized as follows: Section 2.1 gives an overview of the original algorithm and the departures of this paper, Section 2.2 describes the ”bank of bandpass” filter structure, Section 2.3 discusses how to obtain the structured complex gain Ai,l from a simulated diffuse field, Section 2.4 gives a systematic method for spacing the L center frequencies, and Section 2.5 discusses the justifications for a high frequency limit. Sections 3.1 and 3.2 present numerical validations of the frequency and spatial autocorrelation of the Ai,l values, and Section 3.3 presents a numerical validation of the spatial autocorrelation in the resulting acoustic fields. Section 4.1 discusses the relationship of DFM to known psychoacoustic experimental results and Section 4.2 discusses the engineering decisions made during the development of the algorithm.
2
Filter Design
2.1
Overview
In the original DFM publication [1], each individual decorrelation filter was synthesized as the superposition of many exponentially windowed bandpass filters having regularly-spaced center frequencies and random complex gains Ai,l (channel i, frequency index l). Following the statistical description of reverberation [7], the magnitude squared |Ai,l |2 was drawn from an exponential distribution and the phase ∠Ai,l was distributed uniformly on the interval [π − π[. At low frequencies, the values Ai,l were filtered across space such that the spatial autocorrelation (SAC) of the channels converged to the theoretically expected sinc(kr) [7][8][9]; simulations of reconstructed fields also converged to this same expected SAC. The diffuse field model is a pure tone model [7][9]. Room modes, however, have non-zero bandwidth and the constant smearing of modes above the Schroeder frequency is smoother or more jagged depending on the RT60 [7][9]. This is the reason that the original strategy overlapped gated exponentials in frequency by adjusting the decay constant ν, the smoothness of the filter spectra could be made to corresponded to the smoothness B-Format RIR spectrum. This smoothness can be described in terms of FAC and related
3
to RT60 [7]. The particular values of Ai,l were altered by this frequency overlap, although the statistical distributions and the spatial autocorrelation were largely preserved. A problem with this approach was that large amplitude spikes could occur in the superimposed response and entire filter sets had to be iteratively accepted or rejected based on a post-hoc assessment. It was noted in the development of the original algorithm that a numerical simulation of a pure tone diffuse field model converged to the expected spatial autocorrelation, and that the statistical relationships corresponding to different microphone directivity patterns [10] could be approximated by associating these directivity patterns with the simulated observations. See Section 2.3 for further discussion. Because the simulation provides an approximate statistical relationship for arbitrary spacings and orientations, and because this structure would be difficult to obtain by way of the spatial filtering strategy used in the original publication, in this paper the random gains Ai,l will be obtained from a pure tone diffuse field simulation for a set number of L control frequencies. It is desired to blur the structured values of Ai,l as little as possible. As such, the exponential window had to be abandoned because its wide frequency bandwidth modified many surrounding values of Ai,l ; the Hamming window was instead chosen because of its low sideband structure [11]. An equivalence between the bandwidth of the main lobe of the Hamming window transform and the target FAC was established and, because of the relationship between FAC and RT60 [7][9], gives a systematic means of determining the necessary filter length M based on an estimate of the frequency-dependent RT60. Also developed is a method of specifying the minimum necessary spacing of bandpass filters by defining a degree of “channel overlap” that modestly altered the values of Ai,l while not leaving gaps in the frequency spectrum. A drawback of using the Hamming window is that it introduces a smooth build up of diffuse audio of duration M/2 that may or may not be acceptable depending on the application. An M = 16k-tap FIR filter is sufficient for RT60 values of approximately 2 seconds, and, at 48 kHz, the value of M/2 corresponds to 170 ms. The presence of high-order modeled early reflections would help “fill in” this time gap and could be viewed as a cross fade between deterministic and stochastic signal elements. The central concept in DFM is that the decorrelation filters simulate measurements on the edge of the virtual array at some distance from the actual location of the B-Format RIR recording. The introduced magnitude and phase have frequency-dependent spatial autocorrelation corresponding to the virtual array locations and this variation changes with respect to frequency at a rate defined by the frequency autocorrelation. At very low frequencies, however, the origin and the edge values are highly correlated, and the introduction of a simulated field is unnecessary. As such, the simulated diffuse field is “faded in” according to (1 − sinc(kr)) so that only the uncorrelated aspect of the field is simulated by the filters. Finally, all of the above concepts are used to create a simple binaural adaption that transposes the cardioid 4
microphones to the edge of a small, 2-channel virtual array. The locations of the microphones can be chosen to be in the angular location of the ears, and, because changing the orientation of cardioid microphones formed from the B-Format RIR is a matter of changing the weighting gains, the system is amenable to rotational head-tracking. However, the sinc(kr) correlation relationship corresponds to unobstructed observations of a diffuse field while the head and torso of a listener are substantial, frequency-dependent obstructions [12]. The paradigm of transposed cardioid microphones has been maintained using a simple doubling of the head radius, see Section 2.3 for further discussion.
2.2
Filter Structure
The filter structure used in DFM remains a superposition of L bandpass filters, however the Hamming window replaces the original exponential window because of its substantially lower sideband structure. The center frequencies are determined iteratively using an estimate of the frequency autocorrelation (FAC), a relationship between FAC and the Hamming bandwidth, and an overlap factor between adjacent bandpass filters. This design process is described in Section 2.4. At each center frequency l, a Hamming window wl [n] of length Ml is modulated by the frequency ωl and results in a bandpass filter with impulse response: bl [n] = ejωl n wl [n]
(1)
For each channel i, the superimposed “bank of bandpass” decorrelation filter is defined by:
di =
L X
Ai,l bl
(2)
l=1
where Ai,l is derived from a spatial simulation developed below in Section 2.3. This results in a frequency response: Di (ejωm ) =
L X
Ai,l Bl (ejωm )
(3)
l=1
Note that there are L bandpass filters while there are M observation frequencies ωm depending on the length of the Discrete Fourier Transform (DFT) used. Preempting the following sections, exemplar filters are shown in Figure 1 for Pollack Hall at McGill University (RT60 ≈ 2 seconds at low frequency) for a 16-channel loudspeaker array with a 2 m radius, and for the binaural adaption.
5
5
0
dB Magnitude
-5
-10
-15
-20
Filter Response Target Response
-25 10 2
10 3
Frequency Hz
(a) Pollack Hall, 16-channel array, 2 m radius. 5
0
dB Magnitude
-5
-10
-15
-20
Filter Response Target Response
-25 10 2
10 3
Frequency Hz
(b) Pollack Hall, Binaural Adaption.
Figure 1: Exemplar decorrelation filters designed for the Room Impulse Response (RIR) of Pollack Hall at McGill University. The circles are the magnitude (|Ai,l |) derived from the spatial simulation described in Section 2.3. The frequency spacing of the bandpass filters is determined iteratively based on the estimation of RT60 (equivalently frequency autocorrelation) and the bandwidth and overlap of a Hamming window. The solid line is the magnitude spectra resulting from the superposition of L bandpass filters. The phase response (∠Ai,l ) is altered by the time alignment of the bandpass filters and makes visual interpretation difficult. Details are described in Section 2.4. Note that the variation in Ai,l is “faded out” at low frequencies depending on the array radius because the array edge values are highly correlated with the pressure estimated by the B-Format RIR. Note also the differing frequency scales used between the loudspeaker array and binaural cases.
6
2.3
Obtaining Ai,l from a Simulated Diffuse Field
The diffuse field model is a mathematical idealization that describes the field as a superposition of q = 1, 2, ....Q plane waves of random direction, phase, and amplitude. For pure tone fields the complex pressure pˆ is described as a summation: pˆ =
Q X
pˆq enq ·r
(4)
q=1
pˆq is the complex amplitude of the q th plane wave, nq is the directional vector of the q th plane wave, and r is a vector pointing to the observation. The spatial autocorrelation (SAC) is the expected value of the product of p(r) and p(r + ∆r) averaged over space [7][8]:
hp(r)p(r + ∆r)i = hp2 i
sin k|∆r| k|∆r|
(5)
r is an observation point, ∆r is a fixed offset, k is the wave number, and hp2 i is the mean square pressure averaged over space. Lower frequencies are more correlated for greater amounts of space. To understand the influence that the microphone directivity pattern exerts over SAC, consider the pressure observed at two points r1 and r2 . For a given spatial frequency k = ω/c, the maximum phase progression corresponds to the plane wave propagating along nq exactly parallel to the connecting vector r1,2 . Due to their angled projection onto r1,2 , the arguments of all other diffuse plane waves proceed more slowly and will contribute a lesser value of k. If a collinear array of equally spaced points is considered, the observed spatial frequencies on r1,2 are all k up to an ideal brick-wall cutoff of k = ω/c. The use of pressure gradient or cardioid microphones changes the weighting of each contributing plane wave and, depending on the orientation, will emphasize the frequency regions of k differently. The SAC is then the inverse wavenumber (Fourier) transform of the k Power Spectral Density (Wiener Khintchine Theorem) [13]. The magnitude squared coherence of differing microphone directivity patterns and orientations in isotropic diffuse fields are shown formally by Elko in [10]. The spatial autocorrelation for a fixed-radius array with arbitrarily angular spacing would be extremely difficult to determine mathematically and in this paper the relationship is instead determined by numerical simulation. A cardioid microphone can be formed by combining the acoustic pressure pˆ and pressure gradient ∇n pˆ. A gradient microphone is typically phase and frequency compensated by the mass-dominated ribbon or ˆ n ρc diaphragm response (1/jk) which results in a signal proportional to the acoustic velocity [14]. Summing v ˆ n ρc with the pressure creates a cardioid microphone in any given desired direction n. Note that −∇n pˆ/jk = v because of Newton’s 2nd Law: −∇n pˆ = jωρˆ vn . The locations of the virtual microphone array are defined having radial distance R and arbitrary azimuth
7
or elevation angles (φ, θ). Following the single-layer approximation of the Kirchhoff Helmholtz Integral, this virtual array should be coincident with the I real loudspeakers. Two additional points are defined having radius (R − ∆R) and (R + ∆R). For each channel i, the three points are converted to cartesian and the pressure contribution of all Q plane waves is computed and summed. The pressure gradient ∇r pˆ is approximated as: ∇r pˆ ≈
[ (ˆ pR − pˆ(−∆R) ) + (ˆ p(∆R) − pˆR ) ] 2∆R
(6)
The signal corresponding to an outward-facing cardioid microphone is then: 1 Ci = 2
−∇r pˆ pˆ + jk
(7)
For a given spatial frequency k and array radius R, the origin and edge values are correlated according to Eq. 5. It is not desired to introduce variation that the B-Format microphone can account for, indeed this is distortion of any spatial information that might exist at low frequencies. At higher frequencies, however, the origin and the edge are essentially uncorrelated and the simulated structure is “faded in” according to:
Ai,l =
sin kR sin kR + Ci (1 − ) kR kR
(8)
The left hand term represents what B-Format is capable of accurately representing - low frequencies determined by the array radius R. The right hand term represents the added contribution of the simulated diffuse field modeled with decorrelation filters. Adapting DFM for binaural scenarios is extremely attractive because the diffuse field can be head-tracked by changing the directions of the cardioid microphones formed from the B-Format RIR. This requires only manipulation to the relative gains, and would preserve directional energetic differences present in the BFormat RIR. However, we cannot assume the same correlation relationship between an unobstructed field (the B-Format RIR and the virtual array) and an obstructed field (the B-format RIR and the listener’s ears). In this case, each constituent plane wave will be modified according to the properties of head and torso which include frequency-dependent shadowing, amplification due to reflection, and increased or decreased path length [12]. Because at least half of the plane waves in Eq 4 would be occluded, it is assumed that the ear will be less correlated than predicted by Eq 5. In this current work, it was desired to maintain the paradigm of transposing cardioid microphones by way of decorrelation filters. Inside-the-head localization has been avoided by simply doubling the array distance R. In the future, it would be preferable to create the decorrelation filters from a simulated diffuse field including the head model. Parametric approaches such as those by Duda and Algazi et al [15][16][17][18] are attractive because they allow arbitrary angular 8
incidence which pairs well with the simulated diffuse field model and offer a degree of individualization at low frequencies. The binaural adaption is similar to the work of Menzer and Faller [19] who also created a binaural method using B-Format RIR in conjunction with Head-Related Transfer Functions (HRTF). The method creates a static individualized binaural room impulse responses (BRIR) by matching the interaural cross-correlation (IACC) associated with the ensemble of HRTF measurements and Energy Decay Relief of the B-Format RIR. The binaural adaption of DFM maintains the ability to implement rotational head-tracking by decoupling the B-Format RIR and the decorrelation filters.
2.4
Determining Center Frequency Spacing using RT60 Estimates
In the same way that spatial autocorrelation describes the average similarity between the pressure p at two points in space separated by distance ∆R, the frequency autocorrelation describes the average similarity between the pressure p at two points in frequency separated by ∆Ω. Because shorter RT60 values correspond to room modes with wider bandwidths, the spectra are, on average, more correlated for greater ∆Ω. Longer RT60 values correspond to room modes with narrow bandwidths and are less correlated for similar ∆Ω. The modal decay constant is related to RT60 by τ = RT 60/13.82 for pressure squared (the constant 6.9 pertains to linear pressure). The frequency autocorrelation is described by [7]: ha(Ω)a(Ω + ∆Ω)i 1 = ha2 (Ω)i 1 + (τ ∆Ω)2
(9)
This relationship can be used to determine the main lobe bandwidth of the Hamming window. A criterion FAC level gc is defined and Equation 9 is manipulated to determine the corresponding frequency ∆Ωc :
∆Ωc =
1 τ
r
1 −1 gc
(10)
The equivalent bandwidth is defined to be twice ∆Ωc due to the symmetry of frequency autocorrelation and the fact that bandwidth is typically two-sided. In discrete time angular frequency the equivalent bandwidth is: ωc =
2∆Ωc Fs
(11)
The bandwidth of a Hamming window in discrete time angular frequency is [11]:
ωH =
9
8π M
(12)
where M is the length of the window that can be determined by using the equivalent bandwidth ωc to define the main lobe width ωH of a Hamming window (ωH = ωc ):
M=
8π ωc
(13)
A good correspondence between the theoretical FAC and the resulting FAC of the ensemble of filters has been obtained using gc = 0.1. This can be seen in Figure 2 in Section 3.1. Note that both the theoretical prediction and DFM filters are in good agreement with the FAC measurement from the B-Format RIR. RT60 varies slowly with respect to frequency and, to account for this, the window length Ml is allowed to vary. All values of Ml are rounded to the nearest odd number to form FIR filters that are symmetric around a center point (Type 1 FIR). The center point of all L filters are aligned to avoid discrepancies in group delay. Because ∆ωl will also vary, the next center frequency is determined iteratively:
ωl+1 = ωl + ∆ωl
(14)
Where ∆ωl =
7 ωc 16
(15)
The choice of the 7/16 constant represents a modest overlap between the main lobes of each individual bandpass filter. This was done to insure the resulting spectrum was adequately interpolated and is considered to be an acceptable blurring of the spatial structure present in Ai,l . The constant 7/16 was determined by setting all values of |Ai,l | to unity. The value of ∠Ai,l was also set to 0.0, however the time alignment of the individual bandpass filters led to some phase variation. At the value of 7/16, the resulting spectrum has almost no magnitude variation (< 0.5 dB); even slightly greater spacings such as 15/32 caused magnitude dips between the values of |Ai,l |. When the variation in Ai,l is reinstated, the resulting phase and magnitude spectra are smooth with occasional narrow zeros caused by greatly differing phase values between the channels. In typical filter design cases, these zeros would be objectionable, but these also occur in modal diffuse fields [20] and are considered to be acceptable for this application. The lowest value of ωl is can be chosen depending on the expected frequency response of the loudspeaker array at hand. In development a value corresponding to 16Hz was used.
2.5
High Frequency Cutoff
The acoustic absorption of materials is described by the coefficient α and is typically frequency-dependent [7][20]. Many common materials such as upholstery, curtains, or carpet attenuate high frequencies more than 10
low frequencies. Acoustic sources typically have frequency-dependent directivity, with low frequencies being nearly omni-directional and high frequencies being highly-directional [21][20]. A common interpretation of the diffuse field model attributes the constituent plane waves to mirror sources in a perfectly reflective room [20]. If the impulse response of these mirror sources is considered in conjunction with the attenuation due to α and to off-axis directivity, the high frequency and higher-order reflections will decay to negligible levels very quickly. Even in this case of sustained high frequency sound, the spatial variation of an acoustic field is on the order of the wavenumber k [20], and, for high frequencies, will be comparable or smaller than the human head [12]. Further, each constituent plane wave will be subject to frequency-dependent occlusion and time delay [12][16][17] and, for lateral plane waves, will make very different contributions to the two ears. Due to the above considerations and to limit the computational complexity of generating the decorrelation filters, it was decided to limit the algorithm to an upper cutoff frequency. Above this frequency, the MATLAB fir1() function was used to generate a high-pass filter and was temporally aligned with the center point of the DFM bandpass filters. Both timbral coloration and in-the-head localization are avoided using an upper cutoff of 8 kHz, the Tanna and Pollack Hall B-Format RIR, and the binaural adaption of the DFM filters.
3 3.1
Numerical Validation Validation of Frequency Autocorrelation
The RT60 of an acoustic space is typically frequency-dependent and, because of the relationship in Equation 9, so is the frequency autocorrelation (FAC). This variation is gradual and can be considered to be locally stationary at a given frequency. To validate the FAC of the ensemble of filters, the zero-padded (16x) Discrete Fourier Transform (DFT) of each filter is computed, a base frequency is chosen, and the autocorrelation of the frequency sequence is computed for a small span of frequencies (≈ 10 Hz) in the up-sampled spectrum using the MATLAB autocorr() function. These autocorrelation sequences are then averaged over the ensemble of filters and compared to the theoretical predictions of Equation 9. A comparison is also made to the computed FAC of the RIR itself - this was done using the same length DFT and, to improve the smoothness and consistency of the results, was averaged over 6 different cardioid microphones formed from the B-Format RIR. Results for Tanna Hall with a base frequency of 200 Hz are shown in Figure 2.
11
Frequency Autocorrelation
1 0.9
Theoretical
0.8
DFM
0.7
Measured
0.6 0.5 0.4 0.3 0.2 0.1 0 200
201
202
203
204
205
206
207
208
209
210
Frequency Hz
Figure 2: Frequency Autocorrelation (FAC) computed with a base frequency of 200 Hz for Tanna Hall at McGill University. The theoretical FAC predicted by Eq 9 is plotted as the solid line, the ensemble of DFM filters is the dashed line, and the Tanna Hall Room Impulse Response (RIR) is the dotted line. A smoother and more consistent estimate of the measured FAC was calculated by averaging the FAC from 6 different cardioid microphones formed from the B-Format RIR.
3.2
Validation of Spatial Autocorrelation
In the first DFM publication, the spatial autocorrelation (SAC) was computed for regularly-spaced, circular loudspeaker arrays; the frequency to which the loudspeaker channels were correlated depended on the channel spacing. The correlation was also plotted with respect to the product of spatial frequency k and distance r to allow comparison with the theoretical ideal sinc(kr). In this paper, deriving the introduced randomness from a simulated diffuse field allows for the convenient generation of three dimensional loudspeaker arrays with arbitrary angular spacing. The regularly-spaced computation, however, must be abandoned in favor of computing the pairwise SAC between every pair of loudspeakers in an array. The number of pairwise combinations is computed for the number of channels I as C2I and, for each pair, the DFT is computed for both decorrelation filters. The distance between the pair is computed as is the normalized conjugate product of the spectra. The frequency index is reinterpreted with respect to kr for the given spacing r and this spectrum is stored. Most loudspeakers pairs have differing distances and the kr spectra are not identically sampled. As such, a small increment of ∆kr is defined and all coherence values in this band are averaged. Results are shown in Figure 3.
3.3
Validation of Reconstructed Fields
The discrete formulation of the Kirchhoff-Helmholtz Integral used in the first DFM publication (Section 5.1 of [1]) was again used to generate a complex pressure field in the horizontal-plane. A reference diffuse field was simulated with Q = 1024 plane waves of random direction and complex amplitude. A virtual array was formed using the cardioid microphone to cardioid loudspeaker approximation of K/H and the
12
1 0.9
Theoretical
Spatial Autocorrelation
0.8
DFM
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 -1
10 0
10 1
10 2
Frequency Hz
Figure 3: Averaged pairwise channel correlation plotted with respect to kr; the pairwise approach was necessary for arbitrarily spaced 3-dimensional loudspeaker arrays. While it is anticipated that adjacent loudspeaker will have the most acoustic interaction, the results above suggest that all loudspeakers converge the expected sinc(kr) spatial autocorrelation. driving functions for the radiators were formed from linear combinations of pressure p and normal velocity vn ρc = −∇n p/jk measured from the simulated field on the edge of the virtual array. The pressure pˆ = W and the x and y pressure gradients (−∇x,y pˆ) / jk = X, Y are observed at the origin. For every channel i, an outward facing cardiod C(i) is derived from a linear combination of X and Y based upon the azimuth angle φi as C(i) = W + cos(φi )X + sin(φi )Y . For comparison, a case with no decorrelation is formed and the loudspeaker driving signal is E(i) = C(i). In the DFM case, decorrelation filters are applied to the driving signals as E(i) = Di (m)C(i). For viewing, plots of the real component of the complex pressure field were computed for a dense cartesian grid. These were visually similar to the original publication and are not reproduced for the sake of brevity. The quantity βm (|r0 |) defined in the original publication followed the definition of coherence (Fourier Transform of normalized cross-correlation) [9][13]: P (0)P ∗ (|r0 |, φi ) P 2 2 I |P (0)| I |P (|r0 |, φi )| P
βm (|r0 |) = pP
I
(16)
and is used to compute the spatial autocorrelation for concentric rings |r0 | around the origin by considering the origin reference pressure P (0) and the field pressure P (|r0 |, φi ). Figure 4 shows the spatial autocorrelation of the reconstructed field. Because the particular field heavily influences the SAC estimate, these results were averaged over 32 different simulated fields generated with different MATLAB random number generator seeds.
13
Spatial Autocorrelation 80 Hz
1
0.8
0.4
0.4
0.4
0.2
0.2
0.2
βm(r)
0.6
0
0
-0.2
-0.2
-0.2
-0.4
-0.4
-0.4
-0.6
-0.6
-0.6
-0.8
-0.8
-1
-0.8
-1 0.2
0.4
0.6
0.8
meters
1
1.2
1.4
Ref DFM None
0.8
0.6
0
Spatial Autocorrelation 320 Hz
1 Ref DFM None
0.6
βm(r)
βm(r)
0.8
Spatial Autocorrelation 160 Hz
1 Ref DFM None
-1 0.2
0.4
0.6
0.8
1
1.2
1.4
meters
0.2
0.4
0.6
0.8
1
1.2
1.4
meters
Figure 4: Averaged spatial autocorrelation of the reconstructed fields at 80, 160, and 320 Hz. Because the particular field heavily influences the SAC estimate, these results were averaged over 32 different simulated fields generated with different random number seeds. The virtual array using measured edge values is considered to the be reference and is plotted as the solid line. The dashed line uses origin observations with the DFM decorrelation filters, and the dotted line uses origin observations and no decorrelation filters.
4
Discussion
The intent of this paper is to provide a systematic and robust filter design method for irregular angular loudspeaker spacings and for a binaural adaption. This has been done by deriving the frequency-dependent spatial structure of an acoustic diffuse field from simulation. To minimize distortion to this structure, the filter design has been altered such that the constituent bandpass filters have minimal spectral overlap and the center frequencies are determined iteratively based on an estimate of frequency autocorrelation. In the following, Section 4.1 discusses the relationship to existing psychoacoustic research and the areas where further perceptual research is needed. The resulting temporal onset, the computational cost, and a speculative link to spherical basis functions are presented in Section 4.2.
4.1
Relationship to Existing Psychoacoustic Research
DFM introduces frequency-dependent channel correlation for both sparse and dense loudspeaker arrays. Some insight into the perceptual ramifications of this channel correlation can be gained from the experimental results presented by Blauert [12] regarding the auditory events associated with loudspeaker arrays with variable channel coherence kΦ . This quantity is defined as the maximum broadband cross-correlation and the subscript Φ is used to distinguish it from the wavenumber k. For a regularly-spaced, 4-channel loudspeaker array, kΦ values near 1.0 lead to auditory events heard near or above the head, and kΦ levels near 0.0 lead to distinct auditory events heard at the individual loudspeakers. Intermediate to this (kΦ ≈ 0.3) is an auditory event of large spatial extent filling the listening area. DFM is similar to this intermediate case with the
14
major difference being that the quantity is kΦ is broadband while the channel correlation resulting from DFM is frequency-dependent. A study by Hiyama and Hamasaki [22] sought to determine the minimum number of loudspeakers required to perceptually approximate a diffuse field. Using a “difference grade” rating experiment, as few as 6 regularly-spaced loudspeakers were found to be perceptually comparable a 24-loudspeaker reference. A closely related experiment by Santala and Pulkki [23] used uncorrelated pink noise in a 15-loudspeaker frontal arc and presented subjects with a wide variety of regular and irregular configurations of active and inactive loudspeakers. Notable is the fact that many configurations differing by only one loudspeaker or one pair of symmetric loudspeakers were rated similarly which suggests that the exact distribution of loudspeakers could not be perceived accurately. It is of particular interest to understand how the frequency-dependent channel correlation in DFM filters influences these results. In headphone presentation, the maximum broadband interaural cross-correlation (IACC) influences the lateral width of an auditory event heard on the interaural axis [12] and can also be described by kΦ . If the same signal is sent to both ears (kΦ = 1.0), the auditory event will be heard in the center of the head with relatively concise localization. Two separate auditory events will be heard as kΦ nears 0.0: one located at each ear. In between these two extremes, the width of the auditory event is related to the value of kΦ with lower values being wider. This is consistent with the listening assessment of the binaural adaption of DFM in conjunction with a simple spherical head model [17] for the direct path and early reflections. Turning the decorrelation filters off or using a very small head radius led to inside-the-head localization, while using a plausible head radius (with the suggested doubling to account for the obstructed field) led to a natural-sounding image. It was noted previously that the binaural adaption is similar to the work of Menzer and Faller [19] but allows rotational head tracking and a degree of individualization (head radius) in the filter design. A perceptual experiment conducted by the same authors [24] replaced the diffuse aspect of a binaural room impulse response (BRIR) with white Gaussian noise matched to have a similar energy decay relief. The IACC was matched using both frequency-independent and frequency-dependent methods and a number of “split points” between the measured BRIR and the synthetic diffuse tail. Notable is the fact that frequencydependent IACC in conjunction with split points of 80 and 200 ms were rated as nearly indistinguishable from the reference. The binaural adaption of DFM also imparts frequency-dependent IACC and the temporal onset of the Hamming window (≈ 170 ms for a 16k-tap filter) is similar to the split point of 200 ms. DFM introduces amplitude and phase variation into the generated channels of diffuse reverberation and is similar to what would be found if an array of room microphones were set up in an actual acoustic diffuse field. Notable is the lack of timbral coloration found in the perceptual evaluation [2] and confirmed in the 15
development of this paper. It appears that this introduced variation is finer than the frequency resolution of the inner ear [25] and tends to “average out” over a critical bandwidth. A similar point is made by Schroeder in his seminal paper on artificial reverberation [26].
4.2
Engineering Decisions and Future Work
DFM requires that the structure of the B-Format RIR be diffuse according to the physical definition of randomly incident plane waves or a superposition of exponentially decaying modes. To achieve this constraint, a measured RIR must be treated to remove the direct path and any strong specular reflections. Given the known need for modeled reflections [2] and the smooth “bloom” associated with the convolution by a Hamming window, it was decided to simply remove the initial part of the RIR corresponding to half the window length (referenced to the direct path). An M = 16k-tap FIR filter is sufficient for RT60 values of approximately 2 seconds, and, at 48 kHz, the value of M/2 corresponds to 170 ms. Note that the filter lengths are shorter for lesser RT60 values. On one hand, this approach provides a gradual build up of diffuse energy and the structure of this audio is very likely to be perfectly diffuse. On the other hand, this approach requires a high-order reflection engine to properly synthesize the mixing of specular early reflections and the diffuse field. This is likely acceptable for the creation of scene-based content and experimental stimuli. There are likely cases where the temporal onset associated with the Hamming window is not acceptable. It is left as future work to develop a method that balances the temporal response with the bandpass channel overlap. DFM is computationally reasonable for large loudspeaker arrays where a long convolution is required for each of the four B-Format RIR and a lesser (nominally 16k-tap) convolution is then required for each loudspeaker. For mobile binaural applications, however, these convolutions are computationally forbidding. The long convolution with the B-Format RIR can be eliminated by precomputing the diffuse streams and storing them for later playback. It is left as future work, however, to determine a computationally modest approximation to the DFM filters appropriate for this use case. The relationship between spherical basis functions (SBF) and multipoles is demonstrated in [27] and, for the 1st -order case, multipoles and SBF are proportional within a constant. The connection between spherical basis functions and the Kirchhoff-Helmholtz Integral is that an infinite SBF series can predict the field values pˆ on the bounding surface and a differentiation property [27] allows the computation of ∇ˆ p. DFM shares a conceptual similarity in the fact that an observation at the origin can be used to determine values on a bounding surface, though B-Format can only predict values on the bounding surface for very low frequencies [28]. During this work it has been apparent that there may be a formal relationship to SBF, in particular
16
that the simulated fields might be “filling in” the missing higher-order harmonics. For the sake of time and a readable manuscript this was not pursued further, however the decision to “fade in” the simulated field shown in Equation 8 was informed by the seeming parallel to the SBF interpretation.
5
Conclusion
The modal and diffuse field models are foundational concepts underlying the statistical description of reverberation [7][20]. These statistical descriptions do not attempt to predict exact acoustic field values, but rather detail the probability distribution and autocorrelation with respect to time, frequency, and space. The auditory system’s estimation of a source’s distance is thought to consider a number of physical properties, one notable property being the ratio between the direct sound (deterministic) and the (stochastic) reverberant diffuse field [12][25][29]. The Kirchhoff Helmholtz Integral (K/H) states that if the pressure and pressure gradient are known on the bounding surface then the field within the surface can be reconstructed using a secondary array of sources [27][7]. In practice, the two “layers” of K/H must be collapsed into a single layer and the bounding surface is reduced to discrete points of reception and re-radiation. The virtual sources used in wave field synthesis (WFS) [30] can be seen as the simulation of deterministic acoustic values on a single-layer bounding surface; summing localization (amplitude panning [12]) can be interpreted as the perceptual simulation of deterministic values for sparsely sampled bounding surfaces. The statistical time / frequency aspects of the reverberant decay have been the subject of scientific research ranging from Sabine [7] to Schroder [26] to Jot [31]. These time / frequency aspects, as well as broad energetic directional differences, are captured by a B-Format Room Impulse Response (RIR). Diffuse Field Modeling, in conjunction with a B-Format RIR, simulates the (stochastic) spatial structure of a diffuse field for either sparse or dense bounding K/H bounding surfaces. To best approximate the original diffuse field, this spatial structure changes with respect to frequency based on the estimated frequency autocorrelation (and equivalently the RT60) of the RIR. The contributions of this paper allows the creation of arbitrary angular spacings for three dimensional loudspeaker arrays as well as for a simple binaural adaption. The filter design process has been made more systematic by using estimates of the RT60 to determine the bandwidth of bandpass filters created using Hamming windows. Sound engineers typically use differing strategies for the direct and room sound [4]. DFM allows greater logistical flexibility for the diffuse sound, however it must be used in parallel with a point source techniques such as amplitude panning [12], vector-base amplitude panning (VBAP) [32], wave field synthesis (WFS) [30], or Head Related Transfer Functions [12] for the direct sound and specular
17
reflections. DFM can be used for spatial content creation in the current sound recording paradigm and is extensible to dense loudspeaker arrays such as those found in WFS. The development has been done in terms of BFormat Room Impulse Responses, however it is equally applicable to B-Format audio streams that are known to have diffuse structure. The ability to systematically render an approximation of a diffuse field from a 4-channel B-Format audio stream should make DFM a suitable component of a high-definition object-based transmission format. It is hoped that DFM will become a complement to existing techniques and assist in the creation of virtual acoustic scenarios that are perceptually indistinguishable from acoustic reality.
Acknowledgements The author would like to thank Philippe Depalle for a number of insightful technical discussions during the early development of this project and, more significantly, during his graduate work. He would also like to thank Ralph Algazi and Catherine Guastavino for their help reviewing the manuscript.
6
References
References [1] David Romblom, Philippe Depalle, Catherine Guastavino, and Richard King. Diffuse Field Modeling using Physically-Inspired Decorrelation Filters and B-Format Microphones: Part I Algorithm. The Journal of the Audio Engineering Society, 64(4):177–193, April 2016. [2] Olli Rummukainen, David Romblom, and Catherine Guastavino.
Diffuse Field Modeling using
Physically-Inspired Decorrelation Filters and B-Format Microphones: Part II Evaluation. The Journal of the Audio Engineering Society, 64(4):194–207, April 2016. [3] Kimio Hamasaki. Multichannel Recording Techniques for Reproducing Adequate Spatial Impression. In Audio Engineering Society Conference: 24th International Conference: Multichannel Audio, The New Reality, July 2003. [4] G¨ unther Theile. Multichannel Natural Recording Based on Psychoacoustic Principles. In Audio Engineering Society 108th Convention, Feb. 2000. [5] David Romblom, Richard King, and Catherine Guastavino. A Perceptual Evaluation of Room Effect Methods for Multichannel Spatial Audio. In Audio Engineering Society 135th Convention, Oct. 2013. 18
[6] David Romblom, Catherine Guastavino, and Philippe Depalle. Perceptual Thresholds for Non-Ideal Diffuse Field Reverberation. Journal of the Acoustical Society of America, 140(5):3908?3916, November 2016. [7] Allan D. Pierce. Acoustics: An Introduction to Its Physical Principles and Applications. The Acoustical Society of America, Melville, New York, 1994. Chapter 6. [8] Richard K. Cook. Measurement of Correlation Coefficients in Reverberant Sound Fields. The Journal of the Acoustical Society of America, 27(6):1072–1077, 1955. [9] Finn Jacobsen and Thibaut Roisin. The Coherence of Reverberant Sound Fields. The Journal of the Acoustical Society of America, 108(1):204–210, 2000. [10] Gary W. Elko. Spatial Coherence Functions for Differential Microphones in Isotropic Noise Fields. In Michael Brandstein and Darren Ward, editors, Microphone Arrays, Digital Signal Processing, pages 61–85. Springer-Verlag Berlin Heidelberg, 2001. [11] Alan V. Oppenheim and Ronald W. Schafer. Discrete-Time Signal Processing (3rd Edition) (PrenticeHall Signal Processing Series). Prentice Hall, 3 edition, August 2009. [12] Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, Cambridge, Massachusetts and London, England, Oct. 1996. Chapters 2 and 3. [13] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. John Wiley & Sons, New York, 1996. [14] John Eargle. The Microphone Book. Focal Press, Burlington MA, 2004. Chapters 4 and 5. [15] V. Ralph Algazi, Carlos Avendano, and Richard O. Duda. Estimation of a Spherical-Head Model from Anthropometry. The Journal of the Audio Engineering Society, 49(6):472–479, 2001. [16] C Phillip Brown and Richard O Duda. A Structural Model for Binaural Sound Synthesis. IEEE transactions on speech and audio processing, 6(5):476–488, 1998. [17] Richard O. Duda and William L. Martens. Range Dependence of the Response of a Spherical Head Model. The Journal of the Acoustic Society of America, 104(5):3048–3058, 1998. [18] V Ralph Algazi, Richard O Duda, and Dennis M Thompson. The use of Head-and-Torso Models for Improved Spatial Sound Synthesis. In Audio Engineering Society Convention 113. Audio Engineering Society, 2002.
19
[19] Fritz Menzer, Christof Faller, and Herv´e Lissek. Obtaining Binaural Room Impulse Responses From B-Format Impulse Responses Using Frequency-Dependent Coherence Matching. IEEE Transactions on Audio, Speech, and Language Processing, 19(2):396–405, 2011. [20] Frank J. Fahy. Foundations of Engineering Acoustics. Academic Press, London, 2000. [21] Kinsler, Frey, Coppens, and Sanders. Fundamentals of Acoustics. John Wiley & Sons, New York, 1982. [22] Koichiro Hiyama, Setsu Komiyama, and Kimio Hamasaki. The Minimum Number of Loudspeakers and its Arrangement for Reproducing the Spatial Impression of a Diffuse Sound Field. In Audio Engineering Society 113th Convention, Oct. 2002. [23] Olli Santala and Ville Pulkki. Directional Perception of Distributed Sound Sources. The Journal of the Acoustical Society of America, 129(3):1522–1530, 2011. [24] Fritz Menzer and Christof Faller. Investigations on an Early-Reflection-Free Model for BRIR. The Journal of the Audio Engineering Society, 58(9):709–723, 2010. [25] Brian C.J. Moore. An Introduction to the Psychology of Hearing. Academic Press, 2003. [26] Manfred Schroeder. Natural Sounding Artificial Reverberation. The Journal of the Audio Engineering Society, 10(2):219–223, 1962. [27] Nail A. Gumerov and Ramani Duraiswami. Fast Multipole Methods for the Helmholtz Equation in Three Dimensions. Elsevier Series in Electromagnetism. Elsevier Science, Amsterdam, 2005. [28] Rozenn Nicol. Sound Spatialization By Higher Order Ambisonics: Encoding And Decoding A Sound Scene In Practice From a Theoretical Point Of View. In Proceedings of the 2nd International Symposium on Ambisonics and Spherical Acoustics, 2010. [29] Pavel Zahorik, Douglas S. Brungart, and Adelbert W. Bronkhorst. Auditory Distance Perception in Humans: A Summary of Past and Present Research. Acta Acustica united with Acustica, 91(3):409–420, 2005. [30] Jens Ahrens. Analytic Methods of Sound Field Synthesis. Springer-Verlag Berlin Heidelberg, 2012. Chapter 1. [31] Jean-Marc Jot, Laurent Cerveau, and Olivier Warusfel. Analysis and Synthesis of Room Reverberation based on a Statistical Time-Frequency Model. In Audio Engineering Society 103rd Convention, Sept. 1997.
20
[32] Ville Pulkki. Virtual Sound Source Positioning Using Vector Base Amplitude Panning. The Journal of the Audio Engineering Society, 45(6):456–466, June 1997.
Author Biography David Romblom received Music and Electrical Engineering degrees from the University of Michigan, a Master’s in Media Arts and Technology from the University of California at Santa Barbara, and his Ph.D. in Music Technology from McGill University. He has worked for E-mu Systems, Universal Audio, Sennheiser Research, and Applied Acoustics Systems, and is current directing the research and development efforts at Dysonics in San Francisco. His interests center on perceptually-transparent approximations of acoustic reality. In his spare time he enjoys playing the piano and being outside.
21