GENERATING VIRTUAL MICROPHONE SIGNALS USING

2 downloads 0 Views 1MB Size Report
outside the sound scene and yet be able to capture the sound from an arbitrary ... directional audio coding (DirAC) [1], in which it is possible to real- ize spatial ...
2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays

May 30 - June 1, 2011

GENERATING VIRTUAL MICROPHONE SIGNALS USING GEOMETRICAL INFORMATION GATHERED BY DISTRIBUTED ARRAYS Giovanni Del Galdo1 , Oliver Thiergart2 , Tobias Weller1 , and Emanu¨el A.P. Habets2 1

Fraunhofer Institute for Integrated Circuits IIS, Erlangen, Germany 2 International Audio Laboratories Erlangen, Germany Email: [email protected]

ABSTRACT Conventional recording techniques for spatial audio are limited to the fact that the spatial image obtained is always relative to the position in which the microphones have been physically placed. In many applications, however, it is desired to place the microphones outside the sound scene and yet be able to capture the sound from an arbitrary perspective. This contribution proposes a method to place a virtual microphone at an arbitrary point in space, by computing a signal perceptually similar to the one which would have been picked up if the microphone had been physically placed in the sound scene. The method relies on a parametric model of the sound field based on point-like isotropic sound sources. The required geometrical information is gathered by two or more distributed microphone arrays. Measurement results demonstrate the applicability of the proposed method and reveal its limitations. Index Terms— Spatial sound, Sound localization, Audio recording, Parameter estimation 1. INTRODUCTION Spatial sound acquisition aims at capturing either an entire sound scene or just certain desired components, depending on the application at hand. Several recording techniques providing different advantages and drawbacks are available for these purposes. For instance, close talking microphones are often used for recording individual sound sources with high SNR and low reverberation, while more distant configurations such as XY stereophony represent a way for capturing the spatial image of an entire sound scene. More flexibility in terms of directivity can be achieved with beamforming, where a microphone array can be used to realize steerable pick-up patterns. Even more flexibility is provided by parametric methods, such as directional audio coding (DirAC) [1], in which it is possible to realize spatial filters with arbitrary pick-up patterns [2] as well as other signal processing manipulations of the sound scene [3, 4]. All these methods have in common that they are limited to a representation of the sound field with respect to only one point, namely the measurement location. Thus, the required microphones must be placed at very specific, carefully selected positions, e. g., close to the sources or such that the spatial image can be captured optimally. In many applications however, this is not feasible and therefore it would be beneficial to place several microphones further away from the sound sources and still be able to capture the sound as desired. There exist several field reconstruction methods for estimating the sound field in a point in space other than where it was measured. One method is acoustic holography [5], which allows to compute the sound field at any point within an arbitrary volume given that the

978-1-4577-0999-9/11/$26.00 ©2011 IEEE

185

sound pressure and particle velocity is known on its entire surface. Therefore, when the volume is large, an unpractically large number of sensors is required. Moreover, the method assumes that no sound sources are present inside the volume, making the algorithm unfeasible for our needs. The related wave field extrapolation [5] aims at extrapolating the known sound field on the surface of a volume to outer regions. The extrapolation accuracy however degrades rapidly for larger extrapolation distances as well as for extrapolations towards directions orthogonal to the direction of propagation of the sound [6]. In [7] a plane wave model is assumed, such that the field extrapolation is possible only in points far from the actual sound sources, i. e., close to the measurement point. To overcome the drawbacks of these field reconstructing methods, this contribution proposes a parametric method capable of estimating the sound signal of a virtual microphone placed at an arbitrary location. In contrast to the methods previously described, the proposed method does not aim directly at reconstructing the sound field, but rather at providing sound that is perceptually similar to the one which would be picked up by a microphone physically placed at this location. This is possible thanks to a parametric model of the sound field based on isotropic point-like sound sources (IPLS). The required geometrical information, namely the instantaneous position of all IPLS, is gathered via triangulation of the directions of arrival (DOA) estimated with two or more distributed microphone arrays. Therefore, knowledge on the relative position and orientation of the arrays is required. Notwithstanding, no a priori knowledge on the number and position of the actual sound sources is necessary. Given the parametric nature of the method, the virtual microphone can possess an arbitrary directivity pattern as well as physical or non-physical behaviors, e. g., with respect to the pressure decay with distance. The presented approach is verified by studying the parameter estimation accuracy based on measurements in a reverberant environment. The paper is structured as follows: In Section 2 the sound field model is introduced and the geometric parameter estimation algorithm is derived. In Section 3 the virtual microphone approach is presented and discussed in detail. The algorithm is verified with measurement results in Section 4. Section 5 concludes the paper. 2. GEOMETRIC PARAMETER ESTIMATION 2.1. Sound Field Model The sound field is analyzed in the time-frequency domain, for instance obtained via a short-time Fourier transform (STFT), in which k and n denote the frequency and time indices, respectively. The complex pressure Pv (k, n) at an arbitrary position pv for a certain

c1

and e2 (k, n), with respect to the global coordinate system in the origin O, are computed via

pIPLS (k, n)

e1 d1

ϕ1

s p1 x

d2

e1 (k, n) = R1 · ePOV 1 (k, n), e2 (k, n) = R2 · ePOV 2 (k, n),

e2 ϕ2

c2

where R are coordinate transformation matrices, e. g., – » c −c1,y , R1 = 1,x c1,y c1,x

pv p2 y

(3)

(4)

O

Fig. 1. Geometry used throughout this contribution

k and n is modeled as a single spherical wave emitted by a narrowband isotropic point-like source (IPLS), i. e., ` ´ Pv (k, n) = PIPLS (k, n) · γ k, pIPLS (k, n), pv , (1)

where PIPLS (k, n) is the signal emitted by the IPLS at its position pIPLS (k, n). The complex factor γ (k, pIPLS , pv ) expresses the propagation from pIPLS (k, n) to pv , i. e., it introduces appropriate phase and magnitude modifications as discussed in Section 3. Hence, we assume that in each time-frequency bin only one IPLS can be active. Nevertheless, multiple narrowband IPLSs located at different positions can be active at a single time instance n. Each IPLS models either direct sound or a distinct room reflection, such that its position pIPLS (k, n) ideally corresponds to an actual sound source located inside the room, or a mirror image sound source located outside, respectively. Notice that this single-wave model is accurate only for mildly reverberant environments given that the source signals fulfill the W -disjoint orthogonality (WDO) condition, i. e., the time-frequency overlap is sufficiently small. This is normally true with speech signals [8]. The next two sections deal with the estimation of the positions pIPLS (k, n) whereas Section 3 deals with the estimation of PIPLS (k, n) and the computation of γ (k, pIPLS , pv ). 2.2. Position Estimation

when operating in 2D and c1 = [c1,x , c1,y ]T . For carrying out the triangulation we define the direction vectors d1 (k, n) and d2 (k, n) as d1 (k, n) = d1 (k, n) e1 (k, n), (5) d2 (k, n) = d2 (k, n) e2 (k, n), where d1 (k, n) = ||d1 (k, n)|| and d2 (k, n) = ||d2 (k, n)|| are the unknown distances between the IPLS and the two microphone arrays. The triangulation is computed by solving p1 + d1 (k, n) = p2 + d2 (k, n)

for either d1 (k, n) or d2 (k, n). Finally, the position pIPLS (k, n) of the IPLS is given by pIPLS (k, n) = d1 (k, n)e1 (k, n) + p1 .

where ϕ1 (k, n) is the azimuth of the DOA estimated at the first array, as depicted in Fig. 1. The corresponding DOA unit vectors e1 (k, n)

186

(7)

Equation (6) always provides a solution when operating in 2D, unless e1 (k, n) and e2 (k, n) are parallel. When using more then two microphone arrays or when operating in 3D, however, triangulation is not directly possible when the direction vectors d do not intersect. In this case, we can compute the point which is closest to all direction vectors d and use the result as position of the IPLS. Notice that in general all observation points p1 , p2 , . . . must be located such that the sound emitted by the IPLS falls into the same temporal block n. Otherwise, combining the DOA information for a certain time-frequency bin cannot provide useful results. This requirement is fulfilled when the distance ∆ between any two observation points is smaller than ∆max = c

The position pIPLS (k, n) of an IPLS active in a certain timefrequency bin is estimated via triangulation on the basis of the direction of arrival (DOA) of sound measured in at least two different observation points. Let us consider the geometry in Fig. 1 where the IPLS of the current (k, n) is located in the (unknown) position pIPLS (k, n). In order to determine the required DOA information, we use two microphone arrays with known geometry, position, and orientation placed in p1 and p2 , respectively. The array orientations are defined by the unit vectors c1 and c2 . The DOA of the sound is determined in p1 and p2 for each (k, n) using a DOA estimation algorithm, for instance as provided by the DirAC analysis [1]. The output of the DOA estimators from the point of view (POV) of the arrays can be POV expressed as the unit vectors ePOV (k, n) (not de1 (k, n) and e 2 picted in the plot). For instance, when operating in 2D, » – cos(ϕ1 (k, n)) ePOV , (2) 1 (k, n) = sin(ϕ1 (k, n))

(6)

nFFT (1 − R) , fs

(8)

where nFFT is the STFT window length, 0 ≤ R < 1 specifies the overlap between successive time frames, and fs is the sampling frequency. For example, for a 1024-point STFT at 48 kHz with 50% overlap (R = 0.5), the maximum spacing between the arrays to fulfill the above requirement is ∆ = 3.65 m. 2.3. Parameter Estimation in Practice When only direct sound and distinct room reflections are present, the estimator introduced in the previous section leads to a position pIPLS (k, n), which corresponds to either an actual sound source (for direct sound) or an image mirror source (for a distinct room reflection). However, in most cases, the sound field measured in a room does not consist only of direct sound and distinct room reflections, i. e., non-diffuse sound, but also of diffuse sound, which is not considered in the model in (1). In the case of pure diffuse sound, the estimated position pIPLS (k, n) is random and its distribution depends on the DOA estimator used. For instance, when using DirAC [1] the estimated DOA is a uniformly distributed random variable. In this case, the

y [m]

2

60

1.5

50

1

40

0.5

30

3.1. Reference Pressure Signal The reference pressure signal Pref (k, n) is derived from the array microphones. Assuming that the microphone arrays consist of omnidirectional sensors, there exist different approaches to generate Pref (k, n), for instance • using one specific, fixed microphone,

0 −2

−1

0

1

2

20

• using the microphone which is closest to the localized IPLS or to the position of the virtual microphone,

x [m]

Fig. 2. Black lines: equally spaced DOA vectors from the point of view of two arrays. Greyscale plot: PDF in dB of the localized position when both arrays estimate uniformly distributed random DOAs.

triangulation in (6) leads to positions which are concentrated around the observation points p1 and p2 . To visualize this, please consider Fig. 2, which shows two array positions at p1 = [−1, 0]T and p2 = [1, 0]T and several local DOA vectors e1 and e2 (represented by the black lines) equally spaced in azimuth. It can be seen that the intersection points are more dense for distances closer to the microphone arrays, and less dense for larger distances. This confirms that when both arrays estimate uniformly distributed random DOAs, the triangulation yields positions with higher probability near the two arrays. This is also shown by the greyscale plot in Fig. 2 which depicts the probably density function (PDF) (in dB) of the localized position when both arrays estimate uniformly distributed random DOAs. The two specific scenarios discussed above, namely non-diffuse sound only and diffuse sound only, accurately describe most scenarios encountered in practice in case of speech signals. In fact, the frequent onsets and speech pauses lead to situations in which either non-diffuse sound or diffuse sound is dominant.

3. VIRTUAL MICROPHONE GENERATION Once the IPLS have been localized as discussed in the previous section, the omnidirectional pressure signal Pv (k, n) at the (arbitrary) position pv of the virtual microphone can be estimated following (1). Based on this signal we can then derive the output P˜v (k, n) of a virtual microphone with arbitrary directivity. The pressure signal PIPLS (k, n) required in (1) is obtained from the pressure signal Pref (k, n) of a physical reference microphone located in pref . Analogously to (1), we can write Pref (k, n) = PIPLS (k, n) · γ (k, pIPLS , pref ) ,

(9)

which is solved for PIPLS (k, n). The reference signal Pref (k, n) can be obtained for instance from one microphone of the arrays, as discussed in Section 3.1. In general, the complex factor γ (k, pa , pb ) expresses the phase rotation and amplitude decay introduced by the propagation of a spherical wave from its origin in pa to pb . However, practical tests indicated that considering only the amplitude decay in γ leads to plausible impressions of the virtual microphone signal with significantly fewer artifacts compared to also considering the phase rotation. The exact computation of the reference signal Pref (k, n), the propagation factor γ, and the virtual microphone output P˜v (k, n) is given in the following sections.

187

• combining the microphone signals of one array, • combining all available microphones. Using the array microphone closest to the position pIPLS (k, n) of the IPLS potentially provides higher SNR and lower reverberation (since the distance to the sound source is smaller), but will likely introduce coloration or artifacts when switching for each (k, n) to a different sensor. A similar observation can be made when combining the array sensors, e. g., via beamforming. A beamformer can be realized in a straightforward manner by exploiting the geometrical information on the IPLS. For instance, one possible beamforming solution combining both microphone arrays can be realized by compensating the delay between all sensor signals and then computing the sum. Since both arrays are sufficiently close such that a sound event appears for all microphones in the same time frequency bin (k, n), the delay compensation can be realized by phase rotation of each (k, n). This requires exact knowledge of the phase differences between the sensors, which can be directly obtained from the geometrical parameters, namely from the distances between the individual sensors and the position of the current IPLS. This approach however requires a very precise parameter estimation, especially when compensating the delays between the sensors of different arrays. The performance of the different methods for computing the reference pressure signal is topic of current research. For this contribution, we restrict the discussion to the case in which we take as reference pressure signal Pref (k, n) the array microphone which is closest to the virtual microphone in pv , as this requires the smallest modifications for generating the virtual microphone signal and thus, potentially provides the fewest artifacts. 3.2. Magnitude Reconstruction The sound energy which can be measured in a certain point in space depends strongly on the distance r from the sound source. In many situations, this behavior can be modeled with sufficient accuracy using well-known physical principles, such as the 1/r decay of the sound pressure in the far-field of a point source. Therefore, when the distance of both the reference microphone and the virtual microphone from the sound source is known, we can estimate the sound energy at the position of the virtual microphone from the signal energy of the reference microphone. This means that the output signal of the virtual microphone can be obtained by applying proper gains to the reference pressure signal. Let the reference microphone be located in pref = p1 as shown in Fig. 1 and the virtual microphone be located in pv . Since the geometry in Fig. 1 is known in detail, we can easily determine the distance d1 (k, n) = ||d1 (k, n)|| between the reference microphone and the IPLS, as well as the distance s(k, n) = ||s(k, n)|| between the virtual microphone and the IPLS, namely s(k, n) = ||s(k, n)|| = ||p1 + d1 (k, n) − pv ||.

(10)

r = 2m A p2

O

early part direct part

P1 h1 (t)

P2

P3 P4

0

p1

5

10

15

20

25

30

35

40

t [ms]

d=

B

cm 3.2

Fig. 4. Room impulse response h1 (t) between source A and the first microphone in p1 .

Fig. 3. Left: measurement setup with two sources A and B, and two array positions p1 and p2 . Right: microphone array consisting of four omnidirectional sensors.

is the output of a virtual microphone with cardioid directivity. It is clear that the directional patterns, which can potentially be generated in this way, depend on the accuracy of the position estimation.

The sound pressure Pv (k, n) at the position of the virtual microphone is computed by combining (1) and (9), leading to

4. MEASUREMENT RESULTS

γ (k, pIPLS , pv ) Pref (k, n). Pv (k, n) = γ (k, pIPLS , pref )

(11)

As mentioned at the beginning of the section, the factors γ only consider the amplitude decay due to the propagation. Assuming for instance that the sound pressure decreases with 1/r, then Pv (k, n) =

d1 (k, n) Pref (k, n). s(k, n)

(12)

When the model in (1) holds, i. e., when only direct sound is present, then (12) can accurately reconstruct the magnitude information. However, in case of pure diffuse sound fields, i. e., when the model assumptions are not met, the presented method yields an implicit dereverberation of the signal when moving the virtual microphone away from the positions of the sensor arrays. In fact, as discussed in Section 2.3, in diffuse sound fields we expect that most IPLS are localized near the two sensor arrays. Thus, when moving the virtual microphone away from these positions, we likely increase the distance s = ||s|| in Fig. 1. Therefore, the magnitude of the reference pressure is decreased when applying a weighting according to (11). Correspondingly, when moving the virtual microphone close to an actual sound source, the time-frequency bins corresponding to the direct sound will be amplified such that the overall audio signal will be perceived less diffuse. By adjusting the rule in (12), one can control the direct sound amplification and diffuse sound suppression at will. 3.3. Virtual Microphone Directivity From the geometrical information estimated in Section 2.2, we can apply arbitrary directivity patterns to the virtual microphone. In doing so, one can for instance separate a source from a complex sound scene, assuming that the model assumptions hold. Since the DOA of the sound can be computed in the position pv of the virtual microphone, namely « „ s · cv ϕv (k, n) = arccos , (13) ||s|| where cv is a unit vector describing the orientation of the virtual microphone, we can realize arbitrary directivities for the virtual microphone. For instance, ˆ ` ´˜ P˜v (k, n) = Pv (k, n) 1 + cos ϕv (k, n) (14)

188

Measurements were carried out in a room (9.3 m × 7.5 m × 4.2 m) with reverberation time T60 ≈ 0.36 s to verify the accuracy of the geometrical parameter estimation. The measurement setup as well as one of the two identical arrays are depicted in Fig. 3. The microphone arrays, each consisting of M = 4 omnidirectional sensors with spacing d = 3.2 cm, are located in p1 and p2 , respectively. Two sound sources are placed in A and B emitting female and male speech signal, respectively. The distance between p1 and p2 as well as between A and B is r = 2 m. The microphone signals are transformed into the frequency domain using a 1024-point STFT at fs = 48 kHz with 50% overlap (R = 0.5). According to Section 2.1, this spectro-temporal resolution ensures that a sound event arrives at both arrays in the same time frequency bin, and, also guarantees the W -disjoint orthogonality (WDO) [8]. The geometrical analysis, which yields for each (k, n) the position pIPLS (k, n) of the individual sound events, is computed following the theory in Section 2.2. As reference pressure Pref (k, n) we use the sensor P1 (k, n) of the microphone array located in p1 . The DOA of the sound is estimated in the horizontal plane in both points p1 and p2 following the DirAC algorithm [1]. In DirAC, the POV DOA unit vectors ePOV 1 (k, n) and e 2 (k, n) are defined as ePOV (k, n) = −

Ia (k, n) , %Ia (k, n)%

(15)

where Ia (k, n) is the active sound intensity vector in the observation point. This vector is computed for both arrays via [9] ˘ ¯ (16) Ia (k, n) = Re P (k, n) V ∗ (k, n) ,

where Re{·} is the real part operator and (·)∗ denotes complex conjugation. The sound pressure P (k, n) in the center of the array is computed by taking the mean of the complex signals at the four array microphones. The corresponding particle velocity vector V (k, n) is determined from pressure differences [10]. Due to the spacing d between the array microphones, spatial aliasing occurs at frequencies higher than [11] r 1c fmax = ≈ 7.5 kHz, (17) 2d where c is the sound velocity. Therefore, we limit all following investigations to a maximum frequency of fmax . Notice further that in diffuse sound fields, the estimated intensity vector Ia (k, n) in (15)

4

0

4

3 −10

2

1

−20

0 −30

−1 −2

−10

2

y [m]

y [m]

1

−20

0 −30

−1 −2

−40

−3 −4 −5

0

3

−40

−3 −2.5

0

2.5

5

−4 −5

−50

−2.5

x [m]

0

2.5

5

−50

x [m]

(a) complete RIR

Fig. 6. SPD [dB] of the localized positions pIPLS when both sound sources (marked by the white dots) are active at the same time.

(b) direct sound

(c) early part

(d) late part

Fig. 5. Spatial power densities (SPD) [dB] of the localized positions pIPLS for a single sound source situation.

points to random directions. Therefore, the directions of the DOA unit vectors ePOV (k, n) are uniformly distributed in 2π, leading to the distribution in Fig. 2 as discussed in Section 2.3. 4.1. Single Talker Situation Let us first study the performance of the proposed system for a single talker situation where only the sound source A is active. Figure 4 shows the room impulse response (RIR) h1 (t) of sensor P1 (k, n) of the microphone array in p1 . The direct sound part and the early part of the RIR are marked by the two windows. In the following we filter out different parts of the measured RIRs by means of the depicted windows, and then use the resulting impulse responses to convolve the dry speech signal to obtain the microphone recordings. In doing so, we can analyze separately the influence of the different parts of the sound field on the parameter estimation. To investigate the accuracy of the parameter estimation, we compute the spatial power density (SPD) Γ(p) of the estimated IPLS positions pIPLS . The SPD describes the sound energy localized in a certain position p = [x, y]T i. e., X Γ(p) = |Pref (k, n)|2 , (18) K

˘ ¯ where K = (k, n)|p = pIPLS and Pref (k, n) is the reference pressure signal as explained in Section 3.1. Before computing Γ(p), all localized positions pIPLS with distance larger then the room size are clipped to the room borders. The SPDs Γ(p) for the single source scenario are depicted in Fig. 5. Plot (a) shows the results when the complete measured RIRs is used. The white dot represents the true source position while the

189

black dots show the microphone arrays. The cross marks the center of gravity. As can be seen, most energy is localized around the true source position. However, we notice an estimation bias towards the right array. The energy of the mirror image sources, which is mapped to the room borders, is also clearly visible in the plot. We further notice that some energy is distributed around both arrays. This energy mainly corresponds to the diffuse sound energy, which, as discussed in Section 2.3, is localized with higher probability around the array locations. To verify this claim, Fig. 5(b)–(d) shows the corresponding SPDs when filtering out different parts of the measured RIRs. Notice that we use the same decibel scale as in Fig. 5(a), but normalize the plots to a maximum of 0 dB. When only direct sound components are present (Fig. 5(b)), nearly all energy is localized around the true source position. In Fig. 5(c), which illustrates the result when only the early part of the sound arrives at the array, we again notice the localization bias towards the right microphone array. This indicates that the ground reflection has a significant influence on the position estimation. Figure 5(c) depicts the results when only the late part of the sound arrives at the microphone arrays. In this case most energy is localized around the array positions, verifying the theory in Section 2.3. 4.2. Double Talk Situation For a situation with two speakers active at the same time, the SPD is shown in Fig. 6. The black circles indicate the positions of the microphone arrays and the white circles indicate the exact speaker positions. Although both speakers are active at the same time, a distinct power concentration around both true positions of the speakers can be seen. As for the single talker case, a concentration of energy can be observed around the microphone arrays due to the diffuse energy in the late signal part (see Fig. 5). As an example, we place a virtual microphone in the origin. The signal from the first microphone of the first array is chosen as the reference signal. Then, this signal is weighted as in (12). To avoid extreme amplifications, the value of ratio of distances has been limited to a reasonable maximum. The spectrogram of the virtual microphone signal Pv (k, n) is shown in Fig. 7(a). In this scenario, first only speaker A is active, then only speaker B, and in the end, both speakers are active at the same time.

|Pv (k, n)|2 [dB] 4000

it is possible to localize the position (direction and distance) of the IPLS. Given the estimated source positions and the pressure signal measured at an appropriate reference position, one can compute the output signal of an arbitrary virtual microphone. Informal listening tests indicated that the signal of the virtual microphone is perceptually similar to the one which would have been measured had the microphone been placed physically in the sound scene. The parametric nature of the scheme allows us to define an arbitrary directivity pattern for the virtual microphone and also to realize a non-physical behavior, for instance by applying an arbitrary decay with distance. The introduced signal model is valid for mildly reverberant environments given that the time-frequency overlap of the emitted sound source signals is sufficiently small. This assumption is normally true with speech signals. The proposed approach has been verified by measurements in a reverberant environment in both a single and a double talker scenario.

0

3500 source A

A+B

source B

−10

3000 −20

f [Hz]

2500 2000

−30

1500

−40

1000 −50

500 0 10

15

20

25

30

35

40

45

−60

t [s] (a) virtual omnidirectional sensor

6. REFERENCES

|P˜v (k, n)|2 [dB] 4000

0

3500 source A

source B

A+B

[1] V. Pulkki, “Spatial sound reproduction with directional audio coding,” J. Audio Eng. Soc, vol. 55, no. 6, pp. 503–516, June 2007.

−10

3000

f [Hz]

[2] M. Kallinger, H. Ochsenfeld, G. Del Galdo, F. Kuech, D. Mahne, R. Schultz-Amling, and O. Thiergart, “A spatial filtering approach for directional audio coding,” in Audio Engineering Society Convention 126, Munich, Germany, May 2009.

−20

2500 2000

−30

1500

[3] R. Schultz-Amling, F. Kuech, O. Thiergart, and M. Kallinger, “Acoustical zooming based on a parametric sound field representation,” in Audio Engineering Society Convention 128, London UK, May 2010.

−40

1000 −50

500 0 10

15

20

25

30

35

40

45

[4] J. Herre, C. Falch, D. Mahne, G. Del Galdo, M. Kallinger, and O. Thiergart, “Interactive teleconferencing combining spatial audio object coding and DirAC technology,” in Audio Engineering Society Convention 128, London UK, May 2010.

−60

t [s] (b) virtual cardioid sensor

Fig. 7. Plot (a) shows the virtual microphone signal Pv (k, n). Plot (b) shows the virtual microphone signal P˜v (k, n) when applying a cardioid directivity. The virtual microphone is placed in the origin directed towards source A. Since the two sound sources are spatially separated, it is also possible to separate their signals by means of a directive virtual microphone. Therefore, in addition to the distance dependent filter a cardioid-like pick-up pattern as described in (14) is assigned to the virtual microphone, with the look direction pointed at speaker A. The spectrogram of the resulting signal (given the same speaker scenario as above) is depicted in Fig. 7(b). It can be seen that while the signal of speaker A remains unchanged, the signal of speaker B is highly attenuated. The results show that for the presented scenario source separation is indeed possible to a certain extent using a virtual microphone. Nevertheless, appropriate listening tests are necessary in order to determine quantitatively the signal quality. 5. CONCLUSIONS This contribution proposes the use of isotropic point-like sources (IPLS) to model a complex sound scene. Each IPLS represents either direct sound or a distinct room reflection and is active only in a single time-frequency bin. By estimating the direction of arrival of sound at two or more points in space, e. g., via microphone arrays,

190

[5] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography, Academic Press, 1999. [6] A. Kuntz and R. Rabenstein, “Limitations in the extrapolation of wave fields from circular measurements,” in 15th European Signal Processing Conference (EUSIPCO 2007), 2007. [7] A. Walther and C. Faller, “Linear simulation of spaced microphone arrays using b-format recordings,” in Audio Engineering Society Convention 128, London UK, May 2010. [8] S. Rickard and Z. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Acoustics, Speech and Signal Processing, 2002. ICASSP 2002. IEEE International Conference on, April 2002, vol. 1. [9] F. J. Fahy, Sound Intensity, Essex: Elsevier Science Publishers Ltd., 1989. [10] R. Schultz-Amling, F. Kuech, M. Kallinger, G. Del Galdo, J. Ahonen, and V. Pulkki, “Planar microphone array processing for the analysis and reproduction of spatial audio using directional audio coding,” in Audio Engineering Society Convention 124, Amsterdam, The Netherlands, May 2008. [11] M. Kallinger, F. Kuech, R. Schultz-Amling, G. Del Galdo, J. Ahonen, and V. Pulkki, “Enhanced direction estimation using microphone arrays for directional audio coding,” in HandsFree Speech Communication and Microphone Arrays, 2008. HSCMA 2008, May 2008, pp. 45–48.

Suggest Documents