2182
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
An Informed Parametric Spatial Filter Based on Instantaneous Direction-of-Arrival Estimates Oliver Thiergart, Student Member, IEEE, Maja Taseska, Student Member, IEEE, and Emanuël A. P. Habets, Senior Member, IEEE
Abstract—Extracting desired source signals in noisy and reverberant environments is required in many hands-free communication systems. In practical situations, where the position and number of active sources may be unknown and time-varying, conventional implementations of spatial filters do not provide sufficiently good performance. Recently, informed spatial filters have been introduced that incorporate almost instantaneous parametric information on the sound field, thereby enabling adaptation to new acoustic conditions and moving sources. In this contribution, we propose a spatial filter which generalizes the recently proposed informed linearly constrained minimum variance filter and informed minimum mean square error filter. The proposed filter uses multiple direction-of-arrival estimates and second-order statistics of the noise and diffuse sound. To determine those statistics, an optimal diffuse power estimator is proposed that outperforms state-of-the-art estimators. Extensive performance evaluation demonstrates the effectiveness of the proposed filter in dynamic acoustic conditions. For this purpose, we have considered a challenging scenario which consists of quickly moving sound sources during double-talk. The performance of the proposed spatial filter was evaluated in terms of objective measures including segmental signal-to-reverberation ratio and log spectral distance, and by means of a listening test confirming the objective results. Index Terms—Dereverberation, interference reduction, microphone array processing, optimal beamforming.
S
I. INTRODUCTION
OUND acquisition in noisy and reverberant environments with simultaneously active sources remains a challenging task, especially when the sources are moving or emerging and when the noise and reverberation vary quickly across time. A large variety of spatial filtering techniques has been proposed for capturing the desired source signals while suppressing undesired interferers, noise, and reverberation. These spatial filtering techniques often make use of classical spatial filters [1]–[15] or parametric filters [16]–[19]. Classical
Manuscript received April 26, 2014; revised July 24, 2014; accepted October 02, 2014. Date of publication October 16, 2014; date of current version October 24, 2014. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mads Græsbøll Christensen. The authors are with the International Audio Laboratories Erlangen (a joint institution of the Friedrich-Alexander-University Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS), University of Erlangen-Nuremberg, 91058 Erlangen, Germany (e-mail:
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASLP.2014.2363407
spatial filters are computed as a closed-form or adaptive solution of a specific optimization problem. Both implementations often require a priori knowledge of the directions of the desired sources or a period in which only the desired sources are active. For closed-form solutions, this information is required to estimate the propagation vectors or second-order statistics (SOS) of the desired sources and the SOS of the noise and reverberation, while for adaptive solutions, it is required to control the filter update. A drawback of these solutions is the inability to adapt sufficiently quickly to new situations, e. g., source movements, competing speakers that become active when the desired source is active, or changing power ratios between the noise and reverberant sound. Parametric spatial filters are often based on a relatively simple sound field model, i. e., the received signal in the time-frequency domain is formed by a superposition of a direct sound and a reverberant sound component. Usually, the reverberant sound is modeled as a (time-varying) diffuse sound field [20] while the direct sound is modeled as a single plane wave for each time-frequency instant. The parametric spatial filters are computed based on almost instantaneous estimates of the model parameters, e. g., the direction-of-arrival (DOA) or diffuseness of the sound. When these parameters are estimated with a high time resolution, parametric filters can adapt quickly to new situations. However, even though the model parameters are estimated using multiple microphones, the output signal is usually obtained by filtering only a single microphone signal. Moreover, the filters only perform as desired when the underlying signal model is satisfied. Unfortunately, the common single plane wave signal model can easily be violated in practice, namely when multiple sources are active at the same time (e. g., double talk) [21]. To overcome the drawbacks of the aforementioned spatial filtering techniques, we recently proposed an informed linearly constrained minimum variance (LCMV) filter that unifies the concepts of the classical spatial filters and parametric filters [22]. This filter provides an arbitrary spatial response for at most sound sources being simultaneously active per time-frequency instant. The arbitrary spatial response enables different applications such as source extraction or spatial sound reproduction. The underlying sound field model considers plane waves per time-frequency instant, as well as diffuse sound and noise. In this manner, model violations are less likely to occur. Similar to the parametric filters, the informed LCMV filter incorporates almost instantaneous parametric information on the sound field, namely DOA estimates and a diffuse-to-noise
2329-9290 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
ratio (DNR) estimate for each time-frequency instant, resulting in an informed spatial filter. In contrast to conventional spatial filter implementations, the almost instantaneous parametric information updates the spatial filter for each time-frequency instant, allowing the filter to quickly adapt to changes in the acoustic scene while still capturing the sound as desired. Moreover, using the DNR information in the filter computation allows to adjust the trade-off between noise and diffuse sound reduction, depending on which of the two components is more dominant. Conceptually, the informed LCMV filter provides a distortionless response for the plane waves, which, however, limits the maximum attenuation of diffuse sound and noise. Therefore, an informed minimum mean square error (MMSE) filter was proposed [23]. This filter has similar benefits as the informed LCMV filter [22], namely an arbitrary spatial response and a very short response time, but provides a stronger attenuation of the diffuse sound and noise, however, at the expense of a higher speech distortion. Even though the recently proposed informed spatial filters (ISFs) can outperform the classical linear filters and parametric filters, some problems remain unsolved: Both recently proposed ISFs require an estimate of the diffuse sound power, which was previously estimated with an estimator proposed in [22]. The estimator consists of an auxiliary spatial filter suppressing the plane waves and capturing only the diffuse sound and noise. Unfortunately, this filter is not optimal as it does not maximize the DNR. Moreover, both informed filters consider a fixed number of plane waves which is defined a priori. This, however, is problematic in practice. In fact, if is too low, model violations can occur. In contrast, if is too high, the performance of the filters in attenuating the noise and reverberation is reduced. Another limitation of the recently proposed informed filters is that they assume spatially white noise. Hence, undesired noise sources such as fans or air-conditioning are not suppressed effectively. In this paper, we propose an ISF referred to as the informed parametric multi-wave multi-channel Wiener (PMMW) filter, which is similar to the approach in [15], but represents a generalization of the informed LCMV filter [22] and the informed MMSE filter [23]. The proposed spatial filter allows us to control the trade-off between signal distortion, noise suppression, and dereverberation with control parameters. In contrast to [15], the proposed filter is derived such that the exact LCMV and MMSE solutions can be obtained by adjusting the control parameters. To provide a good performance in practice, the control parameters are computed signal-dependent as a function of the input signal-to-diffuse-plus-noise ratio (SDNR). To further improve the performance, the proposed ISF (i) does not assume that the noise is spatially white, as done in [22], but estimates the noise statistics from the microphone signals, (ii) estimates the number of waves per time-frequency instant, such that we obtain a better performance in terms of noise suppression and signal distortion, and (iii) incorporates a novel estimator for the diffuse sound power, which, in contrast to the estimator in [22], is optimal in the sense that it maximizes the DNR.
2183
The remainder of the paper is organized as follows: Section II introduces the sound field model and formulates the problem. In Section III, we explain the proposed ISF. The estimation of the required parameters and the novel estimator for the diffuse sound power are discussed in Section IV. The ISF is evaluated in Section V. Section VI concludes the paper. Notation: Lower-case boldface symbols denote vectors and upper-case boldface symbols denote matrices. is the iden. The operators , , and tity matrix of size denote transpose, conjugate transpose, and complex conjugate, respectively. II. SIGNAL MODEL AND PROBLEM FORMULATION A. Signal Model In the following, we generalize the multi-wave sound field model introduced in [22] and [23]. Let us consider the time-frequency domain with frequency index and time index . For each ( ) we assume plane waves propagating in an isotropic and homogenous diffuse field. The plane waves represent the direct sound of multiple sound sources located in a reverberant environment, while the diffuse sound represents the reverberation. Note that might be smaller than the actual number of active sound sources. In fact, if the source signals are sufficiently sparse in the time-frequency domain, then the number of sound sources being active per time-frequency instant is usually smaller than the total number of sound sources. This is typically the case for speech signals [24]. In contrast to [22] and [23], the number of plane waves is not assumed to be known a priori and fixed, but is considered to be time and frequency dependent. The dependencies are omitted for brevity in the following. The sound is captured with omnidirectional microphones positioned at . The microphone signals are written as (1) where contains the signals proportional to the sound pressure of the -th plane wave at the different microphones, is the measured diffuse field, and is a noise component. With the noise we can model for instance a stationary background noise (e. g., fan noise) or the microphone self-noise. In contrast to [22] and [23], does not need to be spatially white. Without loss of generality, we consider the first microphone located at as the reference microphone. The direct sound in (1) can be written as (2) is the signal proportional to the sound where pressure of the -th plane wave at the reference microphone and is the propagation vector of the -th plane wave. The propagation vector depends on the direction-of-DOA of the plane wave, which is expressed by the unit-norm vector . Assuming that the plane
2184
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
waves are propagating in the horizontal plane, the DOA vector can be expressed as (3) where is the azimuth of the DOA of the -th plane wave. An extension to the plane waves propagating in the three-dimensional space is straightforward. Especially in dynamic multisource scenarios, the DOAs of the plane waves can vary rapidly across time and frequency. The -th element of the -th propagation vector, i. e., , is the relative transfer function (RTF) between the first and -th microphone for the -th plane wave. For omnidirectional microphones the RTF can be written as (4) where denotes the imaginary unit, is the displacement vector between microphone and , and is the wavenumber. Assuming that the three terms in (1) are mutually uncorrelated, we can express the power spectral density (PSD) matrix of the microphone signals as (5a) (5b) where the matrix contains the propagation vectors of the plane waves. Moreover, the PSD matrix of the plane waves is given by where the signal vector of the waves is . For mutually uncorrelated plane waves is diagonal and the powers of the plane waves are given by . The PSD matrix of the diffuse sound is expressed as (6) The power of the homogeneous diffuse field is identical for all microphones and can vary rapidly across time and frequency. The -th element of the spatial coherence matrix , denoted by , is the spatial coherence between microphone and for a purely diffuse sound field [25] at frequency index . We assume that is known a priori and time-invariant, which is a reasonable assumption in practice. For instance for a spherically isotropic diffuse field and omnidirectional microphones, we have . In contrast to the PSD matrix of the diffuse sound, the PSD matrix of the noise is assumed to be time-invariant or slowly time-variant. B. Problem Formulation The aim of the paper is to capture the direct sound with a specific arbitrary response for each plane wave depending on the DOA, while attenuating the diffuse sound and noise. The desired signal can be expressed as a weighted sum of the plane waves at the first (reference) microphone, i. e., (7)
Fig. 1. Desired response functions for different applications. (a) Stereo sound reproduction (VPAB panning scheme [26]); (b) Source separation (desired source around 60 ); (c) Solid line: arbitrary response function (the dashed line in (c) represents the resulting directivity pattern of an example spatial filter, plane waves with DOAs and ). which assumes
where is referred to as desired response vector. The -th element is the desired response for the -th plane wave, which depends on the DOA of the plane wave. Note that can be complex-valued and frequency-dependent. The values result from a user-defined, arbitrary response function, denoted by . In other words, is the value of an arbitrary response function evaluated at the DOA of the -th plane wave, i. e., . The desired response function depends on the application. For instance: • In spatial sound reproduction, where in (7) is one of the loudspeaker signals, is the panning function corresponding to the loudspeaker. Example panning functions for the left and right loudspeaker of a stereo reproduction setup are depicted in Fig. 1(a). For example, if the -th plane wave arrives from 30 , the panning function for the left loudspeaker yields 1 while for the right loudspeaker we obtain 0. As a result, the plane wave is reproduced from the left loudspeaker only and thus, perceived from the left. • In speech enhancement for hands-free communication, we may aim at extracting all direct sound components without attenuation while only suppressing the diffuse sound and noise. In this case, in (7) is the sum of all plane
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
waves and the desired response function is equal to 1 for all possible DOAs. • In source separation, we may aim at extracting the signals of sound sources located at a specific angular region, while attenuating the signals of sources located at other regions. For instance, if the desired sources are located around 60 , we can use the response function depicted in Fig. 1(b), which extracts all plane waves arriving close to 60 while suppressing plane waves interfering from other directions. By defining the region of interest relatively broad, desired sources can move freely within the defined region while still being captured as desired. Clearly, this application requires a priori knowledge about the position of the desired sources to define an appropriate response function . In some applications, we can assume that the desired sources are located at a fixed region of interest (e. g., in front of a TV), while in other applications we can use for instance face recognition techniques to localize and track the desired sources. In general, one can design arbitrary and time-variant response functions , including response functions such as the one in Fig. 1(c) (solid line), which may be a user-defined function to attenuate plane waves from 45 by 19 dB, while plane waves from 110 are captured with unit gain. The desired response function can also correspond to a head-related transfer function (HRTF) to enable spatial sound reproduction over headphones. In this case, will be complex-valued. In the next section, we discuss the estimation of the desired signal from the microphone signals . III. INFORMED SPATIAL FILTERING This section introduces the ISF for estimating in (7). The general solution of the filter is presented in Section III-A. Practical considerations are discussed in Section III-B. A. General Solution An estimate of the desired signal is obtained by a linear combination of the microphone signals , i. e., (8) where is a complex weight vector of length . Obtaining an accurate estimate of requires a spatial filter that can capture multiple source signals, namely the plane waves, with the desired response. In [22], the filter weights in (8) were derived as an informed LCMV filter, which minimizes the diffuse sound and stationary noise at the filter output. In [23], the weights were derived as an informed MMSE filter, which minimizes the mean square error between and . As explained later, both filters have specific advantages and disadvantages. In the following, we propose an optimal multi-wave filter which represents a generalization of the informed LCMV filter and the informed MMSE filter. The proposed spatial filter is referred to as the informed PMMW filter and is found by minimizing the stationary noise and diffuse sound at the filter output subject to a constraint that limits the signal distortion of
2185
the extracted direct sound. Expressed mathematically, the filter weights are computed as
(9) subject to (10) is the noise plus difwhere fuse sound PSD matrix and is the desired filter output for the -th plane wave. Moreover, is the actual filter output for the -th plane wave, which is potentially distorted. The desired maximum distortion of the -th plane wave is specified with . A higher maximum distortion means that we can better attenuate the noise and diffuse sound in (9). As shown in the appendix, a closed-form solution for is (11) where is the desired response vector for the direct sound used in (7) and (12) Computing the filter requires the DOA of the plane waves [to compute the propagation matrix and response vector ] as well as the PSD matrix and noise-plus-diffuse sound PSD matrix . Note that the filter is updated (informed) with this information for each ( ). The real-positive diagonal matrix contains time and frequency-dependent control parameters, i. e., , which allow us to control the trade-off between noise suppression and signal distortion for each plane wave. The parameters are determined by the maximum distortions in (10), and vice versa (the exact relation is discussed in the appendix). For instance for , we obtain the informed LCMV filter proposed in [22] for which the distortions are zero [given does not contain estimation errors], i. e., (13) where (14) The -th column of the filter matrix can be interpreted as a separate spatial filter that extracts the -th plane wave without distortions. The extracted plane waves are then weighted with the desired responses contained in and summed resulting in the output signal . The informed LCMV filter provides a trade-off between white noise gain (WNG) and directivity index (DI) depending on which of the two undesired components (diffuse sound or noise) is more prominent [22].
2186
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
For , the filter informed MMSE filter proposed in [23], i. e.,
becomes the
(15) This filter provides a stronger suppression of the noise compared to the informed LCMV but introduces undesired signal distortions ( ). For we can achieve for each plane wave a trade-off between the LCMV filter and MMSE filter such that we obtain a strong attenuation of the noise while still ensuring a tolerable amount of undesired signal distortions. For , we can achieve an even stronger attenuation of the noise and diffuse sound than with the MMSE filter, at the expense of stronger signal distortions. As shown in the appendix, the parameters are mutually dependent, i. e., the parameter for the -th wave does not only depend on the desired distortion , but also on the other control parameters. The individual steps of the informed spatial filtering are summarized in Algorithm 1. The estimation of the required parameters is discussed in Section IV. As an example for an ISF, Fig. 1(c) (solid line) shows the magnitude of an arbitrary (e. g., user-defined) desired response function as a function of the azimuth angle . This function means that we aim at capturing a plane wave arriving from (indicated by the circle) with a gain of dB, while a second plane wave arriving from (indicated by the square) is captured with a gain of dB. Both gains would then form the desired response vector in (11) (assuming both waves are simultaneously active). The directivity pattern of the resulting spatial filter, which would capture both plane waves with the desired responses contained in , is depicted by the dashed line in Fig. 1(c). Here, we are considering the LCMV solution [ ] and an (ULA) with omnidirectional microphones and microphone spacing cm at kHz. As we can see in the plot, the directivity pattern of the spatial filter exhibits the desired gains for the DOAs of the two plane waves. Note that the directivity pattern of the spatial filter is different for different and DOAs of the direct sound. Moreover, the directivity pattern is essentially different from the desired response function . In fact, it is not the aim of the spatial filter to resample the response function for all angles , but to provide the desired response for the DOAs of the plane waves. B. Practical Considerations : The multi-wave signal 1) Control Parameters model in Section II and corresponding spatial filter (11) allow us to control the noise and diffuse sound suppression, as well as signal distortion for each of the waves, with the corresponding control parameter . In general, it is desired to capture a plane wave with low undesired distortions if the wave is strong compared to the noise and diffuse sound. In this case, a strong attenuation of the noise and diffuse sound is less important since the noise and diffuse sound are weak compared to the plane wave signal and might even be masked. On the other hand, if the plane wave is weak compared to the noise and diffuse sound, a strong attenuation of these components is
Algorithm 1 Summary of the ISF processing 1) Estimate the stationary noise PSD matrix , e. g., during silence or prior to the processing [Section IV-C]. 2) Estimate the number of plane waves [Section IV-A and Section IV-B].
and DOAs
3) Estimate the diffuse sound PSD matrix [Section IV-D] and signal PSD matrix [Section IV-E]. from the 4) Compute the spatial response vector desired spatial response function which was defined a priori [Section III-A]. 5) Compute the filter control parameters contained in matrix [Section III-B1]. 6) Compute the filter weights estimate the desired signal
in (11) and with (8).
desired and undesired distortions of the plane wave signal are assumed to be less critical. To obtain this behavior of the ISF, the control parameter is made signal-dependent. To ensure a computationally feasible algorithm, we compute all parameters independently, even though the parameters jointly determine the signal distortion for the -th plane wave as discussed in the appendix. To compute , let us first introduce the logarithmic input SDNR for the -th plane wave as (16) where is the noise plus diffuse sound power with denoting the mean power of the noise across the microphones. The parameter should approach 0 [leading to the LCMV solution (13)] if is large. On the other hand, should become 1 [leading to the MMSE solution (15)] or larger than 1 [leading to an even more aggressive filter] if is small. This behavior for can be achieved for instance if we compute via a sigmoid function, i. e., (17) which is monotonously decreasing with decreasing input SDNR. A suitable sigmoid function is for instance given by (18) where are control parameters that provide high flexibility in controlling the behavior of . In practice, the control parameters may need to be adjusted specifically for the given application and also depending on the accuracy of the parameter estimators in Section IV to obtain the best performance. Clearly, different functions to control can be designed depending on the specific application and desired behavior of the filter. However, the sigmoid function in (18), with the associated parameters, provides a high flexibility in adjusting the behavior of the spatial filters.
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
2187
violated, but also the parameter estimators in Section IV are influenced (especially the DOA estimators). Therefore, we leave this study for future research. Nevertheless, the experimental results in Section V show that the proposed filter yields the desired performance in practice where also early reflections occur. IV. PARAMETER ESTIMATION
Fig. 2. Example sigmoid functions depending on the input SDNR .
Fig. 2 shows three examples of in (18). Note that the sigmoid functions are plotted in decibels. The default parameters (solid line) considered throughout this work are (4.8 dB), , and . This parameter setting provided a good performance in most scenarios considered throughout this work. With , we control the maximum of the sigmoid function for low input SDNRs . Higher values for lead to a more aggressive noise and diffuse sound suppression when the input SDNR is low. With and we control the slope and shift along the axis, respectively. For instance shifting the sigmoid function towards low input SDNRs and using a steep slope means that the plane waves are extracted with low undesired distortions unless the diffuse sound and noise become very strong. Accordingly, the parameter setting 2 in Fig. 2 [dashed line, ( dB), , ] would yield a less aggressive filter, while the parameter setting 3 [dash-dot line, (10 dB), , ] would yield a more aggressive filter. Note that all parameters can be designed frequency dependent. 2) Assumed Number of Plane Waves : The number of plane waves assumed in III-A has a strong influence on the performance of the filter. If is too small (smaller than the actual number of prominent waves), then the signal model in (1) is violated. In this case, the filter (11) extracts less plane waves than desired, which leads to undesired distortions of the direct signal components. The effect of such model violations is discussed e. g. in [21]. On the other hand, if is too high, the spatial filter has less degrees of freedom to minimize the noise and diffuse sound power in (9). Therefore, we assume that is signal dependent and estimate for each ( ) as explained in Section IV-A. 3) Early Reflections: The signal model in (1) assumes that the direct sounds, i. e., the plane waves, are mutually uncorrelated. This assumption greatly simplifies the derivation of the informed PMMW filter (shown in the appendix) leading to the closed-form solution in (11) and (12). Assuming mutually uncorrelated plane waves is reasonable for the direct sounds of different sources, but typically does not hold for the direct sound of a source and its early reflections1. Hence, we assume that the plane waves for a given time-frequency instant correspond to different sound sources (rather than one or more reflections of the same source). Predicting the impact of mutually correlated direct sounds and reflections on the filter performance is difficult. In fact, not only the underlying assumption of the filter is 1Note that early reflections of the sound sources are not directly considered by the model in (1), but they can be represented as plane waves as well.
Several parameters need to be estimated to compute the ISF introduced in Section III. In the following, we explain the estimation of the number of waves , the DOAs , the stationary noise PSD matrix , the diffuse sound PSD matrix , and the signal PSD matrix . In order to achieve a good performance of the ISF in dynamic scenarios, the parameters need to be estimated with sufficiently high time-frequency resolution. Therefore, we compute all parameters for each time-frequency instant. A. Number of Waves There exist different approaches to estimate the number of signal components forming a mixture, for instance by computing the minimum description length (MDL) [27]. Alternatively, one can consider the number of prominent eigenvalues of the microphone signal PSD matrix, in our case . This approach is used in [14], where an eigenvector is classified as signal or noise component by considering the corresponding eigenvalue relative to the maximum and minimum eigenvalue of the signal PSD matrix. The eigenvalue-based approach has the advantage that it is computationally efficient in our application, since the eigenvalue decomposition of is required for the DOA estimation as well (Section IV-B). Therefore, we use this approach in the following. Let be the eigenvalues of . These eigenvalues are real-positive since is Hermitian. The number of plane waves is set to the number of dominant eigenvalues. Similar to [14], an eigenvalue is dominant if all of the following three conditions are satisfied: (i) the ratio between and the strongest eigenvalue is larger than a specific eigenvalue ratio , i. e., (19) and the smallest eigenvalue is (ii) the ratio between larger than a specific eigenvalue ratio , i. e., (20) (iii)
is larger than a minimum value
, i. e., (21)
In the following, we define the minimum eigenvalue signal dependent relative to the noise power , i. e., (22) is a real-positive number. Finally, we limit the where estimated to a specific maximum value . The parameters , , and allow us to adjust the estimator for different applications and recording scenarios. In practice, it is often helpful to slightly overestimate , namely to ensure that
2188
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
the DOA estimator (discussed in the next subsection) localizes the prominent direct sound components even if the noise or diffuse sound are comparatively strong, or during more difficult double talk scenarios. The overestimation improves the robustness of the informed spatial filters against signal distortions, but also reduces the performance in terms of noise suppression and dereverberation (see Section III-B2). While a smaller value for overestimates when the direct sound overestimates is prominent, a smaller value for during the reverberant tails. Moreover, a smaller increases the estimated for lower signal-to-noise ratios (SNRs). The best parameter setting for estimating depends on the application which may require lower signal distortions or a better noise suppression. A suitable parameter setting, which resulted in a good performance for most scenarios considered throughout this work, is presented with the experimental results in Section V. B. Direction-of-Arrival The DOAs of the plane waves can be obtained with classical narrowband DOA estimators such as ESPRIT [28] or root MUSIC [29]. While ESPRIT is computationally more efficient, root MUSIC is in general more accurate. Since the DOA represents a crucial parameter for the ISF, we use root MUSIC throughout this work. This estimator can be applied to ULAs, non-uniform linear arrays [30], and circular arrays [31]. The DOA estimator takes as input the signal PSD matrix defined in (5a). In practice, is often estimated by approximating the expectation in (5a) using a recursive temporal averaging filter, i. e, (23) is the filter coefficient corresponding to a spewhere are estimated, cific time constant . Once the DOAs we can obtain the elements of the response vector from the desired arbitrary response function (7) which was defined a priori, e. g., as in Fig. 1(c) (solid line). The elements of the propagation vectors required for the ISF can be computed for instance with (4) if we have omnidirectional microphones. C. Noise PSD Matrix Since the SOS of the noise are time-invariant or slowly timevariant, we can estimate the noise PSD matrix for instance during time frames where the sound sources are inactive and where no diffuse sound is present. Several corresponding approaches for estimating , which can be used in the ISF framework, are discussed in literature, for instance [12], [32]–[34]. D. Diffuse PSD Matrix This subsection discusses the estimation of the diffuse sound power , from which we can compute the diffuse sound PSD matrix with (6). As in [22], we employ an addi, which suppresses the tional spatial filter, denoted by plane waves and captures only the diffuse sound and noise. In
contrast to [22], the filter filter output, which is given by
maximizes the DNR at the
(24) This is a very desired property as discussed at the end of this subsection. The proposed filter is compared in Section V-A to the approach in [22]. The proposed filter is computed by minimizing the noise at the filter output, i. e., (25) subject to (26a) (26b) Constraint (26a) ensures that the power of the plane waves is zero at the filter output. Note that a weight vector , which , satisfies (26a), also satisfies the constraint which means that each individual plane wave is canceled out. With constraint (26b) we capture the diffuse sound power with a specific factor . The factor is necessarily real-positive and ensures non-zero weights . For any , the filter minimizes the ratio (27) subject to (26), which is equivalent to maximizing the output DNR (24) [subject to (26)]. To compute , we first consider the matrix in (26a), which is Hermitian with rank and thus has non-zero real-positive eigenvalues and zero eigenvalues. We consider the eigenvectors corresponding to the zero eigenvalues. Any linear combination of these vectors can be used as weight vector which would satisfy (26a), i. e., (28) is a matrix conwhere taining the eigenvectors and is a vector of length containing the (complex) weights for the linear combination. The optimal , which yields the weights in (28) that minimize the stationary noise, is denoted by and can be found by inserting (28) into (25), i. e., (29) subject to (26b). The cost function to be minimized is now
(30) where is the Lagrange multiplier. Setting the complex partial derivative of with respect to to zero, we obtain (31)
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
2189
where
. Moreover, we have . Equation (31) is a generalized is the generalized eigenvector eigenvalue problem and of and corresponding to the largest eigenvalue. The weights are found with (28) using . Note do not necessarily satisfy that the computed weights (26b). However, this is irrelevant in the following since the computed weights still maximize the DNR at the filter output as mentioned above. To estimate the diffuse sound power , we apply the weights to the signal power spectral density matrix defined in (5) and estimated with (23), leading to (32) By rearranging (32) we obtain an estimate of the diffuse sound power, i. e., (33a) (33b) is given in (27). If the output DNR where the error term becomes large compared to (24) is high, then and represents an accurate estimate of . Moreover, since the output DNR is maximized by the filter , the error term is minimized. Therefore, is optimal for estimating the diffuse sound power with (33). To further improve the estimation accuracy, one could subtract the error term , which can be computed with (27), from . In practice, this might lead to negative power estimates, namely when the involved quantities and contain estimation errors. Therefore, we use (33) to estimate . Although we in theory overestimate , we found that this yields slightly better results in practice. E. Signal PSD Matrix To estimate the signal PSD matrix , we use the approach presented in [23]. This approach uses knowledge about the microphone signal PSD matrix , the noise plus diffuse sound PSD matrix , and the propagation matrix . The resulting estimate of is optimal in the (LS) sense and is computed as (34) where yields a vector containing the main diagonal yields the columns of stacked elements of matrix , into one vector, and (35a) (35b) (35c) The reader is referred to [23] for further details. V. EXPERIMENTAL RESULTS We have carried out simulations and measurements to verify the proposed ISF. First, Section V-A analyzes the spatial filter in (25) for estimating the diffuse sound power and compares it to
Fig. 3. Performance of different spatial filters for extracting diffuse sound. [22] for extracting diffuse sound; (b) DNG (dB) (a) DNG (dB) of the filter [22] compared to in Section IV-D. of the filter
the corresponding filter proposed in [22]. Secondly, Section V-B evaluates the ISF proposed in Section III based on computer simulations. Section V-C finally presents listening test results based on measured data. A. Spatial Filter
for Estimating
In [22], we have recently proposed a linearly constrained spatial filter for estimating the diffuse sound power. This spatial filter, denoted by , attenuates the plane waves while capturing the sound from a direction where no plane wave arrives. The direction is computed such that it has the maximum distance to all directions where a plane wave arrives. In the following, we compare the performance of the filter proposed in (25) to . For this purpose, let us first introduce the diffuse-to-noise ratio gain (DNG) as (36) where the input DNR is given by . The DNG describes by how much the filter improves the DNR at the filter input. As explained in Section IV-D, a high output DNR, i. e., a high DNG, is desired when estimating the diffuse sound power. For spatially white noise, we have (37) assuming spaFig. 3(a) shows the DNG for the filter tially white noise. The filter was computed for plane waves arriving from (fixed) and . We were considering an uniform linear array with cm microphone spacing, omnidirectional microphones, Hz. The DNG was computed as a function of , i. e., the direction from which the filter captures the sound. The dashed line in Fig. 3(a) shows the direction estimated as in [22]. The solid
2190
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
(dB) considered throughout the experiFig. 5. Response function ments. The markers show the directions of the interferer (circles) and desired source (square).
Fig. 4. Simulated showbox room ( captured by a non-uniform linear array with
ms) with two speech sources microphones.
line shows the optimal for which we obtain the maximum DNG. The plot shows that with the estimated , we do not obtain the desired maximum DNG and especially for the difference to the maximum is large. Note that the optimal (solid line) is not available as an analytical function. Fig. 3(b) shows the DNG for the filter [along the dashed and solid line in Fig. 3(b)] compared to the proposed spatial filter (dash-dot line). We can see that with the proposed filter , we obtain nearly the same DNG which would be possible with if the optimal was available. This shows that is optimal for estimating the diffuse sound power as explained in Section IV-D. B. ISF System Evaluation In the following, we evaluated the ISF. We first introduce the simulation setup. Secondly, we evaluate the parameter estimation. Finally, we investigate the performance of the ISF. Corresponding listening examples are provided online [35]. 1) Simulation Setup: We consider a typical living room scenario as depicted in Fig. 4. A reverberant shoebox room ( m , ms) and a non-uniform linear array with omnidirectional microphones (microphone spacing 12–6–3–6–12 cm) was simulated using the source-image method [36]. Two speech sources were located in front of the microphone array. Source A represents an interfering (undesired) speaker while Source B represents the desired speaker. The desired Source B is located at a distance of 1.7 m at (the array broadside is 90 ). The interfering Source A is active from different directions as indicated by the circles: In Scenario I, Source A moves along the dashed line from position 1 to position 3 and back within 4 s. In Scenario II, Source A is subsequently active at the positions 1 to 4 (1.3 s at each position). The two scenarios are used to evaluate the performance of the ISF in dynamic scenarios. For both scenarios, the input signals consisted of 1 s silence, single talk (Source A), double talk (Source A and B), and single talk (Source B). The sources were moving also during the double talk part. White Gaussian noise was added to the microphone signals resulting in a segmental signal-to-noise
ratio (SegSNR) of 34 dB. The sound was sampled at 16 kHz and transformed into the time-frequency domain using a 512-point short-time Fourier transform (STFT) with 50% overlap. The microphone signal PSD matrix was estimated with the recursive temporal averaging filter in (23) with a time constant ms. The noise PSD matrix was computed from the silent period at the signal beginning. The number of plane waves was estimated as explained in Section IV-A where dB, dB, and dB. Moreover, we defined which due to the sparsity of speech, is sufficiently high even for scenarios where many speech sources are active. The DOAs were estimated using root MUSIC. The diffuse sound PSD matrix and signal PSD matrix were computed as explained in Section IV-D and Section IV-E, respectively. Note that all required parameters were determined with the same temporal resolution corresponding to ms as all parameter estimators involve the microphone signal PSD matrix . We were considering the real-valued desired response function in Fig. 5, i. e., we were aiming at extracting the desired Source B without attenuation while attenuating the power of the interfering Source A by 21 dB. The gain function becomes broader towards lower frequencies to compensate for a lower DOA estimation accuracy. Note that attenuating the interferers by 21 dB instead of nulling them out completely [which would correspond to for DOAs outside the spatial window] has practical reasons: Placing a spatial null towards the exact DOA of the interfering wave is difficult in practice due to DOA estimation errors, especially when multiple sources are active. Since directivity patterns of spatial filters are typically very steep in the vicinity of a spatial null, we would capture the interfering waves with strongly fluctuating gains if the estimated DOAs are noisy, leading to perceivable musical tones. Limiting the attenuation to a reasonable value can greatly reduce this problem. In the following, we compare the informed LCMV filter (13), the informed MMSE filter (15), and the proposed informed PMMW filter (11)–(12). For the latter filter, we were considering the parameter function in (17)–(18) with the default setting , , and . The function is depicted in Fig. 2 (solid line). We can see that for higher SDNRs , we force the LCMV solution ( ), while for lower we force a filter which yields a noise suppression that is even stronger than the one we would obtain with the MMSE solution ( ).
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
2191
Fig. 6. Estimated number of plane waves for Scenario I. The dashed lines indicate the beginning and end of the double talk period.
2) Parameter Estimation Performance: In the following, we study the parameter estimation for Scenario I (moving interferer). Note that similar results are obtained for Scenario II. Fig. 6 shows the estimated number of plane waves for Scenario I. The first dashed line indicates where Source B becomes active (beginning double talk) and the second dashed lines indicates where Source A stops (end double talk). During silent periods, where no source is active ( s and s s), we have as desired. This results in zero filter weights , i. e., the output signal of the filter is zero. During signal activity, the estimated number varies between and , independently of how many of the two sources are active. The reason is mainly the reverberation which leads to high eigenvalues in Section IV-A. With the parameters , , and we can adjust the estimator as explained in Section IV-A. However, we found that the parameters specified at the beginning of this section yielded good results for most considered scenarios. As shown later, estimating for each time-frequency instant leads to better ISF performance compared to using a fixed . To investigate the diffuse sound power estimation, we conand true DNR in Fig. 7, sider the estimated DNR computed for Scenario I. We obtain a high DNR during speech activity due to the reverberation in the room. Qualitatively, the true and estimated DNR look very similar indicating that the proposed estimator for the diffuse sound power is relatively accurate. However, we observe in Fig. 7(b) a slight overestimation when the DNR is low, e. g., for s, which is explained in Section IV-D. To investigate the estimation of the DOAs and signal PSDs , we consider the so-called long-term spatial power density (LT-SPD), which characterizes the power of the arriving plane waves for each direction . The input LT-SPD is computed by summing across frequency the estimated power of the plane waves arriving from , i. e.,
Fig. 7. True and estimated DNR (dB) for Scenario I. (a) true DNR; (b) estimated DNR.
, respectively, of the two sources. The plot shows that most estimated power is concentrated around the true DOAs. The two sources are localized sufficiently accurately even during the double talk period. Nevertheless, some power is localized also in spatial regions where no source is active. This power is mainly reverberant sound for which the estimated DOAs are random and the corresponding powers are small. Fig. 8(b) shows the output LT-SPD. The undesired source is strongly suppressed by the filter while the desired source is well preserved. The signal PSD estimation was recently discussed also in [23]. 3) Spatial Filter Performance: We finally evaluate the performance of the ISFs , , and . As a reference, we consider the spatial filter that maximizes the WNG and the robust superdirective (SD) beamformer which maximizes the DI [37]. Both spatial filters posses a single linear constraint corresponding to the fixed look direction towards the desired Source B. Both filters are time-invariant. The filter is computed as (39) subject to (40)
(38)
The robust SD beamformer minimizes the diffuse sound power at the filter output and is computed as
is the highest frequency band of interest, is the where Kronecker delta, and denotes temporal averaging. The output LT-SPD is found with (38) by multiplying with the corresponding gain . The output LT-SPD characterizes the power of the plane waves at the output of the filter. The input LT-SPD is depicted in Fig. 8(a). The dashed lines indicate the true DOAs and
(41) , where subject to (40) and subject to the constraint dB. The parameter defines the minimum WNG and determines the achievable DI of the filter [37]. Note that solving (41) leads to a non-convex optimization problem due to the quadratic constraint, which is time-consuming to solve.
2192
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
TABLE I PERFORMANCE OF THE SPATIAL FILTERS DURING DOUBLE TALK AND MOVING UNDESIRED SOUND SOURCE [IMPROVEMENT COMPARED TO AN UNPROCESSED MICROPHONE SIGNAL]. ALL VALUES IN DB. BEST VALUES ARE UNDERLINED (EXCLUDING PMMW2 AND PMMW3)
Fig. 8. Spatial power density [dB] of the estimated direct sound over time for Scenario I (top: input SPD, bottom: output SPD) to evaluate the performance of the DOA estimation and signal power estimation.
The weights were computed using the CVX MATLAB toolbox [38], [39]. Table I shows the performance of the different filters in terms of SegSNR, segmental signal-to-interference ratio (SegSIR), segmental signal-to-reverberation ratio (SegSRR), and mean log spectral distance (LSD) for Scenario I and II. The values are computed over the more difficult double talk part. The table shows the improvement compared to an unprocessed filter studied microphone signal. In addition to the so far, which is computed with the default control parameters (yielding the solid line in Fig. 2), we have computed the filter with less aggressive control parameters (setting 2 in Fig. 2) as well as with more aggressive control parameters (setting 3 in Fig. 2). The less aggressive PMMW is denoted in Table I, while the more aggressive PMMW by . The specific values for are is denoted by given in Section III-B1. The LSD given a specific filter was computed similarly as in [40], i. e., LSD
(42)
Typically, is the signal when applying the filter to the is only complete microphone signals. Here, however, the direct sound of the desired Source B at the filter output, i. e., . Therefore, (42) is a measure characterizing how much the desired source is distorted by the filter. It does not represent how similar the complete filtered output signal is compared to the desired signal. The log specwhich was limited trum is to a dynamic range of 50 dB. The mean LSD, which is shown in Table I, is obtained by averaging (42) over all double talk frames. The values in Table I show that in terms of SegSIR, the pro, , and (all posed informed filters
settings) outperform the filters and in both scenarios. While the latter two filters have no knowledge about the interferer position, the informed filters estimate the DOA of the interferer and hence can more strongly suppress the interferer. filter and filters (default setting, aggresThe sive setting) provide the highest SegSIR. In terms of SegSNR, filter and the filter provide very similar rethe sults, namely a strong suppression of the noise by more than filter was performing 7 dB. Also the more aggressive filter suffers from a poor WNG espewell. In contrast, the cially at low frequencies which leads to a strong noise amplification resulting in the lowest SegSNR among all filters. In terms filter, of SegSRR, the informed filters even outperform the which provides a clearly better performance than the filter. In terms of LSD, the and filter perform best, i. e., in contrast to the ISF they introduce almost no signal distortion. This result is not surprising since the fixed filters and were steered towards the true direction where the desired source is located, while the informed filters estimate the DOAs. The DOA estimation errors lead to signal distortions, e. g., when the desired source is localized outside the spatial yields window in Fig. 5. Among the informed filters, filters the lowest signal distortion, followed by the . The comparatively aggressive filter and , even though it peryields lower signal distortions than in terms of the other forms almost equal or better than measures. In general, the results in Table I show that the proposed (especially for the default control parametric filters and parameters) provide a good trade-off between the filter. In fact, the filters perform better than filter in terms of interference suppression, noise the suppression, and dereverberation, and introduce lower signal filter. Therefore, the proposed distortions than the is expected to outperform the other parametric filter filters in terms of perceptual performance, which is verified by a subjective listening test in Section V-C. By adjusting filter, we can control the control parameters of the the trade-off between noise suppression and dereverberation performance and the amount of signal distortions. To study the performance of the ISF for different room situations, we have carried out the simulation (Scenario II) for different reverberation times. The resulting SegSIR, SegSNR, SegSRR, and mean LSD are depicted in Fig. 9. We were com(default paring the proposed informed PMMW filter , the control parameters), the informed LCMV filter
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
2193
TABLE II FILTER DURING DOUBLE TALK IF IS EITHER PERFORMANCE OF THE FIXED OR ESTIMATED [IMPROVEMENT COMPARED TO AN UNPROCESSED MICROPHONE SIGNAL]. ALL VALUES IN DB. BEST VALUES ARE UNDERLINED
Fig. 9. Performance of the spatial filters during double talk for different reverberation times (Scenario II). The plots show the improvement compared to an unprocessed microphone signal. (a) SegSIR; (b) SegSNR; (c) SegSRR; (d) mean LSD.
informed MMSE filter , and the SD beamformer . The plots show that the performance of the informed filters , , and decreases strongly for large , i. e., the interference suppression, noise suppression, and diffuse sound suppression are reduced [Fig. 9(a)–(c)], while the signal distortion is increased [Fig. 9(d)]. A main reason for the performance loss is the less accurate DOA estimation for larger . In contrast, the filter provides almost the same performance for all . Nevertheless, the informed filters outperform the filter in terms of SegSIR, SegSNR, and SegSRR even for larger . The proposed filter provides a good trade-off between the and filter, i. e., its performance in terms of noise and interference suppression and dereverberation is better than that of the , but similar compared to the filter. In terms of speech distortion, the proposed filter is better than the informed MMSE filter but worse than the LCMV filter. 4) ISF Performance for Fixed and Estimated : In this section, we verify the benefit of estimating the number of plane waves for each ( ) instead of using a fixed as in [22] and [23]. Table II shows the performance of the informed LCMV filter for Scenario I and II when is estimated as discussed in Section IV-A, or fixed. All other simulation settings and algorithmic steps were not modified. The results in Table II show that does only slightly influence the filter performance in terms of SegSIR at the filter output (the differences mainly arises from different DOA estimation performances when using different ). However, in terms of SegSNR and SegSRR, we obtain the highest performance for while the lowest performance is obtained when using . The reason for the better performance for is the larger number of degrees of freedom of the filter
to minimize the noise and diffuse sound in (9), compared to using . In terms of LSD, however, we obtain better results when using instead of . In fact, during double talk it is not sufficient to represent the sound field with a single plane wave per time-frequency instant, i. e., the signal model (1) is violated in many frames if is too small. In this case, if the two sources are for instance nearly equally strong for a time-frequency instant, the desired source may be localized outside the spatial window depicted in Fig. 5 resulting in signal distortions. When using , such signal distortions can be significantly reduced. The comparatively low amount of signal distortions during multi-talk represents a main advantage of considering multiple waves ( ) in the signal model. This advantage becomes even more significant when broadband sources (e. g., music) are active together with speech sources, which would violate a single-wave model ( ) even stronger [21]. When estimating as proposed in Section IV-A, we obtain a good trade-off between noise suppression, dereverberation performance, and signal distortions, as shown in Table II. This outcome applies in a similar way to the other informed filters. C. Listening Test Results A MUSHRA listening test with measured data and 11 participants was carried out to verify the perceptual quality of the presented approaches. The measurements were carried out in a reverberant room ( ms). Two loudspeakers were used to emit speech (male and female) from two different directions at the same time. The sound was captured with a sampling frequency of 16 kHz at a distance of 1.7 m from the loudspeakers with a non-uniform linear array with ( omnidirectional microphones with spacing 12–6–3–6–12 cm). The male speaker (desired source) was active from 90 (array broadside) while the female speaker (undesired source) was subsequently active from 31 , 57 , 120 , and 148 . The abrupt jumps of the undesired speaker during double talk with the desired speaker is a challenging scenario for spatial filters. The spatial filters , , , , and (default control parameters) were computed and applied to the microphone signals as explained in Section V-B. For the informed filters, we were considering the response function in Fig. 5 (main lobe shifted to 90 ), i. e., the desired source was captured with unit gain while the undesired sources was attenuated by 21 dB. The spatial filters and were steered towards the desired speaker, i. e., their look direction was set to 90 . The participants were listening to the output signal of the filters, which were reproduced over headphones. The MUSHRA test was repeated 5 times and for each test the participants were evaluating different effects of the filters. The reference signal was the direct sound of the
2194
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
TABLE III OF WHICH FILTERS WERE SIGNIFICANTLY THAN OTHER FILTERS BASED ON A -TEST
OVERVIEW
BETTER
a good trade-off between noise and diffuse sound suppression and speech distortion. Fig. 10. MUSHRA listening test results. The plot shows the average and 95% confidence intervals.
desired speaker plus the direct sound of the undesired speaker attenuated by 21 dB, both signals found from the windowed impulse responses. The lower anchor was an unprocessed and low-pass filtered microphone signal. The corresponding listening test files are provided online [35]. The MUSHRA scores are depicted in Fig. 10. Moreover, Table III shows which filters were significantly better than other filters based on a -test. In the first MUSHRA test, denoted by intf, the listeners were asked to grade the strength of the interferer suppression, i. e., the suppression of the undesired speaker. Note that in Fig. 10 denotes that 10 listeners have passed the recommended post-screening procedure [41] (here, listeners were excluded if their MUSHRA score was 25 points higher or lower than the average). We can see that the informed filters (denoted by LCMV, MMSE, and PMMW) performed significantly better than the reference filters (WNG and DI). The rather aggressive informed MMSE filter yielded the strongest interferer suppression. In the second test (nse), the listeners were asked to grade the microphone noise suppression performance. The informed filters clearly outperformed the which maximizes the reference filters, even the filter WNG. The proposed informed PMMW filter was as good as the informed MMSE filter and both were significantly better than the informed LCMV filter. The SD beamformer (DI) strongly amplified the noise and hence, was graded lowest. In the third test (derev), the dereverberation performance was graded. Here, the proposed informed PMMW filter was significantly better than the informed LCMV filter, but worse than the informed MMSE filter. In the fourth test (dist), the listeners were asked to grade the distortion of the direct sound of the desired speaker (high grades had to be assigned if the speech distortion was low). We notice that the informed MMSE filter yielded strongly noticeable speech distortion, while the informed LCMV and PMMW resulted in a low distortion. The reference filters (for which the correct look direction was adjusted), yielded the lowest speech distortion. Finally, the overall quality (listener’s preference) was evaluated in the fifth test (Q). The best overall quality was provided by the informed PMMW and LCMV filter, which were significantly better than the MMSE filter and the reference filters. In general, the results in Fig. 10 verify the results in Table I, i. e., the informed PMMW filter can provide
VI. CONCLUSIONS We derived an informed parametric multi-wave multichannel Wiener filter that generalizes the recently proposed informed LCMV and informed MMSE filters. The informed spatial filtering framework recently developed by the authors, was extended by an optimal DNR estimator and time-frequency binwise number of wave estimation, leading to an improved performance and better trade-off between diffuse noise reduction and WNG. The PMMW provides a trade-off between interference reduction, noise reduction and signal distortion, by adjusting the allowable signal distortion for each wave separately. Performance evaluation was done in multi-talk scenarios with moving sources, for different reverberation levels. The comparison between the proposed PMMW, its special cases, namely the LCMV and the MMSE filters, and a signal-independent superdirective beamformer demonstrated the flexibility of the proposed filter and its advantage over the superdirective beamformer, even at higher reverberation levels, where the DOA estimation is less accurate. The objective evaluation results were corroborated by a MUSHRA listening test using measured data.
APPENDIX To derive the PMMW filter, we omit all dependencies on ( ) for brevity. The cost function to be minimized to obtain the weights in (9) subject to (10) is given by (43) is the desired maxwhere is a Lagrangian multiplier and imum distortion of the -th plane wave signal. Inserting and as well as (2) yields (44) and is a diagonal matrix where with . Expressing (43) as (44) requires that is diagonal, i. e., the plane wave signals must be mutually uncorrelated. Setting the partial derivative of with respect to to zero, we obtain (45)
THIERGART et al.: INFORMED PARAMETRIC SPATIAL FILTER BASED ON INSTANTANEOUS DOA ESTIMATES
The weights can now be computed as (46) where the
matrix
is given by (47)
. (47) was derived from an alternative costwith function in [15]. Applying the matrix inversion lemma yields (48) Equation (48) is identical to (11) with (12). Note that the -th column of is the parametric multi-channel Wiener filter for extracting the -th wave, which was derived in [42] for a single-wave model. For the multi-wave model in (1), this filter is given by (49) where , , . The matrix is obtained by reand moving the -th column from and and are obtained by removing the -th row and -th column from and , respectively. For the filter we have (50) . Equation (50) allows us to where determine such that the signal distortion of the -th wave is limited to the desired maximum . Using this in (12) yields the optimal solution to (9) s. t. (10). However, we can see that the parameters are mutually dependent via the term . Thus, computing requires to express (50) as a system of equations and then jointly solve for . REFERENCES [1] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin, Germany: Springer-Verlag, 2008. [2] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamforming for hearing aid applications,” in Handbook on Array Processing and Sensor Networks, S. Haykin and K. Ray Liu, Eds. New York, NY, USA: Wiley, 2008, ch. 9. [3] S. Gannot and I. Cohen, “Adaptive beamforming and postfiltering,” in Springer Handbook of Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds. Berlin, Germany: Springer-Verlag, 2008, ch. 47. [4] J. Benesty, J. Chen, and E. A. P. Habets, Speech Enhancement in the STFT Domain, ser. ser. Springer Briefs in Electrical and Computer Engineering. Berlin, Germany: Springer-Verlag, 2011. [5] S. Nordholm, I. Claesson, and B. Bengtsson, “Adaptive array noise suppression of handsfree speaker input in cars,” IEEE Trans. Veh. Technol., vol. 42, no. 4, pp. 514–518, Nov. 1993. [6] O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Trans. Signal Process., vol. 47, no. 10, pp. 2677–2684, Oct. 1999.
2195
[7] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Trans. Signal Process., vol. 49, no. 8, pp. 1614–1626, Aug. 2001. [8] W. Herbordt and W. Kellermann, “Adaptive beamforming for audio signal acquisition,” in Adaptive Signal Processing: Applications to real-world problems, ser. ser. Signals and Communication Technology, J. Benesty and Y. Huang, Eds. Berlin, Germany: Springer-Verlag, 2003, ch. 6, pp. 155–194. [9] R. Talmon, I. Cohen, and S. Gannot, “Convolutive transfer function generalized sidelobe canceler,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 7, pp. 1420–1434, Sep. 2009. [10] A. Krueger, E. Warsitz, and R. Haeb-Umbach, “Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 206–219, Jan. 2011. [11] E. A. P. Habets and J. Benesty, “A two-stage beamforming approach for noise reduction and dereverberation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 945–958, May 2013. [12] M. Taseska and E. A. P. Habets, “MMSE-based blind source extraction in diffuse noise fields using a complex coherence-based a priori SAP estimator,” in Proc. Int. Workshop Acoust. Signal Enhance. (IWAENC), Aachen, Germany, Sep. 2012. [13] G. Reuven, S. Gannot, and I. Cohen, “Dual-source transfer-function generalized sidelobe canceller,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 4, pp. 711–727, May 2008. [14] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1071–1086, Aug. 2009. [15] S. Markovich-Golan, S. Gannot, and I. Cohen, “A weighted multichannel wiener filter for multiple sources scenarios,” in Proc. IEEE 27th Conv. Elect. Electron. Eng. Israel (IEEEI), Nov. 2012, pp. 1–5. [16] I. Tashev, M. Seltzer, and A. Acero, “Microphone array for headset with spatial noise suppressor,” in Proc. Int. Workshop Acoust. Echo Noise Control (IWAENC), Eindhoven, The Netherlands, 2005. [17] M. Kallinger, G. Del Galdo, F. Kuech, D. Mahne, and R. Schultz-Amling, “Spatial filtering using directional audio coding parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’09), Apr. 2009, pp. 217–220. [18] M. Kallinger, G. D. Galdo, F. Kuech, and O. Thiergart, “Dereverberation in the spatial audio coding domain,” in Audio Eng. Soc. Conv. 130, London, U.K., May 2011. [19] O. Thiergart, G. Del Galdo, M. Taseska, and E. Habets, “Geometry-based spatial sound acquisition using distributed microphone arrays,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 12, pp. 2583–2594, Dec. 2013. [20] F. Jacobsen and T. Roisin, “The coherence of reverberant sound fields,” The J. Acoust. Soc. Amer., vol. 108, no. 1, pp. 204–210, Jul. 2000. [21] O. Thiergart and E. A. P. Habets, “Sound field model violations in parametric spatial sound processing,” in Proc. Int. Workshop Acoust. Signal Enhance. (IWAENC), Aachen, Germany, Sep. 2012. [22] O. Thiergart and E. A. P. Habets, “An informed LCMV filter based on multiple instantaneous direction-of-arrival estimates,” in Proc. IEEE Intl. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2013, pp. 659–663. [23] O. Thiergart, M. Taseska, and E. A. P. Habets, “An informed MMSE filter based on multiple instantaneous direction-of-arrival estimates,” in Proc. 21st Eur. Signal Process. Conf. (EUSIPCO ’13), 2013. [24] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Process., vol. 52, no. 7, pp. 1830–1847, Jul. 2004. [25] G. W. Elko, “Spatial coherence functions for differential microphones in isotropic noise fields,” in Microphone Arrays: Signal Processing Techniques and Applications, M. Brandstein and D. Ward, Eds. Berlin: Springer, 2001, ch. 4, pp. 61–85. [26] V. Pulkki, “Virtual sound source positioning using vector base amplitude panning,” J. Audio Eng. Soc., vol. 45, no. 6, pp. 456–466, 1997. [27] E. Fishler, M. Grosmann, and H. Messer, “Detection of signals by information theoretic criteria: General asymptotic performance analysis,” IEEE Trans. Signal Process., vol. 50, no. 5, pp. 1027–1036, May 2002. [28] R. Roy and T. Kailath, “ESPRIT-estimation of signal parameters via rotational invariance techniques,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 7, pp. 984–995, Jul. 1989.
2196
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014
[29] B. Rao and K. Hari, “Performance analysis of root-music,” in Proc. 22nd Asilomar Conf. Signals, Syst. Comput., 1988, vol. 2, pp. 578–582. [30] A. Mhamdi and A. Samet, “Direction of arrival estimation for nonuniform linear antenna,” in Proc. Int. Conf. Commun., Comput. Control Applicat. (CCCA), 2011, pp. 1–5. [31] M. Zoltowski and C. P. Mathews, “Direction finding with uniform circular arrays via phase mode excitation and beamspace root-music,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’92), 1992, vol. 5, pp. 245–248. [32] E. A. P. Habets, “A distortionless subband beamformer for noise reduction in reverberant environments,” in Proc. Int. Workshop Acoust. Echo Control (IWAENC), Tel Aviv, Israel, Aug. 2010. [33] M. Souden, J. Chen, J. Benesty, and S. Affes, “An integrated solution for online multichannel noise tracking and reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2159–2169, Sep. 2011. [34] R. Hendriks and T. Gerkmann, “Noise correlation matrix estimation for multi-microphone speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 223–233, Jan. 2012. [35] O. Thiergart, M. Taseska, and E. A. P. Habets, “Sound examples for an informed parametric spatial filter based on instantaneous direction-of-arrival estimates,” [Online]. Available: http://www.audiolabs-erlangen.de/resources/2014-TASLP-ISF/ [36] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943–950, 1979. [37] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-35, no. 10, pp. 1365–1376, Oct. 1987. [38] I. CVX Research, “CVX: Matlab software for disciplined convex programming, version 2.0 beta,” [Online]. Available: http://cvxr.com/cvx Sep. 2012 [39] M. Grant and S. Boyd, “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control, ser. ser. Lecture Notes in Control and Information Sciences, V. Blondel, S. Boyd, and H. Kimura, Eds. Berlin, Germany: Springer-Verlag, 2008, pp. 95–110. [40] P. A. Naylor and N. D. Gaubitch, Speech Dereverberation. New York, NY, USA: Springer, 2010. [41] Method for the subjective assessment of intermediate quality levels of coding systems, Rec. BS. 1534–1, International Telecommunications Union (ITU-R), 2003. [42] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 260–276, Feb. 2010. Oliver Thiergart (M’11–S’13) was born in 1983 in Pößneck, Germany. He studied media technology at Ilmenau University of Technology (TUI), Ilmenau, Germany, and received his Dipl.-Ing. (M.Sc.) degree in 2008. In 2008 he was with the Fraunhofer Institute for Digital Media Technology IDMT in Ilmenau, where he worked on sound field analysis with microphone arrays. He then joined the Audio Department of the Fraunhofer Institute for Integrated Circuits IIS in Erlangen, Germany, where he worked on spatial audio analysis and reproduction. In 2011, he became a member of the International Audio Laboratories Erlangen where he is currently pursuing a Ph.D. in the field of parametric spatial sound processing. His current research interests include spatial audio processing, source localization, spatial filtering, and joint audio video processing.
Maja Taseska (S’13) was born in 1988 in Ohrid, Macedonia. She received her B.Sc. degree in electrical engineering at Jacobs University, Bremen, Germany, in 2010, and her M.Sc. degree at the Friedrich-Alexander-University, Erlangen, Germany in 2012. She then joined the International Audio Laboratories Erlangen, where she is currently pursuing a Ph.D. in the field of informed spatial filtering. Her current research interests include source localization, spatial filtering, blind source separation, and noise reduction.
Emanuël A. P. Habets (S’02–M’07–SM’11) received his B.Sc. degree in electrical engineering from the Hogeschool Limburg, The Netherlands, in 1999, and his M.Sc and Ph.D. degrees in electrical engineering from the Technische Universiteit Eindhoven, The Netherlands, in 2002 and 2007, respectively. From March 2007 until February 2009, he was a Postdoctoral Fellow at the Technion Israel Institute of Technology and at the Bar-Ilan University in Ramat-Gan, Israel. From February 2009 until November 2010, he was a Research Fellow in the Communication and Signal Processing group at Imperial College London, United Kingdom. Since November 2010, he has been an Associate Professor at the International Audio Laboratories Erlangen (a joint institution of the University of Erlangen and Fraunhofer IIS) and Head of the Spatial Audio Research Group at Fraunhofer IIS, Germany. His research interests center around audio and acoustic signal processing, and he has worked in particular on dereverberation, noise estimation and reduction, echo reduction, system identification and equalization, source localization and tracking, and crosstalk cancellation. Dr. Habets was a member of the organization committee of the 2005 International Workshop on Acoustic Echo and Noise Control (IWAENC) in Eindhoven, The Netherlands, a general co-chair of the 2013 International Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) in New Paltz, New York, and general co-chair of the 2014 International Conference on Spatial Audio (ICSA) in Erlangen, Germany. He is a member of the IEEE Signal Processing Society Technical Committee on Audio and Acoustic Signal Processing (2011-2013) and a member of the IEEE Signal Processing Society Standing Committee on Industry Digital Signal Processing Technology (2013-2015). Since 2013, he has been an Associate Editor of the IEEE SIGNAL PROCESSING LETTERS.