IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
4627
Robust Multi-Source Localization Over Planar Arrays Using MUSIC-Group Delay Spectrum Lalan Kumar, Student Member, IEEE, Ardhendu Tripathy, and Rajesh M. Hegde, Member, IEEE
Abstract—Subspace-based source localization methods utilize the spectral magnitude of the MUltiple SIgnal Classification (MUSIC) method. However, in all these methods, a large number of sensors are required to resolve closely spaced sources. A novel method for high resolution source localization based on the group delay of MUSIC is described in this work. The method can resolve both the azimuth and elevation angles of closely spaced sources using a minimal number of sensors over a planar array. At the direction of arrival (DOA) of the desired source, a transition is observed in the phase spectrum of MUSIC. The negative differential of the phase spectrum also called group delay, results in a peak at the DOA. The proposed MUSIC-Group delay spectrum defined as product of MUSIC-Magnitude (MM) and group delay spectra, resolves spatially close sources even under reverberation owing to its spatial additive property. This is illustrated by performing spectral analysis of the MUSIC-Group delay function under reverberant environments. A mathematical proof for the spatial additive property of group delay spectrum is also provided. Source localization error analysis, sensor perturbation analysis, and Cramér–Rao bound (CRB) analysis are then performed to verify the robustness of the MUSIC-Group delay method. Experiments on speech enhancement and distant speech recognition are also conducted on spatialized TIMIT and MONC databases. Experimental results obtained using objective performance measures and word error rates (WER) indicate reasonable robustness when compared to conventional source localization methods in literature. Index Terms—Source localization, DOA, MUSIC, group delay, phase, UCA, azimuth, elevation.
I. INTRODUCTION
S
OURCE localization has been very active area of research for several decades because of its extensive application in various fields like teleconferencing and robotics. Correlation based source localization is a widely used method in this context. Techniques like generalized cross correlation using phase transform (GCC-PHAT) and generalized cross correlation using Roth filter (GCC-Roth) [1] are used for time delay estimate (TDE) and subsequently source localization. Manuscript received February 21, 2014; revised May 30, 2014; accepted July 01, 2014. Date of publication July 08, 2014; date of current version August 07, 2014. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gustau Camps-Valls. This work was funded in part by TCS Research Scholarship Program under project number TCS/CS/ 20110191 and in part by DST project EE/SERB/20130277. L. Kumar and R. M. Hegde are with the Department of Electrical Engineering, Indian Institute of Technology, Kanpur 208016, India (e-mail:
[email protected];
[email protected]). A. Tripathy is with Electrical and Computer Engineering Department, Iowa State University, Ames, IA 50011 USA (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2014.2337271
Beamforming techniques are another class of approaches to direction of arrival estimation (DOA) and source localization. These methods work on the principle of spatial filtering. The beamformer with multiple linear constraint is known as linearly constrained minimum variance (LCMV) beamformer. LCMV beamformer is a special case of the Minimum variance distortionless response (MVDR) beamformer [2], and uses two constraints of giving unity gain to signal of interest (SOI) from DOA, while places a null in the direction of the undesired source. TDE based algorithms do not perform well for multiple DOAs [3]. Under reverberation, TDE estimation becomes even more challenging. For multiple sources, DOA estimates obtained with beamforming based methods are inconsistent [4]. Also, the bias of these estimates may be significant for closely spaced sources. Subspace-based methods like MUltiple Signal Classification (MUSIC) [5] and its variants, the root-MUSIC [6] and the ESPRIT (Estimation of Signal Parameters via Rotational Invariant Techniques) [7] are among the most computationally efficient methods for DOA estimation. The high resolution of these methods is due to subspace decomposition [8]. Several modifications to MUSIC for high resolution DOA estimation have been suggested. In [9], it is established that unweighted MUSIC provides the best resolution. The derivative MUSIC proposed in [10], localizes two closely spaced and correlated sources using the derivative of the data correlation matrix. In [11], second order differential of MUSIC-Magnitude spectrum is used to resolve closely spaced sources. The MUSIC method is widely studied due to its computational efficiency. However, it requires a large number of sensors to resolve closely spaced sources. In reverberant environments, it requires a comprehensive search algorithm for deciding candidate peaks for DOA due to a large number of spurious peaks [12]. Conventionally, DOA estimation utilizes the spectral magnitude of MUSIC to compute the DOA of multiple sources incident on an array of sensors. The phase information of the MUSIC spectrum has been studied in [13] for DOA estimation over a uniform linear array (ULA). In this work, we propose to use the negative differential of the unwrapped phase spectrum (group delay) of MUSIC for DOA estimation over planar arrays. Although the group delay function has been used widely in temporal frequency processing for its high resolution properties [14], the additive property of the group delay function has hitherto not been utilized in spatial spectrum analysis. The primary contribution of this work is the utilization of the MUSIC-Group delay (MGD) spectrum to localize closely spaced sources with limited number of sensors over planar arrays, under reverberant environments.
1053-587X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
4628
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
Fig. 1. Spectral magnitude of MUSIC for UCA. Sources at (50 , 15 ) and (60 , 20 ).
The rest of the paper is organized as follows. The proposed method for azimuth and elevation estimation over planar array, is presented in Section II. Localization error analysis is presented in Section III. Section IV describes the performance evaluation of the proposed method in terms of experiments on source localization, speech enhancement and distant speech recognition. Section V concludes the paper. II. MUSIC-GROUP DELAY SPECTRUM SOURCE LOCALIZATION
FOR
ROBUST
Subspace-based methods for DOA estimation based on the spectral magnitude of MUSIC require a large number of sensors for resolving spatially close sources and are prone to errors under reverberant conditions. In [15], a novel method for high resolution source localization based on the MUSIC-Group delay spectrum over ULA has been proposed. This method is able to resolve closely spaced sources with limited number of sensors. In the following Section, MUSIC-Group delay based method for two dimensional source localization over planar arrays is proposed. A. Music-Group Delay Spectrum for Source Localization Over Planar Array The signal received by a planar array with microphones from narrowband sources can be represented as (1) where
is
steering matrix, expressed as (2)
and is vector of signal amplitudes at the reference microphone. A particular steering vector consisting of time delays [16] is given by (3) where is azimuth angle, measured counterclockwise from positive -axis and is elevation angle, measured down from positive z axis. is time delay at the th microphone with respect to the reference microphone and is narrowband signal frequency. The signal and the noise are assumed to be stationary, zero mean, uncorrelated random
Fig. 2. Spectral phase of MUSIC for UCA (top) and ULA (bottom). Sources at (50 , 15 ) and (60 , 20 ) for UCA. Sources at 50 and 60 for ULA.
process. In this work, a uniform circular array (UCA) [17] is considered. In this case, the delays are related to azimuth and elevation angles as [18] (4) where and are the radius and azimuth angle of the th microphone with the center of the circular array as the reference, is speed of sound. The MUSIC-Magnitude (MM) spectrum for planar array is given by
(5) is noise subspace and , is the th noise where eigenvector. The denominator takes a null value when is the signal direction. Hence, the MUSIC-Magnitude spectrum has a peak at the DOA represented by the azimuth and elevation angle . However, when the sources are closely spaced, MUSIC-Magnitude spectrum is unable to resolve them clearly, giving many spurious peaks or single peak when limited number of sensors are used. This is illustrated in Figs. 1 and 4(a) respectively. The experimental setup for Figs. 1–3 utilizes a UCA of twelve sensors, placed on two concentric circles. Four sensors are placed on the inner circle and eight sensors on the outer circle. The sources are placed at (50 , 15 ) and (60 , 20 ). Additionally, a corresponding figure utilizing a ULA is also illustrated in Figs. 2 and 3. The ULA consisting of eight sensors, is used to illustrate azimuth only of the sources. To overcome the limitation of MUSIC, the group delay function of MUSIC spectrum is presented herein for resolving closely spaced sources with limited number of sensors. The proposed MUSIC-Group delay spectrum for two dimensional
KUMAR et al.: ROBUST MULTI-SOURCE LOCALIZATION OVER PLANAR ARRAYS USING MUSIC-GROUP DELAY SPECTRUM
4629
Fig. 3. Illustration of standard group delay of MUSIC and the MUSIC-Group delay as proposed in this work. (a) Standard group delay spectrum of MUSIC for UCA (top) and ULA (bottom). (b) MUSIC-Group delay spectrum for UCA (top) and ULA (bottom). Sources are at (50 , 15 ) and (60 , 20 ) for UCA, at 50 and 60 for ULA.
DOA (azimuth and elevation) estimation over planar arrays is defined as,
(6) indicates gradient of unwrapped phase spectrum where of . The gradient is with respect to the spatial variables and . Phase spectra of MUSIC for UCA and ULA are shown in Fig. 2. It can be noted from the figure that in the neighborhood of the DOA, there is a sharp change in the unwrapped phase spectrum for both UCA and ULA. Differentiating this unwrapped phase spectrum results in very sharp peaks at the location of the DOAs. In practice, abrupt changes in phase can also occur due to microphone calibration errors. Hence, differential phase can result in sharp peak at an angle even if it is not a DOA. This differential phase (group delay) spectrum is illustrated in the Fig. 3(a) for UCA (top) and for ULA (bottom). MUSIC-Group delay spectrum being product of the MUSIC-Magnitude and the group delay spectra, is able to remove the spurious peaks and retains only the peaks corresponding to DOAs, as illustrated in Fig. 3(b).
where is the Room Impulse Response (RIR) between th microphone and th source. The symbol denotes the convolution. is th snapshot of the th source signal and is the additive noise. In subspace-based methods like MUSIC, the signal eigenvalues of the received signal correlation matrix are significant, compared to the noise eigenvalues. However, because of multipath effects under reverberation, extraneous eigenvalues become significant. This affects the performance of the subspacebased method, especially MUSIC [19]. Reverberation is quantified using direct to reverberant energy ratio (DRR) or reverberation time [20]. With an increase in DRR or decrease in the reverberation time, the room impulse response comes very close to a delta function improving the accuracy of the DOA estimation. MUSIC-Magnitude spectrum and MUSIC-Group delay spectrum plots are shown in Fig. 4 for two sources at (100 , 15 ) and (105 , 17 ) at reverberation time , 400 ms. The RIR is simulated by image method [21], as implemented in [22]. It can be seen that MUSIC-Group delay spectrum is able to resolve the sources, where MUSIC-Magnitude spectrum gives single peak. In the following Section, the resolving power of the MUSIC-Group delay spectrum for azimuth and elevation estimation is justified by proving 2-D additive property of group delay spectrum. C. Two Dimensional Additive Property of the MUSIC-Group Delay Spectrum
B. Spectral Analysis of the MUSIC-Group Delay Function Under Reverberant Conditions In this section, performance of MUSIC and MUSIC-Group delay is presented in reverberant environments. Performance of the subspace-based methods degrades due to multi-path effects. The data model presented in (1) includes only direct path and is no more valid in case of reverberation. The received data at the th microphone is given by (7)
The high resolution of the proposed MUSIC-Group delay is due to the additive property of MUSIC-Group delay spectrum. For closely spaced sources under reverberation, the peaks corresponding to the DOAs, merge together giving single peak in the MUSIC spectrum. However, as described in Section II.A, it is noted that closely spaced sources can be resolved by MUSICGroup delay spectrum using limited number of sensors. This high resolution property of MUSIC-Group delay spectrum is due to its additive property, since a product in MUSIC-Magnitude domain is equivalent to an addition in MUSIC-Group delay domain [15]. The mathematical proof for additive property of
4630
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
Fig. 4. Plots illustrating azimuth and elevation angle as estimated by (a) MUSIC-Magnitude and (b) MUSIC-Group delay spectrum for sources at (100 , 15 ) and (105 , 17 ), reverberation time 400 ms. MM estimates single peak at (105 , 18 ). MGD estimates two peak at (100 , 19 ) and (108 , 17 ).
MUSIC-Group delay spectrum for ULA has already been dealt with in [15]. In case of ULA, the steering vector exhibits Vandermonde structure, and hence root-MUSIC polynomial approach is used for showing the additive property. This is not the case for UCA as it is clear from (3), (4). The UCA can be divided into number of cross sections, where each cross section represents a ULA. A single ULA will be able to estimate only the azimuth angle of arrival. Therefore, in general two ULAs are sufficient to get an estimate of both the azimuth and elevation angles. Having more than two ULAs improves the robustness of the estimates. For multiple incident signals, pairing of the corresponding estimates from various ULAs, can be carried out as in [23]. Other pairing methods for eigenvalue association can be found in [24]. Generalizing the pairing methods of eigenvalue association for a UCA [23], [24], the steering vector can be expressed as a vector of exponentials [see (8), shown at the bottom of the page], where is the number of sensors in the th cross section of the UCA and . Note that is the delay at the th microphone in the th cross section of the UCA. The steering vector can now be expressed to have Vandermonde structure as follows (9) where (10)
Utilizing (9) and re-writing the root-MUSIC polynomial as sum and respecof polynomials in and denoted by tively, we have (12) For actual DOA, , the polynomial and hence each polynomial corresponding to a cross section of the UCA (e.g., ), will become zero. It is to be noted that is a polynomial in having roots. Among the roots of this polynomial, there can be maximum of roots corresponding to sources. It is also possible for two or more different incident signals to lie on the cone of confusion of a particular ULA, in which case there will be more than roots lying very close to the origin of the -plane. In either case, roots with magnitude close to zero can be ignored. Constructing a polynomial from roots corresponding to sources, we have (13) where is the th root of . We have assumed for mathematical simplicity that all sources fall in the field of view of the first cross-section. Without loss of generality and to maintain consistency with the definition of the MUSIC method, one can invert and express it as a combined resonator , where
From (5), constructing the root-MUSIC polynomial for UCA, we have
(14)
(11)
This complies with the approach wherein a DOA is looked at as a pole rather than a zero. As we are interested in group delay
(8)
4631
KUMAR et al.: ROBUST MULTI-SOURCE LOCALIZATION OVER PLANAR ARRAYS USING MUSIC-GROUP DELAY SPECTRUM
Fig. 5. Two dimensional spectral plots for the cascade of two individual DOAs (resonators): (a) source with DOA (60 , 15 ); (b) source with DOA (55 , 18 ); and (c) MUSIC-Magnitude spectrum (d) MUSIC-Group delay spectrum.
spectrum of the combined resonator, as product of poles as shown below
can also be re-written
(15) where is the magnitude and is the phase of the resonator pole is a function of , the spatial variable. It may be noted from (15) that the combined resonator exhibits a product of magnitude spectra of individual resonators. On the other hand, it exhibits a sum of phase spectra of individual resonators. Taking negative derivative of the unwrapped phase spectrum of the combined resonator, we finally have (16) It is clear from (15) and (16) that the MUSIC-Magnitude is a product spectrum, while the MUSIC-Group delay spectrum exhibits additive property. Due to this additive property, the peaks are preserved in MUSIC-Group delay spectrum even for closely spaced sources. On the other hand, MUSIC-Magnitude spectrum fails to do so. This is illustrated in Fig. 5. Two individual resonators at DOAs (60 , 15 ) and (55 , 18 ) are considered. The MUSIC-Magnitude and the MUSIC-Group delay spectra for the cascade of these two resonators are plotted. It can be noted from Fig. 5(c) that the magnitude spectrum is unable to resolve the two sources, as the two peaks are merged due to multiplicative property of magnitude spectrum. On the contrary, the MUSIC-Group delay spectrum is able to resolve the two sources owing to its 2-D additive property, as can be seen in Fig. 5(d).
III. LOCALIZATION ERROR ANALYSIS Subspace-based methods like MUSIC and MUSIC-Group delay are sensitive to finite sample effects, imprecisely known noise covariance, a perturbed array manifold and reverberation. Finite sample effects occur since it is not possible to obtain a perfect covariance matrix of the received data over an array. In practice, estimation of the sample covariance requires averaging over several snapshots of the received data. The finite sample effects can be neglected by taking high SNR or large number of snapshots. The error due to imprecisely known noise covariance is also neglected to analyze the effect of sensor position error and reverberation on the proposed method. In the ensuing section, performance of MUSIC and MUSIC-Group delay is presented under sensor perturbation errors. Performance evaluation is also conducted in a reverberant environment. A numerical analysis is presented comparing Root Mean Square Error (RMSE) of various methods under reverberation with the Cramér–Rao Bound (CRB). A. Performance Under Sensor Perturbation Error Let be the nominal sensor position for the th sensor. The position matrix is formed from the nominal sensor positions as
The steering vector defined in (3), can be re-written as [25]
(17)
4632
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
TABLE I COMPARISON OF AVERAGE RMSE OF VARIOUS METHODS WITH THE CRB – (ILLUSTRATED IN THE FIRST ROW) FOR AN AZIMUTH RANGE OF AND ELEVATION RANGE OF – AT OF 200 ms AND SNR 10 dB
B. Cramér-Rao Bound Analysis Fig. 6. Contour plots of (a) MUSIC-Magnitude spectrum and (b) MUSICGroup delay spectrum, under sensor perturbation errors.
where is the wave vector of the signal. The displacements of the th sensor from the nominal sensor positions is given as
These position perturbations are assumed to be i.i.d. Gaussian random variables and are independent of the signals or any additive noises that may occur at the sensor outputs. In any DOA estimation process, the sensor perturbations are assumed to be time-invariant i.e., the same perturbation is used for . The position error matrix is formed similar to the position matrix as
Cramér-Rao bound provides a lower bound for the mean square error (MSE) of an unknown parameter. We have compared average Root Mean Square Error in DOA estimates for various methods with CRB. Circular array geometry being uncoupled, the statistical coupling effect between azimuth and elevation estimate is ignored. The Cramér-Rao inequality for estimating parameter is given as
where the th element of the Fisher information matrix given by [27], [28]
is
where is the number of snapshots and is array correlation matrix. For 2-D DOA estimation, the unknown parameter vector is . The elements of Fisher information matrix is given by
Hence, perturbed sensor positions are given by . The th steering vector associated with the sensor positions is , which can be written as [26]
.. .. .
..
.
..
.
.. .
.
Under sensor perturbation error, the signal model in (1) turns out to be (18) The perturbed array manifold,
, is given by (19)
The effect of the sensor perturbation on the array autocorrelation matrix is simulated as described in [26], and the analysis is carried out. The resolution of the MUSIC-Magnitude and MUSIC-Group delay methods under perturbation errors is illustrated in Fig. 6(a), and 6(b) respectively. The figure illustrates contour plots for the respective spectra. Note that the spectrum for MUSIC-Magnitude shows a single peak with contours around it, while the spectrum for MUSIC-Group delay shows two distinct peaks with different contours.
The azimuth and elevation angle are varied from – and – respectively at reverberation time, ms and dB. The DOA estimation is done using MVDR and Beamspace MUSIC (BSM) [29], [30], apart from MUSIC-Magnitude and MUSIC-Group delay. For this simulation, 15 channel UCA with (the wavelength), is considered. The maximum phase mode excited for BSM is taken to be 7. Two closely spaced, uncorrelated sources, with separation in azimuth and elevation, are taken in this analysis. Average RMSE for azimuth and elevation estimates obtained by the four methods are compared with average Cramér-Rao bound in Table I. It can be seen that average RMSE for MUSIC-Group delay is the lowest. C. Source Localization Error Analysis Under Reverberant Environments Source localization under reverberant environment is challenging, especially for subspace-based methods. In the following section, performance evaluation of the proposed method is conducted in indoor environment. It is well known that effect of reverberation is prominent in such environments. Hence we consider a small meeting room setup shown in the Fig. 8. The set-up has four participants around the table.
KUMAR et al.: ROBUST MULTI-SOURCE LOCALIZATION OVER PLANAR ARRAYS USING MUSIC-GROUP DELAY SPECTRUM
Fig. 7. Two dimensional scatter plot for localization for the sources at (20 , 10 ) and (10 , 5 ) using (a) MUSIC-Magnitude method and (b) MUSIC-Group delay method. Reverberation time is 150 ms. SNR is 40 dB. Number of iteration is 500. The red dot indicates the actual DOA.
The error analysis in DOA estimation is presented herein by scatter plot. The reverberation is simulated as discussed in Section IV.A. The Noise is generated using a zero mean and unit variance Gaussian distribution. The experiment was conducted under reverberation, with of 150 ms which typically corresponds to small meeting room. DOA estimation trials are conducted for two closely spaced sources at (20 , 10 ) and (10 , 5 ). The SNR considered is 40 dB, to analyze the effect of reverberation. For 500 number of independent trials, the azimuth and the elevation estimates are plotted in the Fig. 7. In case of MUSIC-Magnitude, there were several cases where the estimates overlapped each other, leading to poor localization of the sources. Also, the estimates are unevenly distributed around the actual, as illustrated in Fig. 7(a). Fig. 7(b), shows the distribution of the estimates of the proposed method. It can be seen that the average estimate will be closer to the actual in case of the proposed method. IV. EXPERIMENTAL EVALUATION The performance of the proposed method is evaluated by conducting experiments on speech enhancement, perceptual evaluation and distant speech recognition. In the following section, the experiment on speech enhancement is presented as improvement in signal to interference ratio (SIR) [31]. Experiments on perceptual evaluation are also conducted for various methods and quantified using objective measures. Distant speech recognition experiment results are presented as word error rate (WER). The proposed method, MUSIC-Group delay is compared with MUSIC-Magnitude (MM), Beamspace MUSIC (BSM) [29], [30], linearly constrained minimum variance (LCMV) and minimum variance distortionless response (MVDR). A. Experimental Conditions The proposed algorithm was tested in a typical meeting room environments. A room with dimensions, 730 cm 620 cm 340 cm was used in the experiments. The experimental setup consists of a uniform circular, 15 channel microphone array with a radius of 10 cm. It has one desired speaker, one competing speaker and two interfering sources as shown in Fig. 8. White noise and babble noise from NOISEX-92 [32] database were used as stationary and nonstationary interfering sources respectively. The signals are acquired over the array
4633
Fig. 8. Experimental Setup in meeting room with two speakers (S1 and S2) and two interference (stationary noise source SN and nonstationary noise source NS). Sources are located at (35 , 17 ), (40 , 19 ), (30 , 15 ) and (45 , 21 ), respectively. Radius of the circular array is 10 cm.
of microphones. Under reverberation, the signal is convolved with room impulse response. In real life experimental conditions, a room impulse response is generated in two ways. A microphone is used to record a short sounding pulse, giving room impulse response. Another way involves the use of the maximum length sequence (MLS). In this work, RIR is simulated using image method [21] as implemented in [22]. DOAs are estimated using various algorithms over the acquired signals. A filter sum beamformer (FSB) is trained using the DOA estimates obtained. The signals are reconstructed using the beamformer. Distant speech recognition (DSR) and speech enhancement experiments are conducted on the reconstructed speech signal. The complete procedure is depicted in Fig. 9. B. Experiments on Speech Enhancement in Multisource Environment The performance of the proposed method is presented herein as improvement in SIR. The input SIR of the th speaker relative to the stationary or nonstationary interfering source at microphone , is defined as
(20)
where, is th speech signal in short time Fourier transform (STFT) domain with a rectangular window of length is impulse response for th speaker and microphone pair, is the frame number and is the frequency index. The output SIR is defined in similar fashion as,
(21)
4634
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
Fig. 9. Flow diagram illustrating the methodology followed in performance evaluation for distant speech signal acquired over circular array.
TABLE II ENHANCEMENT IN SIR (dB), COMPARED FOR VARIOUS METHODS AT DIFFERENT REVERBERATION TIME. IS THE DESIRED SPEAKER, SPEAKER, IS NON-STATIONARY NOISE SOURCE AND IS STATIONARY NOISE SOURCE
where, is the reconstructed or beamformed signal. The beamformer to reconstruct the signal herein was LCMV. The result on SIR improvement is presented in Table II. It can be seen that the proposed method performs better than all the conventional methods.
IS THE COMPETING
TABLE III COMPARISON OF PERCEPTUAL EVALUATION RESULTS USING VARIOUS METHODS. RESULTS ARE COMPARED BASED ON OBJECTIVE MEASURE
C. Experiments on Perceptual Evaluation of Enhanced Speech In this Section, we evaluate the proposed method by computing objective measures of perceptual evaluation on enhanced speech. Here desired speaker and stationary noise source pair is considered for evaluation. Six hundred sentences from TIMIT database [33] were selected and randomized to perform the experiments. The objective measures for evaluating speech quality used herein are, Log-Likelihood Ratio measure (LLR) [34], segmental SNR (segSNR) [34], Weighted-Slope Spectral (WSS) distance [35] and Perceptual Evaluation of Speech Quality, PESQ [36]. The results are presented in Table III, at two reverberation level, ms and 250 ms. PESQ and segSNR scores are high while LLR and WSS scores are low for the proposed method, indicating better reconstruction of the signal. D. Experiments on Distant Speech Recognition Speaker independent large vocabulary speech recognition experiments are conducted for speech acquired over circular mi-
crophone arrays [37], [38] in a meeting room scenario. The experimental results are presented as word error rate (WER). The WER is calculated as
where is the total number of words, the total number of substitutions, the total number of deletions, and the total number of insertions. To ensure conformity with standard databases, sentences from TIMIT database [33] were selected. Continuous digit recognition experiments were conducted on MONC [39] database. Separate set of sentences were used
KUMAR et al.: ROBUST MULTI-SOURCE LOCALIZATION OVER PLANAR ARRAYS USING MUSIC-GROUP DELAY SPECTRUM
TABLE IV COMPARISON OF DISTANT SPEECH RECOGNITION PERFORMANCE IN TERMS OF WER (IN PERCENTAGE) AT VARIOUS REVERBERATION TIME
for training and testing. For TIMIT, complete test set of 1344 sentences from 112 male and 56 female were used. The rest were used for training the speech models. For MONC, the speech models were trained with 8400 isolated and continuous digit sentences. For testing, 650 continuous digit sentences were used. Three states, eight mixture HMMs (Hidden Markov Models) were used in the experiments on TIMIT database. For the experiments on MONC database three states, sixteen mixture HMMs were used. Table IV lists WER for various methods along with close talking microphone (CTM) as the benchmark. The MUSIC-Group delay method indicates reasonable reduction in WER when compared to other methods. V. CONCLUSION In this work, a novel high resolution source localization method based on the MUSIC-Group delay spectrum is discussed. The method provides robust azimuth and elevation estimates of closely spaced sources as indicated by source localization experiments when compared to conventional source localization methods. The significance of the MUSIC-Group delay method in speech enhancement and distant speech recognition is also illustrated using improvements in signal to interference ratios and lower word error rates. Pole focusing methods that can utilize the minimum phase property of the MUSIC-Group delay spectrum are currently being studied in the context of source localization in high levels of convolutional distortion. The significance of MUSIC-Group delay based methods for source localization over a spherical array of microphones is non-trivial and is currently being explored. Additionally, the proof of the additive property of the MUSIC-Group delay spectrum, using series expansion, in the context of spherical harmonics is also being studied. REFERENCES [1] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 4, pp. 320–327, 1976. [2] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, no. 8, pp. 1408–1418, Aug. 1969. [3] H. Teutsch, Modal Array Signal Processing: Principles and Applications of Accoustic Wavefield Decomposition. New York, NY, USA: Springer, 2007. [4] P. Stoica and R. Moses, Introduction to Spectral Analysis. Upper Saddle River, NJ, USA: Prentice-Hall, 1997, vol. 89.
4635
[5] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antenna Propag., vol. AP-34, no. 3, pp. 276–280, 1986. [6] A. Barabell, “Improving the resolution performance of eigenstructurebased direction-finding algorithms,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1983, vol. 8, pp. 336–339. [7] R. Roy, A. Paulraj, and T. Kailath, “Esprit-a subspace rotation approach to estimation of parameters of cisoids in noise,” IEEE Trans. Acoust., Speech, Signal Process., vol. 34, no. 5, pp. 1340–1342, Oct. 1986. [8] J. Foutz, A. Spanias, and M. K. Banavar, “Narrowband direction of arrival estimation for antenna arrays,” Synthesis Lectures on Antennas, vol. 3, no. 1, pp. 1–76, 2008. [9] P. Stoica and A. Nehorai, “Music, maximum likelihood, and Cramer-Rao bound: Further results and comparisons,” IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 12, pp. 2140–2150, 1990. [10] R. Scholes, D. Braunreiter, and P. Delaney, “Derivative music angle estimation,” in Proc. IEEE Nat. Aerosp. Electron. Conf. (NAECON), 1997, vol. 1, pp. 396–402, IEEE. [11] K. Ichige, Y. Ishikawa, and H. Arai, “Accurate direction-of-arrival estimation using second-order differential of music spectrum,” in Proc. IEEE Int. Symp. Intell. Signal Process. Commun. (ISPACS), 2006, pp. 995–998. [12] J. Chen, K. Yao, and R. Hudson, “Source localization and beamforming,” IEEE Signal Process. Mag., vol. 19, no. 2, pp. 30–39, 2002. [13] K. Ichige, K. Saito, and H. Arai, “High resolution doa estimation using unwrapped phase information of music-based noise subspace,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., vol. E91-A, pp. 1990–1999, Aug. 2008. [14] B. Yegnanarayana and H. A. Murthy, “Significance of group delay functions in spectrum estimation,” IEEE Trans. Signal Process., vol. 40, pp. 2281–2289, Sep. 1992. [15] M. Shukla and R. M. Hegde, “Significance of the music-group delay spectrum in speech acquisition from distant microphones,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2010, pp. 2738–2741. [16] , M. Brandstein and D. Ward, Eds., Microphone Arrays. Berlin, Germany: Springer-Verlag, 2001. [17] A. Tripathy, L. Kumar, and R. M. Hegde, “Group delay based methods for speech source localization over circular arrays,” in Proc. IEEE Joint Workshop Hands-Free Speech Commun. Microphone Arrays (HSCMA), 2011, pp. 64–69. [18] H. L. Van Trees, Optimum Array Processing. New York, NY, USA: Wiley-Interscience, 2002. [19] R. Mandala, M. Shukla, and R. Hegde, “Group delay based methods for recognition of distant talking speech,” in Conf. Rec. 44th Asilomar Conf. Signals, Syst., Comput. (ASILOMAR), Nov. 2010, pp. 1702–1706. [20] P. Zahorik, “Direct-to-reverberant energy ratio sensitivity,” J. Acoust. Soc. Amer., vol. 112, pp. 2110–2117, . [21] J. B. Allen and A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Amer., vol. 65, pp. 943–950, 1979. [22] E. A. Habets, Room impulse response generator, 2003–2010 [Online]. Available: http://home.tiscali.nl/ehabets/rir_generator.html. [23] K. Wong and M. Zoltowski, “Root-music-based azimuth-elevation angle-of-arrival estimation with uniformly spaced but arbitrarily oriented velocity hydrophones,” IEEE Trans. Signal Process., vol. 47, no. 12, pp. 3250–3260, Dec. 1999. [24] H. Y. , “Techniques of eigenvalues estimation and association,” Digit. Signal Process., vol. 7, no. 7, pp. 253–259, Oct. 1997. [25] A. Manikas, Differential Geometry in Array Processing. Singapore: World Scientific, 2004, vol. 57. [26] V. Cevher and J. H. McClellan, “2-d sensor perturbation analysis: Equivalence to AWGN on array outputs,” presented at the SAM, Washington, DC, USA, Aug. 4–6, 2002. [27] P. Stoica and N. Arye, “Music, maximum likelihood, and cramer-rao bound,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 5, pp. 720–741, May 1989. [28] T. Filik and T. Tuncer, “Design and evaluation of V-shaped arrays for 2-D DOA estimation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Mar. 31–Apr. 4, 2008, pp. 2477–2480. [29] P. Stoica and A. Nehorai, “Comparative performance study of element-space and beam-space music estimators,” Circuits, Syst., Signal Process., vol. 10, no. 3, pp. 285–292, 1991.
4636
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 17, SEPTEMBER 1, 2014
[30] C. P. Mathews and M. D. Zoltowski, “Signal subspace techniques for source localization with circular sensor arrays,” Tech. Reports, 1994 [Online]. Available: http://docs.lib.purdue.edu/ecetr/ [31] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp. 1071–1086, 2009. [32] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 12, no. 3, pp. 247–251, 1993. [33] J. S. Garofolo, TIMIT Acoustic-Phonetic Continuous Speech Corpus. Philadelphia, PA, USA: Linguistic Data Consortium, 1993. [34] J. Hansen and B. Pellom, “An effective quality evaluation protocol for speech enhancement algorithms,” in Proc. ICSLP, 1998, vol. 7, pp. 2819–2822. [35] D. Klatt, “Prediction of perceived phonetic distance from critical-band spectra: A first step,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1982, vol. 7, pp. 1278–1281. [36] Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, ITU-T Draft Recommendation P.862, 2001. [37] M. Seltzer, “Bridging the gap: Towards a unified framework for handsfree speech recognition using microphone arrays,” in Proc. HandsFree Speech Commun. Microphone Arrays (HSCMA), May 2008, pp. 104–107. [38] W. Zhang and B. Rao, “Robust broadband beamformer with diagonally loaded constraint matrix and its application to speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2006, pp. 785–788. [39] CSLU, “Multi channel overlapping numbers corpus distribution,” Linguistic Data Consortium [Online]. Available: http://www.cslu.ogi.edu/ corpora/corpCurrent.html
Lalan Kumar (S’14) received his B.Tech. degree in electronics engineering from IIT (BHU), Varanasi, India, in 2008. He worked in Motorola Bangalore as a software engineer with multimedia team between 2008–2009. Since 2010, he has been working toward the Ph.D. degree in signal processing at MiPS lab, IIT Kanpur, India, where he is working on array signal processing in relation to source localization. His area of interest includes acoustics, speech and music signal processing.
Ardhendu Tripathy received his B.Tech. degree in electrical engineering from IIT Kanpur, India, in 2012. He worked as interim engineering intern in Qualcomm. He is currently pursuing the Ph.D. degree in electrical and computer engineering, Iowa State University, IA.
Rajesh M. Hegde (M’09) is an Associate Professor and P. K. Kelkar Research Fellow with the Department of Electrical Engineering at IIT Kanpur. His current areas of research interest include multimedia signal processing, multi-microphone speech processing, pervasive multimedia computing, ICT for socially relevant applications in the Indian context, and applications of signal processing in wireless networks with specific focus on emergency response and transportation applications. He has also worked on NSF funded projects on ICT and mobile applications at the University of California San Diego, USA, where he was a researcher and lecturer in the Department of Electrical and Computer Engineering between 2005–2008. He is also a member of the National working group of ITU-T (NWG-16) on developing multimedia applications. Additional biographic information can be found at the URL: http://home.iitk.ac.in/rhegde.