A methodology for sound source localization and tracking - IEEE Xplore

3 downloads 0 Views 640KB Size Report
Abstract— Acoustic source localization and tracking using microphone arrays has become a focus of interest in room acoustics, teleconference systems and ...
A methodology for sound source localization and tracking Development of 3D microphone array for near-field and far-field applications Muhammad Imran Department of Architectural Engineering Hanyang University 222-Wangsimni-ro, Seoul, Korea [email protected]

Abstract— Acoustic source localization and tracking using microphone arrays has become a focus of interest in room acoustics, teleconference systems and tracking of sound producing objects. The current methods to estimate the source localization depend on conventional time-delay estimation techniques between microphone pairs, however, ignoring the ambient noise, reflections from surrounding and reverberation in the closed space. In this study, an acoustic source localizer and tracker (ASLT) based on 3D microphone array is designed and developed for real time source detection and localization. Two practical approaches were examined and evaluated, based on direction of arrival (DOA) estimation techniques and steered power response (SPR) of array algorithms in order to improve the accuracy for tracking multiple sources in full 3D coordinates. Among time delay estimation techniques, generalized cross correlation (GCC) is employed in frequency domain using multiple combinations of microphone pairs of the array with Phase transform (PHAT) weighting function for optimum detection of sources in the presence of reverberant environments. PHAT gives a good performance in the presence of ambience, even when the signal to noise ratio (SNR) is low. For SPR, minimum variance distortion less response (MVDR) beamformer weights are evaluated and purposed for accurate source tracking applications. A microphone array is designed using six transducers in spherical configurations and used for the evaluated for proposed methodology. The measurements are carried out in a reverberant chamber under different noise conditions to validate the practicality of the algorithmic chain and finally, the results are obtained and presented to demonstrate the efficiency of the proposed microphone array design and localization technique. Keywords— Minimum variance distortion less response (MVDR); Generalized cross correlation (GCC); Direction of arrival (DOA); Phase transform (PHAT); Spherical microphone array

I. INTRODUCTION Recent developments in acoustics source localization and tracking, using microphone arrays, have numerous applications in room acoustics measurements, teleconference systems, automotive industry and tracking of sound producing objects for surveillance systems. The main aim of any localizer is to accurately estimate the direction of arrival (DOA) of single or

Akhtar Hussain, Nasir M. Qazi, Muhammad Sadiq Centers of Excellence in Science and Technology (CESAT) Islamabad, Pakistan [email protected], [email protected], [email protected] multiple active sound sources simultaneously, in order to point out the listening stream of sound field. A sound localizer is a core part of any capture system that uses a microphone arrays of different configurations and beamsteering. The accuracy of sound source localization critically depends on the ambience and reverberation of indoor environments, such as small rooms, offices and auditorium. Traditionally, the localization methods are based on two approaches to estimate the direction of arrival. The first approach is based on time difference of arrival estimation (TDOA) using paired combinations of microphone pairs, whereas, the second approach maximizes the source steered power response (SPR) at the output of a delay and sum beamformer. Both approaches use single frame of captured sound for estimation, producing poor results in precision and also require additional post-processing algorithms to track multiple sources in real time applications. Recent developments in simultaneous multiple source localization, such as, MUSIC and ESPRIT resolve this problem up to satisfactory extent, however, these techniques are only applicable for narrowband sources. In past few decades, microphone array based processing has been investigated for sound localization and tracking in order to emulate the existing techniques for multi-channel processing. Different microphone array configurations have been studied and proposed for estimation process, such as, direction of arrival (DOA) estimation and localizing the sources, using correlation among the two pairs of microphones. In our study, we investigated both methods, i.e. TDOA and SPR. In time delay estimation techniques, we have employed generalized cross correlation (GCC) method in frequency domain using several combinations of microphone pairs of the array with weighting functions. The array used for this purpose consists of six microphone with spherical geometric configurations in evenly distributed manners at its imaginary surface. Different weighting functions such as, ‘phase transform (PHAT)’, ‘smoother coherence transform (SCOT)’ and the ‘maximum-likelihood (ML)’ for GCC algorithm are evaluated in reverberant and noisy environment. A comparative study and evaluation of these weighting functions is performed and presented, therefore, ‘PHAT’ weighting function is purposed for optimum detection of source in the presence of reverberant environment, especially for multiple sources. For SPR, minimum variance distortion less response weighting is evaluated and purposed for accurate

ª*&&&

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67 ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



source tracking application at getting high SNR of the steered signal in reverberation conditions. In addition, a practical ASLT based on six channel open sphere spherical microphone array is proposed with voice activity detector. The voice activity detector is used for ignoring the ambience noise and capturing only the active speaker or sound source. Several useful algorithms essential for real-time implementation are developed in order to derive and evaluate an appropriate novel localization estimator. An experimental evaluation was performed to verify the algorithmic chain of the processing and to validate the results under different environmental conditions. Therefore, a measurement setup was installed in a reverberant chamber. The results obtained from measurement process and those known beforehand showed a good agreement with each other, achieving high accuracy in results as compared to the traditional methods. II. FORMULATION AND METHODOLOGY High accuracy and precision to estimate acoustic sources are big challenges, particularly in noisy and reverberant environments, where the contribution of correlated and uncorrelated noise plays an important role. Sound encounters a frequency range involving three orders of magnitude from 20 Hz to 20 kHz. Different sound properties, such as, diffraction at low frequencies, scattering at high frequencies, and specular and diffusive reflections in closed space, need to be considered and applied to accurately model the sound field [1]. To estimate the directional information of sound source its accurate detection is very important. In our approach it is supported multi-channel microphone array capturing, weighted GCC method and optimal adaptive MVDR beamformer weights. Fig. 1 shows the block diagram of the anatomy of proposed sound source localizer and tracker.

Fig. 1. Block diagram of an acoustic source localizer and tracker (ASLT)

The first step is to design of an optimal microphone array for sound capturing and processing. Successively, the captured sound is converted to the frequency domain for each signal from microphone elements. A voice activity detector (VAD) is introduced to distinguish between the active sound and noise in order to assist the ASSL algorithm to perform source localization. The VAD computes the likelihood of active source present in every frequency bin that is used as a weighting filter [2]. The Source Localization core perform the localization for each frame of the input signal for one or multiple sound sources depending upon the microphone array configurations and the

confidence level of the measured accuracy from each previous frame. In our approach, at this stage, the computation process of localization is terminated if the VAD detects no active source. Finally, the beamsteering repose is activated to further track the source at particular look direction. An optimization process is initiated, accompanied by a confidence level to reject or accept the localization results. The location estimation results are obtained in the form of azimuth and elevation angles reference to the microphone array coordinate system. A. Microphone array and capturing system A microphone array is a set of spatially distributed transducers aligned in specific configurations on a real or imaginary surfaces. Different microphone array configurations have been studies in recent years, such as, linear arrays, planner arrays, circular arrays and spherical arrays with their own pros and cons. The choice of configurations depends on the use of microphone arrays and the accuracy requirements. Some of the advantages of spherical surface geometries are easy mathematical handling and physical design [1]. Spherical array configurations are further categorized into many types depending on measurement radius, surface of the sphere, and the number of microphones and their arrangement [3]. Design of an array involves different factors, such as, at low frequencies a large radius of sphere is required whereas at high frequencies more microphones are needed for wideband analysis. Therefore, we developed an array, where, the microphones are spatially distributed on an imaginary surface of sphere of radius 7.5 cm using six omnidirectional transducer elements. This arrays is designed to resolve 8 KHz to 10 KHz frequencies with a low contribution of spatial aliasing. The conceptual diagram of the array is shown in Fig. 2.

Fig. 2. Conceptual diagram of the array

B. Time Delay Estimation Technique 1) Time difference of arrival (TDOA) estimation: The direction of arrival of sound source can be estimated with two microphones by the following equation. ߠ ൌ •‹ିଵ ቀ

߬஽ ‫ݒ‬ ቁሺͳሻ ݀

Where, ‘v’ is the sound speed, ‘τD’ is the times difference between the sound arriving at two microphone from source,

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67  ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



known as the time delay, and ‘d’ is the separation of microphone elements. To estimate the delay time, two models are commonly used: near field and far field models. In far field model the source distance form microphone array is considered to be very large as compared to the separation between microphone elements. Commonly, the cross-correlation function is used to estimate the difference of arrival in the sound signal at each microphone element by finding maximum value of correlation function. ‫܀‬ଵଶ ൌ ݅‫ܶܨܨ‬ሾ‫܆‬ଵ ‫ ܆‬ு ଶ ሿሺʹሻ where, ‫܆‬ଵ ‫ ܆‬ு ଶ is cross-power spectral density function used for estimating the correlation function spectrum and ‘R12’ gives values of the times of correlation peaks at the time shifts between two microphones [2]. In this approach, described in (2), cross-correlation function is more effected by high energy frequency bin as compared to the lower energy frequency bins. Further, the separation between the transducers are kept smaller to compensate the aliasing effects, however, smaller separations neglect the frequencies in the lower part of the band. Most of the energy lies in lower band of the frequency spectrum, therefore, small distance between microphone elements limit the use of simple cross-correlation function for lower frequency band. Another factor is the width of peaks in correlation function that also depends on the frequency components present in the frequency spectrum. These become wider with the higher contribution of the low frequency components and leads to a flatter and smeared peak and lower precision of the time delay estimation [2]. As an alternate to simple cross-correlation function, Knapp in [4] introduced the generalized cross-correlation (GCC) function that incorporates filters, H1 and H2, in computing cross-power density function that lead to frequency weightings in GCC function described in (2). ‫܀‬ଵଶ ൌ ݅‫ܶܨܨ‬ൣ࣒Ǥ ۵‫܆‬భ‫܆‬మ ൧ሺ͵ሻ Where, ૐ ൌ ۶૚ ۶૛ୌ is weighting function, ‘.’ element-wise multiplication, and ۵‫܆‬భ ‫܆‬మ ൌ ‫܆‬ଵ ‫ ܆‬ு ଶ. There are several weighting functions, such as, ‘phase transform (PHAT)’, ‘smoother coherence transform (SCOT)’ and the ‘maximum-likelihood (ML)’, available to design the optimal time delay estimator with their own pros and cons and depends on environmental conditions and microphone array configurations. In this paper, we investigated and proposed ‘PHAT’ as a weighting function in GCC due to its several advantage under different condition of space where sound source is required to be estimated. The mathematical formulation of ‘PHAT’ is given in (4) that removes the magnitudes and provides a uniform weightings to the phases for each frequency bin [4]. ߰௉ு஺் ሺ݂ሻ ൌ

ͳ ห‫ܩ‬௑భ௑మ ሺ݂ሻห

ሺͶሻ

Ideally, ‘PHAT’ weighting produces sharp peaks in crosscorrelation function. Therefore, this property of ‘PHAT’

enables us for multiple sources detection, especially when the environments are reverberant and the chances of higher number of reflections are more probable. 2) Combining the pairs: The direction of arrival (DOA) estimate using single pair microphone are limited to -90o to +90o. In our proposed microphone array, we are using six transduce elements, evenly distributed on an imaginary spherical surface in such a way that it enables an estimation of DOA for a 3D-sphere and increases the localization accuracy. Our array offer fifteen unique combinations of microphone elements, computed by generalized formula, i.e. M(M-1)/2. However, to combine the individual result from each pair of microphone is not easy even for a single source and a linear array, where, only one direction is required to be estimated. In reverberant conditions the GCC function produces several correlation function peaks, among them only one is correct, especially when strong reflection peak is captured by the array. An algorithm is designed for combining the GCC functions for each microphone pair. The GCC for each pair is calculated empolying ‘PHAT’ weighting and converted to functions of both azimuth and elevation angles ‘hi(φ,θ)’ [5]. The final direction of source is estimated as the maximum of the combined cross-correlation functions of all microphone pairs using (5). ெሺெିଵሻ ଶ

ሺ߮ǡ ߠሻ ൌ ‫܏ܚ܉‬ ‫ܠ܉ܕ‬ ᇣᇧ ᇧᇤᇧ ᇧᇥ ‫ ۇ‬෍ ݄௜ ሺ߮ǡ ߠሻ‫ ۊ‬ሺͷሻ ఝǡఏ

‫ۉ‬

௜ୀଵ

‫ی‬

In this formulation the spatial resolution is not enough to cover whole the space with higher resolution. Therefore, computing ‘hi(φ,θ)’ with higher resolution requires interpolation among the computed direction that introduces additional error. Further development in this approach is presented in [6] in order to achieve sufficient higher resolution. In this approch the source space is discretized in L points as modeled in (6). ‫ כ݌‬ሺ݈ሻ ൌ ‫܏ܚ܉‬ ‫ܠ܉ܕ‬ ᇣᇧ ᇧᇤᇧ ᇧᇥ ሺ‫ܧ‬ሺ݈ሻሻሺ͸ሻ ௟

Where, E(l) is the estimated power for direction ‘l’. The formulation of computing ‘E(l)’ is given in (7) after removing the duplicate computations of microphone pairs and adding ‘PHAT’ as weighting function. ெିଵ





ிೞ ଶ

‫ܧ‬ሺ݈ሻ ൌ ෍ ෍ ተተ න ߖሺ݂ሻܺ௥ ሺ݂ሻܺ௦‫ כ‬ሺ݂ሻ݁ ௝ଶగ௙ሺఛೝ ିఛೞ ሻ ݂݀ተተ ሺ͹ሻ ௥ୀଵ ௦ୀ௥ାଵ

ி ି ೞ ଶ

Where ߬௥ െ ߬௦ is the time difference of arrival and ߖ is PHAT weighting function described in (4). This modeling of combined GCC in (7) is further developed for second method of steered response power. C. Steered power response method and beamforming In this section, the steered-response power methods are discussed and formulated. Recent investigation and

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67  ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



developments in SPR mainly focus on conventional beam techniques where the beamformer scans the source and computes the magnitude squared responses. Traditional beamforming methods used for source localization work in limited frequency band, therefore, a broad band beamformer is a necessity to cover the full range of audio spectrum. In this paper, we investigated MVDR steered-power response weightings with PHAT as a combiner for all microphone pairs GCC functions. Generally, the power output at any beamformer is a function of the steering direction ‘c’ and is given by (8). ܲ஻ி ሺܿሻ ൌ ‫ܦ‬ሺܿሻு ܵ‫ܦ‬ሺܿሻሺͺሻ Where ‘S’ is a cross power spectral density matrix of captured samples, ‫ܦ‬ሺܿሻ is the directivity of microphone elements and ܲ஻ி ሺܿሻ is the peak value of the beamformer output. This method is known as “Bartlett Beamformer” that uses a delay and sum beamformer to scan the space. Further, certain weighting functions can be introduced in (8) at each frequency bin, such as, ‘Maximum-likelihood (ML)’, ‘multiple signal classification (Music)’ and ‘PHAT’. By placing the MVDR weighting function in (8), that corresponds to ‘MMSE’ and ‘ML’ estimators of the source, we can get a power steered function for ‘MVDR’ as represented by (9) after simplification. ܲሺܿሻ ൌ

ͳ ‫ܦ‬ሺܿሻு ܵ ିଵ ‫ܦ‬ሺܿሻ

ሺͻሻ

The SPR in (9) works for single frequency bin and it is denoted in this paper as PMVDR(c). In audio signals, where the frequency band is wide and sound energy is sparsely distributed over all frequency bins, we need to design a framework that combine all frequency response ‘PMVDR(c,k)’ from different bins. Where ‘k’ represents the frequency bin index. As a simple approach we can compute the average SPR over all frequency band, however, another approach is proposed using ‘PHAT’ weightings that are derived for GCC in (10) as a combiner.

of energy and spectral shifts for each frame of input signal. Finally, the MVDR weights are introduced and implemented for each frequency bin. III. EXPERIMENTAL EVALUATION An experimental evaluation was conducted in order to verify the algorithm and validate the performance of the microphone array. The measurements are conducted under well controlled condition in a reverberant chamber. A. Measurement setup and Procedure The evaluation of GCC-PHAT weighting function for six channel microphone array is performed by a speech source placed at 3 meters away from the array at 0o azimuth. Whereas, two omnidirectional noise sources are placed at ±45o, at the same distance as the speech source, reference to the array coordinate system. Omnidirectional dodecahedrons were installed as speech source and noise sources. The dodecahedron place at 0o azimuth was driven with speech signal for approximately 20 sec, whereas, remaining two dodecahedrons placed ±45o at are driven with the pink noise. The pink noise is analogues to the natural ambience noise. The sound pressure levels were controlled for the three sources in order to produce desired signal to noise ratio (SNR) at the receiving position. The mean reverberation time (RT) for this hall is 1.18 seconds. The array reference coordinate system and the measurement setup is shown in Fig. 3.The evaluation of SPR methodology with MVDR weighting, discussed in section II, was conducted using same measurement setup as shown in Fig.3. In this evaluation case the three dodecahedrons are driven with the three speech sources simultaneously in order to estimate the multiple source localization by providing a steering grid of angle to MVDR beamformer.



ͳ ‫ܯ‬ ܲெ௏஽ோ ሺܿሻ ൌ ෍ ு ܲெ௏஽ோ ሺܿǡ ݇ሻሺͳͲሻ ݇ ܺ௞ ܺ௞ ௞ୀଵ

Where, ‘Xk’ is the input signal vector for this frequency bin k from all transducer elements. D. Framework for methodology This section describes the implementation of methodology discussed in section B. The process of computation is shown in Fig. 1. As a first step the sound capturing is performed and subsequently subjected to framing and windowing block where the time domain data is converted into frequency domain by short time discrete Fourier transform (DFT) for each frame. The frequency transformed data is subjected to localization block for sound source estimation based on GCC with PHAT weightings, however, before estimation process another core block is activated termed as voice activity detectors (VAD). VAD are used for detecting active signals, such as, speech, in a noisy signal contained both speech and noise. Wang in [7] propose a VAD algorithm that relays its decision on the assumption of a quasi-stationary noise. In VADs the noise model is necessary to develop and the voice activity detection is decided on the bases

Fig. 3. Measurement setup and Microphone reference coordinate system

B. Results and discussion In this section the results of performance evaluation of GCCPHAT function and MVDR beamformer for six channel microphone array in noisy and reverberant conditions are discussed.

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67  ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



1) DOA Results using GCC weighting functions: A speech source is placed at 3 meters away from the array at 0 o. The results of GCC functions for all combination of microphone pairs for one speech frame using different weighting functions are shown in Fig. 4. As observed from the Fig. 4, the maximumlikelihood (ML) computes wider width of the GCC function and shows poor results under reverberation conditions. The sharp peaks are observed in PHAT and MLR weightings. We computed DOA for the speech source at each frame by searching the maximum peaks of the GCC-PHAT invoking the VAD algorithm. A 20 sec length of speech signal is selected that normally contains 500 voiced frames at 20ms window size and 44.1 kHz sample rate. The results are displayed in Table I. The GCC-PHAT for every frame was computed and a maximum peak value is searched. A quadratic interpolation is employed to increase the precision using maxima point and its two neighboring points from GCC graph. The results in Table I show that PHAT and MLR weighting function have higher numbers in detecting the speech frames that indicated the better performance, whereas, ML has lower number in frames. Furthermore, the standard deviations (STD) and mean values of ‘PHAT’ are better as compared to ‘ML’, ‘MLR’ and ‘No Weighting’ function.

TABLE I. 

DOA ACCURACY FOR DIFFERENT WEIGHTING FUNCTIONS

Weighting Function

No. of Measurement (Frames)

No Weighting

400

ML

460

PHAT

491

MLR

498

Mean (Deg.)

Deviation (Deg.) 3.219 2.321 0.783 1.310

7.041 5.739 1.684 2.407

2) SPR based DOA estimation results: This section describes the results obtained in implementing SPR-MVDR weightings udder the same environmental condition as described in previous measurements. To analyze the performance of MVDR weighted GCC combined with PHAT weighted function, three sound sources are installed at different position as (0o,90o), (45o,45o) and (90o,90o), and drive with full frequency band sine sweep signals for 3 second duration simultaneously. The impulse responses were collected at microphone array and processed for further computations. The histograms are plotted for both azimuth and elevation angles of DOA to confirm the simultaneous detection of the three source

Fig. 4. A comparison of GCC functions with different weighting functions

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67  ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



and to show the accuracy of the computed results in Fig. 5. Fig. 5a shows the azimuth while Fig. 5b shows elevation angles. The mean values and standard deviation for each direction is presented in Table II. Where ‘φ’ is azimuth and ‘θ’ is elevation

Fig. 5a. The histograms for DOA. for azimuth angles

IV. CONCLUSION We presented a framework for direction of arrival estimation and sound source localization using six channel spherical microphone array with open-sphere configurations based on PHAT-GCC and MVDR weighted PHAT-GCC algorithms. The contribution of this paper is the optimal design of the microphone array using minimum number of channel and producing convincing results within the accuracy of ±2o. In addition, the proposed MVDR weighted PHAT-GCC ASLT framework works pretty well in noisy and reverberant conditions. Experimental evaluation of the proposed methodology has demonstrated the higher performance over different GCC weighting functions and conventional steered power response techniques. The results showed that the PHAT weighted GCC function produces more sharp cross correlation peaks in reverberant and noisy environments that leads to more precise and robust source localization, whereas, the MVDR weighted SPR Localizer estimated the simultaneous multiple sound sources with STD value of ±1.3o in DOA for both azimuth and elevation.

REFERENCES

Fig. 5b. The histograms for DOA. for elevation angles

angle. As observed form the Table II, the localization show better results using MVDR weights in GCC-PHAT as compared to those obtained in only GCC-PHAT. TABLE II. 

[1] I. J. Tashev, Sound Capture and Processing: John Wiley & Sons, Ltd., 2009. [2] B. Rafaely, "Analysis and design of spherical microphone arrays," IEEE Transactions on Speech and Audio Processing, vol. 13, pp. 135-143, 2005. [3] C. Knapp and G. C. Carter, "The generalized correlation method for estimation of time delay," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 24, pp. 320-327, 1976. [4] S. T. Birchfield and D. K. Gillmor, "Acoustic source direction by hemisphere sampling," in Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, 2001, pp. 3053-3056 vol.5. [5] Brandstein, M. and Ward, D. (ed.) (2001) Microphone Arrays, SpringerVerlag, Berlin, Germany. [6] H. Wang and P. Chu, "Voice source localization for automatic camera pointing system in videoconferencing," in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, 1997, pp. 187-190.

DOA ACCURACY FOR SIMULTANEOUS SOURCE DETECTION

Actual Position (Deg.)

Computed Mean (Deg.)

Computed Deviation (Deg.)

(߮ǡ ߠ )

(߮ǡ ߠ )

(߮ǡ ߠ )

(0,90)

(-1.32, 89.02)

(-0.32, 0.08)

(45, 45)

(46.34, 44.01)

(1.05, 0.68)

(90,90)

(89.76, 88.91)

(-0.74, -1.30)

3URFHHGLQJVRIWK,QWHUQDWLRQDO%KXUEDQ&RQIHUHQFHRQ$SSOLHG6FLHQFHV 7HFKQRORJ\ ,%&$67  ,VODPDEDG3DNLVWDQWK±WK-DQXDU\



Suggest Documents