[2] M. Wölfel and J. McDonough, Distant Speech Recognition, John Wiley and Sons, 2009. [3] M. S. Pedersen, J. Larsen, U. Kjems, and L. C. Parra, âA survey of.
2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays
May 30 - June 1, 2011
REAL-TIME PROTOTYPE FOR MULTIPLE SOURCE TRACKING THROUGH GENERALIZED STATE COHERENCE TRANSFORM AND PARTICLE FILTERING Francesco Nesta, Alessio Brutti, Luca Cristoforetti Fondazione Bruno Kessler-Irst via Sommarive 18, 38123 Trento, Italy {nesta|brutti|cristofo}@fbk.eu
ABSTRACT This real-time demo implements a novel method for tracking bidimensional TDOAs of several concurrent acoustic sources using a small number of microphones. A batch on-line frequency-domain ICA is applied to the mixture recorded by an array consisting of three or four microphones. The estimated mixing matrices are used to compute an approximated instantaneous kernel density of the timedelay propagation through the Generalized State Coherence Transform (GSCT). Tracking of the propagation parameters of each source is performed through multiple disjoint particle filters which estimate the posterior kernel density of the bidimensional time-delays based on the instantaneous GSCT function. The tracked 2D TDOAs are then used to drive a Blind Source Separation (BSS) algorithm that extracts the sound source selected by the user. The demo will be run in presence of up to five competitive sources. Index Terms— Blind Source Separation (BSS), Independent Component Analysis (ICA), multiple speaker tracking, kernel methods, sequential Bayesian methods. 1. INTRODUCTION Blind characterization of the parameters of the acoustic propagation of multiple sources is an ambitious task that helps in solving several crucial problems of many speech related applications. An approximated description of the acoustic propagation can be provided by estimating the Time Difference Of Arrivals (TDOA) of acoustic waves impinging at multiple microphone pairs. Solutions typically rely on signals acquired by microphone arrays, either compact or distributed [1]. While for the single source case a wide literature of well established methods can be found [2], few of these methods can explicitly deal with multiple competitive sources. Scenarios where several audio sources are active at the same time have received an increasing interest from the scientific community especially in the field of the Blind Source Separation (BSS)[3] and multiple source localization [4, 5]. In the context of BSS effective methods for TDOA estimation of multiple sources have been proposed [6, 7]. In particular the Generalized State Coherence Transform (GSCT) [6] is an attractive solution for the estimation of multidimensional TDOA associated to multiple competitive sources and can be applied even when the number of sources is greater than that of sensors and in presence of spatial aliasing. The GSCT takes advantage of the powerfulness of ICA in estimating complex-valued propagation models in different frequencies and time-segments. Those models are then combined in the GSCT through an efficient non-linear integration that maps the models directly in the TDOA space. TDOA vectors representing the acoustic propagation of multiple sources can be estimated without requiring any knowledge on the microphone geometry. Since the GSCT is based on the use of a batch ICA implementation it gives
978-1-4577-0999-9/11/$26.00 ©2011 IEEE
161
an instantaneous picture of the acoustic propagation for a given time segment but does not allow an explicit source parameter tracking in time-varying mixing conditions. To this regards, Particle Filters (PF) represent an effective solution for tracking of multiple sources [8]. By means of weighted samples, PF methods approximate the posterior Probability Density Function (PDF) of the targets based on current and past observations. We therefore follow the directions recently proposed in [9] and [10] and set up a PF tracking framework where the measurement likelihoods are obtained from the GSCT instantaneous functions. Tracking of multiple sources is performed in a disjoint fashion by associating an independent filter to each target. Spatial exclusion is employed, in the form of masks applied to GSCT, to ensure that two or more filters do not collapse in the same target. Thanks to the adopted disjoint tracking scheme we do not need to explicitly model the joint probability of the target state, this way reducing the computational complexity. By combining GSCT with PF in the TDOA domain we obtain a tracking system that is as general as possible. In fact, if the microphone geometry is known the source spatial locations can be derived from the estimated TDOAs which, moreover, can be directly used to opportunely drive BSS algorithms. The system has the following advantages: a it does not explicitly require any prior knowledge on the array geometry; b it efficiently exploits ICA in the frequency-domain; c it is able to track a number of sources greater than the number of microphones; d it allows the estimation of multidimensional TDOA parameters without spatial ambiguity. 2. SYSTEM ARCHITECTURE DESCRIPTION The system is available for the configuration with a centralized array of three closely spaced microphones and for the configuration with two distributed microphone pairs (i.e., four microphones), which enhances the orthogonality among the dimensions. The hardware set up of the demo for the four-channel case is described in figure 1. Up to five loudspeakers play independent audio signals (e.g., music and speech). The signal acquired by omni-directional microphones are connected to a pre-amplification stage and sampled at 16kHz through an external 4in-4out USB audio card (ESI Maya44USB), which is connected to a laptop. The graphical interface of the system displays in separate windows the particles of each filter, the instantaneous GSCT functions and the resulting bidimensional TDOA estimates. Through the interface users are allowed to select one of the sources which is then extracted and reproduced in the headphones. The global architecture of the demo is depicted in figure 2 and in the following.
Fig. 1. Demo setup for the configuration with two distributed microphone pairs.
Fig. 2. Block diagram describing the architecture of the adopted tracking framework.
2.1. Batch audio data-stream framing
[2] M. W¨olfel and J. McDonough, Distant Speech Recognition, John Wiley and Sons, 2009.
The sampled time-domain signals are transformed in a discrete timefrequency representation applying a Short-Time Fourier Transform (STFT) with overlapped Hanning windows (e.g., 75% of overlapping) in order to obtain frames with a certain degree of temporal continuity. Since the accuracy of ICA depends on the amount of observed data, STFT time-series are obtained by grouping the estimated STFT frames in different segments. Again, the STFT frames of different batch segments are overlapped to force continuity of the observed data.
[3] M. S. Pedersen, J. Larsen, U. Kjems, and L. C. Parra, “A survey of convolutive blind source separation methods,” in Springer Handbook of Speech, Nov. 2007.
2.2. RR-ICA
[6] F. Nesta and M. Omologo, “Generalized state coherence transform for multidimensional localization of multiple sources,” in WASPAA, 2009.
A simplified implementation of the Recursively-Regularized ICA in frequency-domain [11] is applied to all the batch segments in order to estimate the mixing parameters H(k). To reduce the computational load only the smoothing across the demixing matrices and the scaling normalization are implemented, which has shown to be sufficient for a stable and accurate estimation of H(k).
[7] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation,” IEEE Tran. on Audio, Speech, and Language Processing, vol. 15, no. 5, pp. 1592–1604, July 2007.
2.3. GSCT-PF Tracking of multiple sources is performed by instantiating independent PFs, implementing Sampling Importance Resampling (SIR-PF), where target states are defined in the bidimensional TDOA domain. Filters are updated one after the other in an iterative way by evaluating an enhanced version of the GSCT function [12] in the particle TDOA positions. GSCT is computed with bidimensional vectors, obtained from H(k) as in [13], and is averagely normalized in order to obtain values in the range [0, 1]. Moreover, when processing a given filter a spatial exclusion mask is adopted which forces to zero the GSCT values around the positions of the targets associated to the other filters. The TDOA estimates for each filter are obtained by applying a maximum likelihood criterion to the approximated PDF. 2.4. BSS system The estimated TDOAs are used to drive a source separation system. Either a spatial binary masking or a constrained semi-blind source separation [14] is adopted to enhance a given source of interest. 3. REFERENCES [1] M. Brandstein and D. Ward, Eds., Microphone Arrays, Springer, 2001.
162
[4] A. Lombard, Tobias Rosenkranz, H. Buchner, and W. Kellerman, “Multidimensional localization of multiple sound sources using averaged directivity patterns of blind source separation systems,” in ICASSP, 2009. [5] A. Brutti, M. Omologo, and P. Svaizer, “Multiple source localization based on acoustic map de-emphasis,” EURASIP Journal on Audio, Speech, and Music Processing, 2010.
[8] M. S. Arulampalam et al., “A tutorial on particle filters for on-line nonlinear/non-gaussian bayesian tracking,” IEEE Trans. on Signal Processing, vol. 50, no. 2, February 2002. [9] F. Antonacci et al., “Tracking multiple acoustic sources using particle filtering,” in EUSIPCO, 2006. [10] P. Teng, A. Lombard, and W. Kellermann, “Disambiguation in multidimensional tracking of multiple acoustic sources using a gaussian likelihood criterion,” in ICASSP, 2010. [11] F. Nesta, P. Svaizer, and M. Omologo, “Convolutive bss of short mixtures by ica recursively regularized across frequencies,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 3, pp. 624 –639, march 2011. [12] F. Nesta and A. Brutti, “Self-clustering non-euclidean kernels for improving the estimation of multidimensional tdoa of multiple sources,” in HSCMA, May 2011. [13] F. Nesta and M. Omologo, “Approximated kernel density estimation for multiple TDOA detection,” in ICASSP, may 2011. [14] F. Nesta, T.S. Wada, and Biing-Hwang Juang, “Batch-online semi-blind source separation applied to multi-channel acoustic echo cancellation,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 3, pp. 583 –599, march 2011.