FAST AND ROBUST REALTIME SPEAKER ...

FAST AND ROBUST REALTIME SPEAKER TRACKING USING MULTICHANNEL AUDIO AND A PARTICLE FILTER E. R¨uckert∗ , L. Ottowitz

M. K´ epesi

Graz University of Technology [email protected] [email protected]

Graz University of Technology SPSC Laboratory [email protected]

ABSTRACT In this work a method to track the azimuth (horizontal angle) from multiple speakers in a typically reverberant real office environment is presented. The steered-response-power algorithm (SRP-PHAT) or the recently published joint position and pitch extraction approach (PoPi) combined with a sequential Monte Carlo estimation leads to a robust and fast tracker for audio indexing. One intention of this work was to extract the harmonic frequency or pitch of the speech. This additionally or jointly extracted information is useful for speech recognition systems to find word boundaries, or in the mobile robot field to increase the person detection performance. The results are computed in realtime on a typical notebook with 1.6GHz. A circular arrangement of four microphones, which fits on a mobile robot or a conference table is sufficient enough to obtain a speaker detection rate of at least 80 percent with an accuracy of three degrees. The performance could still be enhanced with more microphones or a better voice activity detector (VAD).

the pitch could also be used to increase the face recognition rate, by restricting the search space. In this work a sequential Monte Carlo method, commonly known as Particle Filter for tracking was used. This technique facilitates the sensor fusion and performs even well under adverse conditions. But the biggest advantage is the reduction of the calculations needed per frame. The considered position and pitch plane according to [1] has 142, 2 ∗ 103 (360 ∗ 395, f s = 44, 1KHz, f ∈[80, 280]Hz) possible states. Our particle filter works well with 500 particles, which means that we need 500 calculations per frame to build a robust tracker for multiple speakers. Figure 1 illustrates the hardware setting used at our experiments. The microphones are orientated upwards to obtain conform measurements. The particle filter framework is described in Section 2. Sections 3 and 4 discusses measurement models. The experimental results are presented in Section 5 and we conclude in 6.

1. INTRODUCTION Multimedia indexing is a topic of big interest. A robust combination of visual and audible information permits several new applications and enhances the performance of vision tasks. Applications could be among others: automatic camera steering, conference recordings or multi-party speech segmentation. Mobile robots could react to a speaking person and then start a communication process. Acoustic source location can be computed faster as locations based on vision based approaches. The location of a speaking person could be used to restrict the search region in an image to boost the vision based algorithm. The joint pitch frequency extraction, introduced by [1] provides a powerful feature for speech recognition systems to identify word boundaries and gender. Abdulla et al. [2] shows that gender specific recognition methods perform better than gender independent models. In the field of the mobile robotic, ∗ Thanks

to T. Habib for helping with the recordings.

Fig. 1. microphone array with different pair combinations (4,8 and 16 microphones)

2. PARTICLE FILTER FOR POSITION ESTIMATION Sequential Monte Carlo methods approximate an optimal Bayesian filter [3] by representing a probability distribution of speaker positions with a set of particles. The Bayes’ Theorem is defined as follows: p(xk | z1:k ) =

p(zk | xk )p(xk | z1:k−1 ) p(zk | z1:k−1 )

(1)

where p(xk ) is the state probability at time step k and z1:k

are all observations. p(zk | z1:k−1 ) is the normalization coefficient. The filter distribution p(xk | z1:k ) can be recursively computed by: p(xk | z1:k ) = m · p(zk | xk )· Z p(xk | xk−1 ) · p(xk−1 | z1:k−1 ))dxk−1

(2)

The normalization coefficient is denoted with m and by approximation p(xk−1 | z1:k−1 )), we get the formulation of the Particle Filter: p(xk | z1:k ) ≈ m · p(zk | xk ) ·

N X

n (wt−1 · p(xk | xnk−1 )) (3)

n

where N is the number of particles. For more details we refer to [3]. 2.1. Algorithm A simple state space model consists of a dynamic model, an observation model and a sampling technique. There are other orders or different model definitions like used in [4]. Each particle αi describes one point in the state space of any dimension (one speaker). Initially N particles are placed randomly in the state space. In the update step one of the observation models from Sections 3 or 4 is used to calculate the likelihoods of the particles. In the following step, the next generation of particles are drawn and if necessary, new particles are added. Afterwards each particle is randomly moved in the state space according to the dynamic model. The particle filter converges after a few cycles to an observed probability distribution. 2.2. Tracking problem The horizontal angle φ is estimated from the center of a circular microphone array to a source location. The observation model described in chapter 3 extracts the angle and the pitch f at the same time, on the other hand the SRP-PHAT model states only the horizontal angle. This requires to estimate the pitch additionally, described in Section 4. The result is not very accurate, because the pitch value is not linked to the speaker position. Furthermore the computational time is increased. On this account a comparison is given in Section 5.3. A particle is defined as: αi = hφi , fi , wi i. The distance to a speaker is not taken into account. A similar acoustic model is discussed in [7].

2.4. Observation model Two observation models are discussed in the next sections. These models are responsible for the calculation of the weights. In the simple case, at every particle filter cycle all weights are re-calculated and normalized. But it is useful to consider also previous estimations. So a trace of three weights is stored for every particle and their sum is considered as probability value. In this way the tracking becomes very robust but it is still hard to detect simultaneously speaking persons. 2.5. Sampling technique To reduce the variance of the modeled probability distribution a resampling method is used. In each cycle the next generation of particles are drawn from a standard distribution of old particles according to their weights (likelihood) wi . Particles with bigger weights are drawn more often. This statistical approach can model multiple objects via multi-modal distributions.

Fig. 2. Position and Pitch Probability Plane with particles, Peak at 220Hz and 280 degrees The particles shown in Figure 2 will converge after a few steps to the maxima of the position and pitch probability plane in the background. Points with a pitch value smaller then 190Hz denote male speakers and are in the upper part of the image. 2.6. Particle clustering

2.3. Dynamic model

A simple and fast clustering algorithm based on the particle weights and their euclidean distance to each other is presented: q dist(αi , αj ) = (fi − fj )2 + angleDist(φi , φj )2 (4)

This model defines the temporal evolution of the state. A pitch variance of 2-4Hz and an angle variance 2-4◦ per VAD frame are sufficient for tracking of multiple speakers in a reverberant small office.

The method iterates through the sorted particle list just once. Initially, The first particle in the list has the highest weight and creates the first cluster. Afterwards we compare the distances between the current particle and all cluster centers. If

the distance is smaller than a threshold we add the particle to the closest cluster but the cluster center stays the same. In the other case, a new cluster is created. The function angleDist(φi , φj ) calculates the correct distance between two angles, while dist(αi , αj ) computes the distance between two particles. At the end we measure the probability of each cluster and discard weak estimations. This approach is quite similar to a mean-shift algorithm. But here it is not necessary to calculate the mean, because the weights of the particles are significant enough to estimate the current speaker position.

Fig. 3. clustered speaker at 82 degrees with a probability of 93 percent. DoA algorithm used: Joint position and pitch extraction Figure 3 shows the clustering result of a single speaker at 82 degrees with a probability of 93 percent. Additionally we assigned a picture and the name of the speaker if the tracked pitch value was in a defined range of 200Hz - 220Hz. This approach should be sophisticated because in a multi-speaker environment the pitch ranges would probably overlap.

the extracted information of one frame. Z ∞ ∗ corrf,g (i) = f (i) ∗ g (i) = f (u + i) · g(i)du

(5)

−∞

K´ epesi et al. proposed a cross-correlation (Equation 5) based method to calculate the pitch and position plane (PoPi-plane). The cross-correlation between two microphone signals is calculated in the frequency domain. corrf,g (i) = iF F T (F (i) · G∗ (i))

(6)

Finally the plane is optimized with a weighting function wv(f ) to get rid of gross-terms. This vector is basd on the cepstrum of one cross-correlation. For a detailed description, see the paper of Wohlmayr et al. [1].

Fig. 4. Position and Pitch Probability Plane, with a peak at 280Hz and 210 degrees An observation model based on the presented formalism is build. At first the cross-correlation between the microphone signals corresponding to one pair. (ex.: Microphones 1-9, Microphones 5-13. see Figure 1 for arrangement of four microphones) is calculated. One of the correlation measurements is chosen to calculate the weight vector.

2.7. Optimizations and parameter values To approximate the state space in a dynamical environment N = 500 particles are used. At every frame, 10 percent of new particles are added and the worst 10 percent were deleted before continuing with the resampling process. Speakers with pitch values below 190Hz are considered as males. The distance threshold of the clustering algorithm is set to 8.0 and we discard clusters with a probability lower than 0.5. This denotes that we could at most track two speakers simultaneously, if all particles are subdivided into two parts of equal size. The SRP-PHAT model detects multiple speakers alternately. The variances are set to 2.0Hz and 2.0 degrees. 3. JOINT POSITION AND PITCH EXTRACTION OBSERVATION MODEL (POPI) Recent work from K´ epesi et al. [5] was a inspiration to build a realtime speaker location application. Figure 4 illustrates

3.1. Weighting function This function depends on the pitch f . The cepstrum of a cross-correlation signal is calculated of one microphone pair and smoothed with the filter function G. CEP = iF F T (log(abs(F F T (corrf1 ,g1 )))) ∗ G

(7)

The weighting function wv(f ) forces correct probability regions and leads to a sharp pitch and position estimation. sampleRate ) (8) f If we skip the weighting, the PoPi-plane would have several incorrect estimates corresponding to higher harmonics. For example a speaker with a pitch at 100Hz leads to at least 2 peaks at 100Hz and 200Hz all present in a pitch range of [80Hz, 280Hz]. wv(f ) = CEP (

3.2. PoPi-plane The following function will be used in the update step explained in particle filter Section 2. p(φ, f ) = wv(f ) ·

M P X mp=0

(

) +K X 1 corrfmp ,gmp (k · P s(f ) + Ds(φ)) (9) 2K + 1 k=−K

4. STEERED-RESPONSE-POWER OBSERVATION MODEL(SRP-PHAT) This simple source localization algorithm is explained by DiBiase et al. in [6] and also rest upon the cross-correlation (6). This method normalizes the cross-correlation in the frequency domain to achieve just one peak: F (i) · G∗ (i) ) corrSRPfmp ,gmp (i) = iF F T ( p F (i) · G∗ (i)

(14)

The probability for one hφ, f i tuple is obtained by sumAfter the calculation of the modified cross-correlation (14) ming up all believes (Equation 5) of the microphone pairs. the maximum is restricted to 1. fmp and gmp are the correM P is the number of microphone combinations and corrfmp ,gmp sponding microphone signals of one microphone pair. Now is the corresponding correlation vector. The correlation lag the probability for every DoA value is estimated: depends on φ and f , therefore the following two functions are introduced: M P X p(φ) = corrSRPfmp ,gmp (Ds(φ, mpAngle)) (15) sampleRate (10) P s(f ) = mp=0 f The number of samples corresponding to the difference of The function Ds(φ, mpAngle) is stored in a lookup table arrival (DoA) of interest is given by Ds(φ): and depends on the microphone array diameter, the speed of sound, the sample rate, the microphone pair angle and at last Ds(φ) = f rameLen · micDist· on the given DoA of the source, φ. The delays for all possible sampleRate φ values for the whole range [0,359]◦ are considered. (11) cos(φshif ted ) · 343m/s Ds(φ, mpAngle) = micDist · cos(mpAngle − φ)· Both functions could be stored in lookup tables to speed up the application. The speed of sound is approximately 343m/s sampleRate (16) and the micDist is 0.4m with the given hardware setting. A 343m/s f rameLen of 2048 samples and a sampleRate of 44,1KHz The SRP-PHAT technique was also used to estimate the is used at the experiments. ◦ vertical angle to a sound source. For this extra task only the The angle φ is estimated for the whole range of [0, 359] . time difference of arrival lookup table has to be extended and, φshif ted = projection(φ − rotationAnglemp ) (12) of course, all particles needs one more property. The results are not as accurate as with the simple 2D SRP-PHAT. Also the with a given rotationAnglemp corresponding to the hardstate space is bigger and at least 2000 particles are necessary ware setting. The projection makes sure, that the angle φshif ted to approximate the probability function. ◦ is always in the range of [0, 359] . At last, K is the number of considered correlation peaks and 5. EXPERIMENTAL RESULTS depends on the pitch frequency candidate f . 5.1. Static experiments K = min(5, bf · f rameLen/sampleRatec − 1)

(13)

Noise effects and optimizations of the joint position and pitch extraction are explained well in [1]. This method works fine under good conditions, but is not useful for reverberant environments. Future work will look at employing filter banks on the microphone signals. For example, instead of calculation just one cross-correlation between 2 signals the performance could be improved by computing cross-correlations for 17 channels. Finally the correlations are normalized and summed up. This sum for each microphone pair would replace the Equation 5 and the following steps stay the same. First tests denotes exact results but the calculation effort is not acceptable for realtime applications.

The methods are validated with challenging recordings under noisy conditions. Figures 5 and 6 demonstrate the accuracy of the PoPi- and SRP-PHAT observation models with single speakers at fixed positions. A set of three lines belongs to 4, 8 and 16 used microphones. Normally the best result is obtained with 16 microphones (every third line). Each of the seven tests are computed 10 times to draw better conclusions. The last three tests are spoken words. The implemented joint position and pitch extraction method is not robust enough to solve the localization task, as shown in Figure 5. The recognition rate is around 5 percent, in contrast the SRP-PHAT localization technique leads to rates beyond 80 percent, see Figure 6.

Fig. 5. single speaker at fixed positions, PoPi-model Fig. 7. First row: moving speaker, vowels. Second row: moving speaker, sentence. Last row: one speaker at 180 degrees and one speaker moving, vowels. Used model: PoPi (the pitch values were not considered), no weight trace. The first column belongs to an arrangement of 4 microphones, the second to 8, while the results of the third column were generated with 16 microphones.

Fig. 6. single speaker at fixed positions, SRP-PHAT-model (weight trace of three) The results are even satisfying with just 4 microphones, as one can see in Figure 6. Sometimes the exact position is missed by one degree in the case of using 16 microphones but not with 8. This effect disappears with an angle range of two degrees. A simple energy based voice activity detector was used to trigger the estimation process. A better VAD implementation could enhance the performance. 5.2. Dynamic experiments More interesting are the next three dynamic experiments. In the first test, a persons starts speaking while being at 90◦ and moves to 350◦ , speaking vowels. The next row shows a single person moving from 90◦ to 45◦ . This is the hardest test, because the person is speaking a sentence during the movement and loses the direct path to the microphone array. At the end, one speaker stays at 180◦ and is speaking vowels and another speaker is moving from 90◦ to 350◦ .

In Figure 7 some boxes are blank. In this case the model did not detect a person. The tests with spoken vowels are sometimes correct. The first column belongs to an arrangement of 4 microphones, the second to 8, while the results of the third column were generated with 16 microphones. As has been mentioned in Section 2.4 a memory of stored weights for each particle increases the performance, but tracking multiple speakers is getting harder. This is depicted on the first subplot in the second row of Figure 8. The noisy version (Figure 8) is able to detect multiple speakers but has more false positives. Figure 9 uses a weight trace of three. The Figure 8 demonstrates the power of our framework, which is able to estimate the position of multiple moving speakers under noisy conditions with an accuracy smaller than three degrees. The Figures 9, 10 and 11 show the SRP-PHAT observation model results with a weighted trace of likelihoods. The illustration 9 indicates the detected speaker positions with 4 microphones and the additionally extracted pitch. The points are not that wide spread because each test was running just 3 times and plot all detections on this image. The pitch extraction method from Section 3 is also used to estimate the pitch in Figures 10 and 11. The more microphones are considered, the better the estimation becomes. 5.3. DoA Model Comparison The SRP-PHAT model is fast and robust with at least 4 microphones. This simple algorithm needs just a typical saleable

Fig. 8. First row: moving speaker, vowels. Second row: moving speaker, sentence. Last row: one speaker at 180 degrees and one speaker moving, vowels. Used model: SRP-PHAT (no weight trace). The first column belongs to an arrangement of 4 microphones, the second to 8, while the results of the third column were generated with 16 microphones.

Fig. 10. 8 microphones: First column: moving speaker, sentence. Second column: additionally extracted pitch. Method: SRP-PHAT

Fig. 11. 16 microphones: First column: moving speaker, sentence. Second column: additionally extracted pitch. Method: SRP-PHAT Fig. 9. 4 microphones: First column: moving speaker, sentence. Second column: additionally extracted pitch. Method: SRP-PHAT notebook to be computed in realtime and allows a wide range of new skills for a mobile robot. For instance, the robot could react on a speaking person which is not in the field of view. On the other hand the PoPi technique extracts position and pitch values jointly and deserves more information about the scene. Future work will be done to tackle the two mean problems. The calculation time has to be increased and the accuracy must meet the demands of a natural reverberant environment. Such a model could detect multiple speakers in realtime. Furthermore the pitch information could enhance the speech recognition performance and force the person detection rate of a visual system. 5.4. Hardware Settings The RME Fireface soundcard works at 16kHz and 16 bits per sample. All experiments were computed on a samsung

x20 notebook with 2GB ram and a cpu rate at 1.6GHz. Microsoft CShrap was chosen as preferred programming language. Cheap conductive microphones with a preamplifier were used to construct the microphone array with a diameter of 0.4m. Guillaume Lathoud [7] uses a different setup with two microphone arrays and therefore it not possible to compare the methods. In the same way, the result of M. Fallon [8] et. al. is not compareable. M. Fallon mount several microphones on the walls of a reverberant room. The obtained result is similar to this work, but the tracking task is easier if the distance between the microphones is bigger. Our point of interest is a mobile and compact hardware setting. 6. CONCLUSION A particle filter framework was introduced to approximate bayesian probability distributions. This tracker performs well in realtime and allows to combine several sensor measurements in an easy way. The joint position and pitch extraction

is, in the given form not practicable and some research is currently done in extensions and improvements. The traditional steered-response-power method leads to very good results, as mentioned in Section 5. But the pitch extraction is not involved the tracking process and therefore not quite accurate. For a mobile application, four standard and cheap microphones are enough to track multiple speaker in reverberant real office environments. Future work will consider a robust person tracker framework with visual and acoustic sensor informations. 7. REFERENCES [1] M. Wohmayr, M. K´ epesi, Joint Position-Pitch Extraction from Multichannel Audio, Proc. Interspeech 2007, August 27-31, Antwerpen, Belgium [2] W. H. Abdulla and N. K. Kasabov. 2001. Improving speech recognition performance through gender separation. In Proceedings of ANNES, pages 218222. [3] N. J. Gordon, D. J. Salmond and A. F. M. Smith, Novel Approach to Nonlinear/non-Gaussian Bayesian State Estimation. IEE Proceedings-F, 140(2):107-113, April 1993. [4] D. Gatica-Perez, J. Odobez, S. Ba, K. Smith, G. Lathoud, Tracking people in meetings with particles, Proc. Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), invited paper, Montreux, Apr. 2005. [5] M. K´ epesi, M. Wohmayr, T. Habib, Pitch-Driven Position Estimation of Speakers in Multispeaker Environments, The 3rd Congress of the Alps Adria Acoustics Association, September 27-28, 2007, Graz, Austria [6] J. DiBiase, H. Silverman, and M. S. Brandstein, Robust localization in reverberant rooms, in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B.Ward, Eds., pp. 131154, Springer Verlag, New York, USA, September 2001. [7] Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez, AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking, in Proceedings of the MLMI’04 Workshop, 2004. [8] M. Fallon, S. Godsill and A. Blake, Joint Acoustic Source Location and Orientation Estimation using Sequential Monte Carlo, Digital Audio Effects (DAFX), Montreal, ON, Canada, September 2006.