A DSP Implementation of Source Location Using ... - CiteSeerX

15 downloads 1517 Views 234KB Size Report
In a teleconferencing environment, it is often necessary to focus audio and video sensors on di erent locations as participants contribute to the discussion.
A DSP Implementation of Source Location Using Microphone Arrays Daniel V. Rabinkin, Richard J. Renomeron, Arthur Dahl, Joseph C. French, James L. Flanagan Ctr. for Computer Aids for Industrial Productivity Rutgers Univ., P.O. Box 1390, Piscataway, NJ 08855-1390 Michael H. Bianchi Bell Communications Research 445 South St., Morristown, NJ 07960-6438

ABSTRACT

The design, implementation, and performance of a low-cost, real-time DSP system for source location is discussed. The system consists of an 8-element electret microphone array connected to a Signalogic DSP daughterboard hosted by a PC. The system determines the location of a speaker in the audience in an irregularly shaped auditorium. The auditorium presents a non-ideal acoustical environment; some of the walls are acoustically treated, but there still exists signi cant reverberation and a large amount of low frequency noise from fans in the ceiling. The source location algorithm is implemented in a two step process: The rst step determines time delay of arrival (TDOA) for select microphone pairs. A modi ed version of the Cross-Power Spectrum Phase Method is used to compute TDOAs and is implemented on the DSP daughterboard. The second step uses the computed TDOAs in a least mean squares gradient descent search algorithm implemented on the PC to compute a location estimate. Keywords: microphone arrays, speech, source location, real-time processing

1. INTRODUCTION

In a teleconferencing environment, it is often necessary to focus audio and video sensors on di erent locations as participants contribute to the discussion. One approach to this problem is to provide a hand-held or body-worn microphone to each potential speaker to capture sound, and to use a set of human-controlled cameras to provide video. However, this requires extensive wiring, long setup time, and is potentially inconvenient to the participants of the teleconference. An alternative method is to use microphone array processing techniques to determine the location of the active speaker. This information can be used to automatically focus audio and video devices without the need for camera operators and unwieldy sound equipment. An implementation of a source location system using inexpensive and readily available hardware components is described in this paper. Brandstein1 classi es approaches to the source location problem into three classes: steered beamformer based locators, high-resolution spectral estimation locators, and Time Delay of Arrival (TDOA) based locators. Steered beamformer based locators scan the space of interest with an acoustic beam and nd a region which produces the highest beam output energy. This method is very simple to implement and was used exclusively in early analog array systems.2 However, it su ers from extremely poor spatial resolution, and poor response time. It becomes fairly impractical when a large number of discrete locations are to be scanned, or when continuous spatial resolution is desired. High-resolution spectral estimation locators3 are based on a spectral phase correlation matrix derived for all elements in the array. Such a matrix is estimated from the captured data. The spectral estimation-based techniques attempt to perform a \best" statistical t for source location using the above matrix. Although numerous techniques exist, most are limited in application to narrowband signals and regularly spaced sensor arrays. These techniques also tend to rely on a high degree of signal statistical stationarity and an absence of re ections and interfering sources. Most applications of the spectral estimation locators occur in the radar domain, and are not general and robust enough to be applicable to wideband acoustic signals. The techniques also tend to be computationally intensive, and are thus not well suited for real-time applications.

The TDOA technique has been the technique of choice in recent digital systems.4,5 It is based on evaluating the delay of arrival of the sound wavefront between a given microphone pair. A set of delays for chosen microphone pairs is computed, and the source geometry is derived from these pairs. The delay of arrival technique is essentially the spectral estimation technique reduced to a two sensor matrix. The system described in this paper uses a modi cation of the method developed by Omologo and Svaizer6 to determine delays between selected microphone pairs. After the delay estimates have been computed, a nondirected gradient descent search is performed over all possible locations to nd a best match for a sound source location based on these delays. The system then directs a computer controlled camera at the source of sound.

2. DELAY ESTIMATION ALGORITHM

Given a single source of sound that produces a time varying signal x(t) each microphone in the array will receive the signal mi (t) = i x(t ? ti ) + ni (t) (1) where i is the microphone number, ti is the time it takes sound to propagate from the source to microphone i, and ni (t) is noise signal present at microphone i. The Time Delay of Arrival (TDOA) is de ned for a given microphone pair i; k as: Dik = ti ? tk (2) The goal of the rst step of the source location process is to determine Dij for some subset of microphone pairs. The Fourier transform of the captured signal can be expressed as:

mi (t) $ Mi (!) = i X (!)e?jwt + Ni (!) where X (!) is the Fourier transform of the source signal x(t). i

(3)

It can be assumed that the average energy of the captured signals is signi cantly greater than that of the interfering noise, namely that (4) 2i jX (!)j2  jNk (!)2 j for all i; k In addition, the signal to noise ratio (SNR) of the captured signal is de ned to be: 4 10 log 2 jXi (!)j2 SNRi (!) = i 2

!

(5)

jNi (!)j

If the condition in (4) is true, the SNR and the cross correlation of mi (t) and mk (t):

Rik ( ) =

Z1

?1

mi (t)mk (t ?  )dt

(6)

can be expected to be maximum at Dik . The frequency domain representation of (6) is Rik ( ) $ Sik (!) = Mi (!)Mk (!) (7) and may be expanded using (3): Sik (!) = ( i Xi (!)e?jwt + Ni (!))( k Xk(!)ejwt + Nk (!)) = i k jXi (!)j2 e?jw(t ?t ) + Ni (!)Nk (!) (8) + i Xi (!)e?jwt Nk(!) + k Xk (!)ejwt Ni (!) The last three terms of (8) are negligible compared to the rst term based on the assumption in (4). Expression (8) now reduces to: Sik (!)  i k jXi (!)j2 e?jwD (9) Thus Dik can be found by evaluating Dik = max R ( ) = max F ?1 fSik (!)g (10)  ik  i

k

i

k

i

k

ik

where F ?1 indicates the inverse Fourier transform.

2.1. Cross-Power Spectrum Phase (CPSP)

Given no a priori statistics about the source signal and the interfering noise, it can be shown7 that the optimal approach for estimating Dik is to whiten the cross-correlation by normalizing it by its magnitude:

Sik (!) = e?jwD $ ( ? D ) ik i k jX (!)j2

(11)

ik

Since i k jX (!)j2 is not known in (11) the approximation in (4) may again be applied, and it is possible to normalize by the product of the magnitudes of the captured signals. The described function is de ned as the cross-power spectrum phase (CPSP) function: 4 Mi (!)Mk (!) CPSPik (!) = jM (!)j jM (!)j i

and its time-domain counterpart



(12)

k

 

4 F ?1 Ffmi (t)gFfmk (t)g cpspik ( ) = jFfmi (t)gj jFfmk (t)gj

(13)

It is seen from (11) that the output of the CPSP function is delta-like with a peak at  = Dik . The analysis above is derived based on analog signals and stationary fDik g. For digital processing, the fmi (t)g are sampled and become discrete sequences fmi (n)g. Also, the fDik g are not stationary, since the source of sound is liable to change location. Finite frames of analysis are also required due to computational constraints. Hence the usual windowing techniques are applied and the sampled signals are broken into analysis frames. After conversion into the discrete nite sequence domain, (13) becomes:



 

DFTfmi (n)gDFTfmk (n)g cpspik (n) = IDFT jDFT fmi (n)gj jDFTfmk (n)gj

(14)

2.2. Modi ed CPSP

The whitening of the cross correlation spectrum is based on the assumption that the statistical behavior of both signal and noise is uniform across the entire spectrum. More speci cally, it is assumed that the approximation in (4) is equally valid for all !. In practice, the SNR level varies with !. In untreated enclosures there is commonly large amounts of acoustic noise below 200Hz. It is therefore desirable to discard the portion of the CPSP below that level. Also, given that the source component x(t) provides the dominant amount of overall energy in mi (t) it can be expected that there is a higher SNR at frequencies where the magnitude of Mi (!) is greater. It is therefore desirable to weigh the portions of Sik (!) with greater magnitude more heavily. An alternate expression to (12) is proposed that performs only a partial whitening of the spectrum: 4 Mi (!)Mk (!) CPSPik (!) = (15) (jMi (!)j jMk (!)j) 0    1 A discrete equivalent for the modi ed CPSP is expressed as:  DFTfm (n)gDFTfm (n)g  i k 01 (16) cpspik (n) = IDFT (jDFTfmi (n)gj jDFTfmk (n)gj) Setting  to zero produces unnormalized cross-correlation, while setting  to one produces (14). A good value of  may be determined experimentally, and varies with the characteristics of room noise and the acoustical re ectivity of room walls. An optimal value for  was determined to be about 0.75 for several di erent enclosures. (see section 5).

3. SEARCH ALGORITHM

The goal of the search algorithm is to determine the location of the sound source based on a selected set of TDOA's. Consider a sound source s with coordinates s = fxs ; ys ; zs g, and a microphone pair m1 ; m2 with coordinates m1 = fxm1 ; ym1 ; zm1 g and m2 = fxm2 ; ym2 ; zm2 g as shown in Figure 1. It takes sound time t1 to propagate from s to m1 and time t2 to propagate from s to m2 . A given time ti of propagation may be computed by the expression:

p 2 2 2 ti = dV(mi ; s) = (xm ? xs ) + (yVm ? ys ) + (zm ? zs ) i

i

i

sound

sound

(17)

where Vsound is the speed of sound equivalent to about 340 meters/second at normal temperature and pressure, and d() is the measure of physical distance. M1

t1 t’1

H t’2

S’

M2 t2

S

Figure 1. Sound propagation from source to microphones.

3.1. Estimation of Source Position

The TDOA estimate computed for the pair m1 ; m2 de nes the di erence t1 ? t2 . This di erence in turn de nes a hyperplane H . All points on H have associated propagation times t01 ; t02 for which the di erence t01 ? t02 is equal to the TDOA estimate D12 . All points lying on this plane are potential locations of source s. The hyperplane is a three dimensional hyperboloid de ned by the parametric equation: d(p ? m1 ) ? d(p ? m2) = t ? t = D (18)

Vsound

1

2

12

where p = fxp ; yp ; zp g de nes a point on the hyperplane H . A given set of TDOA estimates fDik g has an associated set of hyperplanes fHik g. The location of the sound source must be a point p that lies on every Hik and satis es the set of associated parametric equations:

8 d(p?m1 )?d(p?m2 ) > = D12 > Vsound > > .. < . d(p?mi )?d(p?mk ) = D > ik > Vsound > > . : ..

9 > > > > = > > > > ;

(19)

Thus a set of three fDik g uniquely speci es the coordinates of the source (barring a degenerate case). For sets of four or more fDik g a solution may only exist in a least mean square (LMS) sense. Approximate closed form solutions of (19) exist in a two-dimensional case .6 In a three-dimensional case an approximate solution may be obtained when special restricted arrangements of the microphone pairs are used.8 If three dimensional resolution is required for arbitrary microphone arrangements, a closed form solution does not exist, and numerical methods must be used. Given a set of computed TDOA's fDik g, we seek to nd a point p0 with an associated set of TDOA's fDik0 g such that the error between fDik g and fDik0 g is minimized:

E=

p0 is de ned as:

X 0 (Dik ? Dik )2

(20)

alli;k

p0 = p : min p E

(21)

A nondirected gradient search is used to nd p0 .9 Variable step size was used to speed up the search. It was observed that the error function E tends to be quite smooth. Virtually no instances of search failure due to local minima were ever detected over the course of the work presented here.

3.2. False TDOA Rejection

It is found that the TDOA estimates described in section 2 are produced with a small but nite probability of having signi cant error. Figure 3.2 shows a simpli ed two-dimensional case where an incorrect TDOA estimate has been computed. M5

M1 M2

M7

M6

H56

M8

H’78

H12 P

M3 M4

H34

P’

Correct position estimate

H78

Initial position estimate H78 is incorrect based on a bad TDOA estimate D78

Figure 2. Search with incorrect TDOA estimate Since (19) is overdetermined, we may prune equations from the system set and still be able to nd a suitable bad from the set of fDik g. After the search reaches an error minimum, each Dik has an associated contribution to the error sum:

p0 . We wish to develop a procedure to remove a bad TDOA estimate D Eik = (Dik ? D0 ik)2  Two

dimensional refers to a case when it is assumed that all microphones, and the sound source, lie on a common plane

(22)

12 TDOA estimates

Evaluate system of TDOA equations

Yes Error < threshold? Output Position

No Yes

Time out reject frame

No No

Inconsistent TDOA’s reject frame

Real-time Constraint Exceeded?

Too few TDOA’s Left? Yes Drop TDOA with biggest error contribution

Figure 3. Flowchart of procedure to reject incorrect TDOA estimates If all but one of the TDOA estimates are good we may expect the position estimate p0 of the rst search to be fairly close to the correct p. We can therefore expect that the error Ebad associated with Dbad to be bigger than all the other fEikg. We can also expect the overall error for the given set of TDOA estimates to be large. A owchart of a strategy dealing with incorrect delay estimates is given in Figure 3. If after the search converges the overall error exceeds a set threshold, the equation associated with the Dik that contributes the most to the error sum is removed from (19) and a new search is performed on the reduced system. This procedure may be repeated several times to remove multiple bad TDOA estimates. However, the reliability of the consequent postition estimates decreases with each iteration. The probability that a given Dik is correct must be high in order for rejection procedure to be used e ectively.

4. IMPLEMENTATION

The CAIP Source Location System, shown in Figure 4, consists of an 8-element microphone array connected to a Signalogic SIG32C-8 DSP board which is hosted by a generic 66 MHz 486DX2-based PC, and a Canon VC-C1 camera which is connected to the serial port on the PC. The software for the system consists of two parts. The rst is a Microsoft Windows-based program which controls the DSP, computes the source position from the TDOA estimates, and controls the camera. The second is a DSP32C program which performs A/D conversion and calculates the TDOA estimates in real time. Communication between the DSP board and the host PC is handled via Signalogic-developed Dynamic Link Library (DLL) routines. Sound is captured using the 8 microphones and digitized by the DSP board at 16 kHz. The TDOA estimation is performed at most twice per second to satisfy real-time computational constraints. A speech activity detector is used to determine if the incoming signal is speech and should be processed. The detector analyzes the signal from one of the microphones in the array. The signal is highpass ltered at 200Hz to remove room mode noise. The output of the lter is then recti ed and applied to three separate integrators with di erent rates of integration. The slowest integrator is used to determine the noise oor. If the output of both the medium rate integrator and the fast integrator exceed the noise oor and the output level of the fast integrator exceeds the level of the slow integrator during intervals of time, the signal is deemed to be speech. The variation of the fast integrator over the medium integrator is used to indicate variation of signal energy, and indicates nonmechanical noise. The weights with which the integrator constants are compared, as well as the coecients of integration, have been determined empirically. Once speech activity is detected, a 1024-sample frame of data from the eight channels is gathered and an inplace FFT is performed on all 8 channels. The spectrum is preemphasized to mitigate low-frequency noise from

7KH6RXUFH/RFDWLRQ6\VWHP

IURP7'2$HVWLPDWHV

6LJQDORJLF

'63+RVW3& &DOFXODWHVVRXUFHFRRUGLQDWHV

$7 7'63&

3&%XV

'63%RDUG &DOFXODWHV7'2$HVWLPDWHV

&DOFXODWHVSDQWLOWYDOXHV DQGFRQWUROVFDPHUD

(OHPHQW0LFURSKRQH$UUD\ &RQQHFWHG6HULDO3RUWV

&DPHUD

Figure 4. The CAIP Source Location System. ventilation systems and other ambient room noise, and then a Cross-Power Spectrum estimate for each pair is computed according to (14). The program running on the DSP32C is capable of providing a set of 12 TDOA estimates approximately twice per second. When it completes calculation of a TDOA estimate set, the DSP32C sends an interrupt to the controlling DLL, which in turn sends a Windows message to the PC-side program to pick up the data. The estimated position of the source is then calculated using the algorithm discussed in Section 3. The TDOA estimates obtained by the DSP32C program will not always lead to a valid position estimate because re ections and noise will cause inaccuracies in the estimation of the TDOA. To guard against pointing the camera towards an incorrect location, the PC-side program assumes that the position estimate is incorrect (and does not move the camera) if the calculated error metric exceeds a certain threshold, if more than three TDOA estimates are dropped, or if the search does not nd a optimal position estimate after 45 iterations. In addition, the last \good" position estimate is used as the starting point for the search on the next set of TDOA estimates. This allows the algorithm to \home in" on a source if the initial location estimate is close to the actual position of the source but is rejected because of excessive error or number of iterations, and keeps incorrect position estimates from in uencing future position estimates. If a correct position estimate is obtained, the camera is aimed toward the position by sending pan and tilt values to the camera via the PC's serial port. The pan and tilt values are computed from the position coordinates using a simple geometric formula, and are sent to the camera via an o -the-shelf driver for the VC-C1 camera. After this is completed, the PC-side program waits for another data-ready message from the DSP board communication DLL.

4 Meters

4 Meters

Array Microphones

Recording Locations

3 Meters

Figure 5. A 3 m  4 m room used to test the source location system.

5.1. Oine Results

5. PERFORMANCE

The algorithms described in Sections 2 and 3 were implemented oine and run with data recorded in real rooms with the microphone array. This enabled convenient evaluation of the various aspects of the source location algorithm. The performance of the algorithm was evaluated in two acoustical settings: 1. Regular room - Partially foam padded (Around the array) but with most of the wall area, as well as the oor and ceiling, untreated (Figure 5). The room is about 3x4 meters in size. 2. Reverberant auditorium - A highly reverberant and completely untreated auditorium. All walls are untreated cement. The ceiling has large metal beams running down its length (Figure 6). The auditorium is about 10  15 meters in size. The performance of the TDOA estimator was evaluated in both settings. The sentence \I have a question. How much are tomatoes?" was uttered by several speakers from 25 locations representing a fair coverage of the room oor space (Figure 5). Frames from the captured signal that were detected to contain speech were processed using the described algorithms, and TDOA estimates were computed using the previously described methods. The estimates were compared to ones that should have been obtained based on the known positions of the talkers. A similar procedure was performed for the auditorium (Figure 6) with 20 positions covering the oor space. A resulting TDOA estimate was deemed incorrect if it di ered from the expected TDOA by more than a preset error bound. The percentage of all TDOA's that exceeded the allowed error is given in Figure 7. Plots are given for the regular room and the auditorium. Error curves are given for an error bound of one sample and two samples y . The error rate is plotted against the whitening parameter  de ned in equation 16. It is observed that error is minimized when  is set to 0.75 in the regular room, and 0.8 in the reverberant room. The overall performance of the TDOA estimation algorithm is excellent in the benign regular room environment, and su ers considerably in the harsh reverberant environment of the auditorium.

y At a sampling rate of 16 kHz a sample represents an estimate error of 62 seconds. With a microphone pair distance of 30 cm this represents a broadside deviation of 3

15 Meters

Array Microphones

Recording Locations 10 Meters

Figure 6. A 10 m  15 m room used to test the source location system.

Highly Reverberant Auditorium 10x15 Meters 28 x − error>1 sample 26 % Error

o − error>2 samples 24 22 20 18 16 0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.9

0.95

1

p Regular Room 3x4 Meters 2 x − error>1 sample % Error

1.5

o − error>2 samples

1 0.5 0 0.65

0.7

0.75

0.8

0.85 p

Figure 7. Error in TDOA estimates.

5.2. Real-Time Results

The system has been tested in the rooms shown in Figures 5 and 6, and in an 8 m  5 m conference room at Bellcore which was signi cantly less reverberant than the auditorium shown in Figure 6. The unmodi ed CPSP expression (14) and simple energy detection was used in the implementation. In both the small room (Figure 5) and the Bellcore conference room, the system performed moderately well, taking at most 2 seconds to aim the camera at a speaker, provided the speaker was talking at a moderate level. The position estimates deemed correct by the search algorithm were within 30 cm of the actual position of the speaker. In addition, the camera was never moved to point to an incorrect position (i.e. every location estimate deemed correct by the system was truly correct). These results, especially in the Bellcore conference room, were very encouraging. However, the performance in the highly reverberant auditorium (Figure 6) was quite poor. While the camera was never aimed erroneously, it often failed to nd a correct location estimate for sources, and the speaker needed to shout to cause the camera to move. Examination of the TDOA estimates produced by the program revealed an excessive number of incorrect values. The probable cause is the considerable amount of reverberation in the room, which causes the TDOA estimation algorithm to be less reliable, and is consistent with the oine simulation described above.

6. CONCLUSION

A low-cost, real-time system for source location using microphone arrays was presented. The system is capable of determining the location of the source and controlling a camera in real time. Using the Cross-Power Spectrum Phase (CPSP) algorithm, the TDOA for the di erent microphone pairs are estimated. These TDOA estimates are then transferred to the PC, where they are used to calculate the position of the source. Once the position of the source has been obtained, the the appropriate parameters are sent to the camera. The real-time system has been tested successfully in a typical medium-sized conference room with a moderate amount of reverberation, and in a small acoustically treated room. It performed poorly in an exteremely reverberant auditorium, but it is believed that the performance can be improved somewhat by implementing the speech detector described in Section 4 and changing the CPSP algorithm to emphesize portions of the spectrum where there is high speech energy. Future work will include re ning the current system for better performance and extending the system to include spatially selective sound capture. The speech detector and modi ed CPSP expression (16) will be added to the realtime implementation and will hopefully mitigate some of the problems caused by reverberation. Furthermore, the output of the source location system can be used to to steer a beamformer towards the sound source, thus allowing focused audio to accompany the focused video.

REFERENCES

1. M. Brandstein. A framework for speech source localization using sensor arrays. Doctoral Dissertation, Brown University, May 1995. 2. J. Flanagan, J. Johnson, R. Zahn, and G. Elko. Computer-steered microphone arras for sound transduction in large rooms. Journal of the Acoustical Society of America, pages 1508-1518, November 1985. 3. D. Johnson, and D. Dudgeon. Array signal processing - concepts and techniques. Prentice Hall, rst edition, 1993. 4. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Computer, Speech, and Language, 9(2):153{169, April 1995. 5. M. Omologo and P. Svaizer. Use of the crosspower-spectrum phase in acoustic event localization. Technical Report No. 9303-13, IRST, Povo di Trento, Italy, March 1993. 6. M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. Proceedings of ICASSP-1994, Adelaide, Australia, 1994. 7. C. Knapp, and G. Carter. The generalized correlatoon method for estimation of time delay. IEEE Transactions on Accoustics Speech and Signal Processing. Bol. ASSP-24,n.4, August 1976.

8. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form method for nding source locations from microphonme-array time-delay estimates. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3019{3022, Detroit, MI, May 8-12 1995. 9. W. Press, S Teukkolsky, W. Vetterling, and B. Flannery. Numerical recipes in C: The art of sceinti c computing. 2nd Ed. Cambridge University Press, 1992.

Acknowledgements

This work has been supported by a contract with Bellcore, and by NSF Grant No. MIP-9314625 and ARPA Contract DABT63-93-C0037.

Suggest Documents