A LOCALIZATION-ERROR-BASED METHOD FOR MICROPHONE-ARRAY DESIGN. Michael S. Brandstein. John E. Adcock. Harvey F. Silverman. Laboratory for ...
A LOCALIZATION-ERROR-BASED METHOD FOR MICROPHONE-ARRAY DESIGN Michael S. Brandstein John E. Adcock Harvey F. Silverman Laboratory for Engineering Man/Machine Systems Division of Engineering Brown University Providence, RI 02912
2. SOURCE LOCALIZATION PROBLEM
ABSTRACT This paper presents a means for predicting the error region associated with a speech-source location estimate obtained from a set of microphones in a room environment. The error predictor presented is derived assuming a speci c source-sensor geometry consisting of pairs of closely-spaced sensors for which a delay estimate associated with the potential source has been evaluated. The accuracy of the predictor is evaluated through a set of Monte Carlo simulations and an application of the predictor to microphone-array design in the context of a video-teleconferencing scenario is presented.
1. INTRODUCTION
A fundamental requirement of microphone-array systems for speech acquisition is the ability to locate and track a speech source. Accordingly, various methods for talker localization in a room environment have been investigated [1, 2, 3, 4]. For audio-based applications, an accurate x on the primary talker, as well as knowledge of any interfering talkers or coherent noise sources, is necessary to eectively steer the array, enhancing a given source while simultaneously attenuating those deemed undesirable. Given a source-location estimate, a measure of the region of spatial uncertainty related to the estimate is essential before the information can be judiciously employed in a practical application. Knowledge of the error region is useful for designing the array beam pattern; allowing for a suciently wide beam to avoid audio dropout due to misaim while simultaneously restricting the beam size to provide maximum attenuation of interfering sources. Similarly, when pointing a video camera, this error information is useful for selecting camera magni cation permitting close-up views of a talker without overzooming. As will be discussed below, this error measure may also be used to aid in array design. A method is presented to predict the error region associated with a source-location estimate. The accuracy of the predictor is illustrated through Monte Carlo simulations and an application of the predictor to microphone-array design is presented. This
work
funded
by
9120843, and MIP-9509505
NSF
grants
MIP-9314625,
MIP-
The localization problem addressed here may be stated as follows. There are N pairs of sensors mi1 and mi2 for i 2 [1; N ]. The ordered triplet (x,y,z) of spatial coordinates for the sensors will be denoted by mi1 and mi2 , respectively. For each sensor pair, a time dierence of arrival (TDOA) estimate, i , for a signal source located at s is available. The true TDOA associated with a source, s, and the ith sensor-pair is given by:
T (fmi1 ; mi2g; s) = js ? mi1j ?c js ? mi2j
(1)
where c is the speed of propagation in the medium. In practice, i represents a corrupted version of the true TDOA and in general, i 6= T (fmi1 ; mi2g; s). In addition to the i , a variance estimate, i2 , associated with each TDOA is also assumed to be available as a byproduct of the time-delay estimation procedure [5]. Given these N sensor-pair, TDOA-estimate combinations: fmi1; mi2g; i ; i2 for i = 1; : : : ; N it is desired to obtain an estimate of the source-location, ^s. If the TDOA estimates are assumed to be independently corrupted by additive zero-mean Gaussian noise, the Maximum Likelihood (ML) estimate ^sML is found through minimization of a least-squares error criterion [6] denoted here by JML (s): ^sML = arg min (J (^s)) (2) ^s ML where
JML (^s) =
N X i=1
1 2 i2 [i ? T (fmi1 ; mi2g; ^s)] :
(3)
While other localization error criteria are amenable to this localization problem and have been shown to yield advantageous results under the least favorable conditions [1], only the ML error criterion will be considered here.
3. ESTIMATION OF LOCALIZATION ERROR REGION For a pair of sensors, mi and mi , with midpoint mi and unit axis ai , Figure 1 depicts the relationship between the (assumed known) true source-location s and an estimate of the location ^s relative to this ith sensor pair. Ri is de ned as the distance from s to mi and i as the angle between 1
2
produces:
^s
s s R^i Ri ^i
mi
i
mi
2
mi
1
ai
2
are well approximated by:
mi2g; ^s) ^i = cos?1 c Tj(mfmi?1 ; m i1 i2 j i = cos?1 c T (fmi1 ; mi2g; s) jmi ? mi j 1
2
(4)
and ^i is related to the positional vectors via the dot product: R^i cos ^i = (s + s ? mi) ai = Ri cos i + s ai : ^ Substituting ? the rst-order Taylor series expansion for Ri (R^ i Ri + s?Rmi i s) into this expression yields: Ri cos ^i + s ?Rmi s cos ^i = Ri cos i + s ai i
or equivalently: ^i s: cos ^i ? cos i = Rai ? (s ? mi) cos R2 i
i
(6)
where h0i is the (1 3) vector relating the dierence in TDOA for the ith sensor pair to the estimate displacement vector. Denoting the (N 1) vector of TDOA dierences by s^, and the (N 3) matrix composed of the h0i vectors by H, (6) may be extended to all N microphone pairs as:
^s = Hs:
Figure 1. The relationship between the true sourcelocation s and an estimate of the location ^s relative to the ith sensor pair. the directed line segment s ? mi and the sensor pair axis ai . The values R^ i and ^i are de ned similarly for the location estimate ^s. The 3-dimensional Cartesian displacement vector from s to ^s is denoted by s. For sources with a large source-range to sensor-separation ratio (Ri =jmi ? mi j 1), the direction angles, ^i and i , 1
[T (fmi1 ; mi2g; ^s)h? T (fmi1 ; mi2g; s)] i ? = jmi2 ?c mi1 j Raii ? (s ? mi) cosR2i s i = h0i s
(5)
From the assumption of a small s, cosR^ 2i cosR^2^i . Api i plying this and (4) to the cosine terms on the left side of (5),
The case where the source-location is estimated by minimization of the ML error criterion, or ^s = ^sML , is now examined. De ning W as the (N N ) diagonal matrix of the reciprocal TDOA estimate variances and to be the (N 1) vector of dierences between the estimated TDOA and true TDOA, The ML error criterion (3) may be rewritten as:
JML (^s) = ( ? s^)0 W( ? s^): JML(^s) is minimized when ^s = ^sML , which gives: JML (^sML ) = ( ? HsML )0 W( ? HsML ): (7) The right side of this equation is in weighted-linear-leastsquares-error form and can be shown [6] to be minimized when: sML = (H0 WH)?1H0 W : The delay estimates have been assumed to be corrupted by a zero-mean, uncorrelated noise source and therefore E ( ) = 0. With this we can derive the nal expression for covf^sML g in terms of the source and sensor locations and the TDOA estimate variances:
covfsMLg = E sML sML 0 = (H0 WH)?1: (8)
4. DEMONSTRATION OF ERROR PREDICTOR
To illustrate the accuracy of the error predictor a simulation was performed. Figure 2 (a) shows the experimental arrangement: four 0:5m0:5m square arrays centered along the walls of the 6m 6m 4m rectangular room. The diagonal combinations within each sub-array were used as the sensor pairs, yielding 8 TDOA estimates. Monte Carlo simulations were performed; 100 independent location estimates were computed for each of 36 source-locations at four distinct heights within the room. For each of the 100 trials at each source-location the true TDOA values for each sensor pair were calculated and then corrupted by uncorrelated additive white Gaussian noise. The ML location estimate ^sML was then calculated for each trial via a quasiNewton algorithm constrained to search within the physical dimensions of the room. Figure 2 (b) presents an overhead view of the room displaying the principle components of the predicted and measured location-error covariances. It is apparent from these
Sensor Arrays 4m
Monitor Unit 4m
Table
.5m .5m
6m
6m
7m
(a)
Figure 3. Overhead View of the Room and Talker Locations height = 3.0 m
height = 3.5 m
6
5
4
height = 2.5 m
height = 2.0 m
3
2
1
0 0
1
2 3 4 dimension of room (meters)
5
6
(b) Figure 2. (a) Array geometry used to demonstrate the eectiveness of the error predictor. (b) Measured (thin with oset) and predicted (thick) principal components of location-error covariance. results that the estimated location-error accurately predicts the results of the Monte Carlo simulations. Discrepancies between observed and predicted values are greatest in the large variance cases. Source height has little eect on the overall estimation precision, except in the cases where altering the source height signi cantly alters the source's bearing angle relative to a sensor pair. A quantitative analysis of the error predictor is detailed in [1].
5. MICROPHONE ARRAY DESIGN
The placement of sensors can greatly aect the accuracy of source-location estimates. The choice of the number of sensor pairs and their positions ultimately depends upon minimizing some form of a precision-based cost function constrained by the number of available sensors, the physical environment, and the required intra-pair separation distances. Some work in this area related speci cally to speech source acquisition has been reported in [7, 8]. At the core of an optimization procedure there must be a way to evaluate location accuracy given source and sensor locations. Equation (8) may be applied for this purpose. The following simulation illustrates the use of the er-
ror estimation procedure for array design in a videoteleconferencing environment. Figure 3 displays the simulated room: a rectangular table centered within a 4m 7m 2:75m idealized, anechoic enclosure. Participants are seated and standing at regular intervals along the sides of the table, at the table head, and at remote positions o of the corner locations for a total of 15 2 = 30 speaker locations. There is a display monitor at one end of the room which the talkers face. The talkers are modeled as cardiod radiators while the microphones are modeled with a cosine receptive pattern. The noise power is uniform at each microphone and the TDOA variance is computed as the sum of the signal-noise power ratio of the two microphones. The microphones are constrained to orthogonal pairs in the manner of [2] with an inter-sensor spacing of :25m. Figure 4 depicts the optimal locations for several combinations of three 4-element arrays. Many more combinations of sensor units have been simulated and are detailed in [9], but only 4 are shown here due to space restrictions. In each case, each 4-element unit was allowed to move with two degrees of freedom along a speci ed surface (the monitor, ceiling, table, or walls). The localization error associated with a combination of microphone arrays, M, and the set of prede ned source-locations is the average of the total standard deviations for the 30 individual error covariances: 1 X (trace[covfs g]) 12 Average ErrorfMg = 30 MLi 30
i=1
where covfsMLi g is the error covariance associated with the ith source location. Standard optimization techniques were used to nd the array positions that minimize the average total standard deviation value associated with a particular set of positions. For each scenario the principal components and the average total error listed have been scaled relative to a uniform noise condition. Of the constrained minimal error placements shown, the arrangement combining a single array unit on the monitor and two ceiling arrays, (a), yields the lowest total location error. The 3 ceiling, (c), and 3 table, (d), situations are the least effective of the three array unit combinations and exhibit a curious asymmetry in their optimal placement; the rightmost array is approximately :2m o of the midline in each case. It is apparent that it is preferable to have microphones
(a) 1 Monitor/ 2 Ceiling (Error: 1.00)
(b) 1 Monitor/ 2 Side (Error: 1.46)
(c) 3 Ceiling (Error: 1.80) (d) 3 Table (Error: 2.02) Figure 4. Combinations of three arrays for video-teleconferencing scenario. Lines indicating the principal components of the error at each location are shown. Array positions are indicated by circles.
at the room focal point (the video monitor) rather than on the ceiling or table, even though arrays on the ceiling and table are closer to the candidate source-locations. This is a function of the source modeling.
6. DISCUSSION
In this paper we have described a means for evaluating the error region associated with talker-location data obtained from a microphone-array system. The error predictor may be shown to accurately model simulated source position estimates generated for a series of ideal room environments. Further experiments involving real rooms, physical talkers and actual microphone-array systems have been conducted and are given in [1]. The results of these real-world evaluations con rm the results of the simulations presented here; the error covariance predictor serves as a valid indicator of the array source-localization performance. In addition to providing a measure of con dence for the source location estimates, the error region predictor may be incorporated into a host of practical applications. As the preceding section illustrates, the predictor is a useful basis for a sensor placement error criteria and provides an eective tool for array design with regard to source location estimation in room environments. While the example presented here considered an idealized, anechoic room and speci c restrictions on the microphone locations, the procedure may be easily extended to more sophisticated room models and generalized sensor placements.
REFERENCES
[1] M. S. Brandstein. A Framework for Speech Source Lo-
[2] [3] [4] [5] [6] [7] [8] [9]
calization Using Sensor Arrays. PhD thesis, Brown University, Providence, RI, May 1995. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form method for nding source locations from microphone-array time-delay estimates. In Proceedings of ICASSP95, pages 3019{3022. IEEE, 1995. M. Omologo and P. Svaizer. Acoustic event localization using a crosspower-spectrum phase based technique. In Proceedings of ICASSP94, pages II{273{ II{276. IEEE, 1994. E. E. Jan, P. Svaizer, and J. L Flanagan. A database for microphone array experimentation. In Proceedings of Eurospeech 95, pages 813{816. ESCA, 1995. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Computer, Speech, and Language, 9(2):153{169, April 1995. S. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, rst edition, 1993. H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Trans. Acoust. Speech Signal Process., ASSP-35(2):1699{1712, December 1987. S. Gazor and Y. Grenier. Criteria for positioning of sensors for a microphone array. IEEE Trans. Speech Audio Proc., 3(4):294{303, July 1995. M. S. Brandstein, J. E. Adcock, and H. F. Silverman. Microphone-array localization error estimation with application to sensor placement. submitted to J. Acoust. Soc. Am., June 1995.
A LOCALIZATION-ERROR-BASED METHOD FOR MICROPHONE-ARRAY DESIGN Michael S. Brandstein, John E. Adcock and Harvey F. Silverman 1 Laboratory for Engineering Man/Machine Systems Division of Engineering Brown University Providence, RI 02912 This paper presents a means for predicting the error region associated with a speech-source location estimate obtained from a set of microphones in a room environment. The error predictor presented is derived assuming a speci c source-sensor geometry consisting of pairs of closely-spaced sensors for which a delay estimate associated with the potential source has been evaluated. The accuracy of the predictor is evaluated through a set of Monte Carlo simulations and an application of the predictor to microphone-array design in the context of a video-teleconferencing scenario is presented.