Microphone-Array Localization Error Estimation with Application to Sensor Placement Michael S. Brandstein, John E. Adcock, and Harvey F. Silverman Laboratory for Engineering Man/Machine Systems Division of Engineering Brown University Providence, RI 02912 email:
[email protected] tel. (401) 863-1431 email:
[email protected] tel. (401) 863-1044 email:
[email protected] tel. (401) 863-2276 (fax) (401) 863-1157 PACS: 43.72 December 21, 1995
Abstract This paper presents a means for predicting the error region associated with a speech-source location estimate obtained from a set of microphones in a room environment. The error predictor presented is derived assuming a speci c source-sensor geometry consisting of pairs of closely-spaced sensors for which a delay estimate associated with the potential source has been evaluated. The accuracy of the predictor is evaluated through a set of Monte Carlo simulations and an application of the predictor to microphone-array design in the context of a video-conferencing scenario is presented. PACS: 43.72
Michael S. Brandstein et al JASA submission
1
1 Introduction A steerable array of microphones has the potential to replace the traditional head-mounted or deskstand microphone as the input transducer system for acquiring speech data in many applications. An array of microphones has a number of advantages over a single-microphone system. First, it may be electronically aimed to provide a high-quality signal from a desired source-location while it simultaneously attenuates interfering talkers and ambient noise. In this regard, an array has the potential to outperform a single, well-aimed, highly-directional microphone. Second, an array system does not necessitate local placement of transducers, will not encumber the talker with a hand-held or head-mounted microphone, and does not require physical movement to alter its direction of reception. These features make it advantageous in settings involving multiple or moving sources. Finally, it has potential capabilities that a single microphone does not; namely automatic detection, location, and tracking of active talkers in its receptive area. Existing array systems have been used in a number of applications [1, 2]. These include teleconferencing [3, 4, 5, 6], speech recognition [7, 8, 9, 10], speaker identi cation [11], speech acquisition in an automobile environment [12, 13], sound capture in reverberant enclosures [14, 15, 16], large-room recordingconferencing [17], acoustic surveillance [18, 19], and hearing aid devices [20]. These systems also have the potential to be bene cial in several other environments, the performing arts and sporting communities, for instance. An essential requirement of these sensor array systems is the ability to locate and track a speech source and accordingly various methods for talker localization in a room environment have been investigated [21, 22, 23, 16, 18]. For audio-based applications, an accurate x on the primary talker, as well as knowledge of any interfering talkers or coherent noise sources, is necessary to eectively steer the array, enhancing a given source while simultaneously attenuating those deemed undesirable. Source-location information may be used to discriminate individual speakers in a
Michael S. Brandstein et al JASA submission
2
multi-source scenario. With this information available, it would then be possible to automatically focus upon and follow a given source on an extended basis. Of particular interest is the use of speaker location estimates for aiming one or more cameras in a video-conferencing system. In this application, the automated localization system eliminates the need for human camera operators. Given a source-location estimate, a measure of the region of spatial uncertainty related to the estimate is essential before the information can be judiciously employed in a practical application. For instance, given a source-location estimate, knowledge of the associated error region is required for designing the array beam pattern; allowing for a suciently wide beam to avoid audio dropout due to misaim while simultaneously restricting the beam size to provide maximum attenuation of interfering sources. Similarly, in pointing a video camera, this error information is useful for selecting camera magni cation, permitting close-up views of a talker without over-zooming. This data may also be used to aid in array design. As will be demonstrated in Section 5, the estimate uncertainty may be incorporated into a process for determining optimal microphone placement. The assessment of array localization accuracy has been studied extensively in the context of sonar and radar applications [24, 25, 26, 27, 28, 29]. These analyses concentrate primarily on linear sensor arrays, evaluating location estimates with regard to range and bearing error. This information is also of interest for navigational systems [30, 31, 32, 33] where a location estimate is calculated from clocking signals transmitted from known positions. These applications vary in key respects from the speech-source/room-environment scenario addressed below. This paper presents a means for predicting the error region associated with a speech-source location estimate obtained from a set of microphones placed within a room environment. The error predictor presented is derived assuming the speci c source-sensor geometry utilized in [21]; the basic unit of which consists of a pair of closely-spaced sensors and a single delay estimate associated with the potential source. Delay estimates are evaluated with respect to the particular
Michael S. Brandstein et al JASA submission
3
sensor pair. Unlike the case of navigational systems, there is no attempt made to de ne timedierence of arrival (TDOA) values relative to a single reference sensor or an absolute scale. This philosophy is motivated by several arguments. Primarily, in a near- eld source environment, source directionality can create signi cant signal dissimilarities at spatially distant sensors. In the interest of obtaining accurate and reliable TDOA estimates, the individual sensors in each pair must be kept close together. Additionally, the precision of the location estimate is dependent upon the placement of the sensors relative to the actual source-location and the acoustic environment. This may require placing sensors in a wide variety of positions throughout the enclosure. In general, sensor geometries will not be limited to linear arrays, preventing an analysis of location estimate accuracy in terms of range and bearing alone. Given only a xed number of sensors arranged in spatially-local sensor pairs, it is not robust and frequently not possible, to evaluate all the TDOA gures relative to a single sensor location. The proposed sensor-pair geometry addresses the problem of source localization given these autonomous sensor pair-TDOA units. The following section describes the source-localization problem addressed by this work and presents a statistically optimal location estimator. Section 3 contains the derivation of an error region predictor which is then evaluated through a set of simulations in Section 4. Section 5 illustrates the speci c application of the predictor in the context of array design in a video-conferencing scenario. Finally, Section 6 presents some conclusions and suggestions for future work.
2 Source Localization Problem The locationing problem addressed here may be stated as follows. There are N pairs of sensors
mi1 and mi2 for i 2 [1; N ]. The ordered triplet (x,y,z) of spatial coordinates for the sensors will be denoted by mi1 and mi2, respectively. For each sensor pair, a TDOA estimate, i , for a signal source located at s is available. The true TDOA associated with a source, s, and the ith sensor-pair
Michael S. Brandstein et al JASA submission is given by:
4
T (fmi1; mi2g; s) = js ? mi1 j ?c js ? mi2j
(1)
where c is the speed of propagation in the medium. In practice, i represents a corrupted version of the true TDOA and in general, i 6= T (fmi1; mi2g; s). In addition to the i , a variance estimate,
i2, associated with each TDOA is also assumed to be available as a byproduct of the time-delay estimation procedure. This variance estimate is generally a function of the signal content and sensor SNR conditions as well as the source radiation pattern and the physical conditions of the room. Throughout this discussion i is assumed to be an unbiased estimate of the true TDOA. This assumption is reasonable under a broad range of physical conditions. For acoustically live environments reverberation and multipath aect TDOA estimation in a manner similar to that of uncorrelated white noise under diminished SNR situations [34]. These conditions would then be incorporated into an in ated TDOA variance estimate. Given these N sensor-pair, TDOA-estimate combinations:
fmi1; mi2g; i; i2
for i = 1; : : :; N
it is desired to obtain an estimate of the source-location, ^s. If the TDOA estimates are assumed to be independently corrupted by additive zero-mean Gaussian noise, the Maximum Likelihood (ML) estimate ^sM L is found through minimization of a least-squares error criterion [35] denoted here by
JM L (s):
^sM L = arg min (J (^s)) ^s M L where
JM L(^s) =
N X
1
2 [i ? T (fmi1; mi2g; ^s)] i=1 i
(2)
2
(3)
Michael S. Brandstein et al JASA submission
5
While other localization error criteria are amenable to this localization problem and have been shown to yield advantageous results under the least favorable conditions [21], only the ML error criterion will be considered here.
3 Estimation of Localization Error Region In what follows it is assumed that the true source-location s is known and that the goal is to develop a statistical analysis of the precision associated with the source-location estimate ^s. First, the displacement associated with the source estimate is expressed in terms of the true location and the array geometry, and then analyzed relative to the localization-error criterion.
3.1 Displacement Geometry For a pair of sensors, mi1 and mi2 , with midpoint mi and unit axis ai given by:
mi = mi1 +2 mi2 i1 ? mi2 ai = jm m ? m j; i1
i2
(4)
Figure 1 depicts the relationship between the true source-location s and an estimate of the location ^s relative to this ith sensor pair. Ri is de ned as the distance from s to mi and i as the angle between the directed line segment s ? mi and the sensor pair axis ai . The values R^ i and ^i are de ned similarly for the location estimate ^s. The 3-dimensional Cartesian displacement vector from
s to ^s is denoted by:
2 6 6 6 6 6 6 6 4
sx
s = s
3
7 7 7 7 7 y 7 7 5
sz
Michael S. Brandstein et al JASA submission
6
For sources with a large source-range to sensor-separation ratio (Ri =jmi1 ? mi2j 1), the direction angles, ^i and i , are well approximated by: s) i1; mi2 g; ^ ^i = cos?1 c Tj(mfm? mi2j i1 mi2g; s) i = cos?1 c Tj(mfmi?1; m j i1
i2
(5)
and ^i is related to the positional vectors via the dot product:
R^i cos ^i = (s + s ? mi ) ai = (s ? mi ) ai + s ai = Ri cos i + s ai
(6)
Following [32], R^ i is represented by its rst-order Taylor series expansion about the true sourcelocation:
R^i Ri + s ?Rmi s i
(7)
This linearization of the source range requires that the source-location estimate be suciently close to the true location (i.e. small s) such that the error introduced in this approximation is negligible in comparison to localization errors due to TDOA estimate errors. As will be shown, this requirement is met in practice. Additionally, ^s is an inherently biased estimate of the true location. By truncating the series expansion the non-linear terms responsible for this eect are eliminated, and accordingly, the approximation that follows does not account for any biasing found in the actual location estimate.
Michael S. Brandstein et al JASA submission
7
Substituting (7) into (6) yields: s ? m i ^ Ri cos i + s cos ^i = Ri cos i + s ai
Ri
or equivalently:
#
"
^i s cos ^i ? cos i = Rai ? (s ? mi) cos R2 i
From the assumption of a small s,
cos i ^2 R
i
(8)
i
cos ^i ^2 R
i
. Applying this and (5) to the cosine terms on
the left side of (8), produces:
"
#
ai ? (s ? m ) cos i s c [ T ( f m s ) ? T ( f m i i1 ; mi2g; ^ i1 ; mi2g; s)] = jmi2 ? mi1j Ri R2i
And nally: [T (fmi1; mi2g; ^s) ? T (fmi1; mi2g; s)] =
|
"
#
jmi2 ? mi1j ai ? (s ? m ) cos i s i c Ri R2i {z
hi
}
0
= h0i s
(9)
where h0i is the (1 3) vector relating the dierence in TDOA for the ith sensor pair to the estimate displacement vector. It will be useful to express (9) for the N sensor pairs via matrix notation. The (N 1) vector of TDOA dierences is denoted by s^ and the (N 3) matrix composed of the h0i vectors will be
Michael S. Brandstein et al JASA submission
8
given by H , i.e.: 2
s^ =
6 6 6 6 6 6 6 6 6 6 6 6 4
[T (fm11; m12g; ^s) ? T (fm11; m12g; s)] [T (fm21; m22g; ^s) ? T (fm21; m22g; s)] .. . [T (fmN 1; mN 2g; ^s) ? T (fmN 1; mN 2g; s)]
3
2
7 7 7 7 7 7 7 7 7 7 7 7 5
6 6 6 6 6 6 6 6 6 6 6 6 4
H=
h0
3
1 7 7 7 7 7 2 7 7 7 7 7 7 7 5
h0 .. .
h0N
The estimate displacement is then related to the N delay-estimate, sensor-pair combinations by
^s = Hs
(10)
3.2 Source Estimate Based Upon the ML Criterion The case where the source-location is estimated by minimization of the ML error criterion given in (2) is now examined; ^s = ^sM L. The ML error criterion (3) may be rewritten as:
JM L (^s) = =
N X i=1 N X i=1
1 h ? T (fm ; m g; ^s)i2 i i1 i2 2
i
1 h ? T (fm ; m g; s) + i1 i2 2 i i
T (fmi1; mi2g; s) ? T (fmi1; mi2g; ^s)
i2
(11) De ning W as the (N N ) diagonal matrix of the reciprocal TDOA estimate variances and
Michael S. Brandstein et al JASA submission
9
to be the (N 1) vector of dierences between the estimated TDOA and true TDOA, 2
6 6 6 6 6 6 6 6 6 6 6 6 4
1 ? T (fm11; m12g; s) 2 ? T (fm21; m22g; s) .. .
N ? T (fmN 1; mN 2g; s)
3
2
3
7 7 7 7 7 7 7 7 7 7 7 7 5
6 12 6 1 6 6 6 6 6 6 6 6 6 6 4
7 7 7 7 7 7 7 7 7 7 7 7 5
W
1
22
... 1
2 N
Equation (11) is rewritten as:
JM L (^s) = ( ? s^)0W( ? s^) The ML criterion is assumed to be minimized when ^s = ^sM L, which gives:
JM L(^sM L) = min J (^s) = ( ? s^)0W( ? s^) ^s M L = ( ? HsM L )0 W( ? HsM L )
(12)
The right side of this equation is identical in form to the weighted-linear-least-squares-error and can be shown [35] to be minimized when:
sM L = (H0WH)?1H0W
(13)
Therefore, minimization of JM L would result in a sM L as given above. Equation (13) relates the displacement vector associated with a location estimate ^sM L to the TDOA estimates that would produce this particular estimate via minimization of the nonlinear LS error criterion JM L .
Michael S. Brandstein et al JASA submission
10
The covariance of sM L is given by: h
covfsM Lg = E (sM L ? E(sM L))(sM L ? E(sM L))0
i
(14)
The delay estimates have been assumed to be corrupted by a zero-mean, uncorrelated noise source and therefore E ( ) = 0. Substituting this and (13) into (14) yields:
covfsM L g = E sM L sM L0
= (H0WH)?1 H0 WE ( 0 )W0 H (H0WH)?1 = (H0WH)?1 H0 WE ( 0 )W0 H(H0WH)?1
0
(15)
Because E ( ) = 0, cov f g = E ( 0 ) which is an (N N ) diagonal matrix with diagonal elements i2. Similarly, the weighting coecients that comprise the diagonal elements of
W were selected to be (1=i2), and thus, E( 0 )W0 = IN. Equation (15) therefore simpli es to:
covfsM L g = (H0WH)?1 (H0WH)(H0WH)?1 = (H0WH)?1
(16)
Equation (16) predicts the covariance of the ^sM L estimate given knowledge of the source and sensor locations as well the TDOA estimate variances.
4 Analysis of Estimate Error Predictors To evaluate the accuracy of (16) as a predictor of the estimator's true covariance, two sets of simulations were conducted with the varying parameter being the positions of the sensor pairs.
Michael S. Brandstein et al JASA submission
11
4.1 Evaluation #1 In this rst computer simulation a ten-element bi-linear sensor array with :5m spacings was used. The array was centered on one wall in a 6m 6m 4m idealized, anechoic rectangular room as depicted in Figure 2a. The eight pairs of diagonally adjacent sensors were selected as the sensor pairs. Monte Carlo simulations were performed; 100 independent location estimates were computed for each of 36 source-locations within the room. Source-locations were spaced one meter apart in two horizontal planes at 2m and 3m high. For each of the 100 trials at each source-location the true TDOA values for each sensor pair were calculated and then corrupted by uncorrelated additive white Gaussian noise. The corrupting noise level at each sensor pair was xed at a moderate level, a standard deviation of 10?2m when scaled by c. The ML location estimate ^sM L was then calculated for each trial via a quasi-Newton algorithm constrained to search within the physical dimensions of the room. Figure 3 displays the results of these simulations. Each plot in this gure is a view from above the room. The sensor array is indicated by the ve circles on the left vertical axis. In Figure 3 a, the 3600 (36 locations, 100 estimates per location) ^sM L estimates have been plotted with dots. Because the room is symmetric about the midpoint of the array, only one height was used for the locations in each half-plane. In each of these plots, the lower horizontal half-plane contains source-locations 2m high, level with the mid-line of the sensor array. The sources in the upper half-plane are 3m high, one meter above the mid-line of the sensor array. Figure 3 b shows the principal component vectors of the predicted covariance matrix scaled to 2.5 standard deviations with thick lines, and the principal component vectors of the distribution of measured source-locations. The principal components of the measured distributions are oset vertically by a small amount for graphical clarity. For each source-location, the predicted error
Michael S. Brandstein et al JASA submission
12
^s
s s R^ i Ri ^i
mi2
mi
i
mi1 ai
Figure 1: The relationship between the true source-location s and an estimate of the location ^s relative to the ith sensor pair.
Sensor Arrays 4m
4m
.5 m 2.0 m
6m
Sensor Array
.5m .5m
6m
6m
6m
(a) (b) Figure 2: Location Error Evaluation Scenarios: (a) Ten-element bi-linear sensor array centered on one wall. (b) Four sets of square sensor arrays are positioned at the center of each wall.
Michael S. Brandstein et al JASA submission
13
covariance matrix was calculated via (16). An eigenvector-eigenvalue decomposition of the (3 3) matrix yields its principal components vectors [36]. Geometrically, if covfsM Lg is positive de nite with eigenvalue-eigenvector pairs (i ; ei) for i = 1; 2; 3, all the (3 1) vectors x which satisfy: (x ? x)T (covfsM Lg)?1 (x ? x) = h2
p
de ne a hyper-ellipsoid centered about x with axes h i ei . The eigenvalues correspond to the variance of the data set projected onto the corresponding eigenvector or principal component. Setting h = 2:5 in the above expression will therefore generate a hyper-ellipsoid with axes extending 2:5 standard deviations in either direction from the center of the conic along each of the principal component vectors. If the distribution of source estimates possesses a trivariate normal density, a given estimate would have a 0:9 probability of falling on or within such a hyper-ellipsoid [36]. In each case, the estimator has been assumed to be zero-biased and thus the center of the hyperellipsoid is the given source-location. The lines in Figure 3b display the scaled principal component vectors which correspond to the axes of the hyper-ellipsoid associated with the predicted error covariance of each source-location shown in the Figure 3a. Table 1 presents detailed numerical data for selected source-locations in Figure 3. In each case, the principal component standard deviations are listed for the predicted error covariance as well as for the sample covariances associated with the set of ^sM L estimates. The fourth row gives the total standard deviation. This value is calculated as the square root of the component variance summation and is equivalent to the square root of the trace of the covariance matrix. Several characteristics are apparent from Figure 3 and Table 1. The predicted error covariance is generally within a few centimeters of the observed covariance of the ^sM L estimator. As the gure suggests and the table quanti es, disparities between the predicted and observed are most extreme
Michael S. Brandstein et al JASA submission
14
6
6 A
A E
4
5
E
4
C
sensor array
sensor array
C 3
B 1
3
F
height = 2.0 m
height = 2.0 m
F 2
height = 3.0 m
height = 3.0 m
5
2
B 1
D D 0 0
1
2 3 4 dimension of room (meters)
5
6
0 0
1
2 3 4 dimension of room (meters)
5
6
(a) (b) Figure 3: Location Error Evaluation #1: (a) An overhead view of the ^sM L source estimates for the 100 trial simulations at 36 source-locations. (b) For each source-location principal component axes of the predicted estimate error covariance are shown in thick lines and the measured covariance are shown in thin lines (with a small vertical oset). Alphabetic labels refer to those source-locations for which numerical data is presented in Table 1.
Standard Deviations of Point Principal Components (cm) ^sML Label predicted st 1 22.7 21.0 nd 2 9.0 14.9 A rd 3 1.8 1.8 total 24.6 25.8 1st 10.6 10.5 nd 2 1.9 1.9 B rd 3 1.5 1.3 total 10.9 10.8 1st 10.7 10.7 nd 2 2.2 2.2 C rd 3 2.0 1.9 total 11.1 11.1
Standard Deviations of Point Principal Components (cm) ^sML Label predicted st 1 35.4 33.0 nd 2 3.7 3.2 D rd 3 3.0 2.8 total 35.7 33.3 1st 33.6 34.1 nd 2 3.7 3.8 E rd 3 3.5 3.6 total 33.9 34.5 1st 39.9 35.6 nd 2 4.0 4.2 F rd 3 3.9 3.8 total 40.2 36.0
Table 1: Location Error Evaluation #1: The principal component standard deviations for the predicted error covariance and the sample covariances derived from the ^sM L estimates. Point labels refer to source locations in Figure 3.
Michael S. Brandstein et al JASA submission
15
in those cases involving relatively large error variances, where the linearity assumptions used in the derivation of the error covariance predictor are less valid. Consider the source point labeled `D', located at (3:5m; 0:5m) and at a height of 2m. The total observed standard deviation is 33:3cm for the estimator while the predicted total is 35:7cm. The 2:4cm dierence is largely accounted for by the rst principle component. At the other extreme is the point `B' at (1:5m; 1:5m) and height 2m. The disparity between predicted and observed total standard deviation is only :1cm. For sources positioned near the boundary of the room, this disparity between observation and prediction may be due, in part, to arti cially low observed covariance values brought about by the search constraint placed on the ML-error criterion minimizer. Those source-locations that are estimated to be outside of the physical room are placed at the room boundary and the spread of source estimates is subsequently skewed. This eect is apparent in the statistics for point `F' where there is a visible cluster of estimate points at the wall of the room and the observed rst principal component standard deviations are sizably less than the predicted value. As expected, the source-estimation procedure is most accurate for sources close to and within the dimensions of the sensor array. The precision of the location estimates is extremely sensitive to the bearing for sources close to end- re, and the accuracy of the range estimates drops rapidly as the true source range increases. These observations are consistent with results reported for standard linear arrays, as in [37]. For a xed (x,y) position relative to the oor, the variation in height of the half-planes had little eect on the estimator's precision. An exception to this rule being the source-location labeled `A' at location (0:5m; 5:5m) and height 3m and its symmetric counterpart at (0:5m; 0:5m) and height 2m. While the former is farther from the sensors than the latter, it possesses a milder bearing condition relative to the array. As the gure illustrates, this small improvement in bearing angle has a dramatic eect on the error spread of the source's location estimates in comparison to its counterpart's.
Michael S. Brandstein et al JASA submission
16
4.2 Evaluation #2 In the previous simulation, the source-locations that were estimated most precisely possessed two key features: they were broadside (or nearly broadside) to the sensor array, and they were not very distant from the sensors. Short of placing physical obstacles at positions considered undesirable, the set of potential source-locations within a room cannot be dictated. However, there may be a great deal of freedom in the placement of the sensors. The results of Evaluation #1 motivated the choice of the array con guration illustrated by Figure 2 b in which a 0:5m 0:5m square array has been centered along each wall of the same 6m 6m 4m anechoic, rectangular room. This sensor arrangement provides for an improved coverage of the room environment. The vast majority of potential source-locations are close and at a broadside angle to at least one sensor pair. The diagonal combinations within each sub-array were used as the sensor pairs, yielding the
p
same number of TDOA estimates (eight) and the same sensor spacings (:5 2m) as in the previous experiment. Monte Carlo simulations were performed in an manner identical to those of Evaluation #1. Now that the array con guration is symmetric in two dimensions, the source-locations in each of the four parallel quarter-planes are at a single height, ranging from the midpoint of the room, 2m, to a half-meter short of the ceiling, 3:5m. The results of this experiment are presented in Figure 4 and Table 2. Once again, the expression (16) accurately predicts the results of the Monte Carlo simulations. Discrepancies between observed and predicted values continue to be greatest in the large variance cases. Source height has little eect on the overall estimation precision, except in the cases where altering the source height signi cantly alters the source's bearing angle relative to a sensor pair. In short, the trends from Evaluation #1 are also seen with this alternative sensor arrangement. However, the overall sourcelocalization error has been reduced signi cantly as a result of the more judicious placement of sensors; the predicted mean total standard deviation decreased from 30:4cm in Evaluation # 1 to
Michael S. Brandstein et al JASA submission
17
6
6 A
E
4
height = 3.5 m
5
height = 3.0 m
height = 3.0 m
height = 3.5 m
A 5
E 4
C C 3
3
B 1
height = 2.0 m
2
height = 2.5 m
height = 2.5 m
height = 2.0 m
F F
2 B 1
D D 0 0
1
2 3 4 dimension of room (meters)
5
6
0 0
1
2 3 4 dimension of room (meters)
5
6
(a) (b) Figure 4: Location Error Evaluation #2: (a) A top-down view of the ^sM L source estimates for the 100 trial simulations at 36 source-locations. (b) For each source-location principal component axes of the predicted estimate error covariance are shown in thick lines and for the measured are covariance shown in thin lines (with a small vertical oset).
Standard Deviations of Point Principal Components (cm) ^sML Label predicted st 1 14.7 13.5 nd 2 5.4 4.9 A rd 3 2.8 2.8 total 15.9 14.6 1st 6.6 5.1 nd 2 2.9 2.9 B 3rd 1.9 2.0 total 6.6 6.2 st 1 3.2 3.8 nd 2 3.1 3.3 C 3rd 2.9 2.7 total 5.3 5.7
Standard Deviations of Point Principal Components (cm) ^sML Label predicted st 1 4.5 4.9 nd 2 2.0 2.0 D rd 3 1.2 1.2 total 5.0 5.4 st 1 4.1 4.2 nd 2 3.5 3.8 E 3rd 2.3 2.3 total 5.8 6.1 st 1 4.5 4.7 nd 2 2.0 1.8 F 3rd 1.2 1.5 total 5.0 5.2
Table 2: Location Error Evaluation #2: The principal component standard deviations for the predicted error covariance and the sample covariances derived from the ^sM L estimates. Point labels refer to source-locations in Figure 4.
Michael S. Brandstein et al JASA submission
18
7:3cm in Evaluation #2. Also the hyper-ellipsoids in Evaluation #2 are considerably less eccentric than those generated via the bilinear array of Evaluation #1.
5 Optimal Microphone Placement As Evaluations #1 and #2 illustrate, the placement of sensors within a room can dramatically aect the accuracy of the source-location estimates. Sensor positioning is generally subject to the physical or aesthetic constraints of the environment, but may also be restricted by the requirements of the time-delay estimation procedure. It has been assumed throughout this discussion that the source localization process is independent of the TDOA estimation, requiring only the sensor locations, TDOA estimates, and the estimate variances. However, the precision of time-delay estimators is highly dependent upon the coherence or similarity of the signals received at the two sensors. It is therefore essential to the accuracy of the TDOA estimates that the separation of the sensors within each sensor pair be small enough to prevent signi cant disparities in the received signal quality or content between the sensor pair. This quali cation makes certain placement scenarios that seem advantageous for localization, ineective due to the detrimental eects on the accuracy of the time-delay estimates; the increased TDOA variance may overwhelm the advantages of a broadened baseline. In practice, the selection of sensor separation distances requires a knowledge of the environmentally-dependent performance characteristics associated with the time-delay estimation procedure employed. The choice of the number of sensor pairs and their positions ultimately depends upon minimizing some form of a precision-based cost function that is constrained by the number of available sensors, the physical environment, and the required intra-pair separation distances. The details and method of minimizing such a cost function will vary from one application to another. Some work in this area related speci cally to speech source acquisition has been reported in [7, 3, 38, 39]. In many
Michael S. Brandstein et al JASA submission
19
scenarios, prior information concerning the potential locations of signal sources or a set of spatial regions from which it is desirable to obtain `good' location estimates may be speci ed and the complexity of the cost function will be greatly reduced. Regardless of the speci cs, at the core of an optimization procedure there must be a means of evaluating estimation accuracy given source and sensor locations. The previously derived expressions for predicting error covariance may be applied for this purpose. The following simulations exhibit the application of the error estimation procedure to array design in a video-conferencing environment. For this example the source positions and orientations are explicit and a practical cost function is easily de ned. Furthermore, the simulation procedure detailed below may easily be modi ed and extended to a cover a wide range of applications.
5.1 Video-Conferencing Simulation Parameters Details of the simulations are given below:
The Room: A 4m 1:2m :1m rectangular table at a height of :7m, centered within a 4m 7m 2:75m idealized, anechoic enclosure. See Figure 5. No modeling of reverberations or secondary sources is included.
Talker Locations: Participants are seated (height=1:25m) and standing (height=1:75m) at :75m intervals along the sides of the table, at the table head, and a remote positions o of the corner locations for a total of 30 speaker locations. Associated with each speaker location is source orientation vector. In this case, all the talkers face a monitor/camera found at the front of the table and at a height of 1:5m. Again, see Figure 5.
Microphone Placement Constraints: Microphones are restricted to occur in four-element arrays each consisting of two orthogonal microphone pairs with 25cm sensor separations as
Michael S. Brandstein et al JASA submission
20
Monitor Unit 4m
Table
7m
Figure 5: Overhead View of the Room and Talker Locations
.25 m
Mics .25 m Mics
Figure 6: Diagram of a Four-Unit Microphone-Array
Michael S. Brandstein et al JASA submission
21
shown in Figure 6. These particular sensor restrictions are dictated by the practical timedelay estimation considerations discussed above and by the requirements of the Linear Intersection location estimator, a closed-form localization algorithm detailed in [23] and employed for the actual location estimation in practice. The four-element arrays are further constrained to placements on the monitor/camera unit, the table surface, the room walls, or the room ceiling (microphones are assumed to be held in chandelier units and placed at a height :25m below the actual ceiling). In each case the individual microphones are assumed to be mounted normal to the particular surface and a microphone orientation vector is included along with sensor location.
Source Model: The sources are modeled as cardioid radiators. The source amplitude is a function of radiation angle and distance. In particular: 1 + cos( ) sm A(s; m) / 2rsm
where sm is the angle generated by the source orientation vector and the source-microphone displacement vector, s ? m, and rsm is the source-microphone separation, js ? mj.
Microphone Model: The microphones are modeled as possessing a cosine reception pattern. The reception amplitude is a function of source angle relative to the microphone orientation vector.
B(m; s) / j cos(ms)j where ms is the angle generated by the microphone orientation vector and the sourcemicrophone displacement vector.
Michael S. Brandstein et al JASA submission
22
TDOA Variance: The variance of the TDOA estimate, i, for a source s relative to the ith sensor pair fmi1; mi2g is assumed to be related to the sum of the reciprocal signal-noise power ratio at the two microphones:
2
2
i2 = variance(i) / A(s; m )1B(m ; s) + A(s; m )1B(m ; s) : i1 i1 i2 i2 The noise power is assumed to be uniform at each microphone and is therefore incorporated into the constant of proportionality. Given an arrangement of four-element microphone-arrays, the localization error covariance for a particular source-location may be calculated via Equation (16). The localization error associated with a combination of microphone arrays, M, and the set of prede ned source-locations is the average of the total standard deviations for the 30 individual error covariances: 30 1 1X (trace[covfsM Li g]) 2 Average Total ErrorfMg = 30 i=1
where covfsM Li g is the error covariance associated with the ith source location. The units of this statistic are meters however, given the constants of proportionality in the modeling expressions, the total average error values presented will be normalized relative to the minimum error microphone positioning scenario (see Figure 8(A)). The rst set of simulations considered various arrangements of two four-element arrays. In each case, the arrays were allowed to vary with two degrees of freedom along a speci ed surface (the monitor, ceiling, table, or walls). Array positions were selected to minimize the average total standard deviation value associated with a particular set of surfaces. Figure 7 presents the best results obtained for several combinations of arrays. Each of the plots displays an overhead view of the room with the principal component vectors associated with each talker location included. For
Michael S. Brandstein et al JASA submission
23
(A) 1 Monitor/ 1 Ceiling (Error: 2.47)
(B) 1 Monitor/ 1 Table (Error: 2.95)
(C) 2 Ceiling (Error: 3.71)
(D) 2 Table (Error: 8.08)
(E) 2 Side (Error: 4.41)
Figure 7: Combinations of Two Arrays
Michael S. Brandstein et al JASA submission
24
each scenario the principal components and the average total error listed have been scaled relative to a uniform noise condition. Of the ve scenarios considered, the monitor/ceiling arrangement, plot (A), yields the lowest overall error, 2.47. The best positioning of the two arrays was found when the monitor array was mounted on the center of the monitor unit at a height of 1:22m, just below the focus point of the talker orientations. In the plot, the monitor array is illustrated as three small circles protruding from the edge of the monitor unit. The corresponding placement of the ceiling array is :25m behind the table center as shown in plot (A). The principle components of the error covariance display behavior similar to the results of the simulations conducted in the previous section; the major components are generally oriented toward the array units and increase with their range from the arrays. Placing one array on the monitor and one on the table, plot (B), produced the next best results. For this case the height of the table array was compensated for by a positioning of the monitor mounted array at a height of 1:77m, above the focus point. Comparing scenarios involving a monitor array, plots (A) and (B), to those without, plots (C), (D), and (E), it is apparent that it is preferable to have microphones at the room focal point rather than on the ceiling or table, even though arrays on the ceiling and table are closer to the candidate source-locations. Finally, plot (D) incorporating two table arrays, yielded the worst performance of the conditions tested. Interestingly, the result for this condition is not symmetric. The rightmost table array consistently nds a minimal position :2m from the table midline. Figure 8 displays the results obtained with various combinations of three array units. In general, the inclusion of the third array greatly improves the overall localization accuracy relative to the two array scenarios. Once again, conditions involving a monitor array, plots (A), (B), (E), and (F), achieve the smallest error gures. The monitor/2 ceiling combination, plot (A), produces the best results of any condition evaluated, slightly outperforming the monitor/2 table, plot (B), and
Michael S. Brandstein et al JASA submission
25
(A) 1 Monitor/ 2 Ceiling (Error: 1.00)
(B) 1 Monitor/ 2 Table (Error: 1.06)
(C) 3 Ceiling (Error: 1.80)
(D) 3 Table (Error: 2.02)
(E) 1 Monitor / 2 Side (Error: 1.46)
(F) 1 Monitor / 1 Table / 1 Ceiling (Error: 1.03)
Figure 8: Combinations of Three Arrays
Michael S. Brandstein et al JASA submission
26
monitor/table/ceiling, plot (F), scenarios. The 3 ceiling, (C), and 3 table, (D), situations are the least eective of the three array unit combinations and exhibit a curious asymmetry in their optimal placement; the rightmost array is approximately :2m o of the midline in each case.
6 Discussion In this paper we have described a means for evaluating the error region associated with talkerlocation data obtained from a microphone-array system. The error predictor was shown to accurately model simulated source position estimates generated for a series of ideal room environments. Further experiments involving real rooms, physical talkers and actual microphone-array systems have been conducted and are given in [21]. The results of these real-world evaluations con rm the results of the simulations presented here; the error covariance predictor serves as a valid indicator of the array source-localization performance. In addition to providing a measure of con dence for the source location estimates, the error region predictor may be incorporated into a host of practical applications. As the preceding section illustrates, the predictor is a useful basis for a sensor placement error criteria and provides an eective tool for array design with regard to source location estimation in room environments.
Michael S. Brandstein et al JASA submission
27
References [1] J. L. Flanagan and H. F. Silverman. Material for international workshop on microphone-array systems: Theory and practice. LEMS Technical Report 113, LEMS, Division of Engineering, Brown University, Providence, RI 02912, October 1992. [2] J. L. Flanagan and H. F. Silverman. Material for international workshop on microphone-array systems: Theory and practice. Technical report, CAIP, Rutgers University, Piscataway, NJ 08855-1390, October 1994. [3] J. L. Flanagan. Bandwidth design for speech-seeking microphone arrays. In Proceedings of ICASSP85, pages 732{735, Tampa, FL, March 1985.
[4] W. Kellerman. A self-steering digital microphone array. In Proceedings of ICASSP91, pages 3581{3584, Toronto, CA, May 1991. [5] J. Flanagan. Beamwidth and usable bandwidth of delay-steered microphone arrays. AT&T Technical Journal, Vol.64:983{995, April 1985.
[6] J. Flanagan, D. Berkley, G. Elko, J. West, and M. Sondhi. Autodirective microphone systems. Acustica, 73:58{71, 1991.
[7] H. F. Silverman. Some analysis of microphone arrays for speech data acquisition. IEEE Trans. Acoust. Speech Signal Process., ASSP-35(2):1699{1712, December 1987.
[8] C. Che, M. Rahim, and J. Flanagan. Robust speech recognition in a multimedia teleconferencing environment. J. Acoust. Soc. Am., Vol.92(4, pt.2):2476(A), 1992. [9] C. Che, Q. Lin, J. Pearson, B. deVries, and J. Flanagan. Microphone arrays and neural networks for robust speech recognition. In Proceedings of the Human Language Technology
Michael S. Brandstein et al JASA submission
28
Workshop, pages 342{347, Plainsboro, NJ, March 8-11 1994. ARPA-Software & Intelligent
Systems Technology Oce. [10] D. Giuliani, M. Omologo, and P. Svaizer. Talker localization and speech recognition using a microphone array and a cross-power spectrum phase analysis. In Proceedings of ICSLP, volume 3, pages 1243{1246, Sep. 1994. [11] Q. Lin, E. Jan, and J. Flanagan. Microphone arrays and speaker identi cation. IEEE Transactions on Speech and Audio Processing, 2(4):622{629, October 1994.
[12] Y. Grenier. A microphone array for car environments. In Proceedings of ICASSP92, pages I{305 { I{309, San Francisco, CA, April 1992. [13] S. Oh, V. Viswanathan, and P. Papamichalis. Hands-free voice communication in an automobile with a microphone array. In Proceedings of ICASSP92, pages I{281 { I{284, San Francisco, CA, April 1992. [14] J. Flanagan, A. Surendran, and E. Jan. Spatially selective sound capture for speech and audio processing. Speech Communication, 13(1-2):207{222, 1993. [15] E. Jan, P. Svaizer, and J. Flanagan. Matched- lter processing of microphone array for spatial volume selectivity. In Proceedings of ISCAS95, pages 1460{1463. IEEE, 1995. [16] E. Adugna. Speech Enhancement Using Microphone Arrays. PhD thesis, Rutgers University, New Brunswick, NJ, October 1994. [17] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-steered microphone arrays for sound transduction in large rooms. J. Acoust. Soc. Amer., 78(5):1508{1518, November 1985.
Michael S. Brandstein et al JASA submission
29
[18] Maurizio Omologo and Piergiorgio Svaizer. Acoustic event localization using a crosspowerspectrum phase based technique. In Proceedings of ICASSP94, pages II{273{ II{276. IEEE, 1994. [19] M. Omologo and P. Svaizer. Use of the cross-power spectrum phase in acoustic event localization. Technical Report Technical Report No. 9303-13, IRST, Povo di Trento, Italy, March 1993. [20] J. E. Greenberg and P. M. Zurek. Evaluation of an adaptive beamforming method for hearing aids. J. Acoust. Soc. Amer., 91:1662{1676, March 1992. [21] M. S. Brandstein. A Framework for Speech Source Localization Using Sensor Arrays. PhD thesis, Brown University, Providence, RI, May 1995. [22] M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A practical time-delay estimator for localizing speech sources with a microphone array. Computer, Speech, and Language, 9:153{ 169, April 1995. [23] M. Brandstein, J. Adcock, and H. Silverman. A closed-form method for nding source locations from microphone-array time-delay estimates. In Proceedings of ICASSP95, pages 3019{3022. IEEE, 1995. [24] J. Hassab. Underwater Signal and Data Processing. CRC Press, 1989. [25] S. Haykin, editor. Array Processing: Applications to Radar. Academic Press, 1980. [26] W. Hahn. Optimum signal processing for passive sonar range and bearing estimation. J. Acoust. Soc. Am., Vol.58(1):201{207, July 1975.
[27] V. MacDonald and P. Schultheis. Optimum passive bearing estimation. J. Acoust. Soc. Am., Vol.46:37{43, 1969.
Michael S. Brandstein et al JASA submission
30
[28] W. Bangs and P. Schultheis. Space-time processing for optimal parameter estimation. In J. Griths, P. Stocklin, and C. Van Schooneveld, editors, Signal Processing, pages 577{590. New York:Academic, 1973. [29] G. Carter. Variance bounds for passively locating an acoustic source with a symmetric line array. J. Acoust. Soc. Am., Vol.62(4):922{926, October 1977. [30] P. Janiczek, editor. Global Positioning System. Washington, D.C.:The Institute of Navigation, 1980. [31] J. Van Etten. Navigational systems: Fundamentals of low and very low frequency hyperbolic techniques. Electrical Commun., vol. 45(3):192{212, 1970. [32] H. Lee. A novel procedure for accessing the accuracy of hyperbolic multilateration systems. IEEE Trans. Aerosp. Electron., AES-11(1):2{15, January 1975.
[33] N. Marchand. Error distributions of best estimate of position from multiple time dierence hyperbolic networks. IEEE Trans. Aerosp. Navigat. Electron., vol.11:96{100, June 1964. [34] S. Bedard, B. Champagne, and A. Stephenne. Eects of room reverberation on time-delay estimation performance. In Proceedings of ICASSP94, pages II{261{ II{264. IEEE, 1994. [35] S. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, rst edition, 1993. [36] D. Johnson and D. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, rst edition, 1982. [37] G. Carter. Time delay estimation for passive sonar signal processing. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-29(3):463{470, June 1981.
Michael S. Brandstein et al JASA submission
31
[38] F. Pirz. Design of a wideband, constant beamwidth, array microphone for use in the near eld. Bell System Technical Journal, 58(8):1839{1850, October 1979.
[39] M. Goodwin and G. Elko. Constant beamwidth beamforming. In Proceedings of ICASSP93, pages I{169{I{172. IEEE, 1993.