used five IBM ThinkPad running Windows XP and the camera we tested is a Sony EVI-D71. The microphones cannot be plugged to the line in through a simple ...
Implementation of an acoustic localization algorithm for video camera steering Paolo Minero, Mustafa Ergen {minero,ergen}@eecs.berkeley.edu University of California at Berkeley
1
Introduction
Advances in wireless networking and sensor network design open new prospective in performing demanding collaborative sensing and signal processing. This project was motivated by the ambitious plan of designing and implementing acoustic applications in low-cost distributed wireless sensor networks. As first step, we considered the problem of acoustic localization through a set of distributed sensors. Acoustic localization has been extensively studied for microphone arrays since it has several practical interests, e.g. video surveillance, video conferencing, home and military applications. In this report, we first introduce a classic mathematical model for the propagation of wideband acoustic sources in a reverberant environment. Then we review two traditional approaches for acoustic localization, based on the maximum-likelihood and the cross-correlation methods. Finally, we describe the implementation of a Windows-based wireless networked acoustic sensor testbed used for video camera steering.
1
2
Channel model
In the following we consider wideband signals (i.e. 30 Hz - 15 KHz) like voices, vibration or movements. It is assumed to operate in the near-field regime and that the propagation speed is known (typically, for an acoustic source, 345m/s). The acoustic source is located in a reverberant space. Consider the situation in which 2R sensors are randomly distributed in space and a single acoustic source generates the wavefront x(t) in a multipath environment. The channel between the source and each sensor is modeled as an LTI system, and the signal received at each sensor is sampled at the rate (fc ) of one sample per unit of time. Assume that all sensors sample the received signal synchronously. We assume that the channel response has a finite number of taps L, and that the channel taps do not vary over N sample times. Starting at time 1, a sequence of samples yi [1], yi [2], . . . is received at the i th sensor i = 1, .., 2R. The resulting single-input multiple-output (SIMO) channel is: yi [n] =
L−1 X
hi [n]x[n − l] + wi [n]
i = 1, 2, . . . 2R
(1)
k=0
where hi [n] is the sampled impulse response of the channel between the source and the i th sensor and wi [n] is additive background noise, modeled as a white gaussian process with zero mean and variance σn2 . The noise is assumed uncorrelated from sensor to sensor. In vector form, the SIMO system is equivalent to yi = Hi x + wi
n = 1, 2, . . . 2R
2
(2)
where yi
=
[yi [1] yi [2] . . . yi [N ]]T
xi
=
[xi [1] xi [2] . . . xi [N ]]T
=
Hi
3
hi [0]
0
···
0
hi [1]
hi [0]
...
0
...
...
...
0
hi [L]
hi [L − 1]
...
0
...
...
...
0
...
hi [L − 1]
...
hi [0]
Problem formulation
We are interested in estimating the location of the source rs = [rsx rsx rsx ]T , given the received samples yi [n] and the location of each sensor ri = [rix rix rix ]T i = 1, .., 2R. Let us partition the sensors into pairs, and let us denote as j1 and j2 the two sensors in the j th pair. In a multipath channel such as (1), an observable signal characteristic is the time difference of arrival (TDOA) of a source relative to each pair of sensors. In fact, the signal component propagating through the direct path is usually stronger than the component due to reflections. Hence, the channel tap due to the direct path is most of the time the largest in magnitude. As such, the propagation delay in samples τi between the i th sensor and the source can be determinied as the τi = arg max |hi [j]| j
j = 1, 2, . . . L
(3)
And the TDOA between the two channels in the j th pair of sensors is obtained as τj = τj1 − τj2
j = 1, 2, . . . R
3
(4)
Given the TDOA and the sampling rate fc , one can easily compute the relative distance of the acoustic source from the pair of microphones as Dj =
τj v fc
j = 1, 2, . . . L
(5)
where v is the propagation speed of the sound. The TDOA estimates of each pair of sensors, along with the knowledge of the sensor position are used to arrive at a source location estimate. Given Dj , the sound source rs must lie in a three dimensional hyperboloid with equations d(rj1 − rs ) − d(rj2 − rs ) = Dj , where d(x, y) indicated the Euclidian distance between vectors x and y. Let denote with D = [D1 . . . DR ]T the set of distances from the source to every pair of sensors. The vector D identifies a set of hyperboloids and, as a consequence, the location of the source must be the unique intersection point of this set of hyperboloids. However, the channel impulse response hi is usually not known at the i th sensor and it must be estimated from the received samples yi . The problem of the channel estimation is particularly challenging since we do not know the transmitted signal x. The channel must be blindly estimated. It the next section we will derive the optimum ML estimator for the joint estimation of the source signal x and the channel hi given the yi . Errors in the channel estimation may introduce errors in D, and there might not exist a point which interpolates all the hyperboloids. So, given a set of R pairs of sensors, we estimate the source location as the vector rˆs which minimizes the LS error with the vector D. Thus, the source location is estimated as rˆs = arg min r
R X
|d(ri1 − r) − d(ri2 − r) − Di |
2
i=1
In the next section two methods for the channel and TDOA estimations are introduced.
4
(6)
4
TDOA estimation
4.1
ML estimation
Let us consider the i th sensor. We are interesting in jointly estimating the source signal x and the channel hi given the received yi . The block of samples yi can be transformed to the frequency domain by a DFT of length Nc . In order to approximate the linear convolution in (1) appropriate zero padding can be applied, so in general Nc > N . In the frequency domain, the channel model is given by Yi = Hi X + Wi
i = 1, 2, . . . R
(7)
where Yi = [Yi [1] Yi [1] . . . Yi [Nc ]]T represents the Nc -point DFT of the sampled data received at the i th sensor. The ML estimation problem can be expressed as:
max f (Y | Hi , X)
Hi ,X
(8)
It can be shown that the solution to the ML problem is the following: °µ ¶ °2 ³ ´−1 ° ° T ˆ i = min ° I − Hi HT Hi H H Y° i i ° °
(9)
³ ´−1 ˆ = HT Hi X HTi Y i
(10)
Hi
Given the channel estimate (9), one can compute the propagation delay in samples as ¯ ¯ ¯ ˆ i )¯¯ τ = arg max ¯IDF T (H j
(11)
Complexity prohibits the use of (9) in real-time applications. More in general, all the optimal strategies proposed in literature are computationally intense, but tend to possess robustness in
5
realistic environments. The majority of practical localization systems are based on less complex but suboptimal estimators.
4.2
GCC method
The most commonly used method for TDOA is the GCC method, introduced in [2]. The TDOA estimate of each pair of sensors is obtained as the time lag which maximizes the cross-correlation function between filtered versions of the received signal. The GCC function is in fact defined as the cross-correlation function of the filtered signals. In [2], a mathematical model of the signal at the two sensors: y1 (t)
= x(t) + n1 (t)
y2 (t)
= αx(t − D) + n2 (t)
Where n1 (t), n2 (t) and x(t) are assumed jointly wide-sense stationary. Voice signal is usually suppose stationary in 20ms-30ms frames. A property of the autocorrelation function is that R(τ ) ≤ R(0). The cross-correlation of y1 (t) and y1 (t) is Ry1 y2 (τ ) = Rxx (τ ) ∗ δ(τ − D)
(12)
And it has a peak for t = D. So, the delay estimate is ˆ = argmaxx Ry y (τ ) D 1 2
(13)
If the input signal is ergodic, the cross-correlation function is approximated by the correlation function in time, given by ˆ y y (τ ) = R 1 2
1 T −τ
Z
T
y1 (t)y1 (t − T )dt τ
6
(14)
The cross correlation function is related to the cross-spectral density function through an inverse Fourier transform
Z ˆ y y (τ ) = R 1 2
∞
Gy1 y2 (f )ej2πf τ dt
−∞
(15)
In order to eliminate the influence introduced by the possible auto-correlation of a source signal, it is desirable to pre-filter y1 and y2 using a filter before their cross-correlation is computed. Various filter design has been proposed (whitening filters, Wiener-Hopf, ML for Gaussian sources). When y1 and y2 are filtered, then the cross-correlation function becomes Z ˆ y y (τ ) = R 1 2
∞
−∞
Ψ(f )Gy1 y2 (f )ej2πf τ dt
(16)
GCC technique is not good in a reverberant environment (the mathematical model assumes free space propagation). A basic approach to dealing with multi-path channel is deemphasizing the frequency-dependent weighting. The Phase Transform method (PHAT) places equal emphasis on each component of the cross-spectrum phase. The corresponding filter design is Ψ(f ) =
1 |Gy1 y2 |
(17)
The resulting peak corresponds to the dominant delay in the reverberant signal. However, it accentuates components with poor SNR. It is known that the PHAT filter performs badly under high reverberation and high noise conditions. We can apply the GCC method to the channel model in (1). For starters, the model can be written as yi = Hdi x + zi
n = 1, 2, . . . R
(18)
Where Hdi is the component of the channel responce due to the direct path, and zi is some colored noise, given by the sum of wi and the the component of the signal due to reverberation. As an example, let us consider the j th pair of microphones. A possible representation of the two channel matrix is 7
Hdj1
0 ··· hj1 [0] 0 hi [0] 0 = ··· ··· ··· ··· ··· 0 ··· ··· ···
··· ··· ··· hi [0] 0
=
Hdj2
···
hj2 [m]
···
0
···
···
···
···
···
···
0
···
hj2 [m] 0 ··· ··· 0 hj2 [m] ··· 0
(19)
In this specific case we have that 2
yj1 [n]yj1 [n + τ ] ≤ yj1 [n]yj1 [n + m] = hj1 hj2 |x[n]| + zj1 [n]zj2 [n + m]
(20)
More in general, the TDOA for a pair of sensors can be directly computed as the index which maximize the correlation of the two received data samples. Thus, the TDOA can be formally expressed in the following way: τj = arg max m
N X
yj1 [k]yj2 [k + m]
(21)
k=0
The advantage of this approach compared to the ML estimation is that the GCC method is computationally less demanding. On the other hand, the noise is now correlated and in presence of multipath the (21) may not lead to a global maximum. In fact, reverberation causes spurious peaks which may have greater amplitude than the peak due to the true source, so that choosing the maximum peak may not give accurate results. The frequency of these spurious peaks increases dramatically as the reverberation time become larger than 200ms. Also, the method cannot accommodate multi-source scenarios. Altought suboptimal, the method performs well in low reverberant rooms and the semplicity suggests the use of this method in real-time applications.
8
5 5.1
Implementation The synchronization problem
The GCC/PHAT method (17) has been implemented in an acoustic tracking application. This method assumes that each pair of sensors in the network performs synchronous sensing. By synchronous sensing we mean that the audio signals must be sampled synchronously and sampled data must be shared in each pair to perform the signal processing. On the other hand, different pairs may not need to be synchronized and samples data can be collected with time tags at a central node; thus, the synchronization requirement is limited to each pair. Synchronous sensing implies limited timing errors. When sampling at rate higher than 8KHz, the timing error cannot exceed a few microseconds. Beside clock synchronization in different nodes, perfect synchronous sampling can be achieved by controlling the latencies of the sampling system. However, achieving clock synchronization of the order of microseconds in a distributed sensor network is a challenging task and many audio cards have large and non-deterministic latencies when asked to start recording. In [3], clock synchronization is achieved implementing RBS algorithm on Linux machines and the accuracy of the sampling sub-system is controlled using an ”audio server”. RBS is a protocol for high precision synchronization, which, on the other hand, demands a significant communization load. In our implementation, we avoided the synchronization problem by attaching two microphones to the same board, and sampling the received signal through a common processor. This choice also solves the problem of the accuracy of the sampling sub-system, since the non-deterministic latency of the audio card is equal for the two microphones. The resulting network topology is drawn is Figure 1, where each pair of microphones is connected to a processing board, and the boards are wireless linked to a central node.
9
Laptop
Laptop
Laptop era cam
mic1
mic2
mic3
Laptop
Central Laptop
Figure 1: Network structure
5.2
Hardware and software requirements
The hardware system consists of eight microphones, five laptops and one video camera. The number of microphones is arbitrary, as long as multiple of two. The microphones are divided into pairs, and each pair is connected to a laptop. Since the microphone input line is mono, the microphones must be connected to the laptop through the line in. However, the line in is un-powered, so the microphones must be powered with a battery. We tested two types of microphones: the Radio Shack Omnidirectional Boundary Microphone and Sony ECM-F8. We used five IBM ThinkPad running Windows XP and the camera we tested is a Sony EVI-D71. The microphones cannot be plugged to the line in through a simple mini-to-mini Y adaptor. Instead, the microphones will plug into two mini-to-RCA connectors, and the two female RCA will fit into a RCA-to-mini Y adaptor. We used standard connectors and adaptors available in stores. The remaining laptop is used as information sink and to send command to the camera. 10
In the following, we will refer to the central laptop as server and to the laptops connected to the microphones as clients. Clients and server communicate through a TCP connection, so the laptops must be connected via Wireless LAN or Ethernet. We connected the laptops through an 802.11a access point. The operating system used in each laptop is Windows XP, and Matlab (with Data Acquisition Toolbox) must be installed on each machine. The package we developed consists of the following files: • server.m: it is the main application run at the server. Based on the distance estimates, it performes the LS algorithm for estimating the source location (6) and it sends commands to the camera. • client.m: it is the main application run at the client. It estimates the source distance through the PHAT method (17) and sends information to the server. • myfun.m and LMS.m: code run at the server while performing the LS algorithm (6). • detect voice mex.dll : silence/speech detector run at the client. • MCreateFigure.m and Updater.m: they create and update the main figure displayed at the server. • pnet.m, pnet getvar.m, pnet putvar.m, pnet remote.m: they manage the TCP/IP connection. • camcmd.exe and Devserv.exe: these applications sends command to the camera through UDP packets.
5.3
Setup
The core of the code consists in two files, server.m and client.m. Before setting up the parameters in these files, we need to define a cartesian set of coordinates for the room, then fix 11
the camera and microphone locations and measure these positions with respect to the chosen set of coordinates. In this demo it is assumed that the server knows exactly the position of the microphones and the camera, so the coordinates must be stored in server.m. Then, we need to connect server and clients in the same sub-network. Each client should be able to ping the server, and the port chosen for the TCP application should not be in conflict with other applications. Finally, we need to configure the operating system of each client to use Line-In for recording. Since we are using Windows, this can be done using the Volume Control Panel. Finally, some specific parameters might be set in server.m and client.m. Refer to the code documentation for details. The video camera must be connected to the server.
5.4
Running the demo
Once server.m is run, a window will be showed on the server screen (an example is showed in Figure 2). The figure shows the geometry of the room and the relative position of the microphones. A green circle represents the final estimate of the acoustic source location. The size of the circle is proportional to the estimation error, i.e. the smaller the dot the more accurate the estimate. After that, the server waits for connections from the clients. Every time a packet is received from a client, the distance estimates are updated with the new information stored in the packet. If at least one distance estimate has changed the LS algorithm (6) is run to estimate again the source location. When the LS algorithm produces an estimate with an error sufficiently small, the source location is updated on the figure and a command is sent to steer the camera. The user can stop the server by clicking on the ’Stop’ button in the figure. Once client.m is run, Matlab samples the sound signal received at the two microphones and stores a vector of samples corresponding to one second of sound. The two data vectors are sampled synchronously and ready to be processed. A simple silence/speech detector analyses
12
Figure 2: Network structure
the sample data to identify frames of voice. If the presence of voice is detected, then the GCC/PHAT algorithm is run to provide an estimate of the source distance. If the information is reliable, i.e. the estimate is not bigger than the distance between the two microphones, the estimate is sent to the server through the TCP connection. After that, Matlab stores a new sound sample to process until a timer set by the user expires.
6
Conclusion and future work
In this project, we implemented a Window-based wireless sensor network testbed for acoustic localization. The testbed has been used for a camera steering application. Future work includes the implementation of the algorithm on ZigBee boards. The major challenge to face is the design of an efficient synchronization protocol with minimal communication load. Also, we aim
13
to use sensor networks for acoustic beamforming speech enhancement applications, e.g acoustic beamforming and noise suppression.
References [1] J. C. Chen, Kung Yao and R. E. Hudson, Source localization and beamforming, in IEEE Signal Processing Magazine, vol. 19, no. 2, March 2002, pp. 30-39. [2] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Transactions on ASSP, 24(4):320-327,1976. [3] J.C. Chen, R.E. Hudson, and K. Yao, Maximum likelihood source localization and unknown sensor location estimation for wideband signals in the near-field, IEEE Trans. on Signal Processing, vol. 50, Aug. 2002, pp. 1843-1853. [4] S. Hirsch Acoustic Tracker documentation, www.mathworks.com.
14