DoA of gunshot signals in a spatial microphone array: performance of the interpolated Generalized Cross-Correlation method Izabela L. Freire† and Jos´e A. Apolin´ario Jr.†,‡ Military Institute of Engineering (IME) †
Program of Defense Engineering and ‡ Program of Electrical Engineering Prac¸a General Tib´urcio, 80 22.290-2770 Rio de Janeiro, Brazil
[email protected] and
[email protected]
Abstract—The direction of arrival of impulsive, wideband audio signals, specifically of gunshot signals, is the main interest of this work. The Generalized CrossCorrelation (GCC) method, typically employed for wideband signals, is used in a tetrahedral microphone array to estimate horizontal and vertical angles. An investigation on the performance of the scheme is carried out based on signals recorded in an open air, low-noise environment, and then repeatedly replayed in an open-air, noisy environment, under controlled Direction of Arrival. To increase the accuracy of DoA estimates, the correlation function evaluated between each pair of microphones and used by GCC methods goes through an interpolation process. From this study, we have concluded that the GCC method can perform adequately in the task of estimating the DoA of a gunshot signal even in the presence of strong noise level and lower sampling frequency. Index Terms—Direction-of-arrival, DOA estimation, GCC algorithms, microphone array, anti-sniper system
I. I NTRODUCTION This paper presents techniques for the estimation of Direction of Arrival (DoA) of wide-band impulsive signals [1]. Specifically, the DoA of gunshot signals in open-field locations is to be estimated, in the context of an anti-sniper system. The shooting of a supersonic projectile can be associated with two separate audio signals: the muzzleblast (MB), an acoustic signal of supersonic origin, caused by the explosion of the charge that propels the bullet, and thus generated at the location of the sniper, and the ballistic shockwave (bSW), generated, also with supersonic origin, as the projectile travels through air [2], Authors thank the Brazilian agencies CAPES and CNPq for partial funding of this paper.
[3]. The DoA estimation problem is central to an antisniper system, as joint knowledge of MB and bSW DoA’s gives knowledge of not only the direction from which a firearm has been shot, but also an estimate of the location along that direction [4], [5]. The first step of a DoA estimation is determining the presence of a gunshot; for that, we have assumed that the results from [6] and [7] are good enough such that the method used therein may be used with success even in the presence of additive noise. Other clues about the sniper and equipment used can be extracted from the acoustic audio signal generated by the shooting event, as the characterization of the firearm, ammunition and bullet caliber [5], [6], but these are not the topic of this paper. Here we focus on the DoA estimation of impulsive signals, such as MB and bSW, and the experimental part of this work deals exclusively with the MB signal. Due to the impulsive, wide-band characteristics of the signals of interest, DoA algorithms are picked from the class of Generalized Cross-Correlation methods [8] , which estimate the time-difference-of-arrival (TDoA) as the lag time that maximizes the (weighted) crosscorrelation function between signals acquired from a pair of microphones. Different weighting functions will give rise to the unweighted cross-correlation, Maximum Likelihood, and Phase Transform (PHAT) algorithms. Algorithms in this family have low computational cost, in comparison with other DoA estimation techniques. Note should be taken that the GCC family of algorithms is badly suited for reverberating environments [9]. In this paper, DoA is estimated for signals arriving from controlled locations. Gunshot signals are recorded in an open-field, low-noise environment, and then repeatedly presented, via a soundbox, arriving from azimuths
z
and elevations of choice. The use of recordings instead of live gunshots allows more precision in measuring the angles to be estimated.
1
aθ,φ
The main purpose of this paper is to describe the performance of a family of DOA estimation methods known as GCC, presenting statistical results while detailing the techniques and their performance in a practical situation where noise, limited sampling frequency, and possible influence of the environment are present. In [10], the performance of GCC algorithms for multiple source localization is studied. Here, in order to increase DoA estimation accuracy, we propose an interpolation of the GCC correlation functions taken between each pair of microphones. Practical issues leading to a hardware implementation such as the performance of this scheme with a lower sampling frequency and the microphone setup (number of microphones, its geometry, and the choice of pairs of microphones to be used in estimating the DoA) are topics of current investigation.
II. T HE S ETUP In this section, we detail de array and the corpus used in our work. Fig. 1 depicts the spatial microphone array, a tetrahedron with an equilateral (or close to equilateral) base. The coordinates of the microphones are given, in cartesian coordinates, with origin placed at the geometric center of the base of the tetrahedron, by
θ 4 3
φ
y
2
x
Fig. 1.
The microphone array used in this work.
repeatedly reproduced via a soundbox (a total of 300 repetitions, 60 of each different example, per each of the 8 different locations, here referred to as datasets 1 to 8) in a second open-field location. An example of a gunshot signal, recorded by the microphone array, and its power spectral density are shown in Fig. 2. Power Spectral Density −80
p1 = [0 0 0.266]T ,
0.01
−90
p2 = [0.272 0 0]T ,
0.005
p3 = [−0.136 0.235 0]T , and p4 = [−0.136 − 0.235 0]T .
Angles φ and θ, as shown in Fig. 1, define the unit vector aθ,φ , in the direction of the plane wave, and represent azimuth and grazing angle (complement of the elevation) of the sound source. Azimuth was initially measured with a theodolite, and survey marks were positioned with precision of tenths of minutes; later, soundbox and microphone array were positioned with a laser beam, reducing thus the precision of the initial measurement. Elevation was calculated as the arctangent of a ratio of distances. The soundbox was kept exactly above the survey marks, as well as the base of the rotating microphone array. The original gunshot signals were recorded in an openfield location. Five different examples were selected for further experimentation. The recorded signals were
Power/frequency (dB/Hz)
0.015
0 −0.005 −0.01 −0.015 −0.02
−100 −110 −120 −130 −140
20
40
60 80 Time (ms)
100
120
−150 0
10
20 30 Frequency (kHz)
40
Fig. 2. A gunshot muzzle blast and its PSD, as recorded by a microphone whose frequency response is almost flat up to 20kHz and decays sharply after that. The gunshot muzzle blast is a wideband impulsive signal.
The sound was then collected by the tetrahedral microphone array, and sent to four channels of an analogto-digital (A/D) conversion board, the PreSonus Firepod. c Signals from the A/D board were acquired by a Matlab script at 96k samples per second. The distance between the soundbox and the microphones was set to 10m, enough for modeling the sound arriving at the micro-
phone array as in the far-field case. Signals were recorded from DoA’s of azimuths −60o , −30o , 0o , 30o , and 60o . Grazing angles were set to 86.9o , 90o , and 91.4o . The initially recorded signals were free of noise (the field for recording them had a very low level of noise), but, as they were reproduced in the experiments they were contaminated by noise, SNR was measured as
Psignal
Psignal window window − Pnoise
,
(1)
window
with a signal window of 5ms and noise window of 500ms. Measuring SNR of impulsive signals has to be done carefully due to the short duration of the signal window [1]. As such, measured SNR varied between −13.6149 and 13.4723dB .
III. T HE GCC
METHOD
We start to review the DoA methods used herein by defining the signal model, a single source reverberant signal arriving at microphones i and j , as depicted in Fig. 3 where examples of correlation functions of the GCC methods described herein are also shown. In order to improve the estimation of the TDoA, prior to the estimation algorithm, we have multiplied sequences xi (k) and xj (k), discrete-time versions of the microphones signals xi (t) and xj (t), by a Hanning window. (a)
0.4
(b)
3 x (k) j
0.3
2.5
x (k) i
2 1.5
0.1
i j
rx x (τ)
xi(k) and xj(k)
0.2
0
1 0.5
−0.1
0
−0.2 −0.3
−0.5 0
500
1000
1500
2000
−1
2500
0
50
100
150
200
k
(c)
250
0.3
200
0.25 0.2
100
0.15
350
400
450
500
350
400
450
500
i j
rx x (τ)
150
300
(d)
0.35
i j
rx x (τ)
300
250 τ
50 0
0.1 0.05
−50
0
−100
−0.05
−150
0
50
100
150
200
250 τ
300
350
400
450
500
−0.1
0
50
100
150
200
250 τ
300
Fig. 3. Gunshot signals and GCC functions: (a) time domain signals, (b) classical correlation, (c) ML weighting, and (d) PHAT.
The cross-correlation between xi (k) and xj (k) being defined as rxi xj (τ ) = E[xi (k)xj (k − τ )], we can write one possible estimate as rbxi xj (τ ) =
∞ X
xi (k)xj (k − τ ) = xi (τ ) ∗ xj (−τ ). (2)
−∞
Taking into account the definition of the power spectrum density (PSD) Rx (ejω ) = F{rx (τ )}, we write the estimated cross power spectrum density between xi (k) and xj (k) as bxi xj (ejω ) = F{ˆ rxi xj (τ )} = F{xi (τ ) ∗ xj (−τ )} R = Xi (ejω )Xj (e−jω ),
(3)
which corresponds to Xi (ejω )Xj∗ (ejω ) if xj (k) is real. Assuming uncorrelated noise signals, we can replace rxi xj (τ ) by E {[s(k) ∗ hi (k)][s(k) ∗ hj (k − τ )]} such that (assuming real valued s(k) and hj (k)) bxi xj (ejω ) = S(ejω )S ∗ (ejω ) Hi (ejω )Hj∗ (ejω ). R {z } |
(4)
bs (ejω )=|S(ejω )|2 R
Therefore, the cross-correlation between signals xi (k) and xj (k), for the single source reverberant model, can be expressed as bx x (ejω )} rbxi xj (τ ) = F −1 {R Z π i j 1 bs (ejω )ejωτ dω(5) = Hi (ejω )Hj∗ (ejω )R 2π −π Considering (4) and (5), a generalized crosscorrelation between signals from microphones i and j is defined as [9] Z π 1 G bxi xj (ejω )ejωτ dω, (6) rxi xj (τ ) = ψ(ω)R 2π −π
where the frequency weighting function ψ(ω) is added in an attempt to improve the time delay estimation and corresponds to one of the approaches that follows. Also note that, in practical implementations, we usually employ the FFT instead of the Discrete-Time Fourier Transform F{•}. Finally, the time-delay estimation is given by τij = arg max rxGi xj (τ ). (7) τ
A. Classical Cross-Correlation In the discrete time domain, the microphones signals can be modeled as xi (k) = s(k) ∗ hi (k) + ni (k) and xj (k) = s(k) ∗ hj (k) + nj (k).
If we make ψ(ω) = 1, we actually end-up computing an estimate of the regular or classical cross-correlation rbxi xj (τ ).
B. Maximum Likelihood (ML) The ML uses an estimate of an ideal frequency weighting function which is asymptotically unbiased and efficient for uncorrelated, stationary Gaussian signal and noises and no multipath [11]: ψ=
|Xi (ejω )||Xj (ejω )| , bni (ejω )R bxi (ejω ) + R bnj (ejω )R bxj (ejω ) R
immediately to the left or immediately to the right would imply a high frequency which is not present due to the limited band imposed by the anti-aliasing filter (as will be seen later, the original sampling frequency of 96kHz is more than enough to sample correctly the gunshot audio).
(8)
bxi (ejω ) = |Xi (ejω )|2 , R bxj (ejω ) = |Xj (ejω )|2 where R bni (ejω ) = |Ni (ejω )|2 and R bnj (ejω ) = and PSDs R jω 2 |Nj (e )| are estimated during silence intervals. C. Phase Transform (PHAT) The PHAT weighting function is given by [8], [9] 1 ψ= (9) . bxi xj (ejω )| |R
For this case, replacing in (6), it can be noted that R π (9)j(∠ 1 H1 −∠H2 +ωτ ) dω which, asrxGi xj (τ ) results in 2π e −π suming a single delay (hi (k) = αi δ(k) and hj (k) = αj δ(k − τij ), would lead to rxGi xj (τ ) = δ(τ − τij ), a perfect indication of the time delay.
Fig. 4. Interpolation of the correlation function by cubic splines. The number of points to be interpolated between two samples of the correlation function, for a given frequency resolution, varies according to the sampling rate of the time-domain signal.
D. Interpolating the correlation function
E. Obtaining the angles
In order to further improve the accuracy of the TDoA estimate, the correlation function (between each pair of microphones) can be interpolated around its peak value, in search for a new maximum, as shown in Fig. 4. This allows a better TDoA estimation for each pair of microphones and therefore a better overall DoA estimation.
From the peaks of weighted cross-correlation, we obtain all TDoAs (for all possible pairs of microphones in the array), τij , which, for the case of spatial arrays, does not have a direct relation to the DoA such as in the case of a ULA.
Interpolation was performed by cubic splines, working on three samples: the correlation peak and its immediate neighbors. For a target resolution of 4.8M samples per second, cubic splines interpolated 100, 218, and 800 points between the neighbors of the peak correlation value, for sampling rates of 96kHz, 16kHz and 12kHz, respectively. In Fig. 4, we can observe the result of the interpolation procedure (cubic spline) carried out with the correlation function; note in this figure that the maximum value obtained from the GCC method, τij does not correspond to the real maximum of a continuous time function and that the results from the interpolation, τmax , provides a better estimation of the delay ∆t = τmax /fs between signals xi (t) and xj (t). The choice of the interpolation interval τij − 1 and τij + 1 seems correct for any other local maximum above rx1 x2 (τmax ) in the interval
Let τ ij = τij /fs (or τmax /fs , as from the previous subsection, if we use interpolation) be the delay between microphones i and j . Considering τ i the time it takes the plane wave to travel from microphone i to the origin of the coordinate system, we can write that τ ij = τ i − τ j . Now expliciting the unit vector in the direction of the wavenumber, i.e., in the same direction of the propagation, we write T aθ,φ = − sin θ cos φ − sin θ sin φ − cos θ , (10) such that dij = aTθ,φ pi − aTθ,φ pj . (11) and τ ij =
dij vsound
=
aTθ,φ pi − aTθ,φ pj vsound
= aTθ,φ ∆pij ,
(12)
i −pj where ∆pij = pvsound , pi and pj being the coordinates of the microphones.
Using a Least-Squares (LS) approach and considering N microphones with
29
Interpolated Correlation Function 29
N (N − 1) (13) 2 possible delays, we can find the correct DOA obtaining the aθ,φ that minimizes the following cost function.
28
28
27
27
26
26
ξ(θ, φ) = (τ 12 −
Azimuth (o)
Raw Correlation Function
∆pT12 aθ,φ )2
+ (τ 13 − ∆pT13 aθ,φ )2 + · · · 25 +(τ (N −1)N − ∆pT(N −1)N aθ,φ )2 (14) −5
Taking the gradient of (14) with respect to aθ,φ and equating the result to zero, we obtain aDOA = A−1 b,
5 10 SNR (dB)
25 −5
15
0
5
10
15
Fig. 5. Quantization effects on DOA estimations due to discrete-time nature of the correlation coefficients and the effect of interpolation.
(15)
where A = ∆p12 ∆pT12 + · · · + ∆p(N −1)N ∆pT(N −1)N , (16)
and b = τ 12 ∆p12 + · · · + τ (N −1)N ∆p(N −1)N .
0
(17)
DoA. This lack of precision is overcome by observing that the mean of the estimates given by the three different GCC algorithms, after removal of outliers by considering only the central quartiles, converges with increasing sampling rate. An example of this is given in Fig. 6. Estimated DoA, GCC algorithms
T
Assuming aDOA = [ax ay az ] , the azimuth of the DOA is given by θ = cos−1 az
and the grazing angle (π/2-elevation) by ay φ = tan−1 . ax
(18)
96kHz
Correlation ML PHAT Correlation, interpolated ML, interpolated PHAT, interpolated
16kHz 12kHz −60.6
(19)
The cross-correlation is calculated through the crosscorrelation theorem. Considering N microphones, it is necessary to take N FFT’s, which will then be used to calculate the weighted correlation functions. N microphones give rise to C(N, 2) = N (N2−1) combinations of 2 microphones, and for each of these the IFFT must be calculated once.
−60.5
−60.4
The effect of interpolation of the correlation function can be observed in Fig. 5, which shows the estimated DoA’s for various sound samples. Without interpolation, the DoA assumes more coarsely quantized values, and even if the estimations have a lower standard deviation, they may be biased due to this quantization effect. An obstacle to the analysis of the results is the intrinsic lack of precision in measuring the angles of the actual
−60.1
−60
−59.9
−59.8
Fig. 7 depicts the estimated grazing angles for each dataset. As expected from the geometry of the array, the accuracy of this type of estimation is lower than the azimuth estimates. But with a sampling frequency of 96kHz, the obtained estimations agree with the measured angles. Estimated grazing angle, φ
Here we describe results from the analysis of 8 datasets, with varying DoA’s. Each dataset contains 300 signals. Each signal is one second long, centered around the peak amplitude of microphone number 1.
−60.2
Fig. 6. Mean values of different estimation methods with increasing sampling rate (for dataset 8).
Dataset #1, φ < 90o
IV. E XPERIMENTAL R ESULTS
−60.3
Dataset #2, φ > 90o
Dataset #3, φ = 90o
Dataset #4, φ = 90o
92
92
92
92
90
90
90
90
88
88
88
88
86
86
86
Correlation ML PHAT Correlation, interpolated ML, interpolated PHAT, interpolated
96kHz 16kHz 12kHz
96kHz 16kHz 12kHz o
Dataset #5, φ < 90
Dataset #6, φ < 90 92
96kHz16kHz12kHz
Dataset #7, φ < 90o
Dataset #8, φ > 90o
92
92
90
90
88
88
88
86
86
90
96kHz16kHz12kHz
86 96kHz16kHz12kHz
o
96kHz16kHz12kHz
86 96kHz16kHz12kHz
96kHz16kHz12kHz
Fig. 7. Estimated grazing angles. While, during recordings, azimuth was measured with a theodolite and posteriorly aligned using a laser beam, elevation could only be computed as the arctangent of the ratio of two distances.
V. C ONCLUSIONS From the results presented herein, it is possible to conclude that the GCC methods (specially the PHAT as could be observed from the smaller number of outliers) for estimating the DoA from a gunshot signal, using a spatial microphone array such as the one in Fig. 1, presenting an absolute error lower than one to two degrees even for a sampling frequency as low as 12kHz, are adequate for the addressed case of subsonic bullets. Although the Nyquist rate for sampling from a microphone whose frequency response is null above 20kHz is 40kHz, in the case of multiple microphones it could be useful to sample at frequencies higher than that. From our experiments, we were able to observe that, for sampling rates above 40kHz, the accuracy of the method was within one degree.
R EFERENCES [1] A. Dufaux, “Detection and recognition of impulsive sound signals,” Ph.D. dissertation, Institute of Microtechnology. University of Neuchˆatel, Neuchˆatel, Switzerland, 2001. [2] R. C. Maher, “Modeling and signal processing of acoustic gunshot recordings,” in Proc. IEEE Signal Processing Society 12th DSP Workshop, Jackson Lake, USA, pp. 257–261, September 2006. [3] R. Stoughton, “Measurements of small-caliber ballistic shock waves in air,” Journal of the Acoustical Society of America, vol. 102, pp. 781–787, August 1997. [4] J. E. Barger, S. D. Milligan, M. S. Brinn, and R. J. Mullen, “Systems and methods for determining shooter locations with weak muzzle detection,” Patent 7 710 828, May, 2010. [Online]. Available: http://www.freepatentsonline.com/7710828.html [5] T. M¨akinen and P. Pertil¨a, “Shooter localization and bullet trajectory, caliber, and speed estimation based on detected firing sounds,” Applied Acoustics, vol. 71, pp. 902–913, October 2010. [6] I. L. Freire and J. A. A. Jr., “Gunshot detection in noisy environments,” in Proc. International Telecommunications Symposium (ITS’2010), Manaus, Brazil, pp. 1–4, September 2010. [7] A. Chac´on-Rodr´ıguez and P. Julian, “Evaluation of gunshot detection algorithms,” in Proc. Argentine School of MicroNanoelectronics, Technology and Applications (EAMTA 2008), Buenos Aires, Argentina, pp. 49–54, September 2008. [8] C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-24, no. 4, pp. 320–327, August 1976. [9] J. Benesty, J. Chen, and Y. Huang, Microphone Array Signal Processing. Berlin Heidelberg, Germany: Springer-Verlag, 2008.
[10] D. Salvati, A. Roda, and G. L. F. S. Canazza, “A realtime system for multiple sources localization based on ISP comparison,” in Proc.13th International Conference on Digital Audio Effects (DAFx-10), Graz, Austria, pp. DAFX1–DAFX8, September 2010. [11] M. S. Brandstein and H. F. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in Proc. IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP’97), Munich, Germany, pp. 375– 378, April 1997.
Izabela Lyon Freire was born in Belo Horizonte, Brazil, in 1980. She got her B.S. degree in computer science in 2003, at the Federal University of Minas Gerais, and M. S. degree in neuroscience at that same university, in 2010. She worked with the US-based company Novamente, aimed at building an artificial general intelligence, during 2003-2008. Currently she is a PhD student at the Military Institute of Engineering, in Rio de Janeiro, Brazil. Her research interests are signal processing and cognitive science.
Jos´e Antonio Apolin´ario Jr. (SM’04) was born in Taubat´e, Brazil, in 1960. He graduated from the Military Academy of Agulhas Negras (AMAN), Resende, Brazil, in 1981 and received the B.Sc. degree from the Military Institute of Engineering (IME), Rio de Janeiro, Brazil, in 1988, the M.Sc. degree from the University of Bras´ılia (UnB), Bras´ılia, Brazil, in 1993, and the D.Sc. degree from the Federal University of Rio de Janeiro (COPPE/UFRJ), Rio de Janeiro, Brazil, in 1998, all in electrical engineering. He is currently an Adjoint Professor with the Department of Electrical Engineering, IME, where he has already served as the Head of Department and as the ViceRector for Study and Research. He was a Visiting Professor at the Escuela Polit´ecnica del Ej´ercito (ESPE), Quito, Ecuador, from 1999 to 2000 and a Visiting Researcher and twice a Visiting Professor at Helsinki University of Technology (HUT), Finland, in 1997, 2004 and 2006, respectively. His research interests comprise many aspects of linear and nonlinear digital signal processing, including adaptive filtering, speech, and array processing. Dr. Apolin´ario has organized and been the first Chair of the Rio de Janeiro Chapter of the IEEE Communications Society. He has recently edited the book “QRDRLS Adaptive Filtering” (Springer, 2009) and served as the Finance Chair of IEEE ISCAS 2011 (Rio de Janeiro, May 2011).