A Real-Time 3D Sound Localization System with Miniature Microphone Array for Virtual Reality Shengkui Zhao, Saima Ahmed, Yun Liang, and Kyle Rupnow
Deming Chen and Douglas L. Jones
Advanced Digital Sciences Center (ADSC) Illinois at Singapore Email: {shengkui.zhao, saima.a, eric.liang, k.rupnow}@adsc.com.sg
Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Email:
[email protected],
[email protected]
Abstract—This paper presents a real-time three-dimensional (3D) wideband sound localization system designed with a miniature XYZO microphone array. Unlike the conventional microphone arrays for sound localization using only omnidirectional microphones, the presented microphone array is designed with both bidirectional (pressure gradient) and omnidirectional microphones. Therefore, the array has significantly reduced size and is known as the world’s smallest microphone array design for 3D sound source localization in air. In this paper, we describe the 3D array configuration and perform array calibration. For 3D sound localization, we provide studies on the array output model of the XYZO array, the widely known direction-of-arrival (DOA) estimation methods and the direction search in 3D space. To achieve the real-time processing for 1◦ search resolution, we accelerate the parallel computations on GPU platform with CUDA programming, and a 130X speedup is achieved compared to a multi-thread CPU implementation. The performance of the proposed system is studied under various reverberation lengths and signal-to-noise levels. We also demonstrate a real-time 3D sound localization demo showing good ability to virtual reality.
I. I NTRODUCTION The real-time three-dimensional (3D) sound source localization has attracted great research interest for virtual reality, realistic teleimmersion, and human-computer interaction. For instance, in the virtual meeting the high-resolution camera is often required to steer to the target speaker with both horizontal and vertical movements. The real-time localization of sound sources is one of the key techniques to direct the camera for a better viewing experience. Furtherore, human hearing is extremely sensitive to the perceived direction-of-arrival (DOA) of sounds. Thus, accurate estimation of their DOA for appropriate reconstruction of their virtual direction are also critical needs for immersive communications. However, the conventional sound source localization systems are usually implemented with large spatially separated three-dimensional (3D) microphone arrays [1]-[4]. This often brings difficulties of setting up the camera and the microphone array for sizeconstrained conditions and of achieving real-time processing for high-resolution requirements. Therefore, realistic teleimmersion will never be widely deployed if each user must have a large physical array of microphones. In this work, we construct a miniature microphone array which approximates the widely used underwater acoustic vector sensors (AVS) [5] as shown in Fig 1. This ’zeroaperture’ miniature microphone array has four precisely placed standard hearing-aid microphones that have useful directivity
Fig. 1. Configure of the miniature XYZO array. The top microphone has an omnidirectional response and the bottom three microphones have bidirectional figure-eight responses. Each sensor is 6mm (Diameter) ×2.7mm (Height)
between 100 Hz and 20 kHz covering most sound sources. Each microphone measures only a few millimeters across, and is placed about a centimeter apart from the others. Unlike the conventional microphone arrays using only the spaced omnidirectional microphones, the miniature microphone array consists of four collocated microphones. There are three orthogonally mounted bidirectional pressure gradient microphones named X, Y and Z, having figure-eight patterns with their directions of maximum response oriented in the X, Y, and Z axes, and one omnidirectional acoustic pressure microphone named O for directing sounds from all directions with equal magnitude. Therefore, the microphone array size is significantly reduced while keeping good localization capability. We refer this miniature microphone array as XYZO array in the paper. Different from the time-delay based sound source localization approaches, the amplitude differences caused by the pressure gradient microphones are dominant in the XYZO array and are the main cues for the sound source localization. It is noted that the array response depends on both frequency and DOA of sound sources. Therefore, with the knowledge of this dependence through experimental measurements, we are able to apply the popular Capon beamformer [6] or the Multiple Signal Classification (MUSIC) [7] for estimating the DOA of sound sources. Capabilities of the multi-source 2D sound localization and signal enhancement with the XYZO array have been experimentally studied in the works [8] and [9]. In this paper, we investigate the localization ability of the miniature XYZO microphone array for determining the
1 Hamming Multichannel Window Input & FFT
Fig. 2.
Covariance Matrix Estimate
Directional Spectra Estimate Peak DOA Search Output
512
Covariance Matrix Estimate
Directional Spectra Estimate
Workflow of the 3D sound localization system.
direction of origin of a sound source in 3D space, in real time under various noise and reverberation conditions. Based on our knowledge, it is the first time that a real-time 3D source localization system is presented based on the XYZO array. With the extreme small size of the array, it has the advantageous usage in virtual reality applications. II. T HE P ROPOSED 3D S OUND L OCALIZATION S YSTEM WITH XYZO A RRAY A. Overall Workflow of the 3D Sound Localization Algorithm The overall workflow of the 3D sound localization algorithm is shown in Fig. 2. It consists of applying the hamming window and fast Fourier transform (FFT) on each microphone output signal, estimating the covariance matrix and the narrowband directional spectra on each frequency bin, and searching the peak of the incoherently combined directional spectra for DOA output. The steps of processing the received microphone signals for DOA output are given as follows: the received fourchannel microphone output signals are first segmented by a rectangular window which has a size of 15360 signal samples. Each segment of the signal data is termed as a frame. The four parallel frames from the four-channel XYZO array are used for a single DOA estimation. Next, each of the four frames is equally split into 30 smaller blocks of 512 samples. And then a 512-length hamming window and a 512-point fast Fourier transform (FFT) are performed on all the blocks to transform the signals from time-domain into frequency-domain. This results in 512 frequency bins for each microphone and 30 complex values for each frequency bin. To estimate the directional spectra for each frequency, the covariance matrix is estimated on the 30 blocks and the narrowband spectrum is computed in the 3D spatial space. Then, the narrowband spectra are summed across all the frequency bins. Finally, the DOA output is given by the peak in combined directional spectra . The DOA output is represented by the horizontal and vertical angles in 3D space ([0◦ , 360◦ ) × [0◦ , 180◦ ]). B. Output Signal Model of XYZO Array Let s(n, θ)denote a signal generated by a broadband source at time index n with bearing vector θ = [θ, φ], where θ represents the azimuth angle and φ the elevation angle. The output vector x(n) of the XYZO array is given by X x(n) = h(θ) ⊗ s(n, θ) + r(n, φ) ⊗ s(n, φ) + v(n) (1) φ6=θ
where h(θ) is an M × 1 impulse response vector of the array at direction θ with the number of channels M = 4 in
the array, r(n, φ) , h(φ) ⊗ g(n, φ) denotes the combined reverberation vector of the array impulse response h(φ) and the room impulse response g(n, φ), v(n) denotes the additive white Gaussian noise vector, and ⊗ denotes discrete time convolution. For the broadband signal model, it is more convenient to make analysis in time-frequency representation. Performing hamming window and a short-time Fourier transform (STFT) on the output vector, we have X X(n, ω) = e(ω, θ)S(n, ω, θ) + r(n, ω, φ)S(n, ω, φ) φ6=θ
+ V(n, ω)
(2)
where X(n, ω),S(n, ω, θ), r(n, ω, φ)and V(n, ω) are the STFT of the x(n), s(n, θ), r(n, φ) and v(n) respectively. e(ω, θ) is the M × 1 frequency-domain steering vector that is a function of frequency and DOA, which is assumed available and will be obtained from the array calibration as described in section D. The frequency-domain steering vector e(ω, θ) of the XYZO array contains both the amplitude difference due to the orthogonally placed gradient microphones and the phase delay due to the separated mounting of microphones, and it is dominated by the amplitude difference. Therefore, the pure time-delay based sound localization approaches are inefficient for the XYZO array. We will consider the more reliable methods for the DOA estimation based on the knowledge of the steering vectors. C. 3D DOA Estimation with XYZO Array In this section, we consider the DOA estimation methods that are derived from the second-order statistics of the microphone array outputs. The two most popular methods that can be applied to the XYZO array are the Minimum Variance (MV) or Capon beamformer [6] and the Multiple Signal Classification (MUSIC) [7]. We next investigate their performance for the directional spatial spectra estimation for the XYZO array. In general, the second-order narrowband DOA estimation can be represented as ˆ s (ω) = argmax f (RX (ω, θ)) θ (3) θ∈Φ
ˆ s (ω) denotes the estimated DOA of the sound source where θ on the frequency bin ω, Φ , ([0◦ , 360◦ ) × [0◦ , 180◦ ]) is the set of the elevation angles and azimuth angles in 3D space, f (RX (ω, θ)) denotes the function for the secondorder narrowband spatial spectrum estimation, and the spatial correlation matrix RX (ω, θ) of the array output vector is approximated by n X ˆ X (ω, θ) = R X(l, ω)XH (l, ω) (4) l=n−N +1
where N is the number of time-blocks averaged to obtain the short-time estimate, and H denotes Hermitian transpose. By using eigen-analysis, the covariance matrix can be eigendecomposed as ˆ X (ω, θ) ≡ R
M X i=1
λi (θ)qi (θ)qH i (θ)
(5)
i
i
i=1
The MUSIC approach takes advantage of the orthogonality of the signal space and the noise space. Assuming that the eigenvalues are sorted as λ1 (θ) ≥ λ2 (θ) · · · ≥ λM (θ) ≥ 0, the MUSIC directional spatial spectrum is then defined as !−1 M X H 2 ˆ fM U SIC (RX (ω, θ)) = (7) |q (θ)e(ω, θ)| i
i=2
The two approaches were initially proposed for narrowband signals. Therefore, for the broadband signals it is required to process each frequency bin independently and then combine the estimation incoherently across time and frequency bins. Finally, the DOA estimation is given by X ˆ s (ω) = argmax ˆ X (ω, θ) (8) θ f R θ∈Φ
ω
D. 3D XYZO Array Calibration In the previous section C, it is shown that knowledge of the frequency/spatial responses of the XYZO array is needed for the DOA estimation. It is noted that the beam pattern of each microphone deviates very differently from its ideal beam pattern, and the XYZO array also has response variations for each frequency bin. In addition, due to the non-perfect mounting of microphones, the gain and phase relationships of the four microphones are not exactly known. A calibration step is therefore needed to measure the time-domain response or the steering vector of the practical XYZO array. In our calibration setup, the impulse response of each microphone was measured through recording three continuous maximum-length sequences (MLS) signals playing from a loudspeaker. The loudspeaker was placed approximately 2 m away from the XYZO array. The height of the loudspeaker was adjustable such that the measurements were conducted in the directions of azimuth angles with 20◦ resolution and elevation angles with 30◦ resolution around the XYZO array. By taking the circular cross-correlation between the microphone output and the MLS signal, the time-domain impulse responses were extracted. To remove the multipath effect due to the sound reflections, the measured impulse responses were windowed off to 128 samples for the sampling rate of 44.1 kHz. The steering vectors of the XYZO array were obtained using a 512-point Fourier transform applied on the shortened impulse responses. To eliminate the influence on the localization due to different energy level of sound sources, normalization was performed on the frequency-domain responses of all microphones over the omnidirectional microphone. The resolution of the measured steering vectors is usually insufficient to cover the full spatial directions in the threedimensional space. Therefore, an interpolation approach is
5
10
15
20
25 Mic O Mic X Mic Y Mic Z
0
−5 0
Magnitude
2
0 0 5
4
Mic O Mic X Mic Y Mic Z
5 10 15 20 Measured position idex for azimuth angle
25
Mic O Mic X Mic Y Mic Z
2 0 0 5
100
200
300
400
0
−5 0
500 Mic O Mic X Mic Y Mic Z
Phase
Magnitude
4
Phase
where λi (θ) denotes the eigenvalues and qi (θ) the eigenvectors. The Capon directional spatial spectra are obtained by maximizing the total output power over all potential candidate directions and given by !−1 M X −1 H 2 ˆ X (ω, θ)) = fCapon (R (6) λ (θ)|q (θ)e(ω, θ)|
100 200 300 400 Interpolated position idex for azimuth angle
(a) Before interpolation
500
(b) After interpolation
Fig. 3. Frequency-domain response of the XYZO array at 1723 Hz for one round of azimuth angles at fixed elevation angle.
required to increase the measured resolution from 20◦ azimuth and 30◦ elevation to the resolutions of smaller angles. We used a two-dimensional Fourier-series technique. The measured steering vectors for each microphone m were modeled as em (ω, θ, φ) =
Q P X X
cp,q (ω)e−ipθ e−iqφ
(9)
p=−P q=−Q
where the order of the Fourier series is P ×Q. By inserting the values of the measured steering vectors the coefficients of the Fourier series cp,q (ω) were obtained as the least squares solution to an over-determined system. After solving for cp,q (ω), the expansion was computed at higher resolution. Fig. 3 shows an interpolation result based on the two-dimensional Fourierseries fitting. E. Real-Time System Implementation with GPU Speedup The whole 3D sound localization system consists of sound acquisition, voice activity detection (VAD), 3D DOA estimation and the result display. We use a commercial M-Audio sound card connecting to the XYZO array for the sound acquisition with sampling rate of 44.1 kHz. A four-channel preamp is designed for increasing the microphone gain. A periodicity-plus-energy based VAD was used for detecting human voice. The DOA results are shown by a computer monitor. To achieve high resolution requirements of sound localization in 3D space, large quantity of searches are required, which demands very high computational load. Therefore, realtime implementation of these high performance algorithms is very challenging. Our experiment showed that the CPU implementation of the 3D localization system falls behind the real-time requirement. Specifically, the CPU implementation is about 20X slower than real time with 1◦ resolution for both azimuth and elevation angles. By taking the advantage of highly parallel computational structure of the localization system, we realized the real-time implementation of the 3D localization system on a GPU. We used the NVIDIA GTX480 GPU and the CUDA programming, and we parallelized each 3D searching direction and each time-frequency bin using independently different thread blocks. Experiments indicate that the GPU implementation achieves a 501X and a 130X speedup compared to a single-thread and multi-thread CPU implementation respectively.
Elevation Angle (Degree)
180
MUSIC
160
Capon
140
(340,150)
120 (240,120)
100 80
(160,90)
60 (100,60) 40 (40,30)
20 0 0
50
100
150
200
250
300
350
400
Azimuth Angle (Degree)
Fig. 4. Mean values of the DOA estimations of the 3D sound sources with the XYZO array using Capon and MUSIC approaches under all the tested conditions. The values in the brackets are the angles of tested sound sources. Capon Elevation
Fig. 6.
A real-life demo of the 3D localization system with XYZO array.
6
5 Standard Deviation (Degree)
Standard Deviation (Degree)
Capon Azimuth
4 3 2 1 0 10dB
5 4 3 2 1 0 10dB
300msec 20dB SNR
200msec
300msec 20dB
200msec
100msec
30dB 0msec
RT60
SNR
100msec
30dB 0msec
RT60
Fig. 5. Mean standard deviations of the DOA estimations of the 3D sound sources with the XYZO array using Capon approach.
III. P ERFORMANCE E VALUATION AND R EAL -T IME D EMO We evaluate the performance of the proposed 3D sound localization system in both offline computer simulations and the real-time demo. For the offline scenario, various noise and reverberation levels were tested. To obtain different reverberation time, we convolved varying clean speech signals with the measured impulse response of different lengths corresponding to different reverberation time. The tested reverberation lengths were 0msec, 100msec, 200msec and 300msec. To get different noise levels, an additive white Gaussian noise was added to the resulted signals. The tested noise levels measured by signal-to-noise-ratio (SNR) are 10dB, 20dB and 30dB. For all the conditioins, six testing directions of sound sources were randomly selected from the 3D space. For 20 testing realizations, the mean localization results over the total realizations and the mean standard deviations over the total tested directions are illustrated in Fig. 4 and Fig. 5. From Fig. 4, it is observed that for both the Capon and MUSIC approaches the estimated mean azimuth and elevation angles are within a small range of the target positions for all the tested conditions. From Fig. 5, it is observed that the mean standard deviations of Capon approach are less than 1◦ for RT60=0msec and less than 6◦ when the SNR decreases to 10dB and the reverberation time RT60 increases to 300msec. Fig. 5 also shows that the tested DOA method offers considerable robustness to additive noise, but tends to degrade performance in reverberant conditions. The overall results indicate that the localization accuracy of the XYZO array is within 6◦ error in a relative noisy and reverberant meeting room. Next we show a real-time 3D sound localization demo using the proposed system implemented with the Capon approach on GPU platform with CUDA programming. The tested speaker was talking to the array and the system showed the direction
of the speaker on the monitor in real time. Fig. 6 illustrates the testing scenario and the indications of the speaker’s direction. The monitor shows the direction of the speaker from the XYZO array, which is represented as a head. The upper red arrow indicates that the speaker is behind the XYZO array, which is facing the camera. The lower red arrow indicates that the speaker’s voice is coming from below the level of the XYZO array. The other dots show the speaker’s recent trajectory. For various testing speakers, the localization performance has a good match with the simulated results. As our future work, we are integrating the 3D sound localization system into the telepresence task for virtual reality. IV. C ONCLUSION In this paper, we first demonstrated a real-time 3D sound localization system using a miniature XYZO array of only four microphones. The XYZO array has significantly small size and achieves 1◦ resolution. We discussed the 3D array calibration and real-time implementation on GPUs. The simulated system evaluation showed performance robust to additive noise but degraded by reverberation. Our real-time system demo showed satisfactory localization accuracy in meeting rooms. R EFERENCES [1] S. Basu, B. Clarkson, and A. Pentland, ”Smart headphones: Enhancing auditory awareness through robust speech detection and source localization,” in Proc. IEEE ICASSP, vol. 5, pp. 3361-3364, Salt Lake City, UT, 2001. [2] H. Wang and P. Chu, ”Voice source localization for automatic camera pointing system in videoconferencing,” in Proc. IEEE ICASSP, New Paltz, NJ, Oct. 1997. [3] R. Cutler, Y. Rui, A. Gupta, and J. Cadiz, et al, ”Distributed meetings: A meeting capture and broadcasting system,” in Proc. ACM Conf. Multimedia, 2002. [4] J. H. Dibiase, ”A high-accuracy, low-latency technique for talker localization in reverberation environment using microphone arrays,” Ph.D. dissertation, Brown Univ., Providence, RI, 2001. [5] M. Hawkes and A. Nehorai, ”Hull-Mounted Acoustic Vector-Sensor Processing”, Proc of the ASILOMAR - 29, 1046-1050, November 1996. [6] J. Capon, ”High-resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, 57(8), 1408-1419, 1969. [7] R. O. Schmidt, ”Multiple emitter localization and signal parameter estimation,” IEEE Trans. Antennas Propagat., vol. AP-34, no. 3, pp. 276-280, Mar. 1986. [8] M. E. Lockwood and D. L. Jones, ”Beamformer performance with acoustic vector sensors in air”, J. Acoust. Soc. Am. 119(1), 608-619, Jan. 2006. [9] S. Mohan, M. E. Lockwood, M. L. Kramer, and D. L. Jones, ”Localization of multiple acoustic sources with small arrays using a coherence test”, J. Acoust. Soc. Am. 123(4), 2136-2147, Apr. 2008.