Acoustic Source Localization Using Microphone Arrays Via CNN

0 downloads 0 Views 200KB Size Report
Abstract − CNN algorithms have been proved to be efficient in several computationally intensive applications due to the parallel computing architecture.
Acoustic Source Localization Using Microphone Arrays Via CNN Algorithms Zoltán Fodróczi, András Radványi1 and György Takács2

INTRODUCTION

Location of a sound source is important in many industrial, scientific, educational and security applications. For example, steering a videoconferencing camera, hands-free, to the current speaker can benefit from location information. Beamforming techniques using microphone arrays can eliminate acoustic echoes and reverberation, which plague speech applications in closed environments [1]. These techniques require the estimation of the speaker’s location and their automatic tracking [2]. In the last two decades, several methods have been proposed for solving acoustic source localization problems. Most of these are computationally intensive and difficult to realize in real time [2], [3]. There are some solutions, which try to exploit the massive parallelism of neural networks [4], but all of these are difficult to implement due to the abstract neural network models. In this paper, we show a system, which utilizes some advantages of Cellular Neural Networks (CNN) [5], and the CNN paradigm [6] in a sound source localization application, by applying the Terra operation per second equivalent computing power of CNN Universal Machine [7] (CNN-UM) chip implementation (ACE4k [8]). The source localization problems can be divided into two parts. The first is to determine the delay of arrival (DOA) of the sound wavefront between a given microphone pair, while the second is to localize a sound source in a

2

EXPERIMENTAL SYSTEM The experimental system has three major parts (see Figure 1.). The first part is for capturing the audio signal. It consists of three microphones with preamplifiers, and a data acquisition (DAQ) card, which is plugged into a PC. Using of three microphones facilitates source localization in the microphones plane.

t3 SOUND SOURCE

MICROPHONE 3

MONITORED REGION

t1 MICROPHONE 1

t2

PC

DAQ card

1

predefined area using delay estimations. Our method can be used to determine DOA, since the second problem can be solved by a linear system of equations. The algorithm were designed for special circumstances (e.g.: for a security surveillance system), where image processing capabilities of CNN-UM has to be extended with multi modal sensing, and some acoustic event can be used to direct the attention of security camera. The presented solution can run on the existing 64x64 CNN-UM chip [8]. This work is organized as follows. Section 2 describes the experimental system. In section 3, the time delay estimation technique is presented. In section 4, the conclusion of our work is presented.

SOUND WAVE

Abstract − CNN algorithms have been proved to be efficient in several computationally intensive applications due to the parallel computing architecture of CNN-UM. In this paper, a new method is presented for acoustic source localization using several microphones. We present an experimental result how a CNN Universal Machine chip can be applied to determine delay of arrival. The algorithm has been tested on a 64x64 CNN-UM chip.

PCI Bus

CNNUM

MICROPHONE 2

Figure 1. The experimental system. The sound wave originating from the source is received by microphones with different time delays.

1

Analogic and Neural Computing Systems Laboratory, Computer and Automation Institute, Hungarian Academy of Sciences, P.O.B 63, H-1502, Budapest, Hungary E-mail: [fodroczi,radvanyi]@sztaki.hu 2 Information Technology Department of Pázmány Péter Catholic University Piarista köz 1, H-1052 Budapest, Hungary E-mail: [email protected]

The second part of the system is the Aladdin Pro [9], a platform independent standard application interface for running analogic algorithms on CNN-UM chips. The third part is the so-called user interface, written in C++. In addition to providing an

interface to the users, it also gives the connection between the sensory part and the Aladdin Pro, and carries out necessary signal conditioning.

M 0 0

3

TIME DELAY ESTIMATION WITH CNN

In time delay estimation calculations it is assumed that signals are similar apart from scaling factors and a time shift. In some extent, the similarity may be spoiled in a reverberating environment resulting in degradation of accuracy. Therefore in such circumstances proper selection of microphones are of crucial importance. The time delay estimate between two microphones (M1 and M2) is (using notations of Figure 1.):

t1− 2 = t1 − t 2

(1)

where t1 and t2 is the time required for sound waves to reach M1 and M2. The traditional method for determining time shift is based on computing crosscorrelation [10 11] of sample sequences x1[n] and x2[n] received by M1 and M2 k+

y[k ] =

∑ x [ m] ∗ x [ m +

This diagram is mapped into a NxM black and white image (Ik, k=1..3), in which each column represents the discrete valued level of the signal at a given sampled time instance. (see Figure 2.) Our aim is to find the relative displacement of two images that result in their best matching. A measure of matching is the number of different pixels, calculated using a pixel-wise XOR operation followed by a counting of black pixels. Both the XOR and the counting are single operations provided by the recent CNN chips. The smaller this value, the better the matching between the two signals will be.

1

2

N − k] 3

(2)

Il Microphones Area of cover measurement

Let y[kmax]=max{y[k]|0≤k≤2N/3} be the maximum value of cross-correlation. At displacement kmax the signals match the best, and the delay estimate can be computed as:

N − k max = 3 fs

Column diagram of signal sequence

Ml

where N is the length of the sequence on which the correlation is taken 0≤k≤2N/3 is the domain of the correlation.

t1−2

Figure 2. Image created from the sampled signal diagram. The black pixel columns represent the sequence of samples. N=576 M=64

N

N 3

m=k

N-1

XOR Ik

Mk

M

SHIFT Ik

W

N Y

Is repeated 2N/3 times?

Search minimum covering displacement

Count black pixels, and store result

Result of covering at different displacements

(3)

where fs is the sampling frequency. Based on its image processing capabilities we used CNN-UM to determine the correlation of signals. At first, characteristic images representing the sound signals are created, which can be used in crosscorrelation calculations. By normalizing the signals to a given maximum amplitude, we can eliminate the differences between the signals caused by different position, type and preamplifier sensitivity of microphones. Using these normalized values, a well characterizing sampled signal diagram for microphone signals can be created.

Figure 3. Process of DOA calculation. The matching of microphone signals is determined by applying a pixel-wise XOR operation between the appropriate parts of column diagrams. The process of determination of the best matching displacement, from which the time delay of two signals can be calculated, is illustrated in Figure 3. One of the images (Ik in figure) has to be shifted through the correlation domain and the measure of matching over the marked area has to be calculated in each position. A single operation on the CNN grid can do shifting of all points of the signal image. Taking the minimum measured value, in 2N/3 steps the best matching displacement can be found. The flowchart of the Analogic Micro Code (AMC) algorithm doing the above process can be found in Figure 4. At first we have to load the signal images (Ik, Ij ; i≠j) to the ACE4k chip. Since the images are

larger then the chip size, we have to do calculations in several consecutive phases, working on chip-sized sections of the images in each phase. By loading such a section of Ik to the first Local Logic Memory (LLM) and then loading two adjacent sections of Ij to LLM2 and LLM3 the calculation of matching can be carried out by applying XOR operation between LLM1 and LLM2. Applying the SHIFT operation on LLM2 while moving the appropriate row from LLM3 does the displacement. The calculations are made in a nested cycle, the kernel of which is a chipsized long repetition of the above operations. The embedding frame consists of also loading appropriate portion Ij to LLM3 and it runs N-W times. As a result we obtain a measure of matching the two signals at over a range of displacements. To make our measure more robust against noise and periodic nature of signals we have to repeat the calculations with the further parts of Ik.

with 30% diffuse surfaces. The signal to noise ratio in the experiments was 20dB. The acoustic signal was normal male speech in Hungarian. With these circumstances the best covering displacement could be clearly identified. (see Figure 5) Using the best matching displacement (BMD) DOA can be computed as:

ti− j =

1 * BMD fs

(4)

The coordinates of the source location can be evaluated by solving a system of linear equations that can be derived from the geometry of microphone positions in Figure 1.

2 x0 X 21 + 2 y0Y21 + X 122 + Y122 = 2

(− ∆t2 − 2t1∆t2 ) * sv2 2 x0 X 31 + 2 y0Y31 + X 132 + Y132 = (− ∆t32 − 2t1∆t3 ) * sv2

Load signal pictures to ACE4k

(5)

2 x0 X 32 + 2 y0Y32 + X 232 + Y232 = (∆t22 − ∆t32 + 2t1 (∆t2 − ∆t3 ) * sv2 Send back results

Load the opposite section of Ij to LLM2 Y

Load section of Ij to LLM3 where Ik will be shifted

N

XOR LLM1 and LLM2

Y

Count black pixels and store result

N

Is it done 64 times? Move next row from LLM3 to LLM2

Is it done 3 times?

Is it done 6 times?

Y

N

Figure 4. The AMC flowchart of cover measuring In our experiments a 192 pixel wide window of Ik part (denoted as W in Figure 3.) proved to be sufficient to determine time differences between 576 sample long signals. Simple electret microphones with sampling frequency 8kHz were used to record the signals. The acoustic environment was normal room condition

where Xij=(xi-xj) and Yij=(yi-yj) are the differences of coordinates of microphone i and j (i,j=1..3) x0 and y0 is the coordinate of the source sv is the speed of sound ∆ti=t1-ti (i=2,3). In theory the localization error is half of the sampling time multiplied by the speed of sound (~0.02m with 8kHz sampling rate). One of the advantages of this method compared to other neural network based solutions (e.g.: [4]) that the localization error is equally distributed in the monitored region. Num ber of differ ent pixels

Load the the appropriate sections of Ik to LLM1

-192

0

192 Displacement [pixel]

Figure 5 Result of matching. The minimum marks the best matching displacement. The running time of algorithm can be computed as follows: - SHIFT operation 5 µs - computing XOR 300 ns - moving 1 row 5 µs

- counting black pixels

29 µs

According to the values above the running time of the whole procedure in Figure 4 will be 46 ms. This value is comparable to the usual cross correlation algorithm with similar parameters running on 1GHz Pentium class PC. Using the new 128x128 CNN-UM [12], the time consumption will significantly reduced to less than 10ms. Owing to parallel computing capabilities of the CNN-UM the time consumption of cross-correlation algorithm versus sequence length is below to usual quadratic dependence. 4

CONCLUSIONS

We have proposed a novel method to localize an arbitrary sound source in a given environment. The time consumption of the presented algorithm is commensurable with the existing digital solutions, and efficiently can be used in systems where a CNNUM is present. Acknowledgement This project has been supported by the Hungarian National Research and Development Program: TeleSense NKFP 2001/02/035 and by the Hungarian National Research Fund: OTKA-TS40858. Special thank to Ákos Zarándy for his valuable assistance. References [1] D. V. Rabinkin "Digital Hardware and Control for a Beam-Forming Microphone Array", Master's Thesis, January, 1994 [2] M. Jian, A. C. Kot, and M. H. Er, “DOA estimation of speech sources with microphone arrays,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 5, June 1998, pp. 293–296. [3] J.-H. Lee, Y.-M. C., and C.-C. Yeh, “A covariance approximation method for near-field direction-finding using a uniform linear array,” IEEE Trans. Signal Processing, vol. 43, pp. 1293–1298, May 1995. [4] G. Arslan, F. Ayhan Sakarya “A Unified NeuralNetwork-Based Speaker Localization Technique” IEEE Trans. On Neural Networks, vol. 11, no. 4, July 2000 [5] L.O. Chua and L. Yang, "Cellular Neural Networks: Theory", IEEE Trans. on Circuits and Systems, (CAS), Vol.35.pp. 1257-1272,1988. [6] L.O. Chua and T. Roska, "The CNN paradigm", IEEE Trans. on Circuits and Systems I:

Fundamental Theory and Applications, (CAS-I), Vol.40, No. 3, pp. 147-156, 1993. [7] T. Roska and L. O. Chua, The CNN universal machine: an analogic array computer, IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing Vol. 40, No. 3, pp. 163173, 1993. [8] S. Espejo, R. Dominguez-Castro, G. Linan, A. Rodriguez-Vázquez, "A 64x64 CNN Universal Chip with Analog and Digital I/O", Proceedings of 5th IEEE International Conference on Electronics, Circuits and Systems, (ICECS'98), pp. 203-206, Lisboa, 1998. [9] ALADDIN Pro system, Analogic Computers, 2002. http://www.analogic-computers.com [10] D. V. Rabinkin, Richard J. Renomeron, Joseph C. French and James L. Flanagan "Estimation of Wavefront Arrival Delay Using the CrossPower Spectrum Phase Technique" , 132nd Meeting of the Acoustical Society of America, December 1996. [11] D. V. Rabinkin, Richard J. Renomeron, Arthur Dahl, Joseph C. French, James L. Flanagan, and Michael H. Bianchi "A DSP Implementation of Source Location Using Microphone Arrays", Proceedings of the SPIE, Vol 2846, pp. 88-99, Denver, Colorado, August 1996. [12] A. Zarandy “Measurement results of mixed signal Visual Micro Processors”, Proceedings of 3rd International Conference on European Conference on Circuit Theory and Design (ECCTD03) 2003

Suggest Documents