Blind Separation of Convolutive Mixtures and an ... - CiteSeerX

10 downloads 0 Views 223KB Size Report
Blind Separation of Convolutive Mixtures and an. Application in Automatic Speech Recognition in. Noisy Environment. F. Ehlers, H.G. Schuster. Abstract| In thisĀ ...
Blind Separation of Convolutive Mixtures and an Application in Automatic Speech Recognition in Noisy Environment F. Ehlers, H.G. Schuster

Abstract | In this paper we propose a two-step-algorithm for

the blind separation of convolutive mixtures. We show that its application to automatic speech recognition in a noisy environment yields good results. Keywords | Blind separation of convolutive mixtures, speech recognition in noisy environment. I. Introduction

Blind source separation is an emerging eld of signal processing. Many recent publications focus on the topic of blind separation of convolutive mixtures. In the following we will apply a new algorithm [1] to separate acoustic signals which have been recorded with microphones in a room whose furniture and walls produce echoes by re ection. Since we have to deal with echoes we have to solve the source separation problem for a convolutive mixture described by: i1 (t) = a1(t) + (h12  a2)(t) i2 (t) = a2(t) + (h21  a1)(t): (1) Here a1(t) and a2 (t) are the source signals, i1(t) and i2 (t) are the measured signals and h12 and h21 are the mixing impulse response functions. Algorithms that separate this kind of mixtures have been published in [2-17]. Our algorithm [1] uses only second order statistics instead of higher order statistics [7-16]. Since we have in addition to the two speaker sources also background noise we proceed in two steps. First we take advantage of the rigorous reduction of the source separation problem to an Eigenvalue problem [1]. We assume that there are only two speakers without noise and determine the mixing coecients in a rst order approximation by solving (after appropiate Fourier transformation, see below) the Eigenvalue problem [1]. In a second step we take care of the noise. We set up a cost function whose minima in coecient space correspond to decorrelated outputs and determine these minima by a Monte-Carlo procedure. Since we use as a starting point the coecients of our rst step we are already close to the correct minima. In the next sections we introduce an approach, which was developed to achieve best recognition rates with the SpeechMaster speech recognition system by the company

'aspect' (Norderstedt, Germany) in a test situation described below. The SpeechMaster system provides continuous speech recognition for control and data gathering applications with a freely de nable vocabulary of several hundreds words per application [18]. II. Algorithm

To achieve separation of mixtures of acoustic signals the algorithm must be able to calculate the transfer functions. Our approach is an extension of the formalism published by Molgedey and Schuster in [1]. A. Step I: Calculating Transfer Functions

As in other publications the algorithm is applied to discrete signals [2-17][20-21]. After a time varying discrete Fourier transformation [13][19-21] of (1) we obtain I1 (!; T) = A1 (!; T) + H12(!)A2 (!; T) I2 (!; T) = A2 (!; T) + H21(!)A1 (!; T);

(2)

where T denotes the position of the window with width T  . The transfer functions H12 and H21 do not depend on T. The convolutive mixture is transformed to instantaneous complex mixtures [20-22]. In order to simplify the notation we drop in the following the frequency index !. We introduce correlation functions

hAi (T)Aj (T + )i = M1

M X T

=1

Ai (T)Aj (T + )

(3)

where MT  is the measuring time. For uncorrelated source signals, i.e.

hAi (T)Aj (T + )i = ij Fi ();

(4)

we derive in analogy to [1] the Eigenvalue equation [M(0)M?1 (  )]H = H[(0)?1 (  )]:

(5)

Mij () = hIi (T)Ij (T + )i are the measurable correlation functions, ii() = hAi (T)Ai (T + )i are the unknown source strengths and the matrix H is de ned by   1 H 12 The authors are with the Institut fuer Theoretische Physik, Uni(6) H = H21 1 : versitaet Kiel, Olshausenstr. 40, 24098 Kiel, Germany.

We thank the company 'aspect' for making available to us their SpeechMaster recognition system, the Deutsche Forschungsge- Note that [(0)?1 (  )] is a diagonal matrix. The solumeinschaft and the Technologie-Stiftung Schleswig-Holsteinfor nantion of (6) for each frequency index ! gives the transfer cial support and the unknown referee for helpful comments.

functions H12(!) and H21(!). Because of the experimenIII. Description of our Application tal setup described below only transfer functions of strict causal FIR lters are allowed, therefore we avoid the degeneracy problem (generally the solution of the separation problem of N sources is only known up to N! permuta- The recognition system was trained and tested with strings tions) encountered by Jutten et al. [8]. of german spoken digits inclusive 'zwo' in a 20m2 room lled with chairs, desks and computers which produce an unknown background noise. The speech signals were B. Step II: Calculating the Impulse Responses recorded by a desk-mounted microphone at a distance of about 1m from the speaker with a sampling rate of 8kHz The second step minimizes the cost function in 8-bit-Alaw format. L The SpeechMaster system is suitable for the use in this X V= hu1 (t)u2 (t + )i2: (7) quiet environment because of an implemented signal noise  =?L reduction for constant background noise. We followed the 2-stage training algorithm provided by the SpeechMaster The variables ui(t) are the output signals system: 1. Creation of speaker-independent initial patterns from u1 (t) = i1 (t) ? (h12  i2)(t) isolated words. u2 (t) = i2 (t) ? (h21  i1)(t); (8) 2. Speaker-dependent training of whole word patterns from continuously spoken application sentences, each of them inwhere h12 and h21 are strict causal FIR lters of order L cludes 5 or 6 spoken digits. that should be adapted, such that u1 depends only on a1 The trained system performed recognition of continuously and u2 depends only on a2 . This procedure generalizes the spoken strings of digits with an error rate of about 6%. Next a noisy environment was built up in the same room potential approach of [1] to convolutive mixtures. We remark that only second order statistics are used. The using a radio, which was placed into several distances and criterium of minimizing V has been applied successfully by angels relatively to the microphone. As disturbing signals several authors [2-6]. Since we work only with FIR lters we choose any kind of music or speech (like news). Data containing about 50 minutes of these mixtures of spono higher order statistics are needed [2]. ken digits and disturbing signals was recorded. The whole For the correct h12 and h21 this leads to amount of data is separated in 20 data les, each of them X u1(t) = a1 (t) ? h12 (l)h21(l0 )a1 (t ? l ? l0) (9) contains 30 sentences, which on their part consist of 5 or 6 spoken digits. Only for the duration of one sentence the ll position of the radio is constant, so that the separation aland an analog expression for u2(t). We remark that the gorithm must be able to work with a low amount of about 30000 samples to perform the separation. The error rate post ltering step [3] increases depending on the power of the background sig1 nals and on the kind of signals: If speech was presented 1 ? H21(!)H12(!) ; in the background many insertion errors occurred, while background signals with almost constant frequency charhas no signi cant e ect on the recognition rates. acteristics caused less errors. In a recorded test le of senIn order to minimize the potential V we used a Monte-Carlo tences the error rate was 52% (see below). The one-channel algorithm starting from the coecients found in our rst speech recognition is not possible in this situation. step hinitial = IFFT(HStepI ), where 'IFFT' denotes the Therefore we inserted in this test environment a second miinverse fast Fourier transformation. We choose a Monte- crophone which was placed into a distance of about 20cm Carlo algorithm (instead of a gradient algorithm or LMS to the left of the rst microphone. The signal recorded (RLS) proposed in [2-17]) because optimization procedures by this second microphone was sampled in the same way of this kind are shown to be practicable if many variables as the rst one, that means synchronous stereo data was being involved [23]. In the application we choose strict recorded and each channel was a convolutive mixture of the causal FIR lters with 80 coecients for separation. two sources: radio and speaker. The algorithm (inclusive step I and step II) works with the The described setup is comparable to the problem of sepaassumption of uncorrelated source signals, which is shown rating audio signals in room acoustics considered by Yellin to be sucient for our task of separating FIR ltered mix- and Weinstein [11]. But due to the following reasons the tures [2-6]. Our goal was not only to produce one good situation in which we have applied our algorithm is more example of separation, but to have good recognition rates realistic: in many situations and to show that our algorithm is able (i) Other sources (like the computers) produce a ground to work with an amount of about 30000 samples. Moreover, level of (acoustic) noise in the room. its implementation on a Pentium PC (133MHz) needs at (ii) The human speaker is moving while speaking, this most 4 second for this amount of data. changes the transfer functions. 0

IV. Results

After the second step the initial error rate of 52% decreases to 12%. The human listener hears an artifact of the background signal which may be caused by the echoes that are not suppressed by the algorithm if they are delayed more than 80 time steps. On the other hand it is not known whether the convergence of the minimization algorithm is ending at the best possible values, and the errors mentioned above certainly play a not negligible rule. In a last step the artifact of the background signal can be eliminated in the pauses of the human speech signals. After this step the error rate further decreases to about 8%. Tables I and II demonstrate how the separation algorithm decreases the error rate of the automatic speech recognition system. Figures 1-4 show the time signals of the detected mixtures and the output of the source separation system together with the cross correlation functions. One could see in these gures that the output signals of the separation algorithm are decorrelated. V. Conclusion

We have demonstrated an approach to speech recognition in noisy environment which is able to achieve recognition rates higher than 90% in a real environment with very low a priori information about the background signals using only a small amount of low sampled data. Acknowledgments

The authors would like to express their gratitude for the helpful comments of A. Noll and H.-H. Hamer. References [1] L. Molgedey and H.G. Schuster: Separation of a Mixture of Independent Signals Using Time Delayed Correlations, Physical Review Letters, Vol.72,No.23, pp.3634-3637,1994. [2] E. Weinstein, M. Feder, A. V. Oppenheim: Multi-Channel Signal Separation by Decorrelation, IEEE Transactions on speech and audio processing, Vol. 1, No. 4,pp.405-413, October 1993. [3] S. Van Gerven, D. Van Compernolle: Signal Separation by Symmetric Adaptive Decorrelation: Stability, Convergence, and Uniqueness, IEEE Transactions on signal processing, Vol. 43, No. 7, pp.1602-1612, July 1995. [4] U. Lindgren, H. Sahlin, H. Broman:Source Separation Using Second Order Statistics, EUSIPCO'96,pp.699-702, 1996. [5] U. Lindgren, T. Wigren, H. Broman:On Local Convergence of a Class of Blind Separation Algorithms, IEEE Transactions on signal processing, Vol. 43, No. 12, pp.3054-3058, December 1995. [6] D. Chan, P. Rayner, S. Godsill:Multi-Channel Signal Separation, ICASSP'96, pp.649-652, 1996. [7] D. Yellin, E. Weinstein: Criteria for Multichannel Signal Separation, IEEE Transactions on signal processing, Vol. 42, No. 8, pp.2158-2168, August 1994. [8] C. Jutten, L. Nguyen Thi, E. Dijkstra, E. Vittoz, J. Caelen:Blind Separation of Sources: an Algorithm for Separation of Convolutive Mixtures, InternationalSignal Processing Workshop on High Order Statistics, Chamrousse (France), July, 1012th, 1991. [9] L. Nguyen Thi, C. Jutten:Blind source separation for convolutive mixtures, Signal Processing 45, pp.209-229, 1995. [10] A. Mansour, C. Jutten, P. Loubaton:Subspace Method for Blind Separation of Sources in Convolutive Mixture, EUSIPCO'96, pp.2081-2084, 1996.

[11] D. Yellin, E. Weinstein:Multicahnnel Signal Separation: Methods and Analysis, IEEE Transactions on Signal Processing, Vol. 44, No. 1, pp.106-118, 1996. [12] X. Ling, W. Tian, B. Liu:Blind Signal Separation for MA Mixture Model, ICASSP'95, pp.3387-3390, 1995. [13] W. Jun:Wideband Blind Identi cation and Separation of Independent Sources, EUSIPCO'96, pp.2077-2080, 1996. [14] K. Torkkola:Blind Separation of Convolved Sources Based on Information Maximization, IEEE workshop on Neural Networks for Signal Processing, Kyoto, Japan, Sept 4-6, 1996. [15] R. H. Lambert:A New Method for Source Separation, ICASSP'95, pp.2116-2119, 1995. [16] S. Icart, R. Gautier:Blind Separation of Convolutive Mixtures Using Second and Fourth Order Moments, ICASSP'96, pp.3018-3508, 1996. [17] A. Souloumiac:Blind Source Detection and Separation Using Second Order Non-Stationarity, ICASPP'95, pp.1912-1915, 1995. [18] H. Bergmann, H.-H. Hamer, A. Noll, A. Paeseler, H. Tomaschewski: Modularization in Task-Spezi c Language Modelling, Eurospeech95, 4th European Conference on Speech Communication and Technology, Proceeding: Volume1, September 1995. [19] A.V. Oppenheim, R. Schafer: Digital Signal Processing, Prentice-Hall, 1975. [20] V. Capdevielle, C. Serviere, J.-L. Lacome: Blind Separation of Wide-Band Sources: Application to Rotating Machine Signals, EUSIPCO'96, pp.2085-2088, 1996. [21] V. Capdevielle, C. Serviere, J.-L. Lacome: Blind Separation of Wide-Band Sources in the Frequency Domain, ICASSP'95, pp.2080-2083, 1995. [22] E. Moreau:Criteria for Complex Source Separation, EUSIPCO'96, pp.931-934, 1996. [23] Gunter Dueck: New Optimization Heuristics, Journal of Computational Physics 104, 86-92 (1993).

TABLE I The errors of the automatic speech recognition system of a data set, in which the sentences consist of 5 or 6 spoken digits, without (002) and with (002N) separation. Column S,I contain the numbers of substitution and insertion errors respectively, while column D shows the numbers of deleted words. sentence 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

002 SID 200 000 202 102 104 004 300 001 110 200 240 101 230 010 340

002N SID 100 000 100 100 100 000 000 000 001 000 001 000 000 000 100

sentence 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

002 SID 230 140 330 230 230 000 120 020 100 100 200 310 210 300 110

002N SID 110 000 000 010 020 000 000 000 000 001 100 000 100 000 000

Fig. 2. Output signals of the rst example: Time dependence of the output signals s1 ; s2 , that are calculated by the separation algorithm, together with their crosscorrelation function CCF 1 2 ( ). Time was measured in units of 0:125 ms, s1 and s2 have values from -2048 to 2047. The amplitude of the CCF is calculated in the same units that are used in gure 1. One could see here that the maximum of the crosscorrelation function is drastically reduced in its value (from 1144 to -76), i.e. the input signals have been decorrelated indeed. s ;s

TABLE II The word error rate of three test data sets (002, 003 and 005) without and with separation. Each set consisted of 30 sentences. no separation separation

002 52.2 % 8.3 %

003 42.6 % 7.7 %

005 26.1 % 7.8 %

Fig. 3. Second example of input signals: Time dependence of two detected signals s1 ; s2 , that are mixtures of music and speech , together with their crosscorrelation function CCF 1 2 ( ). Time was measured in units of 0:125 ms, s1 and s2 have values from -2048 to 2047. The amplitude of the CCF is calculated in arbitrary units. s ;s

signals of the second example: Time dependence Fig. 1. First example of input signals: Time dependence of two de- Fig. of4. theOutput output signals s1 ; s2 , that are calculated by the sepatected signals s1 ; s2 , that are mixtures of two di erent speech sigration algorithm, together with their crosscorrelation function nals, together with their crosscorrelation function CCF 1 2 ( ). CCF 1 2 ( ). Time was measured in units of 0:125 ms, s1 and Time was measured in units of 0:125 ms, s1 and s2 have values s2 have values from -2048 to 2047. The amplitude of the CCF is from -2048 to 2047. The amplitude of the CCF is calculated in calculated in the same units that are used in gure 3. Again the arbitrary units. maximum of the crosscorrelation function is drastically reduced in its value (from 2182 to 176). s ;s

s ;s