CONTINUOUS SPEECH RECOGNITION IN A MULTI-SIMULTANEOUSSPEAKER ENVIRONMENT USING DECORRELATION FILTERING IN THE FREQUENCY DOMAIN Koutras A., Dermatas E. and Kokkinakis G. Wire Communications Laboratory, Dept of Electrical Engineer and Computer Technology. University of Patras, 26100 Patras, HELLAS. Tel: +30 61 991 722, FAX: +30 61 991 855 e-mail:
[email protected]
ABSTRACT In this paper it is shown experimentally that blind signal separation (BSS) methods in the frequency domain improve significantly the speaker signal to interference ratio (SIR) and the phoneme recognition score of a continuous speech, speaker-independent acoustic decoder in a multi-simultaneousspeaker office room. Two BSS methods are used to perform speech separation by processing the convolved-mixed signal received by an omni-directional microphone and the speech signals of multiple interfering speakers. In extensive experiments, the SIR resulting from the output decorrelation filtering method is increased by 119% while the method based on the information maximization criterion achieves an improvement of 124%. Furthermore, phoneme recognition improvement of 127% and 159% for the two BBS methods is measured. From the computational complexity standpoint, the implemented frequency domain BSS methods are proven to be faster than the analogous time domain.
1. INTRODUCTION The state of the art techniques of automatic speech recognition (ASR) are still vulnerable in the presence of interferences. One of the most difficult problems encountered is the interference speech from competing speakers, or even worse, from moving speakers. In such environment, robust speech recognition still remains a challenging task. The "cocktail party effect" represents the human's ability to focus one's listening attention on a single talker among a din of conversations and background noise, and recognize a specific voice. Recently [1-2], hands-free speech recognition systems simulating the "cocktail party effect" have been evaluated in a multi-simultaneous-speaker environment using real-life recordings. In particular, phoneme recognition experiments were carried out by processing, in the time domain, the convoluted signal of an omni-directional microphone and the signals obtained from close-talking microphones of three simultaneous speakers. In this paper two signal separation algorithms in the frequency domain, based on the minimization of the mutual information (FDMI) and the minimization of the cross correlation of the separated signals (FDODF) are implemented in the same application task. The algorithms' evaluation is carried out in an artificially created multi-simultaneous-speaker environment using a subset of the TIMIT speech corpus. It is shown that the SIR is increased from a mean value of -0.16dB (measured at the omni-directional microphone output), to a mean value of 19.42dB for the FDODF method and 20.26dB for the FDMI method.
Moreover, phoneme recognition improvements of 127% and 159% compared to the analogous in the multi-simultaneousenvironment are measured when the BSS algorithms are used as a front-end processor of a continuous speech recognition system. The structure of this paper is as follows: In the next section an application-dependent speech separation method and the corresponding frequency domain implementation are briefly described. Furthermore, we estimate the computational complexity reduction of the proposed frequency domain methods in comparison to the same methods in the time domain. In section 3 we present the speech recognition system and the experimental conditions for solving the specific "cocktail party" problem. In section 4 we present the evaluation results of the proposed methods. Finally, in the last section, conclusions and proposals for research studies are given.
2. SIGNAL SEPARATION IN THE FREQUENCY DOMAIN The most simple speech separation problem in reverberant rooms is the case of two simultaneous-speakers where the speech signal of one speaker is known (Two Input-Two Output network - TITO). The basic TITO separation networks for both the FDMI and the FDODF optimization criteria are shown in figure 1. Lower case letters denote time-domain signals while the corresponding capital letters denote their frequency domain equivalent. Let x1(t) be the speech signal obtained from a speaker headmounted microphone (HM) and x2(t) the convolutive mixed signal of both speakers obtained by an omni-directional microphone (ODM). The network output u1(t) and u2(t) denote the speakers' separated signals. By assuming linearly convolutive mixture between the speakers and the omnidirectional microphone, the signal separation process can be approximated by an N-order linear processor of W coefficients. The basic feedforward TITO signal separation network can be described in the frequency domain by the following equations: U1( f ) = X1( f ) U 2 ( f ) = X 2 ( f ) ±W ( f )X1( f )
(1)
The vector W represents the coefficients of the linear separation filter, X1(f), X2(f) denote the Fourier transform (FT) of the microphone signals and U1(f), U2(f), denote the FT of the separated signals. The addition is valid for the case of the FDMI algorithm and the subtraction for the case of the FDODF respectively.
x1 (t)
X 1(f)
the Jacobian determinant can be simplified to the following equation:
U 1(f)
FT
ln( J ) = ln(
∂y1 ∂y ) + ln( 2 ) ∂u1 ∂u 2
(3)
W (f )
The adaptation learning rules for the filter coefficients can be calculated by using the stochastic gradient descent method. In this case, the derived learning rule for the time domain becomes [4]:
x2 (t)
X 2(f) FT
+
t
U 2 (f)
-
∆w ∝ x1 (t)
X 1 (f)
U 1(f) = Y1 (f)
FT
W(f)
x2 (t)
+
X 2 (f) FT
+
t
u 2 (t)
U 2(f) IFT
y 2 (t) g(.)
+
Y 2 (f) FT
∂ (ln( J )) ∂w
∂ ∂y 2 = ∂u 2 ∂u 2
∂y 2 ∂u 2
network in the frequency domain for the case of: FDODF (top) and FDMI (bottom).
2.1 Output Decorrelation Filtering Frequency Domain (FDODF)
in
the
In [2] it was shown that efficient time domain ODF (TDODF) requires the estimation of long FIR filters in case of speech separation in a real room environment. As the length of the separation filters increases, the learning rate must be set at close to zero values in order to ensure stability of the adaptation process. This restriction minimizes the convergence speed, which is an obstacle for real-time BSS implementations. However, real-time convergence is required for implementing the separation network in speech recognition systems. In this paper it is shown that significant reduction on both convergence time and computational complexity can be achieved by performing signal separation in the frequency domain. The FDODF can be derived from the Fourier transformation of the TDODF adaptation equation which results in: H ∆W ∝ µ ⋅U 2 ( f ) ⋅ X 1 ( f
)
(2)
where H denotes conjugate transposition and is the learning rate.
2.2 Information Maximization in the Frequency Domain (FDMI) Bell et al [3], showed that for the FDMI network architecture of figure 1, a learning algorithm for the filter coefficients can be derived in the time domain by maximizing the joint entropy of the network output, H(y(t)), where yT(t) = (y1(t),y2(t)). The vector y(t) is defined to be a bounded monotonic nonlinear function (e.g. the sigmoid function) of the separated signal components. It was shown that maximizing the joint entropy of the network outputs, leads to the maximum of E[ln(|J|)], where J is the Jacobian matrix of the network and |.| denotes the determinant of a rectangular matrix. The natural logarithm of
⋅ u1
T
(4)
We may keep the same form of the equation 4 for the separation filter by moving into the frequency domain representation where the elements of the above matrices are filter coefficients. Then the multiplication operation replaces the convolution property in (4). Lambert [5] showed that FIR polynomial matrix algebra can be used as an efficient tool to solve problems easily in the frequency domain. From the above, the learning rule of equation 4 can be reformulated as follows: H ∆W ∝ µ ⋅ FT ( yˆ 2 ) ⋅ X 1 (f)
Figure 1 The basic two input-two output signal separation
−1
yˆ 2 =
(5)
∂p (u 2 ) / ∂u 2 ∂y , p (u 2 ) = 2 p (u 2 ) ∂u 2
We must note that in the above formula, the non-linear function operates in the time domain and the Fourier transformation is applied on the network output. From the above two learning rules (eq. 2 and 5) it can be observed that they are in the form of the least mean squared (LMS) adaptive filters. A fast implementation of the LMS adaptive filters in the frequency domain can be realized by using the overlap and save block LMS technique, where two blocks are processed simultaneously and x1(t) is shifted by one block after each iteration:
X1( f ) = FT[ x1((k −1)n) x1(kn − 1) x1(kn) x1(kn + n −1)] (6)
2.3 Computational Complexity The computational complexity of the frequency domain adaptation rules (equations 2 and 5) is significantly lower than in the time domain for multiplication operations. Analytically, for the case of the TDODF, the number of multiplications required for the re-estimation of M filter coefficients is proven to be 2M2. The FDODF requires five frequency transformations for every iteration [6]. We know that each N-point FFT/IFFT requires approximately Nlog2N real multiplications, where N=2M, so for the FDODF the number of multiplications sums up to 5Nlog2N. In addition, we have an extra 8N multiplications for the computation of the u2 output (fig. 1), hence the total corresponding number of multiplications performed in the FDODF is: 10Mlog2M + 26M. Therefore the computational complexity of this method is O(MlogM) when compared to the O(M2) in the time domain. So for M=512, the FDODF is 8.83 times faster than the time domain equivalent method. By using the same strategy, we can deduce that for the FDMI method the total number of multiplications performed is 10Mlog2M + 28M. Therefore, the estimated computational complexity is O(Mlog2M) compared to the O(M2) for the time domain implementation. So, for M=512, the FDMI method
works slightly slower than the FDODF equivalent and gives a computation time that is 8.67 times faster than the TDMI.
3. EXPERIMENTS In this section we present the experiments that were conducted to demonstrate the efficiency of the implemented algorithms in the frequency domain. In our experiments, speech recordings from a subset of the TIMIT database were used to simulate the "cocktail party effect". In particular, recordings from the test set of the TIMIT database were used to formulate a set of 180 sentences in a scenario where three speakers are talking simultaneously and it is assumed that: (a) the transfer function of all acoustic paths between speakers and microphones can be approximated by FIR linear filters, (b) the speech signals mixture is linear, and (c) the speakers are randomly positioned in a room and are not moving. Each transfer function is modeled by a linear FIR filter with 512 taps (simulation of room impulse responses recorded in a real room). The task is to measure the SIR and to perform speech recognition of the unknown speaker's voice by processing the mixture signal of the omni-directional microphone and the other two signals recorded on the interfering speakers microphones. All signals are sampled at 16 kHz.
3.1 Signal Separation For the separation of the speech signals in the three simultaneous speaker artificial environment, we used the cascade topology of the basic TITO separation network as in [1-2].
phonemes were used for training while for the testing we used 5268 phonemes. A set of 39 different phonemes was employed that was created by a unification of the 49 original phonemes of the TIMIT database, based on their acoustic similarity and the work of Lee and Hon [7].
4. RESULTS In this section we present the experimental results of the BSS methods and in particular the SIR improvement and the percent mean phoneme recognition rate for a 3,4 and 5 hidden-state CDHMM.
4.1 SIR improvement In Figure 2 we present schematically the improvement of the SIR for the case of FDODF and FDMI methods. It is clearly shown that the SIR improves drastically after the separation process and that the FDMI method gives slightly better results. The SIR of the mixed signal in the simulated multisimultaneous environment ranged from values as low as -10 dB up to 10 dB with a mean value of -0.17 dB. The FDODF achieved a mean improvement of 19.42 dB while the FDMI method gave 20.26 dB. These results correspond to a mean percent improvement of 118,65% and 123,74% for the FDODF and the FDMI respectively. In both cases however, audible tests showed that the separation is nearly perfect. Furthermore, as can be seen from the figure, the FDODF separates the speech signals better in the case of high SIR in ODM signals while the FDMI method outperforms the FDODF when the SIR of the ODM reaches its minimum. In all cases however minor differences are measured.
$I WHUVHSDUDWLRQ 6,5G%
Throughout the experiments with both methods, the learning rate was kept constant at the fixed value of 0.001 and separation was accomplished after only 20 iterations. By comparing the number of iterations that were required for the speaker separation with same methods in the time domain (about 400) and the separated signals as well, we reach the conclusion that the implemented frequency domain methods converge faster and more accurately as extensive listening tests showed.
%HI RUH6HSDUDWLRQ
3.2 Speech recognition
$I WHUVHSDUDWLRQ
The phone recognition experiments were carried out on a speaker-independent acoustic decoder based on the Continuous Density Hidden Markov Models (CDHMM). Each phonemeunit HMM was considered to be successively a three, four and five states left to right CDHMM with no state skip. The output distribution probabilities were modeled by means of a Gaussian component with diagonal covariance matrix. The classification was achieved by reaching the maximum forward probability of the observation sequence for each phoneme model. In the training process the segmental K-Means algorithm was used to estimate each CDHMM's parameter from multiple observations. A total number of 25000 manually labeled
6,5G%
The speech signal frames were decomposed in critical rectangular bands (the first 20 bands from [8], page 142), with the use of the 32 ms FFT computed every 5 ms. The feature vector consisted of the normalized log-energy of each critical band with respect to the total frame log-energy.
%HI RUH6HSDUDWLRQ
Figure 2
SIR improvement for the case of (A) FDODF and (B) FDMI. The bottom line in each figure shows the sorted SIR of the 180 mixed speech signals before separation while the top shows the SIR after the implementation of the two BSS methods.
REFERENCES
4.2 Phoneme Recognition
5HFRJQLWLRQ5DWH
In figure 3, we show the phoneme recognition rates that were achieved in continuous speech environment for the case of 3, 4 and 5 hidden states CDHMM. For training the CDHMM we used the HM speech signals from the TIMIT database, while the evaluation of the implemented phoneme recognition system was performed by using the ODM and the separated signals by the two methods. It is clearly shown that the 4 hidden states give the best recognition results when compared to the other two CDHMMs. This system achieved a recognition rate of 19.873% for the case of FDODF and 22.337% for the case of FDMI method. However, the recognition rate is maximized in the case of the 3states HMM. In this case the percentage improvement reached the score of 126.69% for the FDODF and 158.05% for the FDMI when compared to the ODM signal (107.51% and 133.21% for the 4state CDHMM, 109.76% and 143.95% for the 5state CDHMM).
+0
VWDWHV
VWDWHV
VWDWHV
)'2')
)'0,
2'0
Figure 3 Phoneme Recognition rates for the case of 3, 4 and 5 hidden states CDHMM.
5. CONCLUSIONS In this paper we tested two blind signal separation techniques and the separated signals were applied to a phoneme recognition system. The learning rules of the separation techniques were derived from different perspectives such as information maximization and output decorrelation in the frequency domain. The recognition rate of a phoneme recognition system was enormously increased after separating the artificially mixed speech signals. This shows that the derived frequency domain BBS methods, are very promising from the perspective of speech recognition accuracy, as well as the computational complexity, as it has been shown in section 4. However, one of the main tasks to be examined is the testing of these methods in the case of sources recorded in real room environments.
[1] Koutras A., Dermatas E., Kokkinakis G.: "Speech recognition through an omni-directional microphone in a multi-simultaneous-speaker environment". 2nd Workshop on Speech and Computers, Cluj-Napoca, Romania, (1997), 99-104. [2] Koutras A., Dermatas E., Kokkinakis G.: "Speech recognition in a real room and multi-simultaneous-speaker environment". TSD-98, Brno, Czech Republic, (1998), to be published. [3] Bell A., Sejnowski T.: "An information maximization approach to blind separation and blind deconvolution". Neural Computation, 7(6), (1995), 1004-1034. [4] Lee T., Ziehe A., Orglmeister R,, Sejnowski T.: "Combining time-Delay Decorrelation and ICA: Towards solving the cocktail party problem". ICASSP-98, (1998), 1249-1252. [5] Lambert R.: "Multi channel blind deconvolution: FIR matrix algebra and separation of multipath mixtures". PhD Thesis, University of Southern California, Dept. Of Electrical Engineering, (1996). [6] Haykin S.: "Adaptive Filter Theory", Prentice Hall, New Jersey, (1996). [7] Lee K., Hon H.: "Speaker independent phone recognition using Hidden Markov Models". IEEE Trans. on ASSP 37(11): (1989), 1641-1648. [8] Zwicker E., Fasl H.: "Psychoacoustics: Facts and Models", Springer Verlang, Berlin Heidelberg, (1990).