Improved Histogram-Based Feature

0 downloads 0 Views 141KB Size Report
gorithm, referred to as Recursive DCN, provides word error rate improvements .... cal sentence length but longer than a typical frame length. Therefore, we apply ...
Improved Histogram-Based Feature Compensation for Robust Speech Recognition and Unsupervised Speaker Adaptation Yasunari Obuchi Advanced Research Laboratory Hitachi Ltd., Tokyo, Japan [email protected]

Abstract Feature compensation for noise robust speech recognition becomes more effective if normalization of time-derivative parameters is taken into account. This paper describes an implementation of Delta-Cepstrum Normalization (DCN) that runs with only minimum response time. The proposed algorithm, referred to as Recursive DCN, provides word error rate improvements comparable to conventional DCN. Since DCN includes the procedure that adjusts the mismatch between the cepstrum part and the delta-cepstrum part, it works effectively even if only small amount of data can be used. We also investigate the possibility of applying DCN to unsupervised speaker adaptation. It is shown that DCN adaptation improves the recognition accuracy even without reference transcription of the adaptation data. Finally, DCN adaptation is combined with Feature-space Maximum Likelihood Linear Regression (FMLLR). It shows promising results in the batch mode experiments, although the improvement is rather small in the recursive mode.

1. Introduction As speech recognition systems make inroads into the real world applications, it becomes more important to investigate various algorithms in terms of robustness and efficiency. The system must work in various unpredictable conditions with the limited resources. In particular, the computational power is very low in embedded systems such as PDAs and cellular phones, whereas the need for voice user interface in these small devices is extremely high, and they tend to be used in noisy conditions. Recently we investigated a series of feature compensation algorithms and proposed Delta-Cepstrum Normalization (DCN) [1]. DCN is an extension of Cepstral Mean Normalization (CMN), Mean and Variance Normalization (MVN), and Histogram Equalization (HEQ). These algorithms are simple enough to implement in small devices, but effective in noisy conditions. In those algorithms, an assumption is made for the distribution of the cepstral coefficients in an utterance, and irrelevant information is removed by normalization. This means that the compensation process is executed in batch mode, where feature compensation and decoding do

not start until the user’s utterance finishes. Time is wasted during the utterance, and a delay of the response is introduced. Recursive CMN [2] was proposed to avoid this problem, in which the cepstral mean is estimated using preceding data up to the current point, instead of using the whole utterance. It realizes frame-synchronous decoding, and the delay is reduced. Implementation of Recursive CMN has some variations such as Live CMN [3] and Real-time CMN [4], but they all have the same principle. This principle can be applied to other algorithms, where the cepstral variance or the cepstral probability density function is estimated recursively. In this paper, the concept of DCN is extended to avoid the delay by introducing recursive estimation of parameters. The new algorithm is referred to as Recursive DCN. It is shown that Recursive DCN has more advantage in the case where the amount of data is not enough for Histogram Equalization. In addition, the advantage of Recursive DCN in unsupervised speaker adaptation is also presented. DCN adaptation does not need the adaptation data to be transcribed, so one can avoid the time to obtain the reference transcription by running the speech recognizer. Moreover, DCN can be combined with MLLR if the transcription can be obtained. The remainder of this paper is organized as follows. In the next section, the concept of DCN is explained in detail. Section 3 describes the recursive implementation of DCN. In Section 4, the proposed algorithm is studied in a case in which some unsupervised adaptation data are available. Experimental results are shown in Section 5, and the last section gives conclusions and future works.

2. Delta-Cepstrum Normalization Delta-Cepstrum Normalization (DCN) is an extended version of Histogram Equalization (HEQ) [5] to the cepstral time-derivative domain. HEQ is defined to equalize the cumulative density function (CDF), the integration of the probability density function, of the test data to the CDF of the training data as follows: 

     ½    







(1)

where  is the CDF estimated from training data,   is the CDF of the test data,  is an observed cepstral coefficient of

cep HEQ cep

current utterance

adaptation data

current utterance

cep HEQ -adjustment

cep

cep

0 T 2T 3T (a) Recursive DCN

cep

Decoder

0 T 2T 3T (b) Combination with DCN adaptation

Figure 2: Schematic diagram of Recursive DCN. T denotes the interval of normalization

Figure 1: Schematic diagram of DCN. Adaptation Data

the  frame, and   is the corresponding transformed cepstral coefficient. In DCN, the HEQ transformation such as eq. (1) is also applied to the delta-cepstrum, and the mismatch between the normalized cepstrum and the normalized delta-cepstrum is compensated by the procedure called -adjustment. More precisely, -adjustment is defined by the following equations:      (2) 









·½

 ½



(3)

where  is the mismatch of the   frame, and is an weight parameter whose value was set to 1 empirically. The procedural flow of DCN is illustrated in Fig. 1.

3. Recursive DCN In [1], DCN was implemented and evaluated in batch mode. In this case, decoding is suspended until the end of user’s utterance, and it causes a long delay of response even if DCN itself is very fast. One approach to avoid this problem is to apply feature compensation recursively using the preceding data only. In [2], this approach was adopted to CMN by updating cepstral mean estimation at each frame. However, in DCN, re-calculating CDFs frame by frame is not reasonable because the execution time of DCN is shorter than a typical sentence length but longer than a typical frame length. Therefore, we apply feature compensation with a time interval that is comparable to an acceptable delay. For example, the input data from  (sec) to  (sec) are compensated at  (sec) using the data of this period, the data from  (sec) to   (sec) are compensated at   (sec) using the data of this period and the preceding part of the utterance, and so on. This procedure is illustrated in Fig. 2 (a). In this case, the delay would be the processing time for the last one second only. Considering that a short silence (typically 0.5 or 1 second) is necessary for endpoint detection, the delay would be acceptable in most systems. One can make the delay even shorter by changing the interval length along time. At the beginning, more data are needed to estimate the CDF, and a longer interval is preferred. After a certain part of the utterance was uttered, the estimation of the CDF is reliable, and a short interval can

cep

Non-linear Trans. for Cep

cep cep cep

Non-linear Trans. for Delta-Cep

-adjustment cep

cep

Decoder

Figure 3: Schematic diagram of DCN adaptation. be used without losing the accuracy. However, we do not investigate this approach in this paper, because the best implementation depends heavily upon the application and the hardware specification.

4. DCN adaptation If we have any adaptation data uttered by the same speaker in the same environment, recognition accuracy can be additionally improved. The adaptation data may not necessarily be transcribed, since we can use the output of the speech recognizer as the reference transcription. This scheme is called unsupervised adaptation, and can be applied to either HEQ or DCN. However, the process to obtain the reference transcription by the speech recognizer is time-consuming. In contrast, there is a simpler adaptation technique using HEQ or DCN, which does not need to decode the adaptation data. In [6], the histogram of the cepstral coefficients is made from the adaptation data, and a nonlinear transformation is defined so as to realize HEQ for the adaptation data. In the recognition step, the input feature vector is compensated simply by this nonlinear transformation. If we define another nonlinear transformation for delta-cepstra and introduce -adjustment, this technique can easily be extended to DCN. The extended version is called DCN adaptation, whose procedural flow is shown in Fig. 3. DCN adaptation has an advantage that normalization in the recognition step is realized only by a table lookup procedure of pre-defined nonlinear transformations.

50

50

CMN MVN HEQ DCN

45 WER (%)

WER (%)

45

CMN MVN HEQ DCN

40

40 35 30

35

25 1 2

30

5

10

15

# adaptation sentences

25 1.0

2.0

3.0

4.0

5.0

Full

Figure 5: Recognition results of standard adaptation.

Interval (sec) 50

Figure 4: Recognition results of recursive normalization.

5. Experimental results 5.1. Experimental setup A series of recognition experiments were carried out to evaluate the proposed algorithm. The Sphinx-III decoder developed by CMU was used for decoding. Triphone HMMs with 2000 tied states (8 Gaussians/state) were trained using the 5000-word LDC Wall Street Journal database (WSJ0). Speech input was sampled by 11.025kHz, and 13 MFCCs were computed every 10ms. We recorded 330 utterances from eight speakers using the built-in microphone of the PDA (HP iPAQ Model 3630). Each speaker uttered 40 to 43 sentences chosen from the WSJ0 database test set. The average length of one sentence is 6.58 seconds. In the speaker independent (SI) experiment, all of 330 utterances were used. In the speaker adaptive (SA) experiment, the utterances of one speaker were divided into two sets. Adaptation sentences were chosen from one set, and the other set was used for evaluation. It was repeated by switching the role of the first and second set, so all of 330 utterances were used for evaluation. 5.2. Recursive DCN Figure 4 shows the word error rates (WERs) obtained when various normalization methods were applied in a recursive

WER (%)

It is also possible to combine DCN adaptation and Recursive DCN, even though it takes a slightly longer time than standard DCN adaptation. As shown in Fig.2 (b), at each step of Recursive DCN, we can use both the input speech up to the current point and the adaptation data. All of these data are modified by DCN, and only the part corresponding to the input speech is extracted. Finally, as described in [6], histogram-based compensation and maximum likelihood linear regression (MLLR) are expected to work cooperatively. Using those two algorithms would give more benefit if the time to decode the adaptation data and to calculate MLLR parameters is not problematic.

CMN MVN HEQ DCN

45 40 35 30 25 0 1 2

5

10

15

# adaptation sentences

Figure 6: Recognition results of recursive adaptation. manner. Full means the batch mode, where the whole utterance was used to normalize the utterance itself. As shown in the figure, MVN, HEQ, and DCN improves the WER of CMN in that order with any interval length. Moreover, although MVN and HEQ provide only small improvement with a short interval, DCN gives significant improvement even at    (sec). This can be interpreted that the erroneous estimation of the CDF in HEQ can be compensated by -adjustment in DCN. 5.3. DCN adaptation The proposed algorithm was also evaluated in unsupervised speaker adaptive experiments, in comparison with CMN, MVN, and HEQ. First, we made experiments that do not need the adaptation data to be transcribed. Figure 5 shows the results obtained by DCN adaptation described in Fig. 3, compared with CMN, MVN, and HEQ adaptation. Adaptation data were used to estimate the cepstral mean, variance, and CDF. Test data were normalized simply using these pre-estimated parameters. The results show that the tendency is the same as Fig. 4, and WERs are reduced as more adaptation data can be used. Next, recursive adaptation was introduced. The test utterance up to the current point was added to the adaptation data, and adaptation was done with a fixed interval time. In this experiment, the interval was fixed at 1.0 second. The results are shown in Fig. 6. The WERs of 0 adaptation sentence correspond to the WERs of Recursive CMN/MVN/HEQ/DCN without

35

CMN-FMLLR MVN-FMLLR HEQ-FMLLR DCN-FMLLR

30 WER (%)

30 WER (%)

35

CMN-FMLLR MVN-FMLLR HEQ-FMLLR DCN-FMLLR

25 20

25 20

15

15 1 2

5

10

15

# adaptation sentences

1 2

5

10

15

# adaptation sentences

Figure 7: Recognition results of FMLLR with batch normalization.

Figure 8: Recognition results of FMLLR with recursive normalization.

adaptation. In CMN and DCN, one second interval is enough to estimate the normalization parameter, therefore the WERs are not reduced much by using more adaptation data. In contrast, the WERs of MVN and HEQ are reduced when one or more adaptation sentences can be used.

adaptation framework. Adaptation is very simple because it does not need any transcription. In the recognition step, DCN adaptation is even faster than DCN and still effective with small amount of adaptation data. Finally, DCN adaptation was combined with FMLLR. It worked quite effectively in batch mode, but the improvement was rather small in recursive mode. This can be attributed to the mismatch between the adaptation data and the test data that were normalized differently. Combining those two algorithms effectively is one of our future works.

5.4. Combination with FMLLR If we do not mind the time to decode the adaptation data, more adaptation technique can be used. We combined our approach with Feature-space MLLR (FMLLR) [7]. FMLLR is a special version of MLLR, which is equivalent to onecluster constrained MLLR, and can be implemented as a feature compensation algorithm. First, we combined FMLLR with DCN and other normalization methods in batch mode. Normalization was applied to the adaptation data, and the output was recognized to make a reference transcription. A regression matrix was estimated using these data, and applied to the test data after normalization. The results are shown in Fig. 7. It shows that these normalization methods work quite cooperatively with FMLLR, and that DCN-FMLLR provides the best performance. If we use five sentences for adaptation, the WER of DCN-FMLLR is 20.7%, which means 17% relative improvement from CMN-FMLLR. Finally, the normalization was done in a recursive manner. The results are shown in Fig. 8. Unlike the previous experiments, the improvement of DCN-FMLLR is almost as same as HEQ-FMLLR. This is difficult to explain, but it may be attributed to the mismatch between the test data that is normalized in the recursive mode and the adaptation data that is normalized in the batch mode.

6. Conclusions In this paper, recursive feature normalization using DeltaCepstrum Normalization (DCN) was introduced. If normalization is applied in batch mode, a delay is unavoidable even if the normalization algorithm itself is fast. However, the delay can be reduced by estimating the normalization parameters recursively. Experimental results showed that this framework works quite effectively with DCN. DCN was also investigated in the unsupervised speaker

7. References [1] Y. Obuchi and R. M. Stern, “Normalization of timederivative parameters using histogram equalization,” Proc. EUROSPEECH, Geneva, Switzerland, 2003 [2] K. Koumpis and S. K. Riis, “Adaptive transition bias for robust low complexity speech recognition,” Proc. International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, USA, 2001 [3] http://cmusphinx.sourceforge.net/sphinx4/ [4] N. Kitaoka et al., “Speech recognition under noisy environments using spectral subtraction with smoothing of time direction and real-time cepstral mean normalization,” Proc. International Workshop on Hands-Free Speech Communication, Kyoto, Japan, 2001 [5] A. de la Torre et al., “Non-linear transformation of the feature space for robust speech recognition,” Proc. International Conference on Acoustics, Speech and Signal Processing, Orlando, USA, 2002 [6] S. Dharanipragada and M. Padmanabhan, “A nonlinear unsupervised adaptation technique for speech recognition,” Proc. International Conference on Spoken Language Processing, Beijing, China, 2000 [7] Y. Li et al., “Incremental on-line feature space MLLR adaptation for telephony speech recognition,” Proc. International Conference on Spoken Language Processing, Denver, USA, 2002

Suggest Documents