ABSTRACT. This paper describes a speaker adaptation technique using a segment-based neural-mapping applied to con- tinuous speech recognition.
A SEGMENT-BASED SPEAKER ADAPTATION NEURAL NETWORK APPLIED TO CONTINUOUS SPEECH RECOGNITION Keiji Fukuzawat , Yasuhiro Komorit , Hidefumi Sawaix and Masahide Sugiyamat
t ATR Interpreting Telephony Research Laboratories Sanpeidani, Inuidani, Seika-cho, Soraku-gun, Kyoto 619-02, Japan $Research and Development Center, Ricoh Co., Ltd. 16-1, Shin’ei-cho, Kohoku-ku, Yokohama 223, Japan ABSTRACT This paper describes a speaker adaptation technique using a segment-based neural-mapping applied t o continuous speech recognition. The adaptation neural network has a time shifted sub-connection architecture t o maintain the temporal structure in the acoustic segment and to decrease the amount of speech data for training. The effectiveness of this network has been reported for phoneme recognition. In this paper, this speaker adaptation network is combined with a TDNNLR continuous speech recognizer, and is evaluated in word and phrase recognition experiments with several speakers. The results of 500-word recognition experiments show that the recognition rate by segment-based adaptation is 92.2%, 28.8% higher than the rate without adaptation. The results of 278 phrase recognition experiments show that the recognition rate by segmentbased adaptation is 57.4%, 27.7% higher than the rate without adaptation.
1. INTRODUCTION Several framebased speaker adaptation techniques have been proposed, for example, codebook mapping[l] and neural network-based analog mapping [2]. Nevertheless, differences in voice individuality exist in both the spectral and temporal domains. Frame-based speaker adaptation techniques compensate only for speaker differences in the spectral domain. On the other hand, segment-based speaker adaptation compensates for these spectral and temporal differences. A segment-based speaker adaptation neural network has been proposed [3] as this adaptation technique. In phoneme recognition, the effectiveness of this network was shown and it was also shown that the segmentbased adaptation network is more effective than that of a frame-based network. This adaptation network can easily be applied to a segment-based recognition system. In this paper, a TDNN(Time-Delay Neural Network)-LR[4] system is used as a word-or phrase recognition system.
The performance of the TDNN-LR system has been evaluated for the speaker-dependent mode, but the performance for the cross speaker mode has not been evaluated. Since TDNN performance drops when it recognizes speech that differs from the training speech, it is easily imagined that the performance of TDNNLR for new input speakers will drop. There are two approaches to solve this problem: the speaker adaptation approach and the multi speaker training approach. In the first approach, an adaptation neural network can be trained with a small number of input speaker training samples, the network nevertheless requires an adaptation procedure for each input speaker. In the second approach, the TDNN can recognize each input speaker’s utterances without any adaptation procedure, the TDNN nevertheless requires a large number of multi-speaker training samples, a large number of weighting parameters and considerable training time[5]. A modular TDNN is proposed as a multi speaker training approach and has been evaluated for phoneme recognition[6]; however, it is not easy to build a speaker-independent TDNN-LR system using the modular TDNN. On the other hand, the adaptation network can easily be trained and can easily be combined with the TDNN-LR system. In the following section, segment-based speaker adaptation and a TDNN-LR speech recognition system for evaluation are described. Next, the evaluations of the speaker adaptation neural network for word-and phrase recognition are described.
2. SEGMENT-BASED SPEAKER ADAPTATION NEURAL NETWORK A threelayer feed-forward neural network is applied t o segment-to-segment analog mapping. The network has 27,272 connections (392 connections representing the bias values are included) and its architecture is shown in Fig.1. The input layer, hidden layer and output layer have 16 x 7, 56 x 5 and 16 x 7 units, respectively. The network has a time shifted subconnection architecture: 3-frame x 1-frame x 3-
1-433 0-7803-0532-9192 $3.00 0 1992 IEEE
rr
.
Recognition ResuIt
-0-
output
layer
(16x7) 13,440
Sequence of Mapped Segments
connections
hidden layer (56x5) 13,440
Sequence of Segments
connections
input layer (16x7)
0 SDeech Input
Figure 1: Architecture of Speaker Adaptation Neural Network
Figure 2: TDNN-LR Speech Recognition System with a Speaker Adaptation Neural Network
frame type sub-connection shifted 1 frame t o maintain the temporal structure of each segment. This architecture reduces the number of weighting parameters. As a result, the network can decrease the amount of speech data for training. The number of hidden units in the sub connection is larger than the number of input and output layers t o improve mapping performance. Pairs of input and teaching patterns which are required t o train the network are generated from the same word utterances of a standard speaker and new input speaker. Series of segments are produced from the beginning to the end of the training words. Each segment consists of 7-frame mel-scaled 16-channel FFT outputs. The speech analysis conditions are shown in Table 1. In the evaluation for phoneme recognition &frame segments were used. 7-frame segments are applied, since they show higher performance than 15frame segments in a TDNN-LR phrase recognition system. Time alignment between two series of segments is performed using the DTW(dynamic time warping) technique. The segments from the standard speaker are used as the teaching patterns for the corresponding segments from the input speaker. Training the adaptation network was performed using 100 words selected
Table 1: Speech Analysis Conditions Sampling Frequency
12 kHz
Window Length Analysis Interval
21.3ms
from the 2,620 words which axe also used to train the TDNN in the TDNN-LR. The 100 words contain 611 phonemes. For this training, the BP(Back Propagation )[7] algorithm is used.
3. TDNN-LR SPEECH RECOGNITION SYSTEM The TDNN-LR system with the adaptation neural network is shown in Fig.2. To recognize an input speaker’s utterance, segments of the input speaker are fed into the input-layer of the adaptation network and mapped t o the standard speaker’s segments. The sequence of mapped segments is recognized by the TDNNLR system. In phrase recognition, 278 Japanese phrases from a task called “The International Conference Sec-
1-434
rr .
1
Table 2: Comparison of Fuzzy and Conventional Training Methods for TDNN-LR Phrase Recognition
Table 3: Complexity of 278 Phrase Recognition Task
I
Number of Words
I
Phoneme perplexity
I
~~
5.91
retary Service” are used. The complexity of the task is shown in Table 3. The phoneme scanning TDNN in the TDNN-LR system is trained using phoneme segment data: it is extracted according to hand labels from 2,620 Japanese words uttered by a standard speaker. The training data contains 9101 phoneme samples. The number of training samples for adaptation is about one-fifteenth of that using the TDNN training. For the TDNN training, a fuzzy training method[8] is applied. The fuzzy training method uses the the target values given as fuzzy class information. A comparison of fuzzy and conventional training methods in phrase recognition is shown in Table 2. In the speaker-dependent mode, the averaged recognition rate using the fuzzy training method is 71.2%, 5.5% higher than that using the conventional training method. In the cross speaker mode, the averaged recognition rate using the fuzzy training method is 29.7%, 11.1%higher than that using the conventional training method. These results show that the fuzzy training method improves phrase recognition performance for both modes.
4. EVALUATION OF SPEAKER ADAPTATION NEURAL NETWORK The evaluation of the speaker adaptation neural network for word and phrase recognition is performed with two standard male speakers and two input speak-
ers (one male and one female). The word recognition vocabulary is 500 words and the number of recognition phrases is 278. The word recognition performance and the phrase recognition performance with a segmentbased speaker adaptation neural network are shown in Table 4. In the word recognition experiment, the segmentbased speaker adaptation neural network improves the word recognition rates to over 90% even for femaleto-male adaptation. The averaged word recognition rate with speaker adaptation is 92.2%. In contrast, the recognition rate without adaptation is only 63.4%. The adaptation network thus has a 28.8% higher word recognition rate for input speakers. In the phrase recognition experiment, the adaptation network also improves the recognition performance for each input speaker. The averaged recognition rate with speaker adaptation is 57.4%. In contrast, the recognition rate without adaptation is only 29.7%. The adaptation network thus has a 27.7% higher phrase recognition rate for input speakers. The performance of the TDNN-LR without adaptation drops remarkably in the case of female input speaker and male standard speaker. This must be due to the considerable difference that exists between male and female samples. The adaptation network is effective for both adaptations: male-temale and female-bmale. These results show the effectiveness of the segment-based adaptation neural network for word recognition and phrase recognition. In the recognition performance for phonemes included in testing sets, the adaptation network is effective, especially in vowels. This is due to that the training words for adaptation contain more vowel samples than copsonant samples. Comparing word and phrase recognition, the averaged word recognition rate with adaptation is 5.3% lower than the speaker-dependent rate. In contrast, the averaged phrase recognition rate with adaptation
1-435
Table 4: Word and Phrase Recognition Performance Using a Segment-based Speaker Adaptation Neural Network
is 13.8% lower than the speaker-dependent rate. One reason is the use of word data for training the adaptation network. It is expected that using phrase data for training the adaptation network improves the adaptation performance for phrase recognition.
Adaptation Based on Vector Quantization,” Report of Speech Committee, SP88-106, pp.1-8 (Dec. 1988) (in Japanese). K. Iso, M. Asogawa, K. Toshida and T. Watanabe, “Speaker Adaptation Using Neural Network,” Proc. of Acoust. Soc. of Jpn., Spring Meeting, 1-6-16 (Mar. 1989) (in Japanese).
5. CONCLUSION This paper showed the effectiveness of a segmentbased speaker adaptation neural network in word-and phrase recognition. Nevertheless, the phrase recognition rate with adaptation is still 13.8% lower than the speaker-dependent rate. The following three items are set for future study. 1) improving adaptation performance for phrase recognition by using phrase data to train the adaptation network. 2) applying the segment-based speaker adaptation neural network to a segment-based HMM-LR[9] system. 3) evaluation of performance on multi input speakers to one standard speaker mapping by the segment-based adaptation neural network.
K. Fukuzawa, H. Sawai and M. Sugiyama, “Segmentbased Speaker Adaptation by Neural Network,” Proc. of NNSP, pp.442-451 (Sep. 1991). M. Miyatake, H. Sawai, Y. Minami and K. Shikano, “Integrated Training for Spotting Japanese Phonemes Using Phonemic Time-Delay Neural Networks,” Proc. of ICASSP, S8.10, pp.449-452 (Apr. 1990). H. Sawai, S.Nakamura, K. Fukuzawa and M. Sugiyama, “On Connectionist Approaches to Speakerindependent Recognition,” Proc. of Acoust. Soc. of Jpn., 1-5-17, pp.37-38 (Oct. 1987) (in Japanese). S. Nakamura, H. Sawai and M. Sugiyama, “SpeakerIndependent Phoneme Recognition Using Large-scale Neural Networks,” to appear in Proc. of ICASSP92, (Mar. 1992).
ACKNOWLEDGMENTS
D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Parallel Distributed Processing - Explorations in the Microstructure of Cognition,” Volume l: Foundations, Cambridge, MA, MIT Press (1986).
The authors would like to thank Dr. A. Kurematsu and Mr. S. Sagayama for their support of this research, Mr. H. Hattori for his useful comments and suggestions and all the members of Speech Processing Department for their discussions and encouragement.
Y. Komori, “A Neural Fuzzy Training Approach For Continuous Speech Recognition Improvement,” to appear in Proc. of ICASSP92, (Mar. 1992).
References
K. Ohkura and M. Sugiyama, “Segment-based HMM Applied to Noisy Speech Recognition,” Report of f$.x?ech Committee, SP91-55, pp.1-6 (Sep. 1991) (in Japanese).
Nakamura, T. Hanazawa and K. Shikano, “Phoneme Recognition Evaluation of HMM Speaker
[l] S.
1-436
1r
1