A COMPARISON OF NEURAL NET AND LINEAR ... - Semantic Scholar

0 downloads 0 Views 110KB Size Report
Etienne Barnard. Center for Spoken Language Understanding. Oregon Graduate Institute of Science & Technology. P.O.Box 91000, Portland, OR 97291-1000, ...
A COMPARISON OF NEURAL NET AND LINEAR CLASSIFIER AS THE PATTERN RECOGNIZER IN AUTOMATIC LANGUAGE IDENTIFICATION Yonghong Yan

Etienne Barnard

Center for Spoken Language Understanding Oregon Graduate Institute of Science & Technology P.O.Box 91000, Portland, OR 97291-1000, USA Phone: (503)690-1121 ext. 1637, FAX (503)690-1334 [email protected] [email protected] ABSTRACT

The goal for language identi cation (LID) is to quickly and accurately identify the language being spoken in a given test utterance. Recent researches has shown the importance of acoustic, phonotactic and prosodic information for language identi cation. How to combine these multiple information sources to give the nal results is still a research issue. Traditional ways to combine multiple scores were similar to a linear classi er. In this paper, experiments were conducted to compare the performance of linear classi er based and neural network based nal score combination. The results showed that approximately 15% errors of linear classi er based system were reduced by the neural network based system, which suggests that a non-linear combination of multi-information sources is necessary for language identi cation.

1. Introduction

Automatic language identi cation (LID) has received much renewed attention in recent years. The task for a LID system is to quickly and accurately identify the language being spoken in a given test utterance. Recent research[1,2,3] has shown the importance of acoustic, phonotactic and prosodic information for language identi cation. How to combine these multiple information sources to give the nal identi cation results is still an interesting research issue. Previously, the typical ways to combine mul-

tiple scores were either by some prior knowledge about the relative merit of di erent scores, such as in [1] or by hill-climbing optimization of a linear combination of scores[2]. These techniques thus amount to linear classi cation. A well-trained linear classi er can be viewed as combining the scores with di erent optimal weights to give the best guess. In this paper, comparison experiments were conducted to compare the performance of linear classi er based and neural network based nal score combination. The platform used to conduct these experiments is a language identi cation system introduced recently [3,4]. It is a language-dependent phone recognition based approach to language identi cation. Acoustic models, language models and duration models are exploited for the LID system. Comparison experiments were conducted on two commonly used LID tasks: identi cation of a sixlanguage task and an eleven-language task. In all the experiments, the neural network based systems got better identi cation (correct) rate than the linear classi er based systems. The rest of this paper is organized as: section 2 gives an overview of the platform (the system) for the comparison experiment, section 3 presents the database used in this study, section 4 presents the experiments and results and discussion is given in section 5.

Speech

Recognizer 1

Score Generator 1 LID model 1,2,....,N

Recognizer 2

Score Generator 2 LID model 1,2,....,N

2. Backward language model Final Classifier

Signal

Recognizer M

T Y bLB = ( b(pijpi+1 ) + u(pi ))

LID

i=1

Result

Figure 1: General Structure of the LID system

2. Overview of the LID system

2.1. General Architecture

Our LID system[3] is composed of three parts: (1) language dependent phone recognizers (front end), (2) LID score generators and (3) the nal classi er. A general architecture for an N-language task system is given in Figure 1. For an N-language task, M (N  M ) language-dependent phone recognizers were implemented. During testing, these phone recognizers run in parallel, and independently decode the input utterance into phone strings. The score generator of each recognizer takes the decoded phone string, calculates LID scores and sends them to the nal classi er. For each language in the task, a score generator contains a set of LID models, described in terms of the phone inventory of the language-dependent recognizer that associates with this score generator. If the score generator contains L sets of LID models, the input to the nal classi er will be a M  N  L LID score vector. Our main interest in this paper is to compare the two di erent types of pattern recognizers (neural network and linear classi er) as the nal classi er.

2.2. Models in Score generator 1. Forward language model bLF =

YT ( b(pijpi?1) + u(pi))

(1)

i=1 where, b(bijpi?1 ) is the bigram term and u(pi ) is the unigram term. pi is the ith phone in

the decoded path.

3. Duration model PD =

Score Generator M LID model 1,2,....,N

(2)

YT ((1 ? )P (pijpi?1 2 S) + P (pi))

i=1

(3) where P (pi jpi?1 2 S ) is the context-dependent model, and S is one of the six broad categories: vowel, fricative, stop, nasal, a ricative or glide. P (pi) is the original context independent monophone duration model, which is used here as a smoothing factor with weight . Language models are used to model the phonotactic di erences between di erent languages; the two language models were used to capture the right and left context dependent information. The duration models are used to model the prosodic differences in di erent languages. The backward language model and duration model were rst proposed in [3]. All the language models were optimized by the method proposed in [4].

3. Database

The Oregon Graduate Institute Multi Language Telephone Speech Corpus (OGI TS)[5] was used. It is a telephone speech database collected in USA, all the speakers are supposed to be native speakers of the languages. This database is used by National Institute of Standard and Technology (NIST) as the standard database for language identi cation research. Currently there are speech data of 11 languages in the database, they are: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. Among them, the data of six languages have been phonetically transcribed, they are: English, German, Hindi, Japanese, Mandarin and Spanish. Phone recognizers for these six languages were implemented. Four parts of the utterances in the database were

used: \story-before-the-tone" (story-bt), \storyafter-the-tone" (story-at), \rooms" (room) and \numbers"(num). The training and development sets used in this study contained 1785 utterances total. Data in the training sets were used to train the language dependent phone recognizers, estimate the parameters in the LID models used by the score generators. All the data in the training set and development set were used to train the nal classi er. The comparison experiments presented here were conducted on these 11 and six languages tasks. The test data used were the test sets used by NIST in their 94 evaluation for these two tasks, which contains 45-second long utterances and 10-second long utterances. There were 195 45-second utterances and 625 10-second utterances in the test set. All above sets were disjointed. The data of each language in the task were roughly balanced in all the sets.

4. Experiments and Results

4.1. System Implementation and Parameter estimation

The speech data were sampled from telephone line at 8KHz rate. Speech wave data are parameterized every 20ms with 10ms overlap between contiguous frames. For each frame, a feature vector with 26 dimensions is calculated: 12 LPC cepstra, 12 delta cepstra, normalized energy and delta energy. Six general purpose continuous HMMs based context independent phone recognizers for continuous speech were implemented. A three-state left-toright HMM was used for each phone in each language. The probability density function for each HMM state is represented by three Gaussian mixtures. The numbers of HMMs (number of phones) and the phone recognition accuracy for these recognizers are given in Table 1. The implemented recognizers were used to train the LID models in the score generators. Each recognizer was used to decoded the data of all the languages in the training set; based on the decoded phone path, all the LID models of each language in the task were estimated for the score generator of this particular

Language No. of Model Accuracy English 41 46.79% German 41 46.57% Hindi 45 48.13% Japanese 26 56.25% Mandarin 41 36.33% Spanish 30 54.56% Table 1: General Information about the front end Approach 45-sec. utterance 10-sec. utterance SIX L S 87.50% 76.22% SIX N S 88.39% 77.30% SIX L M 90.18% 79.19% SIX N M 91.96% 81.62% 11 L S 80.51% 69.12% 11 N S 82.56% 70.04% 11 L M 83.59% 70.08% 11 N M 86.67% 73.76% Table 2: Results(correct rate) for all the experiments language dependent phone recognizer. During testing, the scores calculated by each score generator were normalized by the number of phones in the decoded best path before being sent to the nal classi er.

4.2. Comparison experiments and Results

In order to compare the performance of neural network and linear classi er as the nal classi er in the system. Following two sets of experiments were conducted:  Only one set of features (forward language model) is used in score generator.  All the three sets of features are used in score generator. The neural network used was a fully-connected feed-forward net with one hidden layer. The number of output nodes is equal to the number of lan-

Experiment Dimension(M  N  L) SIX S 6  6  1 = 36 SIX M 6  6  3 = 108 11 S 6  11  1 = 66 11 M 6  11  3 = 198 Table 3: Input LID score vector of the nal classi er guages in the task. A conjugate gradient optimization algorithm[6] was used to train the neural networks. conjugate gradient optimization[6]. All the experiments were conducted on both the sixlanguage and 11-language task. Neural networks with di erent number of hidden nodes were evaluated; the best results were obtained by nets with 15 to 25 hidden nodes. Results for these experiments are given in Table 2, where SIX means for six-language task, 11 means for 11-language task, L means linear classi er being used, N means neural net being used, S means only single set of feature being used and M means all the sets of feature being used. The dimensions of the LID score vector being sent to the nal classi er for these experiment are given in Table 3.

5. Discussion

The neural-net based nal classi er outperforms the linear classi er in all the experiments. More improvement is achieved when multiple information sources are used, indicating that the combination of di erent information sources is an important part of the success achieved with the neural network. Overall, we achieved approximately 15% error reduction by using the neural network. The results show that non-linear combination of multiple information sources is useful in improving the performance of language identi cation systems.

6. References

[1] M.A. Zissman. Language Identi cation using Phoneme Recognition and Phonotactic Language Modeling. In 1995 International

Conference on Acoustic, Speech, and Signal Processing Proceedings. Vol.5, pages 3503-

3506, May 1995 [2] T.J. Hazen and V.W. Zue. Recent improvements in an approach to segmentbased Automatic language identi cation. In

Proceedings of International Conference on Spoken Language Processing, pp.1883-1886,

September 1994 [3] Y. Yan and E. Barnard. An Approach to Automatic Language Identi cation based on Language-dependent Phone Recognition. In

1995 International Conference on Acoustic, Speech, and Signal Processing Proceedings.

Vol.5, pages 3511-3514, May 1995 [4] Y. Yan and E. Barnard. An Approach to Language Identi cation with Enhanced Language Model. To appear EUROSPEECH-95. [5] Y.K. Muthusamy, R.A.Cole and B.T.Oshika. The OGI multi-language telephone speech corpus. In Proceedings In-

ternational Conference on Spoken Language Processing 92. Vol.2, pages 895-898, October

1992. [6] E.Barnard and R.A.Cole. A neural-net training program based on conjugategradient optimization. Technical Report CSE 89-014, Oregon Graduate Institute, 1989 [7] K.P. Li, Experimental Improvements of a Language Id System. In 1995 International

Conference on Acoustic, Speech, and Signal Processing Proceedings. Vol.5, pages 3515-

3518, May 1995 [8] E.S. Parris and M.J.Carey, Language Identi cation Using Multiple Knowledge Sources. In 1995 International Conference on Acous-

tic, Speech, and Signal Processing Proceedings. Vol.5, pages 3519-3522, May 1995

Suggest Documents