A Large Vocabulary Continuous Speech

2 downloads 0 Views 97KB Size Report
This paper presents our work to build a pioneering Indonesian Large Vocabulary ... not available yet, we tried to collect speech data from Indonesian native ...
15th Indonesian Scientific Conference in Japan Proceedings. ISSN:1881-4034

A Large Vocabulary Continuous Speech Recognition System for Indonesian Language Dessi Puji Lestari1, Koji Iwano1, Sadaoki Furui1 1

Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan {dessi, iwano, furui}@furui.cs.titech.ac.jp

Abstract. This paper presents our work to build a pioneering Indonesian Large Vocabulary Continuous Speech Recognition (LVCSR) System. In order to build an LVCSR system, high accurate acoustic models and large-scale language models are essential. Since Indonesian speech corpus was not available yet, we tried to collect speech data from Indonesian native speakers to construct a speech corpus for training the acoustic model based on Hidden Markov Models (HMMs). A text corpus which was collected by ILPS, Informatics Institute, University of Amsterdam, was used to build a 40K-vocabulary dictionary and a n-gram language model. Evaluation results of the system are also presented in this paper. The best configuration of the system achieved word accuracy of 78.2% in average. By conducting speaker adaptation using the MAP technique, the accuracy increased up to 82.3% in average.

Keywords. LVCSR, HMM-based acoustic model, n-gram language model, MAP speaker adaptation. 1

Introduction

An automatic speech recognition (ASR) system is generally constructed by a set of technologies that allows a computer to transform sound input recorded through a microphone to a sequence of words. The system includes two main components--an acoustic model and a language model. The acoustic model indicates how a given word or a “phone” is pronounced, and the language model predicts likelihood of word sequences appearing in the language. From a user’s perspective, the speech recognition system can be classified by the following criteria: -

whether the system can be used by only one user (speaker dependent) or can be used by different speakers (speaker independent)

-

whether the system requires users to break up their speech into discrete words (isolated word recognition) or can recognize natural human speech, in which several words can be connected and co-articulated together (continuous speech recognition)

-

whether the size of vocabulary that can be recognized is small (tens or at most hundreds of words) or large (thousands of words). In the large vocabulary case, when the system can accept continuous speech, it is called large vocabulary continuous speech recognition or LVCSR.

Researches on LVCSR have been extensively conducted for many languages and have already been applied to many kinds of applications, such as telephone solutions, voicecontrolled applications, dictation, etc. However, there is no much effort being spent for Indonesian Language. One of the main problems is the lack of Indonesian speech corpus. ATR Spoken Language Translation Laboratories (Japan), in collaboration with R&D Division of PT Telekomunikasi Indonesia (Indonesia) and Electrical Engineering Department of Bandung Institute of Technology (Indonesia), initiated a project related to the automatic speech recognition system and has successfully built a system for digit and simple dialog

17

15th Indonesian Scientific Conference in Japan Proceedings

processing task (Sakti et al. 2004). Since Indonesian LVCSR has not yet been developed, our purposes are to develop an LVCSR system as a baseline for Indonesian speech recognition systems. 2

Database Development

The most common state-of-the art acoustic models based on Hidden Markov Model (HMM) (Furui 2001) and n-gram language models (Huang et al. 2001) are used to develop the LVCSR system for Indonesian Language. Statistical techniques for estimating these models require relatively large amount of speech and text corpus as training data. 2.1

Text Corpus

Two document collections from newspaper (Kompas online collection) and magazine (Tempo online collection) articles made by ILPS (Tala 2003) were used as the training set. The initial form of these document collections contain grammatically incorrect sentences, heading of articles, numbers, punctuation, abbreviations, acronyms, names, foreign words and other symbol, which decrease the performance of the language model. Thus a text amendment tool was built in order to: 1. Remove the heading of each article, 2. Convert all upper case letters into lower case, 3. Change the numbers into words, for example “1” to “satu”, “103” to “seratus tiga”, 4. Remove punctuation symbols, except “,” , ” .”, “!” and ”?” are changed into ” .”, and 5. Add special symbols and to mark the beginning and ending of sentences. After a text amendment tool was applied to the ILPS document collections, manual correction was conducted to split long sentences into several sentences or to merge two short grammatically incorrect sentences that appeared in the document into one correct grammatical sentence. After these processes were conducted, the text corpus as described in Table 1 was obtained. Table 1: Text Corpus Statistics Attributes

Training Set for Language Model Number of Sentences 615,248 Number of words 9,853,517 Vocabulary Sizes 129,919 Average sent. length 16.02 (words)

2.2

Speech Corpus

Since a large vocabulary phonetically-balanced speech corpus of Indonesian language is not available yet, we collected speech data from Indonesian native speakers to construct the Indonesian speech corpus. 2.2.1 Text Selection Three hundred twenty eight sentences were prepared to record the speech. These sentences were taken from the ILPS document collections. We selected these sentences to produce a phonetically balanced corpus.

Indonesian Student Association in Japan

18

15th Indonesian Scientific Conference in Japan Proceedings

2.2.2 Recording For recording, 20 native speakers (11 males and 9 females) were selected. Each speaker was asked to read prepared text of the 328 sentences. Normal recording time for one speaker was about 50 minutes. If there was an error in reading, speakers were asked to read it again. Total size of the speech corpus is 14.5 hours. Speech was recorded in a quiet room. Recording materials were a Sony DAT-recorder DTC2000ES and a close talking Sennheiser microphone HMD 25-1. Speech was recorded into DAT tapes and transferred into files at 16 kHz sampling rate. Then, the corpus was manually segmented using Wavesurfer (KTH 2006). 2.2.3 Phoneme Labeling To train HMMs, phone labeling is necessary. In labeling, we use the Indonesian phoneme set (Darjowidjodjo 1966) as can be seen in Table 2. Table 2. Indonesian Phoneme Set Phonetic Category

Phoneme Word

Example Phoneme Sequence

Vowels /a/ /e/ /E/ /i/ /o/ /u/

saya enak kEmana ingin orang untuk

/s /e /k /i /o /u

a n E n r n

y a m g a t

a/ k/ a n a/ i n/ ng/ u k/

/ai/ /au/ /oi/

sungai danau amboi

/s u ng ai/ /d a n au/ /a m b oi/

/w/ /y/

wanita saya

/w a n i t a/ /s a y a/

/b/ /p/ /d/ /t/ /g/ /k/ /kh/ /j/ /c/ /v/ /f/ /z/ /s/ /sy/ /h/ /r/ /l/ /m/ /n/ /ny/ /ng/

berapa petani dia teman giat kamu khairul juga cinta video maaf jenazah saya syahdu hujan ramai lambat mana mana nyanyian lambang

/b e r a p a/ /p e t a n i/ /d i a/ /t e m a n/ /g i a t/ /k a m u/ /kh a i r u l/ /j u g a/ /c i n t a/ /v i d e o/ /m a a f/ /j e n a z a h/ /s a y a/ /sy a h d u/ /h u j a n/ /r a m a i/ /l a m b a t/ /m a n a/ /m a n a/ /ny a ny i an/ /l a m b a ng/

Diphtongs

Semi-vowels

Consonants Plosives

Africates Fricatives

Liquids Nasals

19

Indonesian Student Association in Japan

15th Indonesian Scientific Conference in Japan Proceedings

In our experiment, we assume that /e/ sound and /E/ sound are the same, thus /E/ is merged into /e/. The same is also implemented to /f/ and /v/ sound, that is, /v/ is merged into /f/. We add a phoneme /q/ to cover some words from Arabic words which often appear in Indonesian language. Thus, the number of phonetic symbols used for this system is 31. There are also three special symbols added for training the acoustic model, symbol /silB/ and /silE/ (silence beginning and ending) to mark the beginning and ending of the sentence and symbol /sp/ (short pause) to mark a short pause made by the speaker. Punctuation marks are not used in this experiment. To have accurate /sp/ labeling, forced alignment was applied, and the new phoneme label was used to retrain the HMMs. 3

Acoustic Model

HTK Toolkit version 3.2 (Young 1989) was used as the acoustic model training tool. The speech corpus as described in Subsection 2.2 was divided into training sets and testing sets. For each speaker, 293 sentences were used as a training set and 35 sentences were used as a testing set. The leave-one-out method was implemented to conduct 10 experiments. For each experiment, the training set contained 18 speakers (10 males and 8 females) and the testing set contained 2 speakers (1 male and 1 female). There was no overlap between speakers in the training set and the testing set (an open-speaker experiment). The 1st through 12th order Mel-Frequency Cepstral Coefficients (MFCC) were computed at every 10 ms using a window with 25 ms-width. Temporal differences of MFCC coefficients (delta MFCC) and power (delta LogPow) were also incorporated. 32 Gaussian mixtures of context dependant HMM were trained. 4

Language Model

CMU-Cambridge SLM Toolkit version 2.0 (Clarkson 1997) was used to train 2-gram and 3gram language models. The training text corpus as described in Subsection 2.1 was used for training 2-gram and 3-gram. Both were trained using Good-Turing back off smoothing technique. When using the training set described in Subsection 2.1, the 3-gram language model had the test-set perplexity of 86.6 and the OOV (out of vocabulary) rate of 1.7% in average. For building the dictionary, the words that occur in the training set for more than 3 times were taken. There were 41,436 words in the dictionary. An automatic transcription tool was made and employed to add pronunciation to this dictionary. 5

Evaluation Experiment

A recognition engine Julius version 3.4.2 (Kyoto University 2001) was used as the speech decoder. Column 2 of the Table 3 shows the word accuracy for all experiments, experiment 1 through 10, and the average of all experiments. In order to improve the recognition result for each speaker, 293 utterances of each speaker which were not used for training and testing were used to conduct supervised adaptation by a MAP speaker adaptation technique (Lee and Gauvain 1993). This technique increased the accuracy by 4.1% in average. Column 3 of the Table 3 shows word accuracy after MAP speaker adaptation was applied.

Indonesian Student Association in Japan

20

15th Indonesian Scientific Conference in Japan Proceedings

Table 3: Evaluation Result Experiment Word Accuracy (%) # Baseline MAP Adaptation 82.7 84.9 1 82.9 86.8 2 70.0 77.5 3 84.0 84.8 4 79.9 82.5 5 74.8 80.1 6 64.0 74.6 7 78.9 80.8 8 82.0 85.0 9 82.8 86.2 10 78.2 82.3 Mean

Analyzing the recognition result carefully, we find that major recognition errors were caused by the following reasons: 1. OOV words. 2. Names of persons / organizations with abbreviations / places. 3. Incorrect word segmentation, e.g. the word “dilepaskan” was recognized as “di lepaskan”, “ke arah” was recognized as “kearah”. 4. Homophone words. 5. Strong dialect of speakers. Some Indonesian words having similar words in regional languages, such as those in Javanese and Sundanese were misrecognized into regional pronunciation, e.g. the word “hijau” was misrecognized as “ijo” when uttered by strong dialect of Javanese speakers (one of the testing speakers in experiment #1, #4, #7), and the word “setia” was misrecognized as “setiah” when uttered by strong dialect of a Sundanese speaker (one of the testing speakers in experiment #2). 6. Noise in the middle of utterances caused by the speaker him/herself like breath sounds, cough sounds, etc (one of the testing speakers in experiment #3). 7. Uncommon speaking rate: speakers who spoke much faster (one of the testing speakers in experiment #5) or much slower (both testing speakers in experiment #7) than the speakers in the training data tended to produce lower accuracy. 8. Unclear pronunciation (one of the testing speakers in experiment #6) 6

Conclusion

This paper described the first attempt of building a LVCSR system for Indonesian Language. The evaluation result of the system shows relatively low accuracy than LVCSR systems which have been built for other languages. LVCSR systems of other languages have gained around 90% of word accuracy (Furui 2001, Digalakis 2003). Therefore, further work needs to be done in order to improve the accuracy of the Indonesian LVCSR system. The future works include adding more training data covering all Indonesian dialect style to cope with the problem of great varieties of Indonesian dialect and also covering many application domains;

21

Indonesian Student Association in Japan

15th Indonesian Scientific Conference in Japan Proceedings

and fixing the dictionary and the language model by pre-processing the text corpus to repair grammatically incorrect sentences, incorrect words, etc. Acknowledgements The authors would like to thank to ILPS, Informatics Institute, University of Amsterdam for giving us Kompas and Majalah Tempo collections. References [1] Clarkson, P. R., and Rosenfeld, R. (1997). “Statistical Language Modeling Using the CMU-Cambridge Toolkit”, In Proceedings of ESCA Eurospeech. [2] Darjowidjodjo, S. (1966). “Indonesian Syntax”, Ph.D dissertation, Georgetown University, Washington. [3] Digalakis, V., Oikonomidis, D., Pratsolis, D., Tsourakis, N., Vosnidis, C., Chatzichrisafis, N., and Diakoloukas, V. (2003). “Large Vocabulary Continuous Speech Recognition in Greek: Corpus and an Automatic Dictation System”, In Proceedings of Eurospeech-2003, 1565-1568. [4] Furui, S. (2001). Digital speech processing, synthesis, and recognition, Second edition, revised and expanded, Marcel Dekker. [5] Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing, A Guide to Theory, Algorithm, and System Developmen, Prentice Hall, 558-560. [6] Kyoto University (2001), Multipurpose Large Vocabulary Continuous Speech Recognition Engine Julius rev. 3.2. [7] Lee, C. H., and Gauvain, J. L. (1993). “Speaker adaptation based on MAP estimation of HMM parameters”, In Proceedings of ICASSP, 2, 558-561. [8] Sakti, S., Hutagaol, P., Arman, A. A., Nakamura, S. (2004). “Indonesian Speech Recognition for Hearing and Speaking Impaired People”, In Proceedings of Interspeech/ICSLP, 2, 1037-1040. [9] Tala, F. Z. (2003). “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia”, M.Sc. Thesis, Information Retrieval Resources for Bahasia Indonesia (ILPS), Informatics Institute, University of Amsterdam, http://ilps.science.uva.nl/Resources/. [10] The Royal Institute of Technology (Kungliga Tekniska Högskolan/KTH) (2006). http://www.speech.kth.se/wavesurfer/. [11] Young, S. (1989). HTK - Hidden Markov Model Toolkit - Speech Recognition toolkit, University of Cambridge, http://htk.eng.cam.ac.uk/.

Indonesian Student Association in Japan

22