Large Vocabulary Continuous Speech Recognition Based on Cross ...

Large Vocabulary Continuous Speech Recognition Based on Cross-Morpheme Phonetic Information In-Jeong Choi¼, Su Youn Yoon, Nam-Hoon Kim¼

¼

HCI_LAB, Samsung Advanced Institution of Technology, Korea¼, Department of Linguistics, Seoul National University, Korea [email protected], [email protected], *[email protected]

Abstract In this paper, we present a novel method to regulate lexical connections among morpheme-based pronunciation lexicons for Korean large vocabulary continuous speech recognition (LVCSR) systems. A pronunciation dictionary plays an important role in subword-based LVCSR in that pronunciation variations such as coarticulation will deteriorate the performance of an LVCSR system if it is not well accounted for. In general, pronunciation variations are modeled by applying phonological variations with all possible phonemic contexts. In order to achieve high recognition performance, current speech recognition systems impose constraints among lexicons using both morphological and phonetic knowledge. This paper suggests a method both to refine pronunciation variations according to cross-morpheme phonetic information and to regulate the connections between pronunciation variants. This method effectively excludes improper connections between pronunciation lexicons, and thus the proposed method gave a 27% reduction in word error rate over the recognizer with conventional lexicons relatively.

1. Introduction In large vocabulary continuous speech recognition (LVCSR), all sorts of pronunciation variation may take place within a word and across a word boundary, which will result in the application of various phonological processes such as assimilation, insertion, reduction or deletion. In order to model pronunciation variations, it is necessary to generate a lexicon with multiple pronunciations for each word. It has been shown in many studies that simply adding variants to the lexicon does not lead to improvements and in many cases even causes to deteriorate performance due to increased confusability in the lexicon [1]. In continuous speech recognition (CSR) task, a lot of the pronunciation variations due to coarticulation occur across boundaries of lexical entries. Many studies have been proposed to model cross-word pronunciation variations, and they may be divided into those which infer the rules from a corpus of pronunciation data and those which start with phonologically pre-specified rules based on linguistic knowledge [2]. In general, speech variations generated by applying phonological rules are highly restricted in specific phonemic context. A possible drawback of the former is that it is very difficult to derive the generalized information, which can be applied to the situation other than the one in question. Furthermore, the sparseness of well representative corpus prohibits from covering the type of variation in the actual speech. On the other hands, a possible drawback of the later is

that it could be a mismatch between the information found in the literature and the data for which it has to be used. Many techniques such as the use of decision tree [3], cross-word triphone modeling [4] have branched out to handle cross-word phenomena. In the case of CSR in English domain, a set of multi-words provides a solution to satisfy both the ease of modeling at the lexical level and the need to model cross-word variations. A multi-word is a sequence of words joined together to create a new entry in the lexicon. For similar reasons, pronunciation-dependent pseudo-morphemes obtained by concatenating morphemes are widely used for Korean CSR [3]. This approach is generally applied to the limited number of words. Otherwise, the rate of out-ofvocabulary (OOV) words will be increased in recognizers with the same size of vocabularies. In this paper, we suggest a novel method to refine pronunciation variations according to cross-morpheme phonetic information and to regulate the connections between lexicon entries. In this approach, we can distinguish between the canonical transcription of the words and their variants due to cross-word coarticulation. We also show that the proposed method effectively excludes improper connections between lexicon entries. Thus the remarkable error reduction can be achieved. This paper is organized as follows. In Section 2, we describe Korean LVCSR system. Section 3 describes the proposed method to design pronunciation dictionary and to regulate lexicon connections. In Section 4, we show the experimental results. Finally in Section 5, conclusions are drawn.

2. Korean LVCSR System 2.1. Lexical Modeling In the Korean language, a morpheme is the smallest unit with semantic meaning. The spacing in Korean written texts is done in the word phrase (eojeol) unit, which results from combining content and function morphemes. Since it is impractical to use all the combinations of morphemes as lexical entries of pronunciation dictionary, most Korean LVCSR systems choose morphemes as basic units for lexical and language models. When a morphology analyzer extracts morphemes from roots of an eojeol, pronunciations of the extracted morphemes often become different from pronunciation in the eojeol. For this reason, sound-based pseudo-morphemes obtained by concatenating morphemes are used for Korean LVCSR [3]. The pseudo-morpheme is characterized as its pronunciation is maintained. The 11,000 most frequently occurred pseudo-morphemes in political articles were selected as basic lexical units. Pronunciation dictionary includes two kinds of lexical entries; canonical

transcriptions for the above 11000 pseudo-morphemes and their pronunciation variants generated by applying the phonological rules across word boundaries. 2.2. Acoustic Modeling In a quiet office environment, 600 speakers were asked to read 60,000 Korean sentences as training data. Speech signals were sampled at 16kHz to produce 16bit data, and segmented into 25ms frames with each frame advancing every 10ms. Each frame was parameterized by 26-dimensional feature vector, which consists of 12 Mel-frequency cepstral coefficients (MFCCs) and energy, together with their differential coefficients. We used 44 base phones. We trained 4,016 context-dependent models. The hidden Markov model topology used for all subword models except silence and short pause was a 3-state left-to-right model without skip transition. For observation probability, we used phonetically-tied mixtures (PTMs). 2.3. Language Modeling To get good language models, a large text corpus is critical. This corpus is also used to generate pronunciation lexicons. We used political articles of 7 years from 1996 to 2002. We converted the eojeol-based text database into pseudomorpheme-based database using a modified Korean POS tagging system. We trained word-class based trigram language model with bigram and trigram cutoff values of 1. Part-of-speech (POS) bigram language models both within eojeols and across eojeol boundaries were used to convert a recognized pseudo-morpheme sequence into eojeol sequence. 2.4. Decoding Algorithm The decoder is composed of two stages. The first pass used internal-word triphone models with a trigram language model to generate lattices of word hypotheses. Both language model probability and information of lexical connection are applied at every cross-lexicon transition. In second pass search, we used cross-word triphone models and a trigram language model to produce N-best hypotheses. With this rescoring process, language-dependent knowledge, which contain morphology and pronunciation rules, are applied to exclude invalid hypotheses. Finally, we used POS bigram probabilities to convert a recognized pseudo-morpheme sequence into eojeol sequence.

3. Regulation of Cross-morpheme Lexical Connections 3.1. Cross-morpheme Pronunciation Variations As one of characteristics in Korean language, pronunciation change occurs between two morphemes by the phonological rule. This is more salient between two consecutive consonants. A phonemic context at word boundaries is defined as an ordered set of two adjacent consonants: the final consonant of the last syllable in a word and the first consonant of the first syllable in the next word. Pronunciation variations across word boundaries are influenced by morphological information as well as phonemic context. Lee et al. have insisted that the cross-morpheme pronunciation variations need to be modeled separately from within-morpheme variations [5]. By analyzing

phonological variations, which are frequently found in Korean spoken language, we derive 816 pronunciation rules based on the both phonemic contexts and morphological information. According to prior experiments, with given context-dependent multiple lexicons, it shows that the modeling of crossmorpheme pronunciation variation donates a slight improvement in the performance of CSR. As a similar manner, we also use the both phonemic contexts and morphological information to reflect pronunciation variations across word boundaries. In order to generate multiple lexical entries for each vocabulary, we identify morpheme boundaries, which can be applied to phonological rules. According to [6], it has shown that the coarticulation between consonant is only occurred in accentual phrase domain. In other words, a phonological change does not occur between subsequent words when speakers have paused between words. Accentual phrases take place in followings: (1) the case that morpheme boundaries lie within an eojeol, (2) the case that morpheme boundaries lies within compound nouns with a short phone sequence, and (3) the case that a class of nouns immediately follows a determiner in eojeol boundary. In order to encode cross-morpheme pronunciation variations into lexicon, we apply following two rules to all of the adjacent morphemes in morpheme-based text corpus. First, we check every pair of adjacent morphemes whether they lie in accentual phrase region or not. If they place themselves out of the accentual phrase region, their lexicons at the boundary are as same as their own canonical transcription. When it comes to the alternative case, the silence or short pause model is padded to the last part in the lexicon of the front morpheme. Secondly, the adjacent morphemes in the accentual phrase region are only considered. Adjacent consonant letters in morpheme boundary are converted into the surface representation in line with given grapheme-to-phoneme conversion table. Phonemic variations in the table depend on the both phonemic context and morphological category. The table checks morphological categories of adjacent morphemes whether they generate a non-standard phonetic transcription with a corresponding phonemic context. The selected phonological rule is applied to the phonemic context for the final generation of phonemic transcription. 3.2. Regulating Cross-morpheme Lexical Connections In order to improve the performance of CSR, it does not need the pronunciation lexicon, which well account for crossmorpheme variation, but it can also regulate the connection between lexical entries of adjacent words. In this paper, we focus on the method regulating cross-morpheme lexical connections. In baseline system, we utilize base phone connection table to exclude improper connection between lexical entries, which is unlikely to occur in actual utterances. The table includes binary representation, which tells the possibility of connection between the last phone of front lexicon and the first phone of the next lexicon. In addition, the degree of connections can be estimated by the probability values. Referring to Figure 1, it shows the example of possible lexical connections between at the word boundaries of three surface representations. One is the /dae-hak+man/, which means "only university" and the alternative is /dae-hak+i/, which means "university is".

For the word "dae-hak", the number of possible lexical entry is twelve. However, each of lexical entry in "dae-hak", cannot be connected to all the other lexical entries of the next word, and only seven connections are allowed to proceed as correct one (solid line). The lexicon, /T EH HH AA NX/ are shared by the both words, "dae-hak" and "dae-hang", which means opposition. Therefore, the acoustic similarity keep them from being distinguished when a subject particle, "i" is followed. The reason why the improper connection cannot be excluded is that it is almost impossible to distinguish between the canonical transcription and lexicon variants changed by phonological rules with conventional approach.

Figure 2: Possible connection between the proposed lexicon entries

Figure 1: Possible connection between conventional lexicon entries In this paper, as remedy of the above-mentioned problem, a new representation in the pronunciation dictionary is introduced. The phonetic transcription, /T EH HH AA NX/ in the word, "dae-hak" is realized only when specific phonemic contexts are encountered. Figure 2 depicts alternative surface representation of phonetic transcriptions. Phones with a parenthesized number indicate lexicon variants ruled by special phonemic contexts. In addition, Table 1 describes examples of changed phonetic rules. L3 stands for the last phone of a syllable, R1 stands for the first phone of the following syllable, respectively. Logical phones with a parenthesized number have the same acoustic property as their corresponding base phones. Because pronunciation variants are dominated by special phonemic contexts, we can effectively exclude improper connection when we discern pronunciation variants from canonical transcription. Figure 2 shows that three improper connections between lexicon entries are excluded compared to Figure 1.

Two improper connections in Figure 2 also can be excluded using morphological knowledge. For this purpose, pronunciation variations at morpheme boundaries are divided into three categories: (1) one always ruled by phonological rules, (2) one always followed by a short or long pause without phonological changes, and (3) one that can be applicable to the both former cases. We determine which kind of pronunciation variations is applied based on POS tags at morpheme boundaries. Based on linguistic/phonetic knowledge and statistics, which derived from the labeled speech corpus, we estimate the correct category for each of every possible POS pairs. In Korean CSR task, the ratios of three categories are 22%, 71%, and 7%, respectively. Phone connection tables are also estimated separately for each category.

Table2: Examples of phone connection table at morpheme boundaries R1 L3

Conventional

Table1: Examples of phonetic change rules Phonemic context

Conventional Proposed method

L3 k k k k

R1 i m i m

Phonetic transcription L3 R1 K IY NX M K IY NX(1) M

Proposed method (No pause between words)

KQ K KH NX N sil KQ K KH KH(1) NX NX(1) N N(1) sil

IY O O O O O O X O X X O X O X X

IY(1) X X X O X X X X X

M O X X O X O X X X X X O X X X

Table 2 shows examples of phone connection tables for the category (1) that is always ruled by phonological rules, in comparison to the conventional table. During decoding procedure, a recognizer determines which kind of phone connection table is used at cross-morpheme transitions, and the selected connection table rules the possibility of connection between lexicon entries. This method enhances the capability to regulate cross-morpheme lexicon connections.

4. Experimental Results and Discussion In this section, we present the results of several recognition experiments to show the effectiveness of the proposed paradigm aiming at the both rescoring with cross-morpheme triphone models and the elimination of improper connection between lexical entries. The test set consists of 480 utterances without OOV from 80 speakers. Table 3 shows the recognition performance under several conditions. In the case of baseline system, the internal-word triphone models are applied excluding short words with two or less phone length, and pronunciation lexicon is consisted of 44 base phones. In addition, connection between each of lexical entry is limited in line with the global base-phone connection table. However, only with the above-mentioned table, it is not possible to discern between the transcription of canonical pronunciations and that of variants. As a result, the baseline system may not be free of a large amount of errors caused by the improper lexicon connections proceeding to the right lexical path. The word and sentence error rates of the baseline system are 10.7% and 54.0%, respectively. First, the experiment on demonstrating effectiveness of second-pass rescoring with cross-morpheme triphone models was performed (see CWTRI). We achieve word error reduction from 10.7% to 9.0%. From this result, there is little difference in recognition performance between CW-TRI and baseline. That is, it can not solely relieve the error of the improper lexicon connection. Additionally, we performed an experiment applying the alternative pronunciation dictionary using cross-morpheme triphones to distinguish between a canonical lexicon and its variants (see CW-TRI + SURFACE). In this case, we show that a single logical phone connection table more or less excludes improper connection from each of lexical entry . In the case of CW-TRI + SURFACE, word and sentence error rate are 7.8% and 42.9%, respectively. Finally, we conducted an experiment using three phone connection tables (see CWTRI + SURFACE + MULTI-TABLE). Based on POS tags of adjacent two morphemes, we determine which kind of phone connection table is used across morpheme boundaries, and the selected connection table rules the possibility of connection between lexicon entries. As shown in the results, CW-TRI + SURFACE + MULTI-TABLE is able to more tightly regulate the improper connection between lexicon entries. Hence, it achieves word error rate of 6.5%. In shorts, we introduced logical representations of the pronunciation lexicon to exclude effectively improper connections between lexical entries across word boundaries. And we proposed a novel method to judge the application of phonological rules based on POS information between two adjacent morphemes. This method enhanced the capability to regulate cross-morpheme lexicon transitions. With the proposed method, we achieved relative word error reduction of up to 27% compared to the recognizer with crossmorpheme triphones and conventional lexical models.

Table 3: Recognition error rates with regulated crossmorpheme connections between lexical entries Case

Word error rate (%)

Sentence error rate (%)

Baseline

10.7

54.0

CW-TRI

9.0

48.3

CW-TRI + SURFACE CW-TRI + SURFACE + MULTI-TABLE

7.8

42.9

6.5

40.0

5. Conclusion We have presented a novel method for regulating improper connection between lexical entries in the morpheme-based pronunciation dictionary. The method is based on using the both POS identifiers and cross-morpheme phone connection tables, which can explain more in detail the connection between lexicon entries. We have evaluated the proposed method in 480 Korean utterances from 80 speakers. We demonstrate that the proposed method limit improper connection between lexicon entries more efficiently. Experimental results show that the both use of multiple phone connection tables and adjacent POS tags achieve relative 27% and 17% reduction in word and sentence error rate, respectively in comparison to the only use of crosstriphone model.

References [1] M. Wester, "Pronunciation modeling for ASR knowledge-based and data-derived methods," Computer Speech and Language, 17, pp. 69-85, 2002. [2] H. Strik, C. Cucchiarini, “Modeling pronunciation variation for ASR: a survey of literature,” Speech Communication, 29(2-4), pp.225-246, 1999. [3] O.-W. Kwon, K. Hwang, and J. Park, “Korean large vocabulary continuous speech recognition using pseudomorpheme units,” Eurospeech’99, Sep. 1999. [4] H.-W. Hon, K.-F. Lee, “Recent progress in robust vocabulary-independent speech recognition,” Proceeding DARPA Speech and Natural Language Processing Workshop, pp.258-263, 1991. [5] Kyong-Nim Lee and Minhwa Chung, “Modeling crossmorpheme pronunciation variations for Korean large vcoabulary continuous speech recognition,” Eurospeech’2003, pp.261-264, 2003. [6] Jun, Sun-Ah, “The accentual phrase in the Korea prosodic hierarchy”, Phonology. 15.2: 189-226, 1998.