Nokia Research Center, Audio-Visual Systems Laboratory, Beijing. {xia.s.wang, yang.1.cao, ... message', 'call', 'pick up call', 'calendar' etc. Speech recognition ...
AN EMBEDDED MULTILINGUAL SPEECH RECOGNITION SYSTEM FOR MANDARIN, CANTONESE, AND ENGLISH Xia Wang, Yang Cao, Feng Ding, Yuezhong Tang Nokia Research Center, Audio-Visual Systems Laboratory, Beijing {xia.s.wang, yang.1.cao, feng.f.ding, yuezhong.tang}@nokia.com
ABSTRACT In this paper, we propose a small-footprint, speakerindependent, multilingual system for isolated word recognition of Mandarin, Cantonese, and English. The baseline system got very promising results without any phoneme shared between different languages. By sharing phonemes, the memory and computational complexity was reduced by about 40%. Non-native, accented speech recognition and mixed language words support are the distinguishing features of our system. Automatic language identification (LID) is one of the key elements in language-independent automatic speech recognition (ASR) systems. LID performance is also analyzed in addition to the engine performance of the proposed system. Supervised Bayesian online adaptation was proved to be effective in compensation for accent mismatch, environment mismatch, as well as for modeling inaccuracy introduced by combined training. Keywords: Embedded multilingual speech recognition, Non-native speech recognition, Automatic language identification
1. INTRODUCTION Today, automatic speech recognition is already one attractive feature of many mobile terminals. It provides users with a new experience and friendly interface. Such devices set distinct requirements for applications: they must have limited memory and processing power requirements. Also, multilinguality is one of the main expectations for speech recognition applications running on mobile devices under the condition of high mobility and internationalization. They should support several
languages and be able to cope with non-native speakers, dialect, and accents. Existing speaker-trained technology inherently supports various languages as all users train the recognition system to match their language and pronunciation characteristics. However, speakertrained technology has some disadvantages. For a speaker-dependent name recognizer, the user has to train the system. In mobile phones equipped with a speaker-dependent speech recognition system, the user can typically dial between 10 and 30 names with voice. As the vocabulary size increases, it becomes more difficult for the user to remember what was said during training. At the same time, a large vocabulary makes the technology unviable. To improve the usability of products and make speech recognition more attractive, multilingual speakerindependent speech recognition is needed. The typical usage scenarios for an ASR system in embedded devices are: 1. Phonebook lookup: to enable the user to search the phonebook by voice. 2. Command and control: to enable the user to control the device by voice, like ‘send message’, ‘call’, ‘pick up call’, ‘calendar’ etc. Speech recognition technology is required more in China due to the input barrier of Chinese characters. Although multilingual speech recognition has been investigated for several years, Chinese was not considered as much as Western languages because of the peculiarity of the language [1][2][3]. In our proposed Mandarin-Cantonese-English trilingual system, more efforts were put into Chinese and Chinese dialects based on the language-independent speech recognition architecture proposed by us in [4]. Although most Chinese can speak Putonghua to some degree, they speak with accents. In fact, except for people who live in metropolitan areas, most Chinese still use their own dialect in the daily life.
This makes for an interesting phenomenon of bilingualism in China. English is also becoming more and more important in China as a consequence of globalization. Therefore, Putonghua and English have the highest priority for a multilingual system to be used in China. Cantonese comes next because of the economical importance of the GuangDong province and Hong Kong. Therefore, in this paper, we decided to investigate Cantonese after the bilingual Putonghua (herein referred to as Mandarin) and English system became mature. English words were also considered in Mandarin and Cantonese testing databases to see how the system works for non-native speech.
2. MULTILINGUALITY IN CHINA Spoken Chinese comprises many regional variants, called dialects. Although they employ a common written form, they are mutually unintelligible, and for this reason controversy exists over whether they can legitimately be called dialects or whether they should be classified as separate languages. Generally, however, the variants of Chinese are referred to as dialects. Most Chinese speak one of the Mandarin dialects, which are largely mutually intelligible. The dialect spoken in Beijing constitutes the base for standard Mandarin dialect. Chinese also has six other dialect groups, all spoken in China's southeastern provinces. They are Yue, Xiang, Min, Gan, Wu, and Hakka. The Yue dialects, also called Cantonese, are spoken in Hong Kong, most of Guangdong, southern Guangxi, parts of Hainan, and in many overseas settlements. In this paper, we will focus on Cantonese and Putonghua (referred to as Mandarin in this paper), which are the most economically and politically important dialects. People from different dialectal areas might not be able to communicate with each other simply because they do not speak the same language. Mandarin, or Putonghua, would be a good choice as a sharing basis since most people in China speak their native dialect and Mandarin. Although many people speak Mandarin, they have different accents, depending on their native dialect. Some people have problem with tones, and some people cannot distinguish between certain pairs of sounds. Well-educated people in China also speak English or other foreign languages, with an accent to some
extent. On the other hand, more and more foreigners in China are learning to speak Mandarin or Cantonese, which adds another dimension to the multilingual Chinese world. By multilinguality, we mean both when one user speaks several languages including mixing two or more languages together in one utterance, and when different users speak different languages, native and non-native ones. To cope with the multilingual environment in China, ASR has to be multilingual. Mandarin, Cantonese, and English are the leading ones in our language portfolio. The typical multilingual scenarios are: 1. In metropolitan areas like Beijing, Putonghua or standard Mandarin is the daily language for most citizens. They will use Mandarin in daily communication with others, as well as with digital hand-held devices if needed. 2. In cities like GuangZhou, both Cantonese and Mandarin are widely used. 3. In Hong Kong, Cantonese, Mandarin, and English are used in a mixed way. People will have Chinese last names and English first names. Even for two Chinese people, when they talk with each other, Chinese (Mandarin or Cantonese) is used most of the time, but English words occur now and then during a conversation. 4. People working in international companies in mainland China have similar situation with those in Hong Kong except that Cantonese is used less. But names like Julia Ma, Jane Zhang, and Leslie Tan are very common. Speaking mixed English and Chinese is also very common. 5. For foreigners working in China, they will use English, and sometimes Chinese.
3. MULTILINGUAL ASR ARCHITECTURE Figure 1 illustrates the architecture for multilingual ASR systems proposed in [4]. The multilingual ASR engine consists of three key units: automatic language identification, on-line pronunciation modeling, and multilingual acoustic modeling modules. The assumption is that vocabulary items are given in the textual form. First, the Language Identification (LID) module detects the language of
the vocabulary item. Once this has been determined, an appropriate on-line pronunciation modeling scheme is applied to get the phoneme sequence associated with the written form of the vocabulary item. Finally, the recognition model for each vocabulary item is constructed by concatenating the multilingual acoustic models. With these basic modules, the recognizer can automatically cope with multilingual vocabulary items without the user's assistance. Vocabulary entry in written form “Jane Zhang”
Language ID
Language Identification Module
English Mandarin Cantonese
Vocabulary entry in written form “Jane Zhang”
Language of the name tag Jane:English Zhang: Mandarin
Pronunciation
Pronunciation Modeling Module
English Mandarin Cantonese
Phoneme sequence for a vocabulary entry /dZ/ /eI/ /n/ /ts`/ /A/ /N/
Acoustic Modeling Module
Acoustic model for “Jane Zhang”
Multilingual ASR Engine
Figure 1: Multilingual ASR Architecture Significant efforts have been put into the embedded multilingual system to reduce the memory and computational complexity. In acoustic modeling, the number of models decreased dramatically by sharing phonemes between languages. Feature component masking, variable-rate partial likelihood update, and density pruning all result in significant savings in the decoding complexity with marginal impact on the recognition performance [5].
3.1.Language Identification The task of the Language Identification (LID) module is to identify the language of each vocabulary item from text [6]. It is worth clarifying that our LID is based on the textual form, which is quite different from the traditional LID from speech. The LID from speech identifies which language the input speech is based on the statistical models trained from large speech corpus, while the LID from text decides what language the input text is based on the
statistical models trained from large text corpus. The decision of LID is utilized to choose an appropriate text-to-phoneme mapping technique, or pronunciation model, for each vocabulary item, which is further used in the decoding phase. Since the result of the LID module is not always unambiguous, it is important to provide multiple results and pronunciations for certain vocabulary items. The vocabulary entries can be Chinese characters, Roman letters or a mixed form of the two. For English, Pinyin (Mandarin Romanization scheme), and Jyutping (Cantonese Romanization scheme made by LSHK [7]), i.e. for all text forms in roman letters, the LID is done by Neural Networks. For Chinese characters, the LID is done by codingbased schemes, i.e., from the Chinese character code range in Unicode. However, from the character itself, we cannot tell whether it is Mandarin or Cantonese because both are possible. So LID will give both Mandarin and Cantonese the same probability when it sees a Chinese character. Therefore, two pronunciation variants will be generated from the same Chinese entry, one for Mandarin, one for Cantonese. For mixed-language words, each part of the word will have one language ID. For example, in ‘Jane Zhang’, ‘Jane’ would be identified as English and ‘Zhang’ as Mandarin, most likely. In ‘Gwaan Cathleen’, ‘Gwaan’ would be identified as Cantonese and ‘Cathleen’ as English, most probably. The LID module also supports mixed Chinese character and Roman letter sequences. Of course, there will be ambiguous words like ‘long’, which is valid in all three languages. In this case, all three language IDs are possible and the pronunciation modeling module will give out pronunciations in all three languages.
3.2.Automatic Pronunciation Modeling On-line pronunciation modeling, i.e. Text-toPhoneme (T2P) mapping, is an obligatory feature in embedded systems with dynamic vocabularies where it is not feasible to have large pronunciation dictionaries for several languages [8]. If the pronunciation of a language is very regular, e.g. in Finnish, Japanese, or Mandarin Chinese, the T2P mapping module is very compact as it can be realized from a finite set of rules. There are, however, many languages, English being the best example, whose pronunciation cannot accurately be expressed using a rule set. To gain a high
performance T2P mapping for this type of nonstructured languages, it is necessary to have large text resources. Decision trees have successfully been used to compress large pronunciation dictionaries [9][10]. The T2P irregularity of the language controls the size and accuracy of the decision tree-based pronunciation model. If the number of T2P exceptions is small, the decision trees do not become very big. However, the size of the decision tree based T2P model increases rapidly if there are many pronunciation exceptions in the language. T2P mapping can also be implemented using neural nets [11] when the module becomes very compact. Multi-pronunciation is an interesting phenomenon worth mentioning. There are words shared in different languages with the same spelling but different pronunciations. There are also words in one language, which can have multiple pronunciations, especially in grapheme-based languages like Chinese. The pronunciation modeling tries to find all possible pronunciations of an input textual entry, for example, 乐 can be pronounced as 'le4' or 'yue4' in Chinese. Without clear context, one cannot decide which one is correct.
3.3.Multilingual Acoustic Modeling The performance of any ASR system is highly dependent on the quality of the acoustic models. One must make compromises in the modeling accuracy, when multiple language support is needed with restricted memory constraints in an embedded system. The sufficiency of memory is the main problem in acoustic modeling. Therefore, some of the most widely used acoustic modeling schemes, such as context-dependent modeling, are not practical in embedded systems due to their large memory requirements. Language-dependent acoustic models are also problematic, particularly if we need to support several languages at the same time. To have a reasonable number of acoustic models, monophone is here selected to be the basic modeling unit. The monophone models are further shared across different languages, and the parameters of continuous density monophone HMMs are trained on multilingual speech corpora to have the lowest number of models possible. We chose the International Phonetic Alphabet (IPA) [12] to define the phoneme inventory for the multilingual ASR engine, if the language is included in the IPA handbook. For those languages that are not described
in the handbook, such as Mandarin Chinese, we defined the phoneme subset based both on the global set and the SAMPA-C definitions made by other speech recognition researchers [13]. Some languagespecific modifications have though been made to the phoneme set either to further reduce the number of models or to increase the modeling accuracy. For the Mandarin-English-Cantonese system, English has 39 phonemes, Mandarin has 46 phonemes, and Cantonese has 41 phonemes. If we consider that most Mandarin and Cantonese phonemes are common to both languages, and English adds 21 more into set, we end up with only 82 phonemes in total. Acoustic model adaptation has been found to be an efficient method to increase the speaker-specific recognition rate by several researchers [14]. Since multilingual acoustic models cannot characterize the language-specific details as accurately as their monolingual counterparts, the importance of model adaptation is even greater in multi- than in monolingual ASR systems. Supervised Bayesian online adaptation was applied to system and substantial improvements were achieved through adaptation in adverse environments. Speakeradaptation also helps a lot for non-native word recognition, e.g. English uttered by Chinese speakers or U.S. English tested with models trained from U.K. English.
4. EXPERIMENTAL RESULTS 4.1.Training Databases The multilingual Mandarin-English-Cantonese speaker-independent ASR system requires large training corpora for Mandarin, English and Cantonese, which have enough phoneme coverage and speaker variability. The Mandarin database comes from Mandarin SpeeCon database, which was collected by Nokia under the European Union SpeeCon project [15], covering four major accent areas, Beijing, northeastern, south-western and eastern. The Cantonese database comes from the Chinese University of Hong Kong. The English database is the British English Wall Street Journal Database.
4.2.Testing Databases All testing databases are our in-house databases, collected in a quiet office, consisting speech of 10 female and 10 male speakers. The vocabulary of the
Mandarin database contains 100 Mandarin names, 25 English names. The Cantonese database consists of 100 native Cantonese names and 25 English names. The British English vocabulary size is 103. In order to simulate the real usage environment, noisy databases are generated by adding car, cafe, or music noise to clean utterances. The signal-to-noise ratio of the noisy database was uniformly distributed between 5 dB and 20 dB.
4.3.Baseline System Performance The baseline system was established on the three language-dependent phoneme sets, without phoneme sharing between languages. Therefore, the number of models is the sum of the three language-dependent systems, i.e. 126 plus one silence model. The number of models was a bit too many. By sharing phonemes among the three languages, the number of models was successfully reduced to 82 including one silence model. Language Mandarin Cantonese English Average
Voc. Clean/Adapt. Noise/Adapt Size (%) (%) 100 94.36 96.14 87.12 91.99 100 89.35 93.66 86.34 92.61 103 95.22 97.81 87.64 92.11 92.98 95.87 87.03 92.24 Table 1: Native test performance
The native test results shown in Table 1 are promising. The result was close to the monolingual system with Cantonese slightly worse because the amount of Cantonese training data was less than Mandarin and English. You may be curious about the difference between this baseline multilingualsystem and a simple integration of three monolingual systems because phonemes are still languagedependent. Yes, all the phonemes are language specific EXCEPT the silence. In the multilingual system, there is only one silence model trained from the databases from all three languages. The second column for each environment in Table 1 is the recognition rate with supervised on-line speaker adaptation. The Cantonese baseline performs the worst among the three, partially due to the accent mismatch between the training database (Hong Kong Cantonese) and the testing database (GuangZhou Cantonese) and partially due to the mismatch between continuous (training) and isolated (testing) speech. Cantonese brought our interest on accented speech recognition and more results will be presented later for a specially designed Chineseaccented English vocabulary. More Cantonese data
containing mainland Cantonese accent and isolated words will be used for acoustic model training in the future to improve the Cantonese performance. However, speaker adaptation improves the recognition performance substantially, especially for Cantonese (accent adaptation) and noisy environment.
4.4.Non-native Test Performance Another experiment was designed to find out how non-native speech would challenge the system, by using the English words spoken by Chinese people in China2002 and Cantonese2001. The results are shown in Table 2. It is obvious from the table that even with a very small vocabulary size (25 words), the recognition rate of non-native speech is decreased dramatically compared to the result of native speech. When speaker adaptation was applied, the error rate was reduced by 54% on average. Native Clean /Adapt. Noisy /Adapt. lang. (%) (%) Mandarin 76.91 92.92 68.16 86.87 Cantonese 82.40 89.35 77.30 87.95 Table 2: Non-native English test performance One thing in need of mentioning is that during the recording no guidance was given to the speakers on how to read these English words. The speakers just freely say whatever comes to mind when they see the prompt. Therefore, the poor performance is a result of two factors: accent and mispronunciation.
Clea n
Nois y
Before Adapt. After Adapt. Before Adapt. After Adapt.
Cantonese (%)
Mandarin (%)
85.63
84.64
89.37
93.88
81.59
76.89
87.39
88.98
Table 3: Full set testing performance Given the non-native test results shown in Table 2, it would not be a surprise for us to see the performance degradation for the full-set with respect to the native test results. The full-set consists of 25 non-native English words and 100 native words in either Mandarin or Cantonese. Although applying speaker adaptation made significant improvements with an error rate reduction of 42%, the performance still had room to improve with respect to those of native tests. It would be interesting to explore the
characteristics of non-native speech in order to find a good way to compensate.
4.5.Compact Phoneme-set Performance The baseline system works fine but the complexity is a bit high for embedded devices. Actually there are 27 phonemes could be shared between Mandarin and Cantonese, 16 between Mandarin and English, 18 between English and Cantonese. The number of distinct phonemes is 81, which saves a lot of memory by removing the duplicate phonemes. Although we could reduce the phoneme set size further by combining allophones together, compromises would be required between performance and complexity. Language Compact/Adaptation (%) Mandarin 83.30 89.52 Cantonese 83.65 90.08 English 94.11 97.23 Average 87.02 92.28 Table 4: Compact phoneme-set performance However, the performance degradation is obvious for the compact set comparing with the baseline. English was observed as robust to the change of the phoneme set, while Mandarin and Cantonese seem to be susceptible because they have fewer languagedependent phonemes in the compact set. If we take a close look at Mandarin, over 60% of its phonemes were shared with Cantonese. For Cantonese, even more were shared with Mandarin. The shared phonemes, although they share the same IPA symbol in two languages, the actual sounds are indeed different, which makes the shared models less accurate. More experiments will be carried out to find the optimal phoneme set for the multilingual system.
4.6.LID Test Performance The results shown above concentrated on the engine performance with language ID given explicitly to the recognizer. In this part, LID would be in the focus. For monolingual words, the LID module works fine with about 95% for Roman letter input as shown in Figure 2. The accuracy would be 100% for Chinese character input. However, for mixed-language words, LID is more likely to make errors. Table 5 shows the recognition performance of the system on the 150-word vocabulary with 100 Mandarin, 25 English and 25 mixed-language words like ‘Jane Zhang’, spoken by 20 native Mandarin speakers, with restriction to top-
N candidates given by LID module. If one name can only have one language ID, i.e. no support to mixedlanguage words, LID would give a maximum of three results for each entry. But in our system, each part of the name would have its own language ID, so the LID results could have different language ID combinations for each entry. 97.00% 96.00%
Mandarin
95.00%
English Cantonese
94.00% 93.00%
LID Accuracy(%)
Figure 2: LID performance LID Nbest
Clean (%)
Noise (%)
Bf. 77.74 69.34 Adapt. 2 Af. 83.72 76.51 Adapt. Bf. 80.37 71.58 Adapt. 3 Af. 86.06 78.35 Adapt. Table 5: System performance with automatic LID on Mandarin + English + Mandarin-English words spoken by native Mandarin speakers When LID output was restricted to 2, the result was on the par with the English words spoken by Mandarin people. But the adaptation result was not as good as that of pure English words because the pronunciation partly follows the Chinese way, partly the English way. The system suffered from the mismatch and mispronunciation of the English part too as described in section 4.4. With one more candidate in LID output, obvious improvements were observed immediately. However, it is not always a good way to increase LID N-Best output, because it will result in linear expansion to the vocabulary, thus to the searching space. According to our experience, N=3 turned out to be the best setting with consideration of final performance and computational complexity.
5. CONCLUSIONS AND OUTLOOK In this paper, we proposed a speaker-independent multilingual speech recognition system for Mandarin, English and Cantonese. The baseline system achieved promising results. The compact system with phoneme sharing had some performance degradation, but with acoustic model adaptation, the results turned to be okay with the reduction of complexity by about 40%. More experiments will be conducted with the guidance of phonetics and dialectology to find an optimal phoneme set of the multilingual system. The non-native test was interesting because a huge mismatch was found between native and non-native speech. It would be a good research topic to make a contrastive study on the phonetic characteristics of non-native vs. native speech. Speaker adaptation is a good way to compensate for accent and environmental mismatches with an error rate reduction of 40%. It helps a lot also for the performance degradation in combined training. However, there is still much room to improve comparing with the native test results. Further study will be made to the non-native or accented speech in the future, which is not limited to English, accented Mandarin and Cantonese and is even more interesting because of the accent diversity in China. Mixed-language words support is the peculiarity of our system, which was enabled by automatic LID and online T2P modules.
6. REFERENCES [1] Schultz T. and Waibel A., “Languageindependent and Language-adaptative Acoustic Modeling for Speech Recognition”. Speech Communication, Vol. 35(1-2), pp.31-51, 2001. [2] Uebler U., “Multilingual Speech Recognition in Seven Languages”, Speech Communication, Vol. 35(1-2), pp. 53-69, 2001. [3] Köhler J., “Multilingual Phone Models for Vocabulary-independent Speech Recognition Tasks”, Speech Communication, Vol. 35(1-2) pp. 21-30, 2001. [4] Viikki O., Kiss I., Tian J., Speaker- and Language-Dependent Speech Recognition in Mobile Communication Systems, Proc. ICASSP 2001, USA, 2001.
[5] Kiss I., Vasilache M., ”Low Complexity Techniques for Embedded ASR Systems”, Proc. of ICSLP’02, Denver, U.S.A, 2002 [6] Tian J., Häkkinen J., Riis S., Jensen K. J., ”On Text-based Language Identification for Multilingual Speech Recognition System”, Proc. of ICSLP’02, Denver, U.S.A, 2002 [7] LSHK Jyutping home page: http://www.hku.hk/linguist/lshk/Jyutping/index.htm
[8] Tian J., Häkkinen J., Viikki O., ”Multilingual Pronunciation Modeling for Improving Multilingual Speech Recognition”, Proc. of ICSLP’02, Denver, U.S.A, 2002 [9] Pagel V., Lenzo K., Black A. W., "Letter to Sound Rules for Accented Lexicon Compression," Proc. of ICSLP'98, Sydney, Australia, 1998. [10] Suontausta J., Häkkinen J., "Decision Tree Based Text-to-Phoneme Mapping For Speech Recognition", Proc. of ICSLP'00, Beijing, China, 2000. [11] Jensen K. J., Riis S., "Self-Organizing Letter Code-Book for Text-to-Phoneme Neural Network Model", Proc. of ICSLP'00, Beijing, China, 2000. [12] The International Phonetic Association, Handbook of the International Phonetic Association (IPA), Cambridge University Press, Cambridge, U.K., 1999. [13] SAMPA-C Definition Developed by Chinese Academy of Social Sciences, www.cass.net.cn/s18_yys/yuyin/sampac/sampac. htm [14] Gauvain J. L., Lee C.-H., "Maximum a Posteriori Estimation of Multivariate Gaussian Mixture Observations of Markov Chains", IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 2, pp. 291-298, April 1994. [15] SpeeCon home page: www.speecon.com