MANDARIN LARGE VOCABULARY SPEECH RECOGNITION. USING THE GLOBALPHONE DATABASE. J. Reichert, T. Schultz, and A. Waibel. Interactive ...
MANDARIN LARGE VOCABULARY SPEECHRECOGNITION USING THE GLOBALPHONE DATABASE J.Reichert,T.Schultz,andAWaibel . Interactive SystemsLaboratories University oKarlsruhe f (Germany),Carnegie MellonUniversit {juergen,tanja,waibel}@ira.uka.de
ABSTRACT Thispaperpresentsourrecenteffortsindevelopin speakerindependentLVCSRengineforMandarin ChineseusingourmultilingualdatabaseGlobalPhone Wedescribeatwopassapproach,inwhichthe recognition first generates Pinyin hypotheses and second transform these into Chinese character hypotheses.Weshowhowthisapproachcanreduce complexityandincreaseflexibility.Weevaluatean comparedifferentsystemsincludingdifferentbase for speech recognition aphoneme s unitsversussyll Furthermore we analyze the influence of tonal information.Ourcurrentlybestsystemshowsvery promising results achieving 15.0%character error
1
-
.
d units ables.
rate.
INTRODUCTION
Withthedistributionofspeechtechnologyproducts overtheworld,thefastandefficientportability targetlanguagesbecameapracticalconcern.OurJa RecognitionToolkit(JRTk)islanguageindependent wehavealreadyshownthattheunderlyingrecogniti methodsandtechniquescanbeappliedtoseveral languages [1].Tosetuparecognizerinanewlanguage theacousticmodels,thepronunciationdictionary a languagemodelhavetobetrainedoradapted.Inth paperwedescribeourworkinbootstrappingaChine LVCSRsystemfromamultilingualrecognizerengine. SincetheinputofChinesecharacterstothecomput averytimeconsumingprocessLVCSRsystemsfor Chineselanguagesareovf eryspecialinterest.In ofportabilitythreeaspectsdistinguishtheChines languagefrom other languages: -
ga
all tonew nus and on
eris
-
-
-
terms e -
TheChineseideographiccharactersdonotreflect thepronunciation ow af ord ModernChinesewrittentextlacksthesegmentation intowords InspokenMandarinthetonalinformationis necessary todistinguishmeanings
Chineseideographiccharactersdonotallowtogene automaticallythepronunciationforanunknownChin character.ChinesetextarewritteninstringsoC f characterswithoutanydelimiterbetweenadjacent words.ThusChineselanguageslacksthe natural segmentationintowordswhichcanbeusedasbasis
unitsforspeechrecognitionpurposesasknownfrom Indo-EuropeanlanguageslikeEnglish.Unitscanbe foundbsyegmentingtextintosinglecharacters,by using prosodicinformation [3]orbydefiningsemantic meaningful strings. Conventional Chinese systems analyzeinformationaboutsyllabicstructureandto ne separately and combine the information by synchronizationinlatersteps [3].Recentsystems integratethetonalstructure [2], [5].Thenumberof phonemesusedforacousticmodelingvariesbetween 33 [3]and100 [5]phonemes.Differencesinthedefinition ofphonemesexistssincetonalinformationiasssig nedto differentpartsofphonemesinonesyllable.Some researchersdefineinitialandfinalpartsosfylla bles [4] andhandletheintra-syllablestructureoamodel. f Most of the current systems use HMM-based acoustic modeling [2][6][7][8][9].Whereasmoststudiesoperate onChinesecharacterswdeecidedtodealwiththe Pinyin transcription.Pinyinarewordsegmentedphoneme transcriptionsaus sedinthepeople’srepublicof China. ThebenefitsoPf inyinbasedspeechrecognitionare the following:
ndthe is se
Complexity:ThemappingbetweenPinyinsyllables andChinesecharactersisveryambiguous.Using Pinyin results in smaller search space with decreaseddictionary size. NoOut-of-Vocabulary(OOV)rate:Onlyabout 1300differentPinyinsyllablesarerequiredforfu coverage. Compatibility:Noadaptationfromexistingtoolsto theChinesecharactersetisnecessary,wecanuse the accustomedsegmentedwordmodellikeinother languages. Modularity:SeparateexaminationofPinyinto Character conversion iapplicable s for text-to-spee Errortracking:Pinyintranscriptionics loserrela toaudiothanChinesecharactertranscriptionand thereforemakes recognition error tracking easier.
2 rate ese hinese
(yUSA)
ll
ch ted
WORDSEGMENTATIONAND PINYINCONVERSION
InMandarinChineseeverycharacterisspokenina monosyllabicmanner.Theover10.000different characterscanbeexpressedbyPinyinsyllableswhi consistofacombinationof408basesyllablesand tones.Sincesomeotfhe2040possiblecombinations
ch 5 do
notappearincommonChinesespeech,wecanlimitt setofPinyinsyllablesto1344. Byputtingall13 Pinyin syllables in the vocabulary every spoken utterancecan bexpressedintermsosegmented f Pi AfterthereversecharacterconversiontheSegmenta doesnotexistanymore....sothattheOut-of-Voca (OOV)problemiseliminatedintherecognitionproc Thisresultsinacompactandefficientrecognition engine. 2.1
he 44 nyin. tion bulary ess.
ThePinyinconversion
Wehaveusedtwoseparatestepstoachievethis conversion.InthefirststepwesegmenttheChines wordsandinthesecondstepwetransformthe segmentedChinesewords intothePinyin representat
! "# $ % &' () * + , -. /0 12 .34
e ion.
System Brute-Force +Lexiconmax length +probabilities Pinyinconverter 1 Finalconverter
Input
Wordsegmentation
Output ge4wei4 ting1zhong4 Mei3guo2 zheng4fu3 jue2ding4jin4yi1bu4 dong4jie2Mei3guo2 jin4chu1kou3yin2hang2 xiang4can1yu4Zhong1guo2 xiang4 mu4di4Mei3guo2 gong1si1ti2gong1de5 dai4kuan3
Phonemetranscription
! "# $ % &' ( )* + , -. /0 12 . 34
Thephonemetranscriptionins otonlysimple a mapp becauseabout13%oftheChinesecharactershasmo than onepronunciation. E.g.:
ing re
~20% ~20% 6% 2.1% 1.5%
2.2
n
add geto fsor
TheChinesecharacters conversion
Forspeech-to-textpurposeslikedictationapplicat ions wehavetoconvertthePinyinhypothesisbackto Chinesecharacterstomaketheoutputreadableand performanceresultscomparable.Thisisthereverse conversionoftheprocessdescribedinFigure1. However,this processim s uchmoreambiguousandve ry largecontextinformationinnecessary.Nevertheles s,in thistaskdatasparsenessins olongeraproblembe cause wecanlabelabigcorpusosfixyearsoPf eoplesD aily newspaper(providedbyLDC1997)withthePinyin converter.Wehaveinventedamethodtoautomatical ly learnminimized a setoftranslationrules.Tocont roland improvetherulelearningprocesswesupervisethe tools feedbackwhichconsistsoperformance f rate,thesi zeof theruleset,andtranslation errorswithfullcont ext. Chinesetextcorpus
= le4 (joy) or yue4 (music)
Inmanycasesadditionalpragmaticknowledgeis requiredtodisambiguatedifferentmeanings.Dueto facts,thatonlyfewdataofhighqualityarepubli available,weintegrateddifferentdatasourceslik lists,statisticalinformationandmanuallyPinyin Chinesetext.Weimplementedforeachstepa3stag back-offprocesstoexploitmostoptimaltheexisti data.Thelengthoftheresultingwordunitsvaries
-~6% 8% 4.9% 3.8%
Inordertogetgoodtranscriptionfortrainingwe rulestohandlelargenumbers,yeardates,percenta thefinalconverter.Wealsointroducemappingrule English abbreviations and acronyms into likewise pronouncedChinesecharacters.
Inthefirstapproachwoe nlysearchedfortheword with themaximallengthilexicon na tosplit caharacte string r lateronweaddedprobabilitiestosupportthespli tting decision.Forthephonemetranscriptionwestartwi tha simplecharactermappingtothemostlikelyPinyin syllable.Buttheresultingerrorratewasunaccept able because both steps are ambiguous and context informationins ecessarytoresolvetheequivocatio ns.In thesegmentationprocessthereexistsnosafedecis ion rulewhentosplittwoChinesecharactersbecausen early everypartofaChinesewordwithmorethantwo charactershasitsownmeaning.Furthermoretheend ing charactercanbecombinedwiththebeginningcharac ter of thefollowing w ord r esulting i c n orrect words. E.g.:
Segmentation Pinyinconversion
Table1Performance : ow f ordsegmentationandPinyi conversion
Figure 1:ThePinyin converter
between1and10syllableswithanaveragelengtho f about2syllablesperword,whichissimilartowor d unitsinIndo-Europeanlanguages.Thislengthias good tradeoffbetweenausefulstringlengthforacoust ic disambiguationandlanguagemodelingcontextaswel l asalimitedsizeorfesultingvocabularyunits.Si ncethe Pinyinreflectsthepronunciationothe f spokencha racter wecanusethePinyinconversiontoolforcreating a pronunciation dictionary by simply adding some pronunciation rules for exceptionalcases. Thisconvertergivesonly1.5%Pinyinerrorrate comparedtohandeditedtextsonPeoplesDailycorp us. Word segmentation faults, mostly resulting from incorrectsplittingpropernames,arenoseriouspr oblem for preparing training dataandpronunciation dicti onary.
the c eword labeled e ng
Pinyinconverter
segmented Pinyintext
rulebased learning
ruleset
Figure 2:TheChinesecharacters conversion Withabouthalfamillionrulesweachieveanerror of3.2%.Asanapproximateover-allerrorratewe
rate can
addthiserrorratetotherecognitionerrorrate. worstcaserecognitionerrorscaninfluencethecon fortheconversionanditispossiblethatwecang largererrorratethanthesumofthetwosingleer rates.Butoftentherecognitionandtheconversion reportsthesameerrorsothatoneerroriscounted TheerrordifferencebetweenthePinyinhypothesis theChinesecharacter outputof thebestrecognizer than 2.6%.
3
Inthe text etan ror twice. and isless
THEGLOBALPHONE DATABASE
Allexperimentshavebeencarriedoutintheframew ork oftheGlobalPhoneproject.Theaim ofthisproject isthe developmentofamultilingualrecognitionengine.F or thispurposealargespeechdatabasehasbeencolle cted whichcurrentlyconsistsof13languages,namely Arabic,MandarinandWuChinese,Croatian,German, Japanese, Korean, Portuguese, Russian, Spanish, Swedish,Tamil,andTurkish.Foreachlanguageabou t 100nativespeakerswereaskedtoread20minuteso f newspaperarticles.Theirspeechwasrecordedinof fice quality, with a close-speaking microphone. The GlobalPhone corpus is fully transcribed including spontaneouseffectslikefalsestartsandhesitatio ns. FurtherdetailsotfheGlobalPhoneprojectaregive nin [1]. TheMandarinpartofthemultilingualdatabasewas collectedinthreeplacesinnorth,middleandsout h mainland china. The different places ensure a widespreadcoloringoMandarin f dialect.Wetriedt oget anuniformdistributionoages f (between18and65) and educationlevels.Ourdatabaseconsistsof10214 utteranceswithatotallylengthof28.6hoursosf peech spokenb1y32speakersoboth f gender.112speakers are usedfortraining,10speakerseachforthedevelop ment andevaluation testset. Totrainthelanguagemodelweused82.5Miowords fromPeoplesDailyandXinhuanewspaper.Thetrigra m perplexityotfhelanguagemodelis207.Thedictio nary has size a o17000 f wordsincludingeachsyllable, which results in aO n OV-rateo0%. f 4
THE RECOGNITIONENGINE
OurgoalistointegratetheresultingChinesereco systemintomultilingual a speechrecognizerframew Todothisinaneasywaywedecidedtousesimilar preprocessing andacousticmodeling for alllanguag Duringthepreprocessingthedimensionalityofthe featuresetisreducedtothefirst24LDAparamete the13melcepstralcoefficients,power,zerocrossi theirfirstandsecondderivativescalculatedfrom sampledinputspeech. Thesystemisafullycontinuous3-stateHMMwith emissionprobabilitymodeledbyamixtureof16 Gaussianswith diagonalvariances.
gnition ork. es. rsof ngand 16kHz
4.1
Bootstrapping
Forbootstrappingwegeneratedamappingfroma multilingualsystemincludingEnglish,German,Span ish andJapanesephonemes [11]andwrotelabelsforthe trainingdata.Inthenextstepwienitializedthe Gaussian codebooks with k-means and trained along the previouslycreatedlabelsandagainwrotelabelswi ththe resultingsystem.Werepeatedthisproceduresevera l timesuntilthiscontextindependentsystemreached its performancemaximum. Foracontextdependentsystemthepolyphonictree of alloccurringquintphones(containingcross-wordmo dels withuptoonephonemelookaheadtoadjacentwords) hasbeenclustereddownto3000codebooksbyusing linguistic motivated questions about the phonetic context. 4.2
Syllablevs.phonemeunits forspeech recognition
Wedevelopedatooltogenerateparametriccontroll phonemesetsandcorrespondingdictionaries.Using toolwecomparedtwopromisingmappingsfrom phoneticinformationtoacousticmodels.Inthefir wehavemappedforeveryPinyinsyllablethebeginn consonant,themiddlevocalconstructandtheendin consonantonitsownacousticmodel.Forthemiddle vowelconstructwedistinguishbetweenfivediffere tonalinformation.
ed this stcase ing g nt
P # honemes Beginnings Middle vowelwithtone Middle diphthongswithtone Middle triphthongswithtone Endings å
21 35 63 19 3 141
Table 2Composition : othe f firstphonemeset TheintersyllablecoarticulationinChineseisonl reflectedinminordegreecomparedwithIndo-Europe languageslikeEnglish.Associatedwiththefactth thereareonlyabout1300frequentlyusedsyllables includingthetonalinformation,wedecidedtouse acousticmodelsbasedonthewholesyllableinthe secondcase. Thefirstcontextdependentsystemoutperformsthe correspondingphonemebasedsystemby29.1%to 30.8%errorrateonthewordbasedPinyinhypothesi Whilebuildingacontextindependentsystemsome severerun-timeandmemoryconsumptionproblems arose,causedbythehugeamountofmorethan1000 acousticmodels.Thisforcedustobreakthefurthe developmentotfhesyllablebasedcontextindepende systemuntilwehavechangedtheJRTk.Theexpectat foraperformanceincreaseisnotaslargeasfort phoneme based system, because the inter-syllable coarticulationismuchsmallerthantheintra-sylla coarticulationfor Mandarin Chinese.
y an at
s.
r nt ion he ble
4.3
Tonalinformation
Additionally,twodifferentacousticmodelingofth tonalinformationwerecompared.Besidestheimplic modelingofthetonalinformationthroughtraining mostly5differentphonemesforavocalconstruct, performedanapproachbyexplicitdetectingpitch information and adding 18 generated pitch characteristicstothefeaturevectorbeforeperfor LDAensuringthatatleast6pitchparametersarel theresultingfeaturevector. 4.4
e it of we
ming eftin
VTLN
InVocalTractLengthNormalization(VTLN)lainear or nonlinearfrequencytransformationcompensatesfor differentvocaltractlengths [10].Findinggoodestimates forthespeakerspecificwarpparametersisacriti cal issue.ForVTLN,wekeepthedimensionconstantand warpthetrainingsamplesoef achspeakersuchthat the LinearDiscriminantisoptimized.Althoughthatcri terion dependsonalltrainingsamplesofallspeakersit can iteratively providespeaker specificwarpfactors. Bytrainingwithspeakerspecificwarpfactorsand estimatinggoodwarpfactorsfortestingwceandec rease theerror aboutmorethan 1%. 4.5
Results
ThetablebelowshowstheprogressoftheMandarin system. PWEisworderrorratebasedonthePinyin hypothesis,while CWEisworderrorratebasedonthe Chinesecharacterhypothesis.Thewordrecognition error rates arereportedfor better comparison tor esultsof otherlanguages.However,tocompareoursystemto otherChinesesystems,thecommonlyusedcharacter basederrorratesarepresentedinthelastcolumn (CCE) which i15.0% s for our currently bestsystem. Systems Firstbootstrappedversion Data correction Pinyintoolimprovements Full training setandlarge LM: 112 train speaker + 82.5 LM Contextdependentsystem Speaker normalization Including explicit pitch Currently bestsystem Bestsyllable basedsystem
PWE 50.0% 43.0% 34.0% 30.8%
CWE -
24.1% 22.9% 21.8% 24.3% 16.1% 20.7% 23.3% 15.0% 27,7% 31.2% 21.3%
CCE -
Table 3System : performance 5
CONCLUSION
Ourexperienceshowsthatanonnativedevelopercan portasystemtoanewlanguagewithinlessthan6 months.Ourcurrentlybestsystemshows15.0%erro rateoncharacteroutput.Furthermore,iturnsout JRTkeasycanbaedaptedtobue sedinseveraldiff languages.Thisisanimportantresultconcerningt
r that erent he
integration into the framework of a multilingual recognizer.WhilebuildingtheChineserecognizerw haveimplementedmethodstoautomateimportantstep buildingnew a recognizerfromscratch.Furthermore portedourJanusRecognitionToolkit(JRTk)tothe Windows platform. 6
e s we
ACKNOWLEDGMENTS
Theauthorswishtothankallmembersothe f Intera SystemsLaboratoriesespeciallyTianshiWei,JingW andJiaxingWengfromtheGlobalPhoneteamfor collectingandvalidatingtheChinesedatabase,and Schubertfor implementing thepitch tracker.
ctive ang Kjell
REFERENCES [1] Schultzeal: t LanguageIndependentandLanguage AdaptiveLVCSR.ICSLP1998. [2] Chen, C.J., Gopinath R.A., Monkowski M.D., PichenyM.A.,andShenK.: NewMethodsin Continuous Mandarin Speech Recognition, Eurospeech 97Rhodes 1543-1546 [3] Lyu,R-Y.eat l: GoldenMandarin(III)-AUseradaptive Prosodic-Segment-Based Mandarin DictationMachineforChineseLanguagewithVery LargeVocabulary.ICASSP95,vol1, pp57. [4] Ma,B.,Huang,T.,Xu,B.,Zhang,X.,andQu,F.: Context-dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta, pp455-458. [5] Zhan,P.,Wegmann,Steven,Lowe,Steve: Dragon Systems1997MandarinBroadcastnewsSystem, DarpaWorkshop,Lansdowne,1998. [6] Gao,Yuqing;Hon,Hsiao-Wuren;Lin,Zhiwei; Loudon,Gareth;Yaganantha,S.;Yuan,Baosheng : Tangerine:ALargeVocabularyMandarinDictation System,ICASSP95,Detroit(Volume1Page , 77) [7] Lyu,Ren-Yuan;u.a. :GoldenMandarin(III)–A User-AdaptiveProsodic-Segment-BasedMandarin DictationMachineforChineseLanguagewithVery LargeVocabulary,ICASSP95,Detroit(Volume1, Page57) [8] Ma,Bin;Huang,Taiyi;Xu,Bo;Zhang,Xijun;Qu, Fei: Context-Dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta (Volume1Page , 455) [9] Ho,Tai-Hsuan;Yang,Kae-Cherng;Huang,KuoHsun;Lee,Lin-Shan Improved : searchstrategyfor large vocabulary continuous mandarin speech recognition,ICASSP98,Seattle(Volume2,Page 825, Paper number 2381) [10] MartinWestphal,TanjaSchultz,AlexWaibel: LinearDiscriminant-anewmethodforspeaker normalisation,ProceedingsoftheICSLP 98, Sydney,Australia,1998. [11] TanjaSchultzandAlexWaibel: FastBootstrapping ofLVCSRSystemswithmultilingualPhoneme Sets,ProceedingsotfheEurospeech'97Vol.1pp 371-373, Rhodes,Greece,September 1997.