mandarin large vocabulary speech recognition using the ... - CiteSeerX

MANDARIN LARGE VOCABULARY SPEECHRECOGNITION USING THE GLOBALPHONE DATABASE J.Reichert,T.Schultz,andAWaibel . Interactive SystemsLaboratories University oKarlsruhe f (Germany),Carnegie MellonUniversit {juergen,tanja,waibel}@ira.uka.de

ABSTRACT Thispaperpresentsourrecenteffortsindevelopin speakerindependentLVCSRengineforMandarin ChineseusingourmultilingualdatabaseGlobalPhone Wedescribeatwopassapproach,inwhichthe recognition first generates Pinyin hypotheses and second transform these into Chinese character hypotheses.Weshowhowthisapproachcanreduce complexityandincreaseflexibility.Weevaluatean comparedifferentsystemsincludingdifferentbase for speech recognition aphoneme s unitsversussyll Furthermore we analyze the influence of tonal information.Ourcurrentlybestsystemshowsvery promising results achieving 15.0%character error

1

-

.

d units ables.

rate.

INTRODUCTION

Withthedistributionofspeechtechnologyproducts overtheworld,thefastandefficientportability targetlanguagesbecameapracticalconcern.OurJa RecognitionToolkit(JRTk)islanguageindependent wehavealreadyshownthattheunderlyingrecogniti methodsandtechniquescanbeappliedtoseveral languages [1].Tosetuparecognizerinanewlanguage theacousticmodels,thepronunciationdictionary a languagemodelhavetobetrainedoradapted.Inth paperwedescribeourworkinbootstrappingaChine LVCSRsystemfromamultilingualrecognizerengine. SincetheinputofChinesecharacterstothecomput averytimeconsumingprocessLVCSRsystemsfor Chineselanguagesareovf eryspecialinterest.In ofportabilitythreeaspectsdistinguishtheChines languagefrom other languages: -

ga

all tonew nus and on

eris

-

-

-

terms e -

TheChineseideographiccharactersdonotreflect thepronunciation ow af ord ModernChinesewrittentextlacksthesegmentation intowords InspokenMandarinthetonalinformationis necessary todistinguishmeanings

Chineseideographiccharactersdonotallowtogene automaticallythepronunciationforanunknownChin character.ChinesetextarewritteninstringsoC f characterswithoutanydelimiterbetweenadjacent words.ThusChineselanguageslacksthe natural segmentationintowordswhichcanbeusedasbasis

unitsforspeechrecognitionpurposesasknownfrom Indo-EuropeanlanguageslikeEnglish.Unitscanbe foundbsyegmentingtextintosinglecharacters,by using prosodicinformation [3]orbydefiningsemantic meaningful strings. Conventional Chinese systems analyzeinformationaboutsyllabicstructureandto ne separately and combine the information by synchronizationinlatersteps [3].Recentsystems integratethetonalstructure [2], [5].Thenumberof phonemesusedforacousticmodelingvariesbetween 33 [3]and100 [5]phonemes.Differencesinthedefinition ofphonemesexistssincetonalinformationiasssig nedto differentpartsofphonemesinonesyllable.Some researchersdefineinitialandfinalpartsosfylla bles [4] andhandletheintra-syllablestructureoamodel. f Most of the current systems use HMM-based acoustic modeling [2][6][7][8][9].Whereasmoststudiesoperate onChinesecharacterswdeecidedtodealwiththe Pinyin transcription.Pinyinarewordsegmentedphoneme transcriptionsaus sedinthepeople’srepublicof China. ThebenefitsoPf inyinbasedspeechrecognitionare the following:

ndthe is se

Complexity:ThemappingbetweenPinyinsyllables andChinesecharactersisveryambiguous.Using Pinyin results in smaller search space with decreaseddictionary size. NoOut-of-Vocabulary(OOV)rate:Onlyabout 1300differentPinyinsyllablesarerequiredforfu coverage. Compatibility:Noadaptationfromexistingtoolsto theChinesecharactersetisnecessary,wecanuse the accustomedsegmentedwordmodellikeinother languages. Modularity:SeparateexaminationofPinyinto Character conversion iapplicable s for text-to-spee Errortracking:Pinyintranscriptionics loserrela toaudiothanChinesecharactertranscriptionand thereforemakes recognition error tracking easier.

2 rate ese hinese

(yUSA)

ll

ch ted

WORDSEGMENTATIONAND PINYINCONVERSION

InMandarinChineseeverycharacterisspokenina monosyllabicmanner.Theover10.000different characterscanbeexpressedbyPinyinsyllableswhi consistofacombinationof408basesyllablesand tones.Sincesomeotfhe2040possiblecombinations

ch 5 do

notappearincommonChinesespeech,wecanlimitt setofPinyinsyllablesto1344. Byputtingall13 Pinyin syllables in the vocabulary every spoken utterancecan bexpressedintermsosegmented f Pi AfterthereversecharacterconversiontheSegmenta doesnotexistanymore....sothattheOut-of-Voca (OOV)problemiseliminatedintherecognitionproc Thisresultsinacompactandefficientrecognition engine. 2.1

he 44 nyin. tion bulary ess.

ThePinyinconversion

Wehaveusedtwoseparatestepstoachievethis conversion.InthefirststepwesegmenttheChines wordsandinthesecondstepwetransformthe segmentedChinesewords intothePinyin representat

! "# $ % &' () * + , -. /0 12 .34

e ion.

System Brute-Force +Lexiconmax length +probabilities Pinyinconverter 1 Finalconverter

Input

Wordsegmentation

Output ge4wei4 ting1zhong4 Mei3guo2 zheng4fu3 jue2ding4jin4yi1bu4 dong4jie2Mei3guo2 jin4chu1kou3yin2hang2 xiang4can1yu4Zhong1guo2 xiang4 mu4di4Mei3guo2 gong1si1ti2gong1de5 dai4kuan3

Phonemetranscription

! "# $ % &' ( )* + , -. /0 12 . 34

Thephonemetranscriptionins otonlysimple a mapp becauseabout13%oftheChinesecharactershasmo than onepronunciation. E.g.:

ing re

~20% ~20% 6% 2.1% 1.5%

2.2

n

add geto fsor

TheChinesecharacters conversion

Forspeech-to-textpurposeslikedictationapplicat ions wehavetoconvertthePinyinhypothesisbackto Chinesecharacterstomaketheoutputreadableand performanceresultscomparable.Thisisthereverse conversionoftheprocessdescribedinFigure1. However,this processim s uchmoreambiguousandve ry largecontextinformationinnecessary.Nevertheles s,in thistaskdatasparsenessins olongeraproblembe cause wecanlabelabigcorpusosfixyearsoPf eoplesD aily newspaper(providedbyLDC1997)withthePinyin converter.Wehaveinventedamethodtoautomatical ly learnminimized a setoftranslationrules.Tocont roland improvetherulelearningprocesswesupervisethe tools feedbackwhichconsistsoperformance f rate,thesi zeof theruleset,andtranslation errorswithfullcont ext. Chinesetextcorpus

= le4 (joy) or yue4 (music)

Inmanycasesadditionalpragmaticknowledgeis requiredtodisambiguatedifferentmeanings.Dueto facts,thatonlyfewdataofhighqualityarepubli available,weintegrateddifferentdatasourceslik lists,statisticalinformationandmanuallyPinyin Chinesetext.Weimplementedforeachstepa3stag back-offprocesstoexploitmostoptimaltheexisti data.Thelengthoftheresultingwordunitsvaries

-~6% 8% 4.9% 3.8%

Inordertogetgoodtranscriptionfortrainingwe rulestohandlelargenumbers,yeardates,percenta thefinalconverter.Wealsointroducemappingrule English abbreviations and acronyms into likewise pronouncedChinesecharacters.

Inthefirstapproachwoe nlysearchedfortheword with themaximallengthilexicon na tosplit caharacte string r lateronweaddedprobabilitiestosupportthespli tting decision.Forthephonemetranscriptionwestartwi tha simplecharactermappingtothemostlikelyPinyin syllable.Buttheresultingerrorratewasunaccept able because both steps are ambiguous and context informationins ecessarytoresolvetheequivocatio ns.In thesegmentationprocessthereexistsnosafedecis ion rulewhentosplittwoChinesecharactersbecausen early everypartofaChinesewordwithmorethantwo charactershasitsownmeaning.Furthermoretheend ing charactercanbecombinedwiththebeginningcharac ter of thefollowing w ord r esulting i c n orrect words. E.g.:

Segmentation Pinyinconversion

Table1Performance : ow f ordsegmentationandPinyi conversion

Figure 1:ThePinyin converter

between1and10syllableswithanaveragelengtho f about2syllablesperword,whichissimilartowor d unitsinIndo-Europeanlanguages.Thislengthias good tradeoffbetweenausefulstringlengthforacoust ic disambiguationandlanguagemodelingcontextaswel l asalimitedsizeorfesultingvocabularyunits.Si ncethe Pinyinreflectsthepronunciationothe f spokencha racter wecanusethePinyinconversiontoolforcreating a pronunciation dictionary by simply adding some pronunciation rules for exceptionalcases. Thisconvertergivesonly1.5%Pinyinerrorrate comparedtohandeditedtextsonPeoplesDailycorp us. Word segmentation faults, mostly resulting from incorrectsplittingpropernames,arenoseriouspr oblem for preparing training dataandpronunciation dicti onary.

the c eword labeled e ng

Pinyinconverter

segmented Pinyintext

rulebased learning

ruleset

Figure 2:TheChinesecharacters conversion Withabouthalfamillionrulesweachieveanerror of3.2%.Asanapproximateover-allerrorratewe

rate can

addthiserrorratetotherecognitionerrorrate. worstcaserecognitionerrorscaninfluencethecon fortheconversionanditispossiblethatwecang largererrorratethanthesumofthetwosingleer rates.Butoftentherecognitionandtheconversion reportsthesameerrorsothatoneerroriscounted TheerrordifferencebetweenthePinyinhypothesis theChinesecharacter outputof thebestrecognizer than 2.6%.

3

Inthe text etan ror twice. and isless

THEGLOBALPHONE DATABASE

Allexperimentshavebeencarriedoutintheframew ork oftheGlobalPhoneproject.Theaim ofthisproject isthe developmentofamultilingualrecognitionengine.F or thispurposealargespeechdatabasehasbeencolle cted whichcurrentlyconsistsof13languages,namely Arabic,MandarinandWuChinese,Croatian,German, Japanese, Korean, Portuguese, Russian, Spanish, Swedish,Tamil,andTurkish.Foreachlanguageabou t 100nativespeakerswereaskedtoread20minuteso f newspaperarticles.Theirspeechwasrecordedinof fice quality, with a close-speaking microphone. The GlobalPhone corpus is fully transcribed including spontaneouseffectslikefalsestartsandhesitatio ns. FurtherdetailsotfheGlobalPhoneprojectaregive nin [1]. TheMandarinpartofthemultilingualdatabasewas collectedinthreeplacesinnorth,middleandsout h mainland china. The different places ensure a widespreadcoloringoMandarin f dialect.Wetriedt oget anuniformdistributionoages f (between18and65) and educationlevels.Ourdatabaseconsistsof10214 utteranceswithatotallylengthof28.6hoursosf peech spokenb1y32speakersoboth f gender.112speakers are usedfortraining,10speakerseachforthedevelop ment andevaluation testset. Totrainthelanguagemodelweused82.5Miowords fromPeoplesDailyandXinhuanewspaper.Thetrigra m perplexityotfhelanguagemodelis207.Thedictio nary has size a o17000 f wordsincludingeachsyllable, which results in aO n OV-rateo0%. f 4

THE RECOGNITIONENGINE

OurgoalistointegratetheresultingChinesereco systemintomultilingual a speechrecognizerframew Todothisinaneasywaywedecidedtousesimilar preprocessing andacousticmodeling for alllanguag Duringthepreprocessingthedimensionalityofthe featuresetisreducedtothefirst24LDAparamete the13melcepstralcoefficients,power,zerocrossi theirfirstandsecondderivativescalculatedfrom sampledinputspeech. Thesystemisafullycontinuous3-stateHMMwith emissionprobabilitymodeledbyamixtureof16 Gaussianswith diagonalvariances.

gnition ork. es. rsof ngand 16kHz

4.1

Bootstrapping

Forbootstrappingwegeneratedamappingfroma multilingualsystemincludingEnglish,German,Span ish andJapanesephonemes [11]andwrotelabelsforthe trainingdata.Inthenextstepwienitializedthe Gaussian codebooks with k-means and trained along the previouslycreatedlabelsandagainwrotelabelswi ththe resultingsystem.Werepeatedthisproceduresevera l timesuntilthiscontextindependentsystemreached its performancemaximum. Foracontextdependentsystemthepolyphonictree of alloccurringquintphones(containingcross-wordmo dels withuptoonephonemelookaheadtoadjacentwords) hasbeenclustereddownto3000codebooksbyusing linguistic motivated questions about the phonetic context. 4.2

Syllablevs.phonemeunits forspeech recognition

Wedevelopedatooltogenerateparametriccontroll phonemesetsandcorrespondingdictionaries.Using toolwecomparedtwopromisingmappingsfrom phoneticinformationtoacousticmodels.Inthefir wehavemappedforeveryPinyinsyllablethebeginn consonant,themiddlevocalconstructandtheendin consonantonitsownacousticmodel.Forthemiddle vowelconstructwedistinguishbetweenfivediffere tonalinformation.

ed this stcase ing g nt

P # honemes Beginnings Middle vowelwithtone Middle diphthongswithtone Middle triphthongswithtone Endings å

21 35 63 19 3 141

Table 2Composition : othe f firstphonemeset TheintersyllablecoarticulationinChineseisonl reflectedinminordegreecomparedwithIndo-Europe languageslikeEnglish.Associatedwiththefactth thereareonlyabout1300frequentlyusedsyllables includingthetonalinformation,wedecidedtouse acousticmodelsbasedonthewholesyllableinthe secondcase. Thefirstcontextdependentsystemoutperformsthe correspondingphonemebasedsystemby29.1%to 30.8%errorrateonthewordbasedPinyinhypothesi Whilebuildingacontextindependentsystemsome severerun-timeandmemoryconsumptionproblems arose,causedbythehugeamountofmorethan1000 acousticmodels.Thisforcedustobreakthefurthe developmentotfhesyllablebasedcontextindepende systemuntilwehavechangedtheJRTk.Theexpectat foraperformanceincreaseisnotaslargeasfort phoneme based system, because the inter-syllable coarticulationismuchsmallerthantheintra-sylla coarticulationfor Mandarin Chinese.

y an at

s.

r nt ion he ble

4.3

Tonalinformation

Additionally,twodifferentacousticmodelingofth tonalinformationwerecompared.Besidestheimplic modelingofthetonalinformationthroughtraining mostly5differentphonemesforavocalconstruct, performedanapproachbyexplicitdetectingpitch information and adding 18 generated pitch characteristicstothefeaturevectorbeforeperfor LDAensuringthatatleast6pitchparametersarel theresultingfeaturevector. 4.4

e it of we

ming eftin

VTLN

InVocalTractLengthNormalization(VTLN)lainear or nonlinearfrequencytransformationcompensatesfor differentvocaltractlengths [10].Findinggoodestimates forthespeakerspecificwarpparametersisacriti cal issue.ForVTLN,wekeepthedimensionconstantand warpthetrainingsamplesoef achspeakersuchthat the LinearDiscriminantisoptimized.Althoughthatcri terion dependsonalltrainingsamplesofallspeakersit can iteratively providespeaker specificwarpfactors. Bytrainingwithspeakerspecificwarpfactorsand estimatinggoodwarpfactorsfortestingwceandec rease theerror aboutmorethan 1%. 4.5

Results

ThetablebelowshowstheprogressoftheMandarin system. PWEisworderrorratebasedonthePinyin hypothesis,while CWEisworderrorratebasedonthe Chinesecharacterhypothesis.Thewordrecognition error rates arereportedfor better comparison tor esultsof otherlanguages.However,tocompareoursystemto otherChinesesystems,thecommonlyusedcharacter basederrorratesarepresentedinthelastcolumn (CCE) which i15.0% s for our currently bestsystem. Systems Firstbootstrappedversion Data correction Pinyintoolimprovements Full training setandlarge LM: 112 train speaker + 82.5 LM Contextdependentsystem Speaker normalization Including explicit pitch Currently bestsystem Bestsyllable basedsystem

PWE 50.0% 43.0% 34.0% 30.8%

CWE -

24.1% 22.9% 21.8% 24.3% 16.1% 20.7% 23.3% 15.0% 27,7% 31.2% 21.3%

CCE -

Table 3System : performance 5

CONCLUSION

Ourexperienceshowsthatanonnativedevelopercan portasystemtoanewlanguagewithinlessthan6 months.Ourcurrentlybestsystemshows15.0%erro rateoncharacteroutput.Furthermore,iturnsout JRTkeasycanbaedaptedtobue sedinseveraldiff languages.Thisisanimportantresultconcerningt

r that erent he

integration into the framework of a multilingual recognizer.WhilebuildingtheChineserecognizerw haveimplementedmethodstoautomateimportantstep buildingnew a recognizerfromscratch.Furthermore portedourJanusRecognitionToolkit(JRTk)tothe Windows platform. 6

e s we

ACKNOWLEDGMENTS

Theauthorswishtothankallmembersothe f Intera SystemsLaboratoriesespeciallyTianshiWei,JingW andJiaxingWengfromtheGlobalPhoneteamfor collectingandvalidatingtheChinesedatabase,and Schubertfor implementing thepitch tracker.

ctive ang Kjell

REFERENCES [1] Schultzeal: t LanguageIndependentandLanguage AdaptiveLVCSR.ICSLP1998. [2] Chen, C.J., Gopinath R.A., Monkowski M.D., PichenyM.A.,andShenK.: NewMethodsin Continuous Mandarin Speech Recognition, Eurospeech 97Rhodes 1543-1546 [3] Lyu,R-Y.eat l: GoldenMandarin(III)-AUseradaptive Prosodic-Segment-Based Mandarin DictationMachineforChineseLanguagewithVery LargeVocabulary.ICASSP95,vol1, pp57. [4] Ma,B.,Huang,T.,Xu,B.,Zhang,X.,andQu,F.: Context-dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta, pp455-458. [5] Zhan,P.,Wegmann,Steven,Lowe,Steve: Dragon Systems1997MandarinBroadcastnewsSystem, DarpaWorkshop,Lansdowne,1998. [6] Gao,Yuqing;Hon,Hsiao-Wuren;Lin,Zhiwei; Loudon,Gareth;Yaganantha,S.;Yuan,Baosheng : Tangerine:ALargeVocabularyMandarinDictation System,ICASSP95,Detroit(Volume1Page , 77) [7] Lyu,Ren-Yuan;u.a. :GoldenMandarin(III)–A User-AdaptiveProsodic-Segment-BasedMandarin DictationMachineforChineseLanguagewithVery LargeVocabulary,ICASSP95,Detroit(Volume1, Page57) [8] Ma,Bin;Huang,Taiyi;Xu,Bo;Zhang,Xijun;Qu, Fei: Context-Dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta (Volume1Page , 455) [9] Ho,Tai-Hsuan;Yang,Kae-Cherng;Huang,KuoHsun;Lee,Lin-Shan Improved : searchstrategyfor large vocabulary continuous mandarin speech recognition,ICASSP98,Seattle(Volume2,Page 825, Paper number 2381) [10] MartinWestphal,TanjaSchultz,AlexWaibel: LinearDiscriminant-anewmethodforspeaker normalisation,ProceedingsoftheICSLP 98, Sydney,Australia,1998. [11] TanjaSchultzandAlexWaibel: FastBootstrapping ofLVCSRSystemswithmultilingualPhoneme Sets,ProceedingsotfheEurospeech'97Vol.1pp 371-373, Rhodes,Greece,September 1997.