mandarin large vocabulary speech recognition using the ... - CiteSeerX

4 downloads 33308 Views 66KB Size Report
MANDARIN LARGE VOCABULARY SPEECH RECOGNITION. USING THE GLOBALPHONE DATABASE. J. Reichert, T. Schultz, and A. Waibel. Interactive ...
MANDARIN LARGE VOCABULARY SPEECHRECOGNITION USING THE GLOBALPHONE DATABASE J.Reichert,T.Schultz,andAWaibel . Interactive SystemsLaboratories University oKarlsruhe f (Germany),Carnegie MellonUniversit {juergen,tanja,waibel}@ira.uka.de

ABSTRACT Thispaperpresentsourrecenteffortsindevelopin speakerindependentLVCSRengineforMandarin ChineseusingourmultilingualdatabaseGlobalPhone Wedescribeatwopassapproach,inwhichthe recognition first generates Pinyin hypotheses and second transform these into Chinese character hypotheses.Weshowhowthisapproachcanreduce complexityandincreaseflexibility.Weevaluatean comparedifferentsystemsincludingdifferentbase for speech recognition aphoneme s unitsversussyll Furthermore we analyze the influence of tonal information.Ourcurrentlybestsystemshowsvery promising results achieving 15.0%character error

1

-

.

d units ables.

rate.

INTRODUCTION

Withthedistributionofspeechtechnologyproducts overtheworld,thefastandefficientportability targetlanguagesbecameapracticalconcern.OurJa RecognitionToolkit(JRTk)islanguageindependent wehavealreadyshownthattheunderlyingrecogniti methodsandtechniquescanbeappliedtoseveral languages [1].Tosetuparecognizerinanewlanguage theacousticmodels,thepronunciationdictionary a languagemodelhavetobetrainedoradapted.Inth paperwedescribeourworkinbootstrappingaChine LVCSRsystemfromamultilingualrecognizerengine. SincetheinputofChinesecharacterstothecomput averytimeconsumingprocessLVCSRsystemsfor Chineselanguagesareovf eryspecialinterest.In ofportabilitythreeaspectsdistinguishtheChines languagefrom other languages: -

ga

all tonew nus and on

eris

-

-

-

terms e -

TheChineseideographiccharactersdonotreflect thepronunciation ow af ord ModernChinesewrittentextlacksthesegmentation intowords InspokenMandarinthetonalinformationis necessary todistinguishmeanings

Chineseideographiccharactersdonotallowtogene automaticallythepronunciationforanunknownChin character.ChinesetextarewritteninstringsoC f characterswithoutanydelimiterbetweenadjacent words.ThusChineselanguageslacksthe natural segmentationintowordswhichcanbeusedasbasis

unitsforspeechrecognitionpurposesasknownfrom Indo-EuropeanlanguageslikeEnglish.Unitscanbe foundbsyegmentingtextintosinglecharacters,by using prosodicinformation [3]orbydefiningsemantic meaningful strings. Conventional Chinese systems analyzeinformationaboutsyllabicstructureandto ne separately and combine the information by synchronizationinlatersteps [3].Recentsystems integratethetonalstructure [2], [5].Thenumberof phonemesusedforacousticmodelingvariesbetween 33 [3]and100 [5]phonemes.Differencesinthedefinition ofphonemesexistssincetonalinformationiasssig nedto differentpartsofphonemesinonesyllable.Some researchersdefineinitialandfinalpartsosfylla bles [4] andhandletheintra-syllablestructureoamodel. f Most of the current systems use HMM-based acoustic modeling [2][6][7][8][9].Whereasmoststudiesoperate onChinesecharacterswdeecidedtodealwiththe Pinyin transcription.Pinyinarewordsegmentedphoneme transcriptionsaus sedinthepeople’srepublicof China. ThebenefitsoPf inyinbasedspeechrecognitionare the following:

ndthe is se

Complexity:ThemappingbetweenPinyinsyllables andChinesecharactersisveryambiguous.Using Pinyin results in smaller search space with decreaseddictionary size. NoOut-of-Vocabulary(OOV)rate:Onlyabout 1300differentPinyinsyllablesarerequiredforfu coverage. Compatibility:Noadaptationfromexistingtoolsto theChinesecharactersetisnecessary,wecanuse the accustomedsegmentedwordmodellikeinother languages. Modularity:SeparateexaminationofPinyinto Character conversion iapplicable s for text-to-spee Errortracking:Pinyintranscriptionics loserrela toaudiothanChinesecharactertranscriptionand thereforemakes recognition error tracking easier.

2 rate ese hinese

(yUSA)

ll

ch ted

WORDSEGMENTATIONAND PINYINCONVERSION

InMandarinChineseeverycharacterisspokenina monosyllabicmanner.Theover10.000different characterscanbeexpressedbyPinyinsyllableswhi consistofacombinationof408basesyllablesand tones.Sincesomeotfhe2040possiblecombinations

ch 5 do

notappearincommonChinesespeech,wecanlimitt setofPinyinsyllablesto1344. Byputtingall13 Pinyin syllables in the vocabulary every spoken utterancecan bexpressedintermsosegmented f Pi AfterthereversecharacterconversiontheSegmenta doesnotexistanymore....sothattheOut-of-Voca (OOV)problemiseliminatedintherecognitionproc Thisresultsinacompactandefficientrecognition engine. 2.1

he 44 nyin. tion bulary ess.

ThePinyinconversion

Wehaveusedtwoseparatestepstoachievethis conversion.InthefirststepwesegmenttheChines wordsandinthesecondstepwetransformthe segmentedChinesewords intothePinyin representat

     ! "# $ % &' () * + , -. /0 12 .34

e ion.

System Brute-Force +Lexiconmax length +probabilities Pinyinconverter 1 Finalconverter

Input

Wordsegmentation

Output ge4wei4 ting1zhong4 Mei3guo2 zheng4fu3 jue2ding4jin4yi1bu4 dong4jie2Mei3guo2 jin4chu1kou3yin2hang2 xiang4can1yu4Zhong1guo2 xiang4 mu4di4Mei3guo2 gong1si1ti2gong1de5 dai4kuan3

Phonemetranscription

       ! "#   $ % &' ( )* + , -.   /0 12 . 34

 

Thephonemetranscriptionins otonlysimple a mapp becauseabout13%oftheChinesecharactershasmo than onepronunciation. E.g.:



ing re

~20% ~20% 6% 2.1% 1.5%

2.2

n

add geto fsor

TheChinesecharacters conversion

Forspeech-to-textpurposeslikedictationapplicat ions wehavetoconvertthePinyinhypothesisbackto Chinesecharacterstomaketheoutputreadableand performanceresultscomparable.Thisisthereverse conversionoftheprocessdescribedinFigure1. However,this processim s uchmoreambiguousandve ry largecontextinformationinnecessary.Nevertheles s,in thistaskdatasparsenessins olongeraproblembe cause wecanlabelabigcorpusosfixyearsoPf eoplesD aily newspaper(providedbyLDC1997)withthePinyin converter.Wehaveinventedamethodtoautomatical ly learnminimized a setoftranslationrules.Tocont roland improvetherulelearningprocesswesupervisethe tools feedbackwhichconsistsoperformance f rate,thesi zeof theruleset,andtranslation errorswithfullcont ext. Chinesetextcorpus

= le4 (joy) or yue4 (music)

Inmanycasesadditionalpragmaticknowledgeis requiredtodisambiguatedifferentmeanings.Dueto facts,thatonlyfewdataofhighqualityarepubli available,weintegrateddifferentdatasourceslik lists,statisticalinformationandmanuallyPinyin Chinesetext.Weimplementedforeachstepa3stag back-offprocesstoexploitmostoptimaltheexisti data.Thelengthoftheresultingwordunitsvaries

-~6% 8% 4.9% 3.8%

Inordertogetgoodtranscriptionfortrainingwe rulestohandlelargenumbers,yeardates,percenta thefinalconverter.Wealsointroducemappingrule English abbreviations and acronyms into likewise pronouncedChinesecharacters.

Inthefirstapproachwoe nlysearchedfortheword with themaximallengthilexicon na tosplit caharacte string r lateronweaddedprobabilitiestosupportthespli tting decision.Forthephonemetranscriptionwestartwi tha simplecharactermappingtothemostlikelyPinyin syllable.Buttheresultingerrorratewasunaccept able because both steps are ambiguous and context informationins ecessarytoresolvetheequivocatio ns.In thesegmentationprocessthereexistsnosafedecis ion rulewhentosplittwoChinesecharactersbecausen early everypartofaChinesewordwithmorethantwo charactershasitsownmeaning.Furthermoretheend ing charactercanbecombinedwiththebeginningcharac ter of thefollowing w ord r esulting i c n orrect words.            E.g.:     



Segmentation Pinyinconversion

Table1Performance : ow f ordsegmentationandPinyi conversion

Figure 1:ThePinyin converter

 

between1and10syllableswithanaveragelengtho f about2syllablesperword,whichissimilartowor d unitsinIndo-Europeanlanguages.Thislengthias good tradeoffbetweenausefulstringlengthforacoust ic disambiguationandlanguagemodelingcontextaswel l asalimitedsizeorfesultingvocabularyunits.Si ncethe Pinyinreflectsthepronunciationothe f spokencha racter wecanusethePinyinconversiontoolforcreating a pronunciation dictionary by simply adding some pronunciation rules for exceptionalcases. Thisconvertergivesonly1.5%Pinyinerrorrate comparedtohandeditedtextsonPeoplesDailycorp us. Word segmentation faults, mostly resulting from incorrectsplittingpropernames,arenoseriouspr oblem for preparing training dataandpronunciation dicti onary.

the c eword labeled e ng

Pinyinconverter

segmented Pinyintext

rulebased learning

ruleset

Figure 2:TheChinesecharacters conversion Withabouthalfamillionrulesweachieveanerror of3.2%.Asanapproximateover-allerrorratewe

rate can

addthiserrorratetotherecognitionerrorrate. worstcaserecognitionerrorscaninfluencethecon fortheconversionanditispossiblethatwecang largererrorratethanthesumofthetwosingleer rates.Butoftentherecognitionandtheconversion reportsthesameerrorsothatoneerroriscounted TheerrordifferencebetweenthePinyinhypothesis theChinesecharacter outputof thebestrecognizer than 2.6%.

3

Inthe text etan ror twice. and isless

THEGLOBALPHONE DATABASE

Allexperimentshavebeencarriedoutintheframew ork oftheGlobalPhoneproject.Theaim ofthisproject isthe developmentofamultilingualrecognitionengine.F or thispurposealargespeechdatabasehasbeencolle cted whichcurrentlyconsistsof13languages,namely Arabic,MandarinandWuChinese,Croatian,German, Japanese, Korean, Portuguese, Russian, Spanish, Swedish,Tamil,andTurkish.Foreachlanguageabou t 100nativespeakerswereaskedtoread20minuteso f newspaperarticles.Theirspeechwasrecordedinof fice quality, with a close-speaking microphone. The GlobalPhone corpus is fully transcribed including spontaneouseffectslikefalsestartsandhesitatio ns. FurtherdetailsotfheGlobalPhoneprojectaregive nin [1]. TheMandarinpartofthemultilingualdatabasewas collectedinthreeplacesinnorth,middleandsout h mainland china. The different places ensure a widespreadcoloringoMandarin f dialect.Wetriedt oget anuniformdistributionoages f (between18and65) and educationlevels.Ourdatabaseconsistsof10214 utteranceswithatotallylengthof28.6hoursosf peech spokenb1y32speakersoboth f gender.112speakers are usedfortraining,10speakerseachforthedevelop ment andevaluation testset. Totrainthelanguagemodelweused82.5Miowords fromPeoplesDailyandXinhuanewspaper.Thetrigra m perplexityotfhelanguagemodelis207.Thedictio nary has size a o17000 f wordsincludingeachsyllable, which results in aO n OV-rateo0%. f 4

THE RECOGNITIONENGINE

OurgoalistointegratetheresultingChinesereco systemintomultilingual a speechrecognizerframew Todothisinaneasywaywedecidedtousesimilar preprocessing andacousticmodeling for alllanguag Duringthepreprocessingthedimensionalityofthe featuresetisreducedtothefirst24LDAparamete the13melcepstralcoefficients,power,zerocrossi theirfirstandsecondderivativescalculatedfrom sampledinputspeech. Thesystemisafullycontinuous3-stateHMMwith emissionprobabilitymodeledbyamixtureof16 Gaussianswith diagonalvariances.

gnition ork. es. rsof ngand 16kHz

4.1

Bootstrapping

Forbootstrappingwegeneratedamappingfroma multilingualsystemincludingEnglish,German,Span ish andJapanesephonemes [11]andwrotelabelsforthe trainingdata.Inthenextstepwienitializedthe Gaussian codebooks with k-means and trained along the previouslycreatedlabelsandagainwrotelabelswi ththe resultingsystem.Werepeatedthisproceduresevera l timesuntilthiscontextindependentsystemreached its performancemaximum. Foracontextdependentsystemthepolyphonictree of alloccurringquintphones(containingcross-wordmo dels withuptoonephonemelookaheadtoadjacentwords) hasbeenclustereddownto3000codebooksbyusing linguistic motivated questions about the phonetic context. 4.2

Syllablevs.phonemeunits forspeech recognition

Wedevelopedatooltogenerateparametriccontroll phonemesetsandcorrespondingdictionaries.Using toolwecomparedtwopromisingmappingsfrom phoneticinformationtoacousticmodels.Inthefir wehavemappedforeveryPinyinsyllablethebeginn consonant,themiddlevocalconstructandtheendin consonantonitsownacousticmodel.Forthemiddle vowelconstructwedistinguishbetweenfivediffere tonalinformation.

ed this stcase ing g nt

P # honemes Beginnings Middle vowelwithtone Middle diphthongswithtone Middle triphthongswithtone Endings å

21 35 63 19 3 141

Table 2Composition : othe f firstphonemeset TheintersyllablecoarticulationinChineseisonl reflectedinminordegreecomparedwithIndo-Europe languageslikeEnglish.Associatedwiththefactth thereareonlyabout1300frequentlyusedsyllables includingthetonalinformation,wedecidedtouse acousticmodelsbasedonthewholesyllableinthe secondcase. Thefirstcontextdependentsystemoutperformsthe correspondingphonemebasedsystemby29.1%to 30.8%errorrateonthewordbasedPinyinhypothesi Whilebuildingacontextindependentsystemsome severerun-timeandmemoryconsumptionproblems arose,causedbythehugeamountofmorethan1000 acousticmodels.Thisforcedustobreakthefurthe developmentotfhesyllablebasedcontextindepende systemuntilwehavechangedtheJRTk.Theexpectat foraperformanceincreaseisnotaslargeasfort phoneme based system, because the inter-syllable coarticulationismuchsmallerthantheintra-sylla coarticulationfor Mandarin Chinese.

y an at

s.

r nt ion he ble

4.3

Tonalinformation

Additionally,twodifferentacousticmodelingofth tonalinformationwerecompared.Besidestheimplic modelingofthetonalinformationthroughtraining mostly5differentphonemesforavocalconstruct, performedanapproachbyexplicitdetectingpitch information and adding 18 generated pitch characteristicstothefeaturevectorbeforeperfor LDAensuringthatatleast6pitchparametersarel theresultingfeaturevector. 4.4

e it of we

ming eftin

VTLN

InVocalTractLengthNormalization(VTLN)lainear or nonlinearfrequencytransformationcompensatesfor differentvocaltractlengths [10].Findinggoodestimates forthespeakerspecificwarpparametersisacriti cal issue.ForVTLN,wekeepthedimensionconstantand warpthetrainingsamplesoef achspeakersuchthat the LinearDiscriminantisoptimized.Althoughthatcri terion dependsonalltrainingsamplesofallspeakersit can iteratively providespeaker specificwarpfactors. Bytrainingwithspeakerspecificwarpfactorsand estimatinggoodwarpfactorsfortestingwceandec rease theerror aboutmorethan 1%. 4.5

Results

ThetablebelowshowstheprogressoftheMandarin system. PWEisworderrorratebasedonthePinyin hypothesis,while CWEisworderrorratebasedonthe Chinesecharacterhypothesis.Thewordrecognition error rates arereportedfor better comparison tor esultsof otherlanguages.However,tocompareoursystemto otherChinesesystems,thecommonlyusedcharacter basederrorratesarepresentedinthelastcolumn (CCE) which i15.0% s for our currently bestsystem. Systems Firstbootstrappedversion Data correction Pinyintoolimprovements Full training setandlarge LM: 112 train speaker + 82.5 LM Contextdependentsystem Speaker normalization Including explicit pitch Currently bestsystem Bestsyllable basedsystem

PWE 50.0% 43.0% 34.0% 30.8%

CWE -

24.1% 22.9% 21.8% 24.3% 16.1% 20.7% 23.3% 15.0% 27,7% 31.2% 21.3%

CCE -

Table 3System : performance 5

CONCLUSION

Ourexperienceshowsthatanonnativedevelopercan portasystemtoanewlanguagewithinlessthan6 months.Ourcurrentlybestsystemshows15.0%erro rateoncharacteroutput.Furthermore,iturnsout JRTkeasycanbaedaptedtobue sedinseveraldiff languages.Thisisanimportantresultconcerningt

r that erent he

integration into the framework of a multilingual recognizer.WhilebuildingtheChineserecognizerw haveimplementedmethodstoautomateimportantstep buildingnew a recognizerfromscratch.Furthermore portedourJanusRecognitionToolkit(JRTk)tothe Windows platform. 6

e s we

ACKNOWLEDGMENTS

Theauthorswishtothankallmembersothe f Intera SystemsLaboratoriesespeciallyTianshiWei,JingW andJiaxingWengfromtheGlobalPhoneteamfor collectingandvalidatingtheChinesedatabase,and Schubertfor implementing thepitch tracker.

ctive ang Kjell

REFERENCES [1] Schultzeal: t LanguageIndependentandLanguage AdaptiveLVCSR.ICSLP1998. [2] Chen, C.J., Gopinath R.A., Monkowski M.D., PichenyM.A.,andShenK.: NewMethodsin Continuous Mandarin Speech Recognition, Eurospeech 97Rhodes 1543-1546 [3] Lyu,R-Y.eat l: GoldenMandarin(III)-AUseradaptive Prosodic-Segment-Based Mandarin DictationMachineforChineseLanguagewithVery LargeVocabulary.ICASSP95,vol1, pp57. [4] Ma,B.,Huang,T.,Xu,B.,Zhang,X.,andQu,F.: Context-dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta, pp455-458. [5] Zhan,P.,Wegmann,Steven,Lowe,Steve: Dragon Systems1997MandarinBroadcastnewsSystem, DarpaWorkshop,Lansdowne,1998. [6] Gao,Yuqing;Hon,Hsiao-Wuren;Lin,Zhiwei; Loudon,Gareth;Yaganantha,S.;Yuan,Baosheng : Tangerine:ALargeVocabularyMandarinDictation System,ICASSP95,Detroit(Volume1Page , 77) [7] Lyu,Ren-Yuan;u.a. :GoldenMandarin(III)–A User-AdaptiveProsodic-Segment-BasedMandarin DictationMachineforChineseLanguagewithVery LargeVocabulary,ICASSP95,Detroit(Volume1, Page57) [8] Ma,Bin;Huang,Taiyi;Xu,Bo;Zhang,Xijun;Qu, Fei: Context-Dependent Acoustic Models for ChineseSpeechRecognition,ICASSP96,Atlanta (Volume1Page , 455) [9] Ho,Tai-Hsuan;Yang,Kae-Cherng;Huang,KuoHsun;Lee,Lin-Shan Improved : searchstrategyfor large vocabulary continuous mandarin speech recognition,ICASSP98,Seattle(Volume2,Page 825, Paper number 2381) [10] MartinWestphal,TanjaSchultz,AlexWaibel: LinearDiscriminant-anewmethodforspeaker normalisation,ProceedingsoftheICSLP 98, Sydney,Australia,1998. [11] TanjaSchultzandAlexWaibel: FastBootstrapping ofLVCSRSystemswithmultilingualPhoneme Sets,ProceedingsotfheEurospeech'97Vol.1pp 371-373, Rhodes,Greece,September 1997.