An Algorithm for Automatic Generation of Mandarin ...

7 downloads 0 Views 973KB Size Report
ensures that 408 Mandarin base syllables(without tonal information) and ... language model. The phonetic ... word dictionary. Text Sentence. Build Word Lattice. Word. Dictionary. Bigram. Model ... sentences that covers the given syllable or co-.
AN ALGORITHM FOR AUTOMATIC GENERATION OF MANDARIN PHONETIC BALANCED CORPUS Jyh-Shing Shyuu and Jhing-Fa Wang 'HSDUWPHQW RI &RPSXWHU 6FLHQFH DQG ,QIRUPDWLRQ (QJLQHHULQJ 1DWLRQDO &KHQJ .XQJ 8QLYHUVLW\ 7DLQDQ 7DLZDQ 52&

EDODQFHG FRUSXV LV GHVLJQHG E\ PDQXDOO\ RQH PD\ WDNH D ORW RI

ABSTRACT

HIIRUWV LQ GHVLJQLQJ WKH WUDLQLQJ FRUSXV +HQFH DXWRPDWLF

This paper proposed an algorithm for automatic generation of Mandarin phonetic balanced corpus. The design of phonetic balanced corpus is particularly important for the collection of continuous speech database to reduce the co-articulate effects in continuous speech recognition (CSR).[1,2,3] Traditionally, balanced corpus is generated manually or semiautomatically.[4] Our proposed algorithm tries to find a minimum number of sentences from a large text corpus set and ensures that 408 Mandarin base syllables(without tonal information) and 38*22 co-articulations between vowels and consonants are distributed in the extracted sentences. The automatic generation of balanced corpus problem can be also treated as a covering problem. In other words, the objective of the problem here is to find the set with minimum number of sentences that can cover all the syllables and co-articulations from a text corpus. If the average number of syllables in a sentence is N, it gives 2*N-1 coverings(N syllables and N-1 co-articulations). The theoretical minimum number of balanced sentences is (408+38*22) / (2*N-1). For example, N=6, the minimum number of balanced sentences is 114.

JHQHUDWLRQ RI EDODQFHG FRUSXV LV QHFHVVDU\ 7KH DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV SUREOHP FDQ EH DOVR FRQVLGHUHG DV D FRYHULQJ SUREOHP ,Q RWKHU ZRUGV WKH REMHFWLYH RI WKH SUREOHP LV WR ILQG WKH FRUSXV VHW ZLWK PLQLPXP QXPEHU RI VHQWHQFHV IURP D ODUJH WH[W FRUSXV VR WKDW WKH EDODQFHG VHW FDQ FRYHU DOO WKH FRDUWLFXODWLRQV EHWZHHQ DFRXVWLF UHFRJQLWLRQ XQLWV 7KLV SDSHU LV RUJDQL]HG DV IROORZV ,Q 6HFWLRQ  ZH JR WKURXJK GHWDLOV RQ RXU SURSRVHG DOJRULWKP IRU DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV ,Q 6HFWLRQ  ZH VKRZ RXU H[SHULPHQWDO UHVXOWV &RQFOXVLRQ LV JLYHQ LQ 6HFWLRQ 



$/*25,7+0

)25

$8720$7,&

*(1(5$7,21 2) %$/$1&(' &25386 %HFDXVH PRVW RI WKH DYDLODEOH FRUSRUD DUH FRPSRVHG RI WH[W VHQWHQFHV ZH FRQYHUW HDFK WH[W VHQWHQFH LQWR D SKRQHWLF VWULQJ E\ D ZRUG VHJPHQWDWLRQ DOJRULWKP DV VKRZQ LQ

)LJ  %DVLFDOO\

WKH ZRUG VHJPHQWDWLRQ DOJRULWKP XVHV D 9LWHUEL VHDUFKLQJ WR GHWHUPLQH WKH PRVW OLNHO\ ZRUG VHTXHQFH EDVHG RQ D ELJUDP

1.

INTRODUCTION

ODQJXDJH PRGHO 7KH SKRQHWLF WDJJLQJV DUH WKHQ IRXQG IURP WKH ZRUG GLFWLRQDU\

,W LV ZHOO NQRZQ WKDW FROOHFWLQJ RI ODUJH VSHHFK GDWDEDVH WR WUDLQ 7H[W 6HQWHQFH

WKH DFRXVWLF PRGHO RU WR HYDOXDWH WKH V\VWHP SHUIRUPDQFH LV WKH ILUVW VWHS WR GHYHORS D ODUJH YRFDEXODU\ DQG FRQWLQXRXV VSHHFK UHFRJQLWLRQ V\VWHP 7KH GHVLJQ RI SKRQHWLF EDODQFHG FRUSXV

%XLOG :RUG /DWWLFH

GHWHUPLQHV ZKDW VSHHFK VHQWHQFHV VKRXOG EH FROOHFWHG VR WKDW HDFK DFRXVWLF UHFRJQLWLRQ XQLW FDQ EH ZHOO WUDLQHG +HQFH D

:RUG 'LFWLRQDU\

SKRQHWLF EDODQFHG FRUSXV VKRXOG FRQWDLQ DW OHDVW WKH IROORZLQJ LQIRUPDWLRQ )LUVW HDFK DFRXVWLF UHFRJQLWLRQ XQLW PXVW DSSHDU LQ 9LWHUEL 6HDUFK

WKH EDODQFHG FRUSXV XQLIRUPO\ 6HFRQGO\ WKH FRDUWLFXODWLRQV

%LJUDP

EHWZHHQ DFRXVWLF UHFRJQLWLRQ XQLWV PXVW EH LQFOXGHG VR WKDW WKH FRDUWLFXODWLRQ HIIHFW UHFRJQLWLRQ

PRGHO

FDQ )RU

EH

DOVR

H[DPSOH

WUDLQHG LQ

LQWR

HDFK

0DQGDULQ

0RGHO

DFRXVWLF

VSHHFK



:RUG 6HTXHQFH ZLWK

UHFRJQLWLRQ XQLWV DUH FRPPRQO\ XVHG LQFOXGLQJ  FRQVRQDQWV

3KRQHWLF 7DJJLQJV

DQG  YRZHOV 7KHRUHWLFDOO\  FRDUWLFXODWLRQV VKRXOG EH LQFOXGHG LQ WKH WUDLQLQJ FRUSXV +RZHYHU RQO\  V\OODEOHV DUH YDOLG

SKRQLF

FRPELQDWLRQV

FRQVRQDQW



YRZHO 

)LJXUH :RUG 6HJPHQWDWLRQ $OJRULWKP

+HQFH

  FRPELQDWLRQV VKRXOG EH LQFOXGHG LQ WKH WUDLQLQJ

$V WKH SKRQHWLFWDJJHG VHQWHQFHV FDQ EH REWDLQHG IURP WKH WH[W

FRUSXV

FRUSXV E\ WKH ZRUG VHJPHQWDWLRQ DOJRULWKP ZH SURSRVHG DQ

,Q

WKH

SDVW

WKH

SKRQHWLF

EDODQFHG

FRUSXV

ZDV

GHVLJQHG

PDQXDOO\ RU VHPLDXWRPDWLFDOO\>@ ,Q FROOHFWLQJ RI VSHHFK GDWDEDVH LW LV SUHIHUDEOH WR SURYLGH PDQ\ WUDLQLQJ FRUSXV VHWV VR

WKDW

GLIIHUHQW

FRDUWLFXODWLRQV

FDQ

EH

FROOHFWHG

,I

WKH

DOJRULWKP

DQG

WU\

WR

ILQG

D

PLQLPXP

QXPEHU

RI

SKRQHWLF

EDODQFHG VHQWHQFHV 7KH DOJRULWKP LV GHVFULEHG DV IROORZV





6HW XS D FRYHULQJ WDEOH WKDW UHSUHVHQWV WKH FRYHULQJ VWDWXV

LI QXPEHU RI GLVWULEXWLRQ IRU V\OODEOH L LQ FRYHULQJ WDEOH

IRU FXUUHQWO\ VHOHFWHG EDODQFHG VHQWHQFHV

!  WKHQ RYHUKHDG

)LQG HVVHQWLDO VHQWHQFHV LQ WKH FRUSXV 

6FDQ DOO VHQWHQFHV LQ WKH FRUSXV DQG ILQG WKH

ORRS L

HVVHQWLDO VHQWHQFHV )RU D JLYHQ V\OODEOH RU FR

IRU L  WR QXPEHU RI V\OODEOHV LQ VHQWHQFH V  

DUWLFXODWLRQ LI WKHUH LV RQO\ RQH VHQWHQFH WR FRYHU WKH JLYHQ V\OODEOH RU FRDUWLFXODWLRQ WKH

LI QXPEHU RI GLVWULEXWLRQ RI YRZHO L FRQVRQDQW L LQ

VHQWHQFH LV FDOOHG DQ HVVHQWLDO VHQWHQFH )RU D

FRYHULQJ WDEOH !  WKHQ RYHUKHDG

JLYHQ V\OODEOH RU FRDUWLFXODWLRQ LI WKHUH LV QR VHQWHQFH

WR

FRYHU

LW

WKH

WH[W

FRUSXV

LV ORRS L

,Q VXFK FDVH ZH DSSHQG WKH WH[W FRUSXV ZLWK

UHWXUQ RYHUKHDG

VHQWHQFHV WKDW FRYHUV WKH JLYHQ V\OODEOH RU FR

`

DUWLFXODWLRQ DQG JR EDFN WR VWHS  5DQGRPO\ VHOHFW VHQWHQFHV LQWR EDODQFHG FRUSXV WR IRUP D

 5HPRYH UHGXQGDQW VHQWHQFHV IURP WKH EDODQFHG FRUSXV  6FDQ HDFK VHQWHQFH LQ WKH EDODQFHG FRUSXV DQG

FRYHU 

WU\ WR UHPRYH WKH VHQWHQFH LI WKH UHPDLQLQJ

6HOHFW WKH HVVHQWLDO VHQWHQFHV LQWR WKH EDODQFHG

VHQWHQFHV VWLOO IRUP D FRYHU (YHQWXDOO\ WKH VHW

FRUSXV DQG XSGDWH WKH FRYHULQJ WDEOH 

RI WKH UHPDLQLQJ VHQWHQFHV LV WKH EDODQFHG

5DQGRPO\ VHOHFW QRQHVVHQWLDO VHQWHQFH LQWR WKH

FRUSXV

EDODQFHG FRUSXV LI LWV UHGXQGDQF\ LV OHVV WKDQ D WKUHVKROG 7KH UHGXQGDQF\ RI D VHQWHQFH LV

,W LV QRWHG WKDW LQ 6WHS  ZH XVH D KHXULVWLF PHWKRG WR FRQVWUXFW

GHILQHG LQ 6WHS  

5HSHDW

VWHS



XQWLO

WKH

EDODQFHG

PDQ\ EDODQFHG FRUSRUD 7KLV LV EHFDXVH WKDW WKH QXPEHU RI

FRUSXV

EDODQFHG VHQWHQFHV LV KLJKO\ LQIOXHQFHG E\ WKH SDUVLQJ RUGHU RI

FRYHUV DOO WKH EDODQFHG LQIRUPDWLRQ 

6HOHFW

D

EDODQFHG

FRUSXV

ZLWK

PLQLPXP

QXPEHU

RI

VHQWHQFHV 

RI

H[SHULPHQWV

EDODQFHG ZH

FRUSRUD

FRQVWUXFW

,Q

EDODQFHG

ZLWK

PLQLPXP

WH[W FRUSXV ,Q IDFW ZH FDQ FRPELQH WKH KHXULVWLF PHWKRG DQG WKH UHFXUVLYH PHWKRG WRJHWKHU WR IXUWKHU UHGXFH WKH QXPEHU RI

FRUSRUD 

6HOHFW

WKH

EDODQFHG

FRUSXV

EDODQFHG VHQWHQFHV

QXPEHU RI VHQWHQFHV IURP QXPEHUV RI EDODQFHG

,Q 6WHS  EDODQFHG VHQWHQFHV DUH VRUWHG E\ WKH RYHUKHDG YDOXH

FRUSRUD IRU IXUWKHU SURFHVVLQJ 

,Q FDOFXODWLQJ WKH RYHUKHDG RI D VHQWHQFH LQ WKH EDODQFHG FRUSXV

5HSODFH UHGXQGDQW VHQWHQFHV LQ WKH EDODQFHG FRUSXV 

ZH PXVW ILUVW UHPRYH RXW WKH VHQWHQFH DQG FRPSXWH WKH

)RU WKH UHPDLQLQJ VHQWHQFHV WKDW DUH QRW LQ WKH

RYHUKHDG YDOXH ZLWK WKH UHPDLQLQJ VHQWHQFHV LQ WKH EDODQFHG RU

EDODQFHG FRUSXV HVWLPDWH WKH UHGXQGDQF\ RU

XQEDODQFHG FRUSXV

RYHUKHDG DQG WU\ WR UHSODFH D VHQWHQFH LQ WKH EDODQFHG FRUSXV LI LWV UHGXQGDQF\ LV OHVV WKDQ

 (;3(5,0(176 $1' ',6&866,216

RQH RI WKH VHQWHQFHV LQ WKH EDODQFHG FRUSXV 7KH

GHWDLOHG

SURFHGXUH

LV

JLYHQ

E\

WKH

,Q WKH H[SHULPHQWV ZH WU\ WR FKDQJH WKH QXPEHU RI EDODQFHG

IROORZLQJ SURJUDP FRGH

FRUSRUD

UHPRYH

EDODQFHGVHQWHQFH N

VR

WKDW

ZH

FDQ

GHWHUPLQH

D

UHDVRQDEOH

QXPEHU

RI

EDODQFHG FRUSRUD XVHG LQ 6WHS  7KH H[SHULPHQWDO UHVXOWV LV

IRU N  WR EDODQFHGVHQWHQFHV WHPSRUDU\

6WHS  WKH DOJRULWKP PXVW EH H[HFXWHG UHFXUVLYHO\ WR UHGXFH WKH EDODQFHG VHQWHQFHV REWDLQHG IURP HDFK LWHUDWLRQ WKHQ WKH ODUJH

WKH



WKH WH[W VHQWHQFH ,I ZH FRQVWUXFW RQO\ RQH EDODQFHG FRUSXV LQ HIIHFW RI WKH VHQWHQFH SDUVLQJ RUGHU LH E\ ILUVW SDUVLQJ WKH

5HSHDW 6WHS DV PDQ\ WLPHV DV SRVVLEOH WR JHW QXPEHUV

RYHUKHDG 

HQGLI

LQVXIILFLHQW WR FRYHU DOO EDODQFHG LQIRUPDWLRQ



RYHUKHDG  

HQGLI

VKRZQ LQ 7DEOH 

IURP

WKH

EDODQFHG FRUSXV DQG XSGDWH FRYHULQJ WDEOH FRPSXWH

RYHUKHDG EDODQFHGVHQWHQFH N

IRU

EDODQFH

1XPEHU RI %DODQFHG &RUSRUD

VHQWHQFH N UHLQVHUW EDODQFHGVHQWHQFHG N LQWR WKH EDODQFHG FRUSXV DQG XSGDWH WKH FRYHULQJ WDEOH ORRS N VRUW EDODQFHG VHQWHQFHV E\ WKH RYHUKHDG YDOXH UHWXUQ

1XPEHU RI %DODQFHG





















6HQWHQFHV

7DEOH 

1XPEHU RI EDODQFHG FRUSRUD XVHG LQ 6WHS  DQG

QXPEHU RI EDODQFHG VHQWHQFHV REWDLQHG E\ WKH DOJRULWKP

RYHUKHDG VHQWHQFH V

%DVHG RQ WKH DOJRULWKP ZH KDYH IRXQG  EDODQFHG VHQWHQFHV

^

IURP

RYHUKHDG 

PHQWLRQHG DERYH WKH WKHRUHWLFDO PLQLPXP QXPEHU RI EDODQFHG

IRU L  WR QXPEHU RI V\OODEOHV LQ VHQWHQFH V

VHQWHQFH LV  7KH DYHUDJH XWLOLW\ UDWH IRU EDODQFHG VHQWHQFHV

D

WH[W

FRUSXV

ZLWK



VHQWHQFHV

$V

ZH

KDYH

LV  LQ RXU EDODQFHG FRUSXV 7KH XWLOLW\ UDWH FDQ EH IXUWKHU LQFUHDVHG E\ HQODUJLQJ WKH WH[W FRUSXV VL]H )RU H[DPSOH D WH[W FRUSXV VL]H ZLWK PRUH WKDQ  VHQWHQFHV FDQ JLYH KLJKHU XWLOLW\ UDWH EHFDXVH WKHUH PD\ LQFOXGH DGGLWLRQDO FR DUWLFXODWLRQV WKDW GR QRW DSSHDU LQ WKH VPDOO WH[W FRUSXV 7KH EDODQFHG VHQWHQFHV DUH VKRZQ LQ WKH $SSHQGL[ $V D GLIIHUHQW EDODQFHG FRUSXV FDQ EH FRQVWUXFWHG E\ FKDQJLQJ WKH WH[W FRUSXV ZH FDQ JHQHUDWH PDQ\ GLIIHUHQW EDODQFHG FRUSXV VHWV IRU FROOHFWLQJ WKH VSHHFK GDWDEDVH IRU FRQWLQXRXV VSHHFK UHFRJQLWLRQ  &21&/86,21

,Q WKLV SDSHU D QHZ DOJRULWKP IRU DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV LV SURSRVHG 7KH H[SHULPHQWDO UHVXOW VKRZV WKDW RXU DOJRULWKP FDQ JHQHUDWH DQ DFFHSWDEOH EDODQFHG FRUSXV ZLWK  DYHUDJH XWLOLW\ UDWH %HVLGHV LW LV YHU\ HDV\ WR PRGLI\ WKH DOJRULWKP WR JHQHUDWH D EDODQFHG FRUSXV ZLWK HDFK EDODQFHG LQIRUPDWLRQ PRUH WKDQ RQH GLVWULEXWLRQ ,W LV SDUWLFXODUO\ XVHIXO IRU FROOHFWLRQ RI VSHDNHU LQGHSHQGHQW VSHHFK GDWDEDVH 7KH DOJRULWKP FDQ DOVR IUHHO\ JHQHUDWH QXPEHUV RI GLIIHUHQW EDODQFHG FRUSXV IRU GLIIHUHQW SXUSRVHV VXFK DV RQH IRU WUDLQLQJ GDWDEDVH DQG WKH RWKHU IRU WHVWLQJ GDWDEDVH 0RUHRYHU RXU DOJRULWKP FDQ EH HDVLO\ DSSOLHG WR RWKHU ODQJXDJH (QJOLVK -DSDQHVH *HUPDQ HWF E\ UHSODFLQJ WKH WH[W FRUSXV WKH SKRQHWLF WDEOH  5()(5(1&(

1. Ching-Hsiang, Lin-Shan Lee, “ An Initial Study on Large Vocabulary Continuous Mandarin Speech Recognition”, Proceedings of ICS 1990, pp.981-986, DEC 1990. 2. L.R. Rabiner, J.G. Wilpon, and B-H. Juang, “A segmental K-means training procedure for connected word recognition,” AT&T Technical Journal. 65(3), 2131, 1986. 3. Lin-Shan Lee, Chiu-Yu Tseng, Hun-yan Gu, Fu-Hua Liu, Chen-Hao Chang, Yueh-Hing Lin, Yumin Lee, Shih-Lung Tu, Shew-Heng Hsieh, and Chian-Hung Chen, " Golden Mandarin (I)-A Real-Time Mandarin Speech Dictation Machine for Chinese Language with Very Large Vocabulary ", IEEE Trans. Speech and Audio Processing, Vol.1, No.2, pp158-179, 1993. 4. S.M. Wu, J.S. Liau, “On the Creation of Mandarin Phonetic Balanced Sentences, “ Telecommunication Journal, Vol 19, No 1. Pp.79-87, MAR. 1990.

APPENDIX Balanced Sentences Listing

3l»]o o oÇB;ú1-1_ 5^Së1§¯bƒ

®Jþ2oÇz m¢fV Û¨û‚H lÞË^g£ò çς1Çv ×÷±1c zDŸ Ea 흕èà"•ÓKK .1#+¡]¥Þc¬¶13 6ñ}"'¥6š„Âì çô'܏…Ô²0é ŠÂUòˆ–£%hãP ªÁD4)6„­lÛ üa U®¬ÑS"G3 ï(CluG—1ùŸW ƒCHX–kâÞ«E Ínq'•¨Jny´Ò è±ðXÇ3v²‡k‹ç¶Í !muSÈU1" }x ˜ŒN˖D`Í@ T3¤´öV¼{ÈzXQ[ éü”ì¶'2PI¿Ä "a»}‚ï[4?̈́3B;p>Q ¡„=«´Œ¤Uv æÿñä1#êê ‡ªÖŸËN¬GÇ ÞzïÌ%lÞ}r š-s­ÝdϪ?Uo ¼[…Ê>U[ ¾Íð’U¬“BÚÈ çóž™ Ü1:‚ µø-/‘çüÌ >Q˜3 U²¶3½à+Ë1÷a H)"®NBãNáëp³ rÇn>› ¤RR1Ö| `‹÷—Y‹÷µ‹÷W¨~OH ˜ð¤Î‰Çinþž¤²û1 ü42UòWD aöçèÇ3hfK Jõ@Jõ@ƒR ]'HG*h[, l"RyŒï "R­â (±Ü l@ëI+'+ zc 1ú‘`û# Û fãBA¢+ï9mÙ ‡ ^ç¸6͆¡12 S ";63'+9Un+!“†± ]]k2˜dÿ°èÏÛRïdÿ

™ÔÖâ“"á V: : ʅÚÚ­ ;–ýÞÄ “!3÷± ˜é1˜s•àÞ)à ‚'DGž zìÖlu '^;oZ™zn1b–Yd'0 mw¼»ˆ‡C[°!ß® Þ£±ƒX¢˜v̈́ ¶Ûޛ± Kâ½v X®;QþºÕÈtº ûõV‰ë1øÕh: ;³†Ì1ºYðôG}1™= ô ãE`7œôÛú ÈĖ¯cr´ÈĖ¯cr´òr á╳ÂÞÞÖ ÷r÷rc3ç3 "^âÆ+È 1G— Û.3Q³íHÚ £W ädåWôïÍςüÒ-bdј² [­‚z™3–‡1àÞ z1WIҟzDU‰ ‡+?‰1#?»‹õ «¤݁Ô31í ±Ä r4½± [7ÍÓâv’ u¶¨¬Þ3y U³U³S{´ ã1ÞzüÙèx1 ¯ n…­©o;n…­©o;©o; m''Ì1ͧ[… »}0†¼1ÃÔ¿; ÖÊ¢šŸ¶Èéc å>?å>?VS"K®|nå‚ ûLF!vW1êê à Þ„,!ýs13H"!U:« ¦rZ8)ëË õ ¤ÊʤVVÌ2 ü`H?· [@« ‘cV#cÿçá 21ä¨VÂVB; q —K¤^ž‡? f¨hö7;ƒr1Þ÷›zãDJ adú3ø¦–é¯Z UzH"º`–©1@ñ Ù •àrÄpõăy ²£‡U‹^WG1TmT

am®ú̍mÎ1§µ Ö×ØÑ"tƒ«hm×ØQ! ,m'éÃo„o¬<  !™`¦1À&  wᗫhëàé v²i‡XÊ´3ü$ ”4.é1