ensures that 408 Mandarin base syllables(without tonal information) and ... language model. The phonetic ... word dictionary. Text Sentence. Build Word Lattice. Word. Dictionary. Bigram. Model ... sentences that covers the given syllable or co-.
AN ALGORITHM FOR AUTOMATIC GENERATION OF MANDARIN PHONETIC BALANCED CORPUS Jyh-Shing Shyuu and Jhing-Fa Wang 'HSDUWPHQW RI &RPSXWHU 6FLHQFH DQG ,QIRUPDWLRQ (QJLQHHULQJ 1DWLRQDO &KHQJ .XQJ 8QLYHUVLW\ 7DLQDQ 7DLZDQ 52&
EDODQFHG FRUSXV LV GHVLJQHG E\ PDQXDOO\ RQH PD\ WDNH D ORW RI
ABSTRACT
HIIRUWV LQ GHVLJQLQJ WKH WUDLQLQJ FRUSXV +HQFH DXWRPDWLF
This paper proposed an algorithm for automatic generation of Mandarin phonetic balanced corpus. The design of phonetic balanced corpus is particularly important for the collection of continuous speech database to reduce the co-articulate effects in continuous speech recognition (CSR).[1,2,3] Traditionally, balanced corpus is generated manually or semiautomatically.[4] Our proposed algorithm tries to find a minimum number of sentences from a large text corpus set and ensures that 408 Mandarin base syllables(without tonal information) and 38*22 co-articulations between vowels and consonants are distributed in the extracted sentences. The automatic generation of balanced corpus problem can be also treated as a covering problem. In other words, the objective of the problem here is to find the set with minimum number of sentences that can cover all the syllables and co-articulations from a text corpus. If the average number of syllables in a sentence is N, it gives 2*N-1 coverings(N syllables and N-1 co-articulations). The theoretical minimum number of balanced sentences is (408+38*22) / (2*N-1). For example, N=6, the minimum number of balanced sentences is 114.
JHQHUDWLRQ RI EDODQFHG FRUSXV LV QHFHVVDU\ 7KH DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV SUREOHP FDQ EH DOVR FRQVLGHUHG DV D FRYHULQJ SUREOHP ,Q RWKHU ZRUGV WKH REMHFWLYH RI WKH SUREOHP LV WR ILQG WKH FRUSXV VHW ZLWK PLQLPXP QXPEHU RI VHQWHQFHV IURP D ODUJH WH[W FRUSXV VR WKDW WKH EDODQFHG VHW FDQ FRYHU DOO WKH FRDUWLFXODWLRQV EHWZHHQ DFRXVWLF UHFRJQLWLRQ XQLWV 7KLV SDSHU LV RUJDQL]HG DV IROORZV ,Q 6HFWLRQ ZH JR WKURXJK GHWDLOV RQ RXU SURSRVHG DOJRULWKP IRU DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV ,Q 6HFWLRQ ZH VKRZ RXU H[SHULPHQWDO UHVXOWV &RQFOXVLRQ LV JLYHQ LQ 6HFWLRQ
$/*25,7+0
)25
$8720$7,&
*(1(5$7,21 2) %$/$1&(' &25386 %HFDXVH PRVW RI WKH DYDLODEOH FRUSRUD DUH FRPSRVHG RI WH[W VHQWHQFHV ZH FRQYHUW HDFK WH[W VHQWHQFH LQWR D SKRQHWLF VWULQJ E\ D ZRUG VHJPHQWDWLRQ DOJRULWKP DV VKRZQ LQ
)LJ %DVLFDOO\
WKH ZRUG VHJPHQWDWLRQ DOJRULWKP XVHV D 9LWHUEL VHDUFKLQJ WR GHWHUPLQH WKH PRVW OLNHO\ ZRUG VHTXHQFH EDVHG RQ D ELJUDP
1.
INTRODUCTION
ODQJXDJH PRGHO 7KH SKRQHWLF WDJJLQJV DUH WKHQ IRXQG IURP WKH ZRUG GLFWLRQDU\
,W LV ZHOO NQRZQ WKDW FROOHFWLQJ RI ODUJH VSHHFK GDWDEDVH WR WUDLQ 7H[W 6HQWHQFH
WKH DFRXVWLF PRGHO RU WR HYDOXDWH WKH V\VWHP SHUIRUPDQFH LV WKH ILUVW VWHS WR GHYHORS D ODUJH YRFDEXODU\ DQG FRQWLQXRXV VSHHFK UHFRJQLWLRQ V\VWHP 7KH GHVLJQ RI SKRQHWLF EDODQFHG FRUSXV
%XLOG :RUG /DWWLFH
GHWHUPLQHV ZKDW VSHHFK VHQWHQFHV VKRXOG EH FROOHFWHG VR WKDW HDFK DFRXVWLF UHFRJQLWLRQ XQLW FDQ EH ZHOO WUDLQHG +HQFH D
:RUG 'LFWLRQDU\
SKRQHWLF EDODQFHG FRUSXV VKRXOG FRQWDLQ DW OHDVW WKH IROORZLQJ LQIRUPDWLRQ )LUVW HDFK DFRXVWLF UHFRJQLWLRQ XQLW PXVW DSSHDU LQ 9LWHUEL 6HDUFK
WKH EDODQFHG FRUSXV XQLIRUPO\ 6HFRQGO\ WKH FRDUWLFXODWLRQV
%LJUDP
EHWZHHQ DFRXVWLF UHFRJQLWLRQ XQLWV PXVW EH LQFOXGHG VR WKDW WKH FRDUWLFXODWLRQ HIIHFW UHFRJQLWLRQ
PRGHO
FDQ )RU
EH
DOVR
H[DPSOH
WUDLQHG LQ
LQWR
HDFK
0DQGDULQ
0RGHO
DFRXVWLF
VSHHFK
:RUG 6HTXHQFH ZLWK
UHFRJQLWLRQ XQLWV DUH FRPPRQO\ XVHG LQFOXGLQJ FRQVRQDQWV
3KRQHWLF 7DJJLQJV
DQG YRZHOV 7KHRUHWLFDOO\ FRDUWLFXODWLRQV VKRXOG EH LQFOXGHG LQ WKH WUDLQLQJ FRUSXV +RZHYHU RQO\ V\OODEOHV DUH YDOLG
SKRQLF
FRPELQDWLRQV
FRQVRQDQW
YRZHO
)LJXUH :RUG 6HJPHQWDWLRQ $OJRULWKP
+HQFH
FRPELQDWLRQV VKRXOG EH LQFOXGHG LQ WKH WUDLQLQJ
$V WKH SKRQHWLFWDJJHG VHQWHQFHV FDQ EH REWDLQHG IURP WKH WH[W
FRUSXV
FRUSXV E\ WKH ZRUG VHJPHQWDWLRQ DOJRULWKP ZH SURSRVHG DQ
,Q
WKH
SDVW
WKH
SKRQHWLF
EDODQFHG
FRUSXV
ZDV
GHVLJQHG
PDQXDOO\ RU VHPLDXWRPDWLFDOO\>@ ,Q FROOHFWLQJ RI VSHHFK GDWDEDVH LW LV SUHIHUDEOH WR SURYLGH PDQ\ WUDLQLQJ FRUSXV VHWV VR
WKDW
GLIIHUHQW
FRDUWLFXODWLRQV
FDQ
EH
FROOHFWHG
,I
WKH
DOJRULWKP
DQG
WU\
WR
ILQG
D
PLQLPXP
QXPEHU
RI
SKRQHWLF
EDODQFHG VHQWHQFHV 7KH DOJRULWKP LV GHVFULEHG DV IROORZV
6HW XS D FRYHULQJ WDEOH WKDW UHSUHVHQWV WKH FRYHULQJ VWDWXV
LI QXPEHU RI GLVWULEXWLRQ IRU V\OODEOHL LQ FRYHULQJ WDEOH
IRU FXUUHQWO\ VHOHFWHG EDODQFHG VHQWHQFHV
! WKHQ RYHUKHDG
)LQG HVVHQWLDO VHQWHQFHV LQ WKH FRUSXV
6FDQ DOO VHQWHQFHV LQ WKH FRUSXV DQG ILQG WKH
ORRS L
HVVHQWLDO VHQWHQFHV )RU D JLYHQ V\OODEOH RU FR
IRU L WR QXPEHU RI V\OODEOHV LQ VHQWHQFH V
DUWLFXODWLRQ LI WKHUH LV RQO\ RQH VHQWHQFH WR FRYHU WKH JLYHQ V\OODEOH RU FRDUWLFXODWLRQ WKH
LI QXPEHU RI GLVWULEXWLRQ RI YRZHOL FRQVRQDQWL LQ
VHQWHQFH LV FDOOHG DQ HVVHQWLDO VHQWHQFH )RU D
FRYHULQJ WDEOH ! WKHQ RYHUKHDG
JLYHQ V\OODEOH RU FRDUWLFXODWLRQ LI WKHUH LV QR VHQWHQFH
WR
FRYHU
LW
WKH
WH[W
FRUSXV
LV ORRS L
,Q VXFK FDVH ZH DSSHQG WKH WH[W FRUSXV ZLWK
UHWXUQRYHUKHDG
VHQWHQFHV WKDW FRYHUV WKH JLYHQ V\OODEOH RU FR
`
DUWLFXODWLRQ DQG JR EDFN WR VWHS 5DQGRPO\ VHOHFW VHQWHQFHV LQWR EDODQFHG FRUSXV WR IRUP D
5HPRYH UHGXQGDQW VHQWHQFHV IURP WKH EDODQFHG FRUSXV 6FDQ HDFK VHQWHQFH LQ WKH EDODQFHG FRUSXV DQG
FRYHU
WU\ WR UHPRYH WKH VHQWHQFH LI WKH UHPDLQLQJ
6HOHFW WKH HVVHQWLDO VHQWHQFHV LQWR WKH EDODQFHG
VHQWHQFHV VWLOO IRUP D FRYHU (YHQWXDOO\ WKH VHW
FRUSXV DQG XSGDWH WKH FRYHULQJ WDEOH
RI WKH UHPDLQLQJ VHQWHQFHV LV WKH EDODQFHG
5DQGRPO\ VHOHFW QRQHVVHQWLDO VHQWHQFH LQWR WKH
FRUSXV
EDODQFHG FRUSXV LI LWV UHGXQGDQF\ LV OHVV WKDQ D WKUHVKROG 7KH UHGXQGDQF\ RI D VHQWHQFH LV
,W LV QRWHG WKDW LQ 6WHS ZH XVH D KHXULVWLF PHWKRG WR FRQVWUXFW
GHILQHG LQ 6WHS
5HSHDW
VWHS
XQWLO
WKH
EDODQFHG
PDQ\ EDODQFHG FRUSRUD 7KLV LV EHFDXVH WKDW WKH QXPEHU RI
FRUSXV
EDODQFHG VHQWHQFHV LV KLJKO\ LQIOXHQFHG E\ WKH SDUVLQJ RUGHU RI
FRYHUV DOO WKH EDODQFHG LQIRUPDWLRQ
6HOHFW
D
EDODQFHG
FRUSXV
ZLWK
PLQLPXP
QXPEHU
RI
VHQWHQFHV
RI
H[SHULPHQWV
EDODQFHG ZH
FRUSRUD
FRQVWUXFW
,Q
EDODQFHG
ZLWK
PLQLPXP
WH[W FRUSXV ,Q IDFW ZH FDQ FRPELQH WKH KHXULVWLF PHWKRG DQG WKH UHFXUVLYH PHWKRG WRJHWKHU WR IXUWKHU UHGXFH WKH QXPEHU RI
FRUSRUD
6HOHFW
WKH
EDODQFHG
FRUSXV
EDODQFHG VHQWHQFHV
QXPEHU RI VHQWHQFHV IURP QXPEHUV RI EDODQFHG
,Q 6WHS EDODQFHG VHQWHQFHV DUH VRUWHG E\ WKH RYHUKHDG YDOXH
FRUSRUD IRU IXUWKHU SURFHVVLQJ
,Q FDOFXODWLQJ WKH RYHUKHDG RI D VHQWHQFH LQ WKH EDODQFHG FRUSXV
5HSODFH UHGXQGDQW VHQWHQFHV LQ WKH EDODQFHG FRUSXV
ZH PXVW ILUVW UHPRYH RXW WKH VHQWHQFH DQG FRPSXWH WKH
)RU WKH UHPDLQLQJ VHQWHQFHV WKDW DUH QRW LQ WKH
RYHUKHDG YDOXH ZLWK WKH UHPDLQLQJ VHQWHQFHV LQ WKH EDODQFHG RU
EDODQFHG FRUSXV HVWLPDWH WKH UHGXQGDQF\ RU
XQEDODQFHG FRUSXV
RYHUKHDG DQG WU\ WR UHSODFH D VHQWHQFH LQ WKH EDODQFHG FRUSXV LI LWV UHGXQGDQF\ LV OHVV WKDQ
(;3(5,0(176 $1' ',6&866,216
RQH RI WKH VHQWHQFHV LQ WKH EDODQFHG FRUSXV 7KH
GHWDLOHG
SURFHGXUH
LV
JLYHQ
E\
WKH
,Q WKH H[SHULPHQWV ZH WU\ WR FKDQJH WKH QXPEHU RI EDODQFHG
IROORZLQJ SURJUDP FRGH
FRUSRUD
UHPRYH
EDODQFHGVHQWHQFHN
VR
WKDW
ZH
FDQ
GHWHUPLQH
D
UHDVRQDEOH
QXPEHU
RI
EDODQFHG FRUSRUD XVHG LQ 6WHS 7KH H[SHULPHQWDO UHVXOWV LV
IRU N WR EDODQFHGVHQWHQFHV WHPSRUDU\
6WHS WKH DOJRULWKP PXVW EH H[HFXWHG UHFXUVLYHO\ WR UHGXFH WKH EDODQFHG VHQWHQFHV REWDLQHG IURP HDFK LWHUDWLRQ WKHQ WKH ODUJH
WKH
WKH WH[W VHQWHQFH ,I ZH FRQVWUXFW RQO\ RQH EDODQFHG FRUSXV LQ HIIHFW RI WKH VHQWHQFH SDUVLQJ RUGHU LH E\ ILUVW SDUVLQJ WKH
5HSHDW 6WHS DV PDQ\ WLPHV DV SRVVLEOH WR JHW QXPEHUV
RYHUKHDG
HQGLI
LQVXIILFLHQW WR FRYHU DOO EDODQFHG LQIRUPDWLRQ
RYHUKHDG
HQGLI
VKRZQ LQ 7DEOH
IURP
WKH
EDODQFHG FRUSXV DQG XSGDWH FRYHULQJ WDEOH FRPSXWH
RYHUKHDGEDODQFHGVHQWHQFHN
IRU
EDODQFH
1XPEHU RI %DODQFHG &RUSRUD
VHQWHQFHN UHLQVHUW EDODQFHGVHQWHQFHGN LQWR WKH EDODQFHG FRUSXV DQG XSGDWH WKH FRYHULQJ WDEOH ORRS N VRUW EDODQFHG VHQWHQFHV E\ WKH RYHUKHDG YDOXH UHWXUQ
1XPEHU RI %DODQFHG
6HQWHQFHV
7DEOH
1XPEHU RI EDODQFHG FRUSRUD XVHG LQ 6WHS DQG
QXPEHU RI EDODQFHG VHQWHQFHV REWDLQHG E\ WKH DOJRULWKP
RYHUKHDGVHQWHQFHV
%DVHG RQ WKH DOJRULWKP ZH KDYH IRXQG EDODQFHG VHQWHQFHV
^
IURP
RYHUKHDG
PHQWLRQHG DERYH WKH WKHRUHWLFDO PLQLPXP QXPEHU RI EDODQFHG
IRU L WR QXPEHU RI V\OODEOHV LQ VHQWHQFH V
VHQWHQFH LV 7KH DYHUDJH XWLOLW\ UDWH IRU EDODQFHG VHQWHQFHV
D
WH[W
FRUSXV
ZLWK
VHQWHQFHV
$V
ZH
KDYH
LV LQ RXU EDODQFHG FRUSXV 7KH XWLOLW\ UDWH FDQ EH IXUWKHU LQFUHDVHG E\ HQODUJLQJ WKH WH[W FRUSXV VL]H )RU H[DPSOH D WH[W FRUSXV VL]H ZLWK PRUH WKDQ VHQWHQFHV FDQ JLYH KLJKHU XWLOLW\ UDWH EHFDXVH WKHUH PD\ LQFOXGH DGGLWLRQDO FR DUWLFXODWLRQV WKDW GR QRW DSSHDU LQ WKH VPDOO WH[W FRUSXV 7KH EDODQFHG VHQWHQFHV DUH VKRZQ LQ WKH $SSHQGL[ $V D GLIIHUHQW EDODQFHG FRUSXV FDQ EH FRQVWUXFWHG E\ FKDQJLQJ WKH WH[W FRUSXV ZH FDQ JHQHUDWH PDQ\ GLIIHUHQW EDODQFHG FRUSXV VHWV IRU FROOHFWLQJ WKH VSHHFK GDWDEDVH IRU FRQWLQXRXV VSHHFK UHFRJQLWLRQ &21&/86,21
,Q WKLV SDSHU D QHZ DOJRULWKP IRU DXWRPDWLF JHQHUDWLRQ RI EDODQFHG FRUSXV LV SURSRVHG 7KH H[SHULPHQWDO UHVXOW VKRZV WKDW RXU DOJRULWKP FDQ JHQHUDWH DQ DFFHSWDEOH EDODQFHG FRUSXV ZLWK DYHUDJH XWLOLW\ UDWH %HVLGHV LW LV YHU\ HDV\ WR PRGLI\ WKH DOJRULWKP WR JHQHUDWH D EDODQFHG FRUSXV ZLWK HDFK EDODQFHG LQIRUPDWLRQ PRUH WKDQ RQH GLVWULEXWLRQ ,W LV SDUWLFXODUO\ XVHIXO IRU FROOHFWLRQ RI VSHDNHU LQGHSHQGHQW VSHHFK GDWDEDVH 7KH DOJRULWKP FDQ DOVR IUHHO\ JHQHUDWH QXPEHUV RI GLIIHUHQW EDODQFHG FRUSXV IRU GLIIHUHQW SXUSRVHV VXFK DV RQH IRU WUDLQLQJ GDWDEDVH DQG WKH RWKHU IRU WHVWLQJ GDWDEDVH 0RUHRYHU RXU DOJRULWKP FDQ EH HDVLO\ DSSOLHG WR RWKHU ODQJXDJH (QJOLVK -DSDQHVH *HUPDQ HWF E\ UHSODFLQJ WKH WH[W FRUSXV WKH SKRQHWLF WDEOH 5()(5(1&(
1. Ching-Hsiang, Lin-Shan Lee, “ An Initial Study on Large Vocabulary Continuous Mandarin Speech Recognition”, Proceedings of ICS 1990, pp.981-986, DEC 1990. 2. L.R. Rabiner, J.G. Wilpon, and B-H. Juang, “A segmental K-means training procedure for connected word recognition,” AT&T Technical Journal. 65(3), 2131, 1986. 3. Lin-Shan Lee, Chiu-Yu Tseng, Hun-yan Gu, Fu-Hua Liu, Chen-Hao Chang, Yueh-Hing Lin, Yumin Lee, Shih-Lung Tu, Shew-Heng Hsieh, and Chian-Hung Chen, " Golden Mandarin (I)-A Real-Time Mandarin Speech Dictation Machine for Chinese Language with Very Large Vocabulary ", IEEE Trans. Speech and Audio Processing, Vol.1, No.2, pp158-179, 1993. 4. S.M. Wu, J.S. Liau, “On the Creation of Mandarin Phonetic Balanced Sentences, “ Telecommunication Journal, Vol 19, No 1. Pp.79-87, MAR. 1990.
APPENDIX Balanced Sentences Listing
3l»]o o oÇB;ú1-1_ 5^Së1§¯b
®Jþ2oÇz m¢fVÛ¨ûH lÞË^g£ò çÏ1Çv ×÷±1c zD Ea íèà"ÓKK .1#+¡]¥Þc¬¶13 6ñ}"'¥6Âì çô'Ü
Ô²0é ÂUò£%hãP ªÁD4)6lÛ üa U®¬ÑS"G3 ï(CluG1ùW CHXkâÞ«E Ínq'¨Jny´Ò è±ðXÇ3v²kç¶Í !muSÈU1" }x NËD`Í@ T3¤´öV¼{ÈzXQ[ éüì¶'2PI¿Ä "a»}ï[4?Í3B;p>Q ¡=«´¤Uv æÿñä1#êê ªÖËN¬GÇ ÞzïÌ%lÞ}r -sÝdϪ?Uo ¼[
Ê>U[ ¾ÍðU¬BÚÈ çó Ü1: µø-/çüÌ >Q3 U²¶3½à+Ë1÷a H)"®NBãNáëp³ rÇn> ¤RR1Ö| `÷Y÷µ÷W¨~OH ð¤ÎÇinþ¤²û1 ü42UòWD aöçèÇ3hfK Jõ@Jõ@R ]'HG*h[, l"Ryï "Râ (±Ü l@ëI+'+ zc 1ú`û# Û fãBA¢+ï9mÙ ^ç¸6Í¡12S ";63'+9Un+!± ]]k2dÿ°èÏÛRïdÿ
ÔÖâ"á V: :Ê
ÚÚ ;ýÞÄ !3÷± é1sàÞ)à 'DG zìÖlu '^;oZzn1bYd'0 mw¼»C[°!ß® Þ£±X¢vÍ ¶ÛÞ± Kâ½v X®;QþºÕÈtº ûõVë1øÕh: ;³Ì1ºYðôG}1= ô ãE`7ôÛú Èįcr´Èįcr´òr áâ³ÂÞÞÖ ÷r÷rc3ç3 "^âÆ+È1G Û.3Q³íHÚ £W ädåWôïÍÏüÒ-bdѲ [z31àÞ z1WIÒzDU +?1#?»õ «¤ÝÔ31í ±Ä r4½± [7ÍÓâv u¶¨¬Þ3y U³U³S{´ ã1ÞzüÙèx1 ¯ n
©o;n
©o;©o; m''Ì1ͧ[
»}0¼1ÃÔ¿; ÖÊ¢¶Èéc å>?å>?VS"K®|nå ûLF!vW1êê à Þ,!ýs13H"!U:« ¦rZ8)ëË õ ¤ÊʤVVÌ2 ü`H?· [@« cV#cÿçá 21ä¨VÂVB; q K¤^? f¨hö7;r1Þ÷zãDJ adú3ø¦é¯Z UzH"º`©1@ñ Ù àrÄpõÄy ²£U^WG1TmT
am®úÌmÎ1§µ Ö×ØÑ"t«hm×ØQ! ,m'éÃoo¬< !`¦1À& wá«hëàé v²iXÊ´3ü$ 4.é1