Constructing Thai Phonetically Balanced Word

12 downloads 0 Views 740KB Size Report
Constructing Thai Phonetically Balanced Word Recognition Test in Speech ... spoken corpus data. Due to ..... the pool data and distribute them to all 5 word lists.
Constructing Thai Phonetically Balanced Word Recognition Test in Speech Audiometry through Large Written Corpora A. Munthuli1, P. Sirimujalin1, C. Tantibundhit1, 2, C. Onsuwan2, 3, K. Kosawat4, N. Klangpornkun1 1

Department of Electrical and Computer Engineering, Thammasat University, Thailand 2 Center of Excellence in Intelligent Informatics, Speech and Language Technology and Service Innovation (CILS), Thammasat University, Thailand 3 Department of Linguistics, Thammasat University, Thailand 4 National Electronics and Computer Technology Center (NECTEC), Thailand [email protected], [email protected], [email protected]

ABSTRACT Thammasat University Phonetically Balanced Word Lists 2014 (TU PB’14) are created with 5 different lists of 25 Thai monosyllabic words. TU PB’14 is constructed based on Thai phoneme distribution [1] from large-scale written Thai corpora, InterBEST [2]. The two major criteria are phonetic balance and list equivalency. Phoneme occurrences of original 5 Thai monosyllabic word lists (OTL) are also given and discussed. To evaluate reliability and inter-list equivalence of TU PB’14 and OTL lists, all 10 lists are given to 30 normalhearing subjects to obtain discrimination scores. Discrimination scores of the five TU PB’14 tests yield a curve that is comparable with that of the OTL, but with the former being slightly lower (more difficult). Importantly, all lists of TU PB’14 exhibit relatively equal range of difficulty. To further examine variability of discrimination scores, we plan to conduct similar tests on hearing impaired subjects. Index Terms: phonetically balanced word lists, speech audiometry, TU PB’14, Thai, assessment of speech input 1.

INTRODUCTION

An audiological examination generally consists of puretone audiometry and speech audiometry. Speech audiometry includes a series of tests; one which detects a patient’s speech reception threshold (SRT) and the other which determines word recognition score (WRS). One of the techniques extensively used to measure WRS is through phonetically balanced word lists (PB lists) [3]. A single PB list is usually employed during a hearing examination session. Therefore, to prevent learning effect and memorization, several test lists which are interchangeable should be available [4]. PB lists have been created for many languages and in addition to phonetic balance, various criteria have been taken into account, i.e., word frequency, word familiarity, syllable structure, lexical neighbors, and equal range of difficulty [5]. Even though original Thai monosyllabic word lists (OTL) shown in Table 1 are commonly used in hearing clinics across Thailand, many limits of the test lists have been found. Firstly, there is a large degree of asymmetrical phoneme occurrences among OTL word

lists (15). Tables 25 give the number of occurrences of initial consonants, vowels, tones, and final consonants across the OTL and TU PB’14 lists. Shaded areas highlight a large degree of asymmetry among of the word lists in each set. It is important to note that every word list has an equal number of monosyllabic words which is 25. Secondly, there are three cases of duplicate words across different lists in OTL. This should be avoided to prevent learning effect and memorization. Focusing on two major criteria, phonetic balance and list equivalency, our goal is to construct phonetically balanced word lists for Thai, Thammasat University Phonetically Balanced Word Lists 2014 (TU PB’14), for the use in speech audiometry. Good phonetic balance reflects true (as much as possible) distributions of speech sounds in the language. To the best of our knowledge, there is no work that reports Thai phoneme distributions based on large-scale spoken corpus data. Due to the lack of the large spoken corpus in the Thai language [1], we have successfully obtained phoneme distributions from existing Thai largescale written corpora, InterBEST [2] [6]. Then, TU PB’14 lists are constructed according to the distributions. To assure test validity and inter-list equivalency, an experiment using TU PB’14 and OTL lists is conducted on 30 normal-hearing subjects. The organization of the paper is as follows: Section 2 presents details of the design and construction of TU PB’14. Section 3 provides experimental design and setup. Section 4 presents experimental results. Section 5 discusses the findings and future work. 2.

TU PB’14 DEVELOPMENT

Details of a design and construction process of TU PB’14 word lists are explained below: 2.1 Corpus-based phoneme distributions in Thai Firstly, we further analyze Thai phoneme distributions from our previous work [1]. More generally, the distributions were derived from InterBest, the largest Thai corpora publicly available in Thailand composed of approximately nine million words divided into 12 genres [2]. Since Thai does not use spaces between words, a grapheme to phoneme (G2P) software [7] is needed to break a sentence into words and generate their

Table 1: Original Thai monosyllabic word lists (OTL). List 1 อ่าน (read)

List 2 รถ (car)

List 3 ซื้อ (buy)

List 4 อ้วน (fat)

List 5 ปาก (mouth)

[

[

[

[

[

เป็ด (duck)

ตาย (dead)

เคาะ (knock)

ยอด (peak)

อุม้ (carry)

เต่า (turtle)

จริง (true)

ลาด (slope)

ตาม (follow)

ญาติ (relative)

[

[

[

[

[

[

[

[

[

[

ไกล (far)

ถ้วย (cup)

ฉาย (show)

ร้อน (hot)

เร็ว (fast)

[

[

[

[

[

ปู่ (grandfather)

ผ้า (fabric)

เสื้อ (shirt)

แว่น (glasses)

[

[

[

[

ผม (hair) [

ถุง (bag)

ว่าว (kite)

ขาด (rip)

คน (human)

[

[

[

[

[

เสื้อ (shirt)

พัด (fan)

บ่อ (pond)

แก้ม (cheek)

วัด (temple)

[

[

[

[

[

บ้าน (home) [

ทา (do)

[

อิม่ (full) [

ชั้น (layer)

[

ไม้ (wood)

ม้า (horse)

ชัด (clear)

ต้น (trunk)

ผัด (fry)

หก (six)

[

[

[

[

[

พาน (tray)

ซ่อม (fix)

ดิน (soil)

ดา (black)

กล้วย (banana)

ลิง (monkey)

แม่ (mother)

จบ (end)

แถว (row)

ดื่ม (drink)

[

[

[

[

[

[

ฉิ่ง ( small cymbal)

[

[

[

ฝ้าย (cotton)

[

[

ผึ้ง (bee)

คือ (be)

ห้าม (prohibit)

ขุ่น (opaque)

[

[

[

[

แหวน (ring)

ขวด (bottle)

ผ่าน (pass)

สาย (late)

งอ(bend)

[

[

[

[

[

จาน (dish)

ปืน (gun)

จันทร์ (moon)

[

[

สร้อย (necklace)

ช้าง (elephant)

ฝน (rain)

อุน่ (warm)

ไทย (Thai)

บุญ (merit)

[

[

[

[

[

หนู (mouse)

บัว (lotus)

ง่าม (prong)

ฝูง (group)

เป็ด (duck)

[

[

[

[

[

มาก (numerous)

[

[

[

[

โต๊ะ (table)

ช้อน (spoon)

ฉาบ (cymbal)

ยิง (shoot)

[

[

[

เรือ (boat)

ไฝ (mole)

ราก (root)

น้อย (few)

ช้า (bruise)

[

[

[

[

[

ยุง (mosquito)

ยาม (guard)

ชีพ (life)

พบ (meet)

หลับ (sleep)

[

[

[

[

[

ควาย (buffalo)

เงาะ (rambutan)

ฟ้า (sky)

ง่าย (easy)

นิด (little)

[

[

[

[

[

เด็ก (child)

ดาว (star)

ลุง (uncle)

สิบ (ten)

[

[

หอม (aromatic)

[

[

หอย (clam)

ก้าว (step)

พระ (monk)

ทอง (gold)

แท้ (genuine)

[

[

[

งู (snake)

เล่น (play)

ฝิ่น (opium)

ฉัน (I, me)

พูด (speak)

[

[

[

[

เคย (ever) [

[

[

[

[

[

ฟัน (tooth)

ห่าน (goose)

กรรม (karma)

ฟัง (listen)

ฟอง (bubble)

[

[

[

[

[

ธง (flag)

ค่า (night)

ปลา (fish)

ว่าว (kite)

จอบ (hoe)

[

[

[

[

[

pronunciations. Each word was transliterated using G2P. Frequency of words, frequency of 81 Thai phonemes in each genre, and the 95% confidence intervals (CI) of average occurrences of each phoneme were calculated. Any phonemes that did not fall within the 95% CI were

tallied and consequently three genres whose distributions are highly incompatible with others were discarded, resulting in the remaining of approximately 80% of the data. Finally, phoneme distributions of initial consonants, vowels, final consonants, and tones were obtained (Table 2 of [1]). 2.2 Phoneme distributions in TU PB’14 word lists Our attempt is to create five lists of Thai monosyllabic words, which have two main features. Firstly, all lists should be phonetically balanced according to the phoneme distributions in Thai [1]. Secondly, each of the word lists should be equivalent in terms of its difficulty. TU PB’14 contains five word lists, each of which is composed of 25 monosyllabic words. Relative frequencies (%) of 81 Thai phonemes are multiplied by 125 and rounded to the nearest integer. Then, each phoneme is equally distributed as much as possible into each list, as shown in Tables 25. In addition, any extremely low-frequency phoneme, e.g., //, //, //, //, and //, that occurs only once, is assigned firstly to the first word list then to the others respectively. 2.3 Word selection for TU PB’14 word lists Each of the five phonetically balanced word lists is composed of 25 meaningful monosyllabic words (CVC or CVV(C)). In word selecting, emotional and objection words are avoided. The main criterion is to select words that are familiar and well-distributed across categories, i.e., they are learned at an elementary school (a minimum educational level required for all Thais) and are used regularly in everyday life communication. Therefore, the following steps are carried out: 1) Pooling words from three elementary Thai subject textbooks resulting in 1,665 distinct words [8][9][10]. 2) Picking open syllable words with a short vowel (which phonetically end with a glottal stop) from the pool data and distribute them to all 5 word lists. This step has to be performed earlier because words of this type are sparse. 3) Generating (from a computer program) a word, which is the most probable from combinations of initial, vowel, final, and tone. Then, check whether the word is available in the pooled words. IF YES, put the generated word in the ongoing list, THEN go to Step 4, ELSE start to construct words in the next list. 4) Constructing each list, one at a time. In the meantime, updating the remaining phonemes in the list and updating the remaining distinct words. 5) Terminating the program and manually constructing the remaining words (by also allowing removal of already generated word and adding a new word if needed).

Table 2: Number of occurrences of initial phonemes in OTL and TU PB’14 lists. // are initial cluster and considered a single unit. 

 1 2 3 4 5

O T L

                 

1

1

1 1

1 1

1 TU 2 PB 3 ’14 4 5

1 1

1

1

1 1

1

1

1 1

1

1

1

1 1

1

2 2 1 2 2 1 2 1 1 1

1 1 1 1 1

1 1 1 1 1

2 2 1 2 2

1 1 1 1 1

1 1 1 1 1

2 2 2 2 2

1 1 1

1 1 1 1 1

2 2 2 3 2

1 1 1 1 1

1 1 1 1 1

1 1 1 1

1 1 1

2 3 1

2 2 1 1 2

2 2 1 1 2

         

1 1 1 1 1

2 1 2 2 2

1 2 2 1 1

1 1 1 1 1

1 1

1

1 1

1 1

1 1

1

2 1 2 2 2

1 1 1

1 2 2 2 1

2 2 2 2 2

1

Table 3: Number of occurrences of vowels in OTL and TU PB’14 lists. /   / are vocalic diphthongs.  O T L

 1 2 3 4 5

1 TU 2 PB 3 ’14 4 5

    

      



7 7 8 6 7

2 1 4

2

1 2 2

1 1

2 1 2 2 2

7 7 8 8 8

5 5 5 6 5

1 1 1 1 1

1

1 1 1 1

2 1 1 1 1

2 1

1 1

1

2 2 2 1 1 2

2 2

2 2

1 1

1 1

1 1

1

1 1 2

1 1 1

1 1

1 1

1 1

       

1 1 1

Table 5: Number of occurrences of final phonemes in OTL and TU PB’14 lists. (In isolation, short-vowel syllables with no final consonant are phonetically ended with [];  denotes the lack of final consonant.)  O T L

TU PB ’14

 1 2 3 4 5

6 4 4 1 6

2

1 2 3 4 5

6 5 5 5 5

3 4 4 4 4











1

1 2 1

1 4 2 2 1

1 1 1 1 1

2 2 2 2 2

1 1 1 2 1

1 1 1











4 4 4

8 4 5 7 8

5 2 1 4 5

1 3 2 1

3 3 3 3 3

2 2 2 1 1

4 4 4 4 4

3 2 3 3 3

1 1 1 1

3 3 2 2 3

The complete TU PB’14 word lists are shown in Table 6. It is important to note that 92%, 82%, 80%, 72%, and 80% of words in the first to the fifth word lists (respectively) are average to high-frequency words (occurring above the median frequency). The word frequencies were calculated from number of occurrences in InterBest. 3.

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 2 2 2

1

1 1 1 1

1

1 1 2 1 1

1 1 1

Table 4: Number of occurrences of tones in OTL and TU PB’14 lists.

3 5 3 6 3

2 1

1 1 1 1 1

EXPERIMENTAL SETUP

An experiment is carried out to investigate whether there is any significant difference in terms of level of difficulty among the five word lists of TU PB’14 as well as those of OTL, and to determine whether both TU PB’14 and OTL on average are comparable in terms of difficulty level. Discrimination scores obtained from

1 1 2 4 1 2 1 2 1 2

1 2

3

2 1

1 2 1 1 1 1

1 1

1 1 1

1 1

 O T L

 1 2 3 4 5

1 TU 2 PB 3 ’14 4 5

 1 7 6 8 1

 5 4 6 2 5

 3 9 7 7 3

 2 4 4 4 2

 5 1 2 4 5

9 8 8 9 9

5 6 6 5 5

4 5 5 5 5

4 4 4 4 4

3 2 2 2 2

TU PB’14 and OTL are compared to answer these questions. To do so, 250 monosyllabic words from five word lists of TU PB’14 and OTL (25×5+25×5=250) were read three times and recorded at a sampling rate of 44.1 kHz in a sound attenuated chamber by a 39 year-old Thai male speaker who was born and grew up in Bangkok. Then, one of the three tokens of each word was selected based on impressionistic hearing evaluation and spectrographic inspection. The ten lists from TU PB’14 and OTL were carefully paired together resulting in 5 pairs and there is no repeated word in any pair. Then, each of the five pairs of word lists was presented in one of five intensity levels, i.e., 15, 25, 35, 45, and 55 dB HL. These intensity levels were chosen based on our preliminary experiments such that floor and ceiling would be achieved. The psychoacoustic tests were performed individually on untrained 30 normal hearing subjects consisted of 15 males and 15 females ranging in age from 19 to 23 years with a mean of 20.5 years. They were drawn from the student population at Thammasat University. All the subjects passed a screening test to pure tones from 125 through 8,000 Hz at 20 dB HL in both ears and the right ear was served as the test ear. Each subject sat in a sound attenuated chamber in an ENT clinic, Thammasat University Hospital and listened to a playback speech stimulus explained earlier via a headphone. He/she had to repeat the word they just heard. If they did not recognize the word, they had to guess

before moving to the next one. The test is divided into one training (10 stimuli) session and five test sessions (50 stimuli each). A short five-minute break was given to the subject at the end of each session. It should be noted that instead of performing a straightforward test of 250 stimuli × 5 intensity levels that would create a test of 1,250 stimuli (considerably long and could cause subject’s fatigue and learning effect [11]), we decided to increase a number of subjects five times. Consequently, the total number of words/trials for each subject could stay at 250 trials. In all, the test sessions were equally divided by 5 intensity levels across 6 sets of five subjects as shown in Table 7. 4.

EXPERIMENTAL RESULTS

Figure 1 illustrates average percent correct discrimination curves of TU PB’14 and OTL with 95% CI values that fit all 5 word lists in each intensity test level. It can be observed that the two curves are comparable, with the former being slightly lower indicating that TU PB’14 is relatively more difficult than OTL. Separate F-tests were performed to determine if there is any significant difference (in terms of level of difficulty) among the 5 word lists of TU PB’14 as well as those of OTL. Significant differences were found for 2 pairs of TU PB’14 word lists: lists 3 and 4 [F(1,50) = 4.5, p < 0.05], and lists 3 and 5 [F(1,50) = 5.35, p < 0.05]. On the other hand, three pairs in OTL: lists 1 and 3 [F(1,50) = 5.35, p < 0.05], lists 2 and 3 [F(1,50) = 6.76, p < 0.05], and lists 3 and 4 [F(1,50) = 5.75, p < 0.05] are significantly different. 5.

DISCUSSIONS AND FUTURE WORK

Good phonetic balance, symmetrical phoneme occurrence and relative inter-list equivalency have been achieved in TU PB’14 lists, with no duplicate words across different lists. Real tests in an experimental setting were carried out to confirm validity of the lists. Since list 3 in TU PB’14 seems to be significantly different from the others, we plan to reanalyze the content of this list to make all the lists truly interchangeable. Finally, it should be noted that all 25 words in list 2 are common words found in the pooled words from three elementary Thai subject textbooks. Therefore, they are suitable for the use with preschool population. To further examine variability of discrimination scores, we plan to carry out similar tests on sensorineural and conductive hearing impaired subjects, which are less homogeneous groups.

Table 6: TU PB’14 word lists (* marks a word with homophones). List 1 กัด (bite) 

List 2 กะ (estimate) 

List 3 กระ (freckle) 

กา (raven) 

กาว (glue) 

เกม (game) 

ข้า (me)

ขับ (drive) 

ไข่ (egg)

*

ครีบ (fin)

[*

ค้าง (remain) 

คอ (neck) 

จา (remember)  ดอย (hill)  ตัน (trap)  ทัน (in time)  เทียม (artificial) 

ACKNOWLEDGMENTS

We would like to acknowledge with much appreciation the guidance of Dr. Nida Rueangwit at Thammasat University Hospital.

จี้ (rob)



ชัง (hate)



ดื่ม (drink)  เตรียม (prepare)  โต๊ะ (table) 

น้า (aunt) 

แทน (replace) 

ใน (in)

ไทย (Thai)  นะ (Thai final particle)  นั่ง (sit)  ป่วย (sick) 



ปก (cover) 

ผ่า (chop)  ฝูง (group)  เพราะ (because)  มือ (hand)  รอด (survive)  โละ (discard) 

ผี (ghost)



ผุ (decay)



ม่าน (curtain)  มิด (completely covered) *

สระ (pool) 

ยาก (difficult) 

สี่ (four)

โรย (sprinkle)  ละ (leave)  เสื้อ (shirt)  หาร (divide)  อูฐ (camel) 



6.

ควาน (grope)



หญิง (female)  หุน้ (share)  ไหว (be able)  อวน (seine) 



จะ (will)



List 4 กลั่น (distill) 

เก๊ะ (small drawer) 

คัน (itch) * เจ (vegetarian) 

เชิง (manner) ช่อ (bouquet)   แดน (territory)  เตะ (kick)  ถ้า (cave)  ท้า (challenge)  นี่ (these) * บิ (break off)  ประ (dot)

List 5 กรรม (karma) * การ (task) *

ขี่ (ride) 

ขึง (stretch) 

เงาะ (rambutan) 

ดีด (flick)

จิ๋ว (tiny)





ตา (eyes)

แช่ (chill)





ถ่าง (open)

ดัก (trap)





ถึก (massive)

ต่อ (renew)





เถาะ (rabbit year) 

ไทร (banyan tree) 

ถ่าน (coal) 

ธูป (joss stick) 

นาง (Madame)

นา (field)





ผัน (fluctuate) 

เน่า (rotten)

นาย (Mister)





มด (ant)  ม้า (horse)

แบบ (model)

บ่อ (well)







ป่า (forest)

ปรุ (perforate)







โยน (throw) 

ผา (cliff)

โปะ (pile up)





พระ (monk)

พัง (collapse)

รั้น (stubborn)  เรือ (boat)  ลอก (imitate)  ลอด (pass through)



วาบ (flash)  สั่ง (command)  ส่าย (swing)  หน่วง (delay)  หาว (yawn) 





มั่น (confident)

ไม้ (wood)





ยุ (incite)

รั้ง (hold back)





ราย (case)

รีด (press)





เริ่ม (begin)

เละ (messy)





วิก (wig)

วัน (day)





สั้น (short)

สัตว์ (animal)





หมด (clear)

ไส้ (stuffing)





หลัง (behind)

อัน้ (restrain)





Table 7: Distributions of intensity test levels (in dB HL) across a set of five subjects (T = TU PB’14, O = OTL, and numerical information = list number). Subject

15

I II III IV V

dB HL 35

25

45

55

T-5

O-1

T-4

O-5

O-3

T-1

T-2

O-4

O-2

T-3

O-5

T-4

O-2

T-3

T-5

O-1

O-3

T-1

T-2

O-4

T-3

O-2

O-4

T-2

O-5

T-4

O-1

T-5

T-1

O-3

O-4

T-2

T-1

O-3

T-3

O-2

T-4

O-5

O-1

T-5

O-3

T-1

T-5

O-1

O-4

T-2

O-2

T-3

T-4

O-5

Avergae Percent Correct Discrimination

100 90 80

70 60 50 40 30

PB TU'14

20

OTL

10 0 5

15

25

35

45

55

65

dB HL

Figure 1. Average percent correct discrimination of TU PB’14 and OTL with 95% CI values that fit all 5 word lists in each test. 7.

REFERENCES

[1] A. Munthuli, P. Sirimujalin, C. Tantibundhit, K. Kosawat, and C. Onsuwan, "A corpus-based study of phoneme distribution in Thai.," in 10th International Symposium on Natural Language Processing, Phuket, TH, 2013, pp. 114-121. [2] K. Kosawat, M. Boriboon, P. Chootrakool, A. Chotimongkol, S. Klaithin, S. Kongyoung, K. Kriengket, S. Phaholphinyo, S. Purodakananda, T. Thanakulwarapas, and C. Wutiwiwatchai, "BEST 2009: Thai word segmentation software contest," in 8th International Symposium on Natural Language Processing, Bangkok, TH, 2009, pp. 83-88. [3] T. Tillman and R. Carhart, "An expanded test for speech discrimination utilizing CNC monosyllabic words," USAF School of Aerospace Medicine, Brooks Air Force, Texas, Northwestern University Auditory Test No. 6. Technical report SAM-TR-66-55 1966. [4] G. Lidén and G. Fant, "Swedish word material for speech audiometry and articulation tests," Acta Oto-Laryngol, vol. 116, pp. 189-210, 1954. [5] R. Sagon, "The development of a phonetically balanced word recognition test in the Ilocano language.," Ph.D. dissertation, PACS, WUSM., Washington, 2006.

[6] A. Hammer, B. Vaerengerg, W. Kowalczyk, L. T. Bosch, and M. Coene, "Balancing word lists in speech audiometry through large spoken language," in 14th Annual Conference of the International Speech Communication Association, Lyon, 2013, pp. 3613-3616. [7] A. Thangthai, C. Hansakunbuntheung, R. Siricharoenchai, and C. Wutiwiwatchai, "Automatic syllable-pattern induction in statistical Thai text-to-phone transcription," in 9th International Conference on Spoken Language Processing, Pittsburgh, 2006. [8] R. Sangworasin, "Education of Kindergarten's Vocabulary 4 - 5 years old in Bangkok," Bachelor's Thesis, Department of Education, Primary Education, Faculty of Education, Chulalongkorn University, Bangkok, 2003. [9] R. Sripaiwan, Thai textbooks set "Mana Manee Piti Chujai". Bangkok, Thailand: Department of Curriculum and Instruction Development, Ministry of Education, 1994. [10] Basic words for teaching Thai language, Primary education. Bangkok, Thailand: Department of Curriculum and Instruction Development, Ministry of Education, 1986. [11] P. C. Loizou, Speech Ehancement: Theory and Practice. New York: CRC Press, Taylor & Francis Group, 2007.