Stress and Tone Recognition of Polysyllabic Words in Thai Speech Nuttakorn Thubthong1, Boonserm Kijsirikul2 Machine Intelligence & Knowledge Discovery Laboratory Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Phayathai Rd., Bangkok 10330, THAILAND E-mail: Nuttakorn.T1,
[email protected]
Sudaporn Luksaneeyanawin Centre for Research in Speech and Language Processing Faculty of Arts, Chulalongkorn University, Phayathai Rd., Bangkok 10330, THAILAND E-mail:
[email protected]
Abstract: Tone information is very important to speech recognition in a tone language such as Thai. In this paper, we present a method for tone recognition of polysyllabic words. We focus on two factors, i.e., tonal assimilation and stress factors, which affect F0 realization of tones. For the first one, the effect is compensated by the tone information of neighboring syllables. For the last one, we propose two methods, i.e., separated stress method (SSM) and incorporated stress feature method (ISFM). These methods need a stress detection algorithm and stress features. Therefore, we also propose a stress detection algorithm and sets of stress features. The experiments were evaluated by feedforward neural networks. The best recognition rates are 84.10 and 89.52% for stress detection and tone recognition, respectively. Key words: Thai tone, Tone recognition, Stress detection, Tonal assimilation effect, Stress effect
1. Introduction Speech recognition technology has undergone substantial research and development in the last two decades. Several applications of speech recognition to computer interfaces have been developed since speech is the most natural way of human communication and interaction. Most existing methods for speech recognition have been developed mainly for spoken English, and some of them have been adapted to Thai language as well. However, unlike English, Thai is a tone language. In such a language, the referential meaning of an utterance is dependent on the lexical tones [1]. Therefore, tone recognition pays an important role in Thai syllable recognition.
fall
150
F0 (Hz)
high 100
mid
50
rise
low
0 0
20
40 60 Duration (%)
80
100
Figure 1: F0 contours of the five Thai tones. In the past few years, some methods for five-tonerecognition of isolated Thai syllables have been proposed. Thubthong et al. [4,5] proposed a set of tone features and used neural networks to recognize the five tones. They concentrated on the effect of initial consonants, vowels and final consonants on tone recognition and found that there are some correlations between tones and the phonematic units constructing the syllables. Tungthangthum [6] used raw F0 contours and hidden Markov models to classify the five tones. His study shows that tones and vowels are independent from each other but his result was done on single-speaker tone recognition only. Kongkachandra et al. [7] proposed two techniques, i.e., intonation flow analysis and voiced identify calculation, for multi-speaker tone reported. However, the data used in this experiment were only ten Thai digits that did not cover the five Thai tones.
There are five different lexical tones in Thai as follows: mid /M/, low /L/, fall /F/, high /H/, and rise /R/. The following examples show the effect of tones on the meaning of an utterance [2]: M /kha‹˘/ (“a kind of grass”); L /kha›˘/ (“galangale”); F /khafl˘/ (“to kill”); H /kha¤˘/ (“to trade”); and R /kha‡˘/ (“a leg”). The tone information is superimposed on the voiced portion1 of each syllable. The identification of a Thai tone relies on the shape of the fundamental frequency (F0) contour. Figure 1 shows the average of F0 contours of five different tones when syllables are spoken in isolation by a male speaker. 1
The voiced portion is the portion of sound that is produced when the vocal cords are tensed together and they vibrate in a relaxation mode as the air pressure builds up, forcing the glottis open, and then subsides as the air passes through [3].
1
/mEflE/
/li¤aN/
/ju›u/
/la‡an/
(a)
/li¤aN/
/mEflE/
/la‡an/
/ju›u/
(b) Figure 2: The waveform and F0 contours of tone in an FHRL sequence when (a) each word is spoken in isolation and (b) when the whole utterance is naturally spoken (adapted from [8]). (a) stressed
(b) unstressed
fall fall
0.5 high
F0 (z-score)
F0 (z-score)
0.5 0.0 mid -0.5
rise
high 0.0 mid -0.5
rise low
low -1.0
-1.0 0
20
40 60 Duration (%)
80
100
0
20
40 60 Duration (%)
80
100
Figure 3: Mean F0 contours of all five tones in (a) stressed and (b) unstressed syllables. Data are normalized within speaker and across tones and the stress categories. F0 contours are smoothed by curvefitting (adapted form [10]). Although there are only five different tones, the tone behavior is very complicated in continuous speech. Figure 2 shows the comparison of F0 realization of an FHRL sequence when each monosyllabic word is spoken in isolation (see top panel) and when all four tones are spoken naturally in running speech (see bottom panel). The tones produced on isolation words are very similar to those in Figure 1, while tones produced on words in continuous speech are much more difficult to identify. Several interacting factors affect F0 realization of tones, e.g., syllable structure, declination, tonal assimilation, stress and speaking rate [5,8,9,10,11]. The syllable structure affects an F0 contour in terms of the voiced/unvoiced property of contiguous phones and also consonantally-induced perturbations on the F0 contour of the following vowel [8]. Initial consonants, vowels and final consonants also affect the shape of F0 contour [5].
The tone of a neighboring syllable may affect the height and slope of F0 contour [12]. The effects of the following syllable and the preceding syllable are called regressive coarticulation and progressive coarticulation, respectively. Gandour et al. [9] studied these effects on Thai tones. The study shows that Thai tones are more influenced by progressive coarticulation than by regressive coarticulation. In this paper, we refer to these effects as tonal assimilation effect. The F0 contours of stressed2 syllables are generally quite different from unstressed ones [14]. For standard Thai, despite systematic changes in F0 contours, all five tonal contrasts are preserved in unstressed as well as stressed syllables. However, F0 contours of stressed syllables more closely approximate the contours in citation forms than those of unstressed syllables [8,10] (see Figure 3). Speaking rate effect on F0 contours in unstressed syllables is more extensive, both in terms of the height and slope, than that of stressed ones [11]. The height of F0 contours in unstressed syllables is
The F0 height will be adjusted to conform to the intonation pattern of the sentence [12,13]. This effect is called declination effect that refers to a gradual modification over the course of a phrase or utterance against which the phonologicallyspecified local F0 targets are scaled [8].
2
In this paper, “stress” refers to a subjective complex some objective phonetic features such as a higher degree of respiratory effort, length, pitch, loudness etc. as compared with the unstressed syllable [15].
2
Thai is a fix accent4 language. The last syllable of the word is accented and always realized as a stressed syllable. Secondary accent is assigned according to the syllable structures of the component syllables. Secondary accents are realized as stressed syllables in formal speech, but as unstressed syllables in rapid conversational or casual speech.
generally higher in the fast speaking rate when compared to the slow, whereas the slope of F0 contours in unstressed syllables varies depending on the range of F0 movement. In this paper, we attack the problem of tone recognition of polysyllabic words in Thai speech. We concentrate on two effects, i.e., tonal assimilation and stress effects. We use tonal assimilation features [16] that focus on the tone information of neighboring syllables to compensate the tonal assimilation effect and propose two methods, i.e., separated stress method (SSM) and incorporated stress features method (ISFM) to alleviate the stress effect.
Luksaneeyawin [15] proposed an accentual system of polysyllabic words in Thai. In the system, syllables are divided into two types, i.e., linker syllables (L) and non-linker syllables (O). Linker syllables are syllables, which have a phonemic /a/ and end with a glottal stop while non-linker syllables are other syllables besides linker syllables. The accentual system of the polysyllabic words in Thai is described as follow:
The paper is organized as follows. Section 2 provides the detail of speech corpus used in this paper. Section 3 presents a stress detection method. In Section 4, a tone recognition method is described. The conclusion is given in Section 5.
1. Bisyllabic word. L′L
2. The Speech Corpus
L′O
′O′O
′O′L
(′LO)′O (′OL)′L
(L′O)′L (′OL)′O
(L′O)L′O (′OL)L′O (′LL)O′O (′OO)L′L
(L′O)L′L (′OL)L′L (′LL)O′L (′OO)L′O
2. Trisyllabic word.
The speech corpus used is the Thai Proper Name Corpus (TNC). The corpus is comprised of 100 Thai names: 3, 46, 45, and 6 of monosyllabic, bisyllabic, trisyllabic, and tetrasyllabic words, respectively (monosyllabic words are excluded from our experiments). The data collected from 40 native Thai speakers (20 male and 20 female speakers), ranging in age from 17 to 34 years (mean=21.60 and s.d.=3.58). Each speaker read the list of 50 names in one trial at a conversational speaking rate. Therefore, the corpus comprises 2000 word utterances. The speech signals were digitized by a 16-bit A/D converter of 11 kHz. They were manually segmented and transcribed as phonemes, rhymes3 and syllables using audiovisual cues from a waveform display.
(′LL)′L (′OO)′O
(′LL)′O (′OO)′L
3. Tetrasyllabic word. (L′O)O′O (′OL)O′O (′LL)L′L (′OO)O′O
(L′O)O′L (′OL)O′L (′LL)L′O (′OO)O′L
Note. ‘′’ is an accented maker that is in front of an accented syllable. 3.2. Acoustic features of stress Many researchers have studied the acoustic correlates of stress in several languages [10,17,18,19,20,21]. The most successful features for stress detection are duration, energy and F0 associated with syllable sequences within an utterance [17]. Most researches have confirmed that duration is the most important acoustic correlate of stress. The duration can be calculated in a number of ways. For example, one could use the duration of a syllable, the duration of the vowel in the syllable or the duration of the rhyme in the syllable. Stress is assumed to be a feature of a syllable, but for practical purposes, stress can be attributed to the vowel [19]. Many researches used vowel duration for representing stressed/unstressed syllables [17,18,19,20] but some researchers proposed to use rhyme duration for representing them [10,21].
3. Stress Detection To compensate the stress effect on tone recognition, a stress detection module is needed. This section presents a stress detection method. Firstly, an accentual system of polysyllabic words in Thai is described. Then, we present several acoustic features of stress. Finally, experiments with several stress feature sets are evaluated. The details are given in the following subsections. 3.1. Accentual system of polysyllabic words in Thai
3
Rhyme onset are defined as the onset of voicing in the vowel after voiceless obstruents that coincided with vertical striations in the second and higher formants, after a nasal murmur, or after a lateral. For nonnasal-ending syllables, rhyme offset was defined as the abrupt cessation of the second and higher formants, for nasal-ending syllables, as the abrupt cessation of a nasal murmur [10].
4
“Accent” will be used to refer to the potentiality of the syllable or the syllables in a word to be realized with stress either when the word occurs by itself in an utterance or with other word in an utterance [15].
3
algorithm has to be rerun from scratch k times, which means that it takes k times as much computation to make an evaluation. In the experiments, we performed five-fold crossvalidation. The original utterances were partitioned into five disjoint sets of equal size. Each set contains utterances collected from two male and two female speakers. Five training sets were then constructed by overlapping the five disjoint sets and dropping out a different one systematically. The different sets, which were dropped, were used as test sets. The experimental results are the average values of the five test sets.
In this paper, we use comprehensive sets of acoustic features (related to duration, energy and F0) that have been reported to be related to stress. To find the best linguistic unit for stress detection, we employ all three units, i.e., vowel, syllable and rhyme units. The basic acoustic features are described in the following: 1.
Duration. The DURATION is manually determined for each linguistic unit. As known that stressed syllable is the dominant syllable in a word and the DURATION feature is highly context dependent [20], the duration of each unit in a syllable is normalized by total duration of the unit in the same word.
2.
Energy. The energy of a particular speech frame is computed with a root mean square energy (RMSE) algorithm. Two energy features, i.e., the maximum energy (MAXENE) and the total energy (TOTENE) [20] are computed for each linguistic unit. MAXENE is defined as the log energy of the frame that has the highest energy value within each unit. The TOTENE of a unit is an integration of the energy over all frames within that unit. Therefore, TOTENE is implicitly sensitive to duration. Like duration, the energy features are normalized by the sum of energy features of all syllables in the same word.
3.
F0. A raw F0 is normalized by transforming the Hertz values to a z-score within each speaker. The average of normalized F0 of each linguistic unit is used as a stress feature. This feature is referred to as ZF0. The detail of F0 extraction will be described in the next section.
The experiments were run using a feedforward neural network. The network has three layers, i.e., input, hidden, and output layers. The number of input units depends on the stress feature sets. The number of hidden and output units are 20 and 2, respectively. The tanh function was used as the activation in the network. Since the network learns more efficiently if the inputs are normalized to be symmetrical around 0 [23], all feature parameters were normalized to lie between -1.0 and 1.0. The network was trained by the error back-propagation. Initial weights were set with random values between -1.0 and 1.0. The NICO (Neural Inference COmputation) toolkit [24] was used to build and train the network.
3.3. Experiments DURATION was used as the primary feature for stress detection. MAXENE, TOTENE and ZF0 were incorporated for generating different stress feature sets (SF1-SF5). Several experiments were run with the different stress feature sets.
The recognition rates (%), standard deviations (s.d.) and error reduction rates (%ER) of all stress feature sets are shown in Table 1. The rhyme unit provides the best recognition rates (83.65% in average) while the vowel unit yields the worst (74.74% in average) for all stress feature sets. The results support the works of [10,21], which demonstrate that rhyme unit is a better correlate of stress in Thai than either vowel or whole syllable unit. TOTENE slightly increases recognition rates for vowel and syllable units. MAXENE also provides the slight improvement of recognition rates for only syllable unit. ZF0 also slightly improves the recognition rates for all three units.
To evaluate the stress feature sets, the k-fold crossvalidation approach [22] was considered. The advantage of this approach is that every sample is in a test set exactly once and the variance of results is reduced as k is increased. However, the training
Table 2 shows the confusion matrix of stress detection when SF5 was used with the rhyme unit. The recognition rate of stressed syllables is quite good (90.89%) while the rate of unstressed ones is not (67.22%). The reason of this can be explained
Table 1: Recognition rates (%), standard deviations (s.d.) and error reduction rates (%ER) of stress detection with different linguistic units and tone features. Stress feature set SF1. DURATION SF2. DURATION + TOTENE SF3. DURATION + TOTENE + MAXENE SF4. DURATION + TOTENE+ ZF0 SF5. DURATION + TOTENE + MAXENE + ZF0 Average
Vowel unit % s.d. %ER 74.00 0.66 74.30 0.50 1.15 74.26 0.39 1.00 75.70 0.31 6.54 75.44 0.42 5.54 74.74 0.46 -
4
Syllable unit % s.d. %ER 82.79 0.41 82.89 0.16 0.58 82.97 0.43 1.05 82.69 0.30 -0.58 83.31 0.53 3.02 82.93 0.37 -
Rhyme unit % s.d. %ER 83.49 0.50 83.49 0.41 0.00 83.49 0.34 0.00 83.67 0.42 1.09 84.10 0.30 3.69 83.65 0.39 -
Table 2: Confusion matrix of stress detection with SF5 and the rhyme unit. Reference #Tokens Recognition rate (%) Stressed 3580 90.89 Unstressed 1440 67.22 Total 5020 84.10
4.1. The baseline tone features As tone information is superimposed on the voiced portion of a syllable, we first locate the voiced portion before extracting the F0 feature. The portion is detected by using energy and zero crossing [25]. The Average Magnitude Different Function (AMDF) algorithm [26] is then applied for F0 extraction with 20 ms frame size and 6 ms frame shift. To correct some of the estimation errors and enforce continuity of the F0 contour, a search algorithm [27] and the moving average smoothing are applied.
Recognized results Stressed Unstressed 3254 326 472 968
An F0 is basically a physiologically determined characteristic and is regarded as being speaker dependent [25]. For example, the dynamic F0 range of a male voice is much narrower (90-180 Hz) than that of a female voice (150-240 Hz). Therefore, for independent-speaker tone recognition that uses the relative F0 of each syllable as the main discriminative feature, a normalization procedure is needed to align the range of the F0 height for different speakers. In this paper, a raw F0 is normalized by transforming the Hertz values to a zscore [11]. The mean and standard deviation are computed from raw F0 values of all syllables within each speaker. Since not all syllables are of equal duration, F0 contours of each syllable are equalized for duration on a percentage scale [10,28].
Figure 4: Recognition rates of stress detection along five stress feature sets with the rhyme unit. as follow. Firstly, in the recording process, the subjects spoke a word in their accentual styles. Thus, the accentual style of each speaker may differ from the standard accentual system as well as other speakers. Secondly, the number of unstressed syllables is small (compared to the number of stressed syllables) and so the classifier may be biased.
F0-normalized data are then fitted with a third-order polynomial (y=a0+a1x+a2x2+a3x3) that has been proven to be successful for fitting F0 contours of the five Thai tones [11]. To evaluate changes in the F0 height and slope of each syllable, a time aligned F0 profile is used. The profile is obtained by calculating the F0 heights at five different time points between 0% to 100% throughout each syllable with the equal step size of 25% (see Figure 5 (a)). The slopes at these five points are also computed by using the polynomial coefficients. The five F0 heights and five slopes are used as the baseline features. Therefore, the tone feature vector of each syllable has the same dimension of 10.
Figure 4 shows the recognition rate of stress detection along five stress feature sets with the rhyme unit, plotted separately by the number of syllables in a word. The bisyllabic words yield the highest results while the trisyllabic words do the worst. In the opposite way, the tetrasyllabic words achieve the highest improvement along the stress feature sets that reduced error from 18.96% to 17.08% whereas the bisyllabic words achieve the worst improvement. This means that the energy and F0 are not the prominent features for stress detection of bisyllabic words.
4.2. Tonal assimilation features Thai tones are influenced by progressive and regressive coarticulations. As described in [9], progressive effects extend forward to about 75% of the duration of the following syllable, while regressive effects extend backward to about 50% of the duration of the preceding syllable. In order to alleviate tonal assimilation effect [16], the F0 heights and slopes of the neighboring syllables (both the preceding and following syllables) are considered. The F0 heights and slopes at 50% and 75% of the preceding syllable and at 25%, 50% and 75% of the following syllable are used as the tonal
4. Tone Recognition This section presents a Thai tone recognition method. We first describe several tone features that include the baseline tone features, the tonal assimilation features, and the stress features. The stress features were designed by using the information of the previous section. Experiments with several tone feature sets are then evaluated. The details are given in the following subsections.
5
25
50
75
0
25
50
75
100 25 50
preceding syllable
considering syllable
Input speech
100
0
75
Stress feature extraction
following syllable
(a) 25
50
75
0
25
50
75
100 25
Stress detection
100
0 50
75
Stressed
Unstressed
(b) Figure 5: The different time points of the proposed tone features: (a) the baseline tone features, and (b) the baseline tone features + tonal assimilation features.
Tone feature extraction
assimilation features (see Fig. 5 (b)). In the case of the beginning or ending syllable of a word, the F0 height and slope are set to 0 and 10000, respectively. This feature is referred to as AF.
Separated Stress Method (SSM). All syllables are separated into stressed and unstressed groups by using the stress detection algorithm as described in the previous section. Each group is trained and evaluated by its own classifier. The process is shown in Figure 6.
2.
Incorporated Stress Feature Method (ISFM). The stress features (i.e., duration, energy, and F0) are incorporated with tone features and are used together for training and evaluating. In the experiment, the stress feature set 5 (SF5) with the rhyme unit was selected.
Tone classification for unstressed syllable
Recognized tone
Recognized tone
Figure 6: The separated stress method (SSM).
4.3. Stress features The F0 contours of stressed syllables are generally quite different from unstressed ones. The most difference between the tone space for stressed and unstressed syllables is the reduction in excursion size of F0 movements [10] (see Figure 3). In stressed syllables, the F0 contours of each tone are quite different from each other. Thus, tones in stressed syllables are easily to recognize by F0 height ans slope. In unstressed syllable, the shapes of F0 contours among the five Thai tones are rather flat. The high and rising tones with rising F0 contours occur to be opposed to the other tones (mid, low, falling). The contours of unstressed syllables are preserved in the syntactic context and other factors such as speaking rate, tonal assimilation and declination. It makes the tone recognition in unstressed syllable to be a hard problem. In this paper, we propose two methods to solve the stress effect as follow: 1.
Tone classification for stressed syllable
4.4. Experiments A series of experiments with six different tone feature sets (TF1-TF6) were evaluated. Like the previous experiments, the five-fold cross-validation was used. The experiments were run using a feedforward neural network with several input units depending on tone feature sets, 20 hidden units and 5 output units, corresponding to the five Thai tones. The tanh function was used as the activation in the network. All feature parameters were normalized to lie between -1.0 and 1.0. The network was trained by the error back-propagation. Initial weights were set with random values between -1.0 and 1.0. The recognition rates (%), standard deviations (s.d.) and error reduction rates (%ER) of all tone features are shown in Table 3. The results are reported separately by stressed, unstressed and total syllables. The tonal assimilation features (AF) provides better recognition rates than the baseline tone features for both stressed and unstressed syllables. The separated stress method (SSM) also improves recognition rates for both stressed and unstressed syllables, and outperforms AF. The incorporated stress feature method (ISFM) gives the best improvement for both stressed and unstressed syllables. The results imply that the stress factor influences tones more than the tonal assimilation factor does. When AF and ISFM are employed together (TF6), the highest error reduction rates of 50.93, 44.40 and 47.56% are reported for stressed, unstressed, and total syllables, respectively.
6
In Table 4, the confusion matrices of tone recognition with TF6 are shown. It can be seen that mid and high tones provide the highest recognition rate (printed in bold) for stressed and unstressed syllable, respectively. The result of stressed syllable is different from the previous researches [5,30] that often reported low recognition rates for mid tone. This is because the similarity of the F0 contours of low tone to mid tone, and so the most errors probably came from the misclassification between them. However, our results still confirm that misclassification. Only explicit reason making the best recognition rate for mid tone is that the number of syllables with mid tone is very large (compared to the number of syllables with the other tones) and thus the classifier may be biased. Considering the confusion matrix of unstressed syllables (see Table 4 (b)), the probable errors also come from the misclassification between low and high tones. This problem frequently occurs by linker syllables that speakers often speak confusedly between low and high tones. However, listeners can correctly recognize that word. For example, even if a speaker wrongly says /na¤t ta¤ k狢n/ having a different tone (high instead of low) to the correct one /na¤t ta› k狢n/ (“philosopher”), the listener can correctly recognize it.
Figure 7: Recognition rates of tone recognition with different tone feature sets. syllables. Recognition rates of stressed syllables are better than those of unstressed ones for all tone feature sets. However, unstressed syllables provide higher degradation in error rates along the tone feature sets than stressed ones. The degradation in error rates are 6.90, 15.97 and 9.50% for stressed, unstressed and total syllables, respectively. Generally, tone recognition of unstressed syllables is quite hard. It is a hard problem of continuous speech recognition. The results show that our proposed feature set (TF6) is useful for recognizing tones in unstressed syllables. Thus, we believe that our tone feature set will also useful for recognizing tones in continuous speech.
Figure 7 shows the recognition rate of tone recognition along six tone feature sets, plotted separately by stressed, unstressed and total
Table 3: Recognition rates (%), standard deviations (s.d.), and error reduction rates (%ER) of tone recognition with different tone feature sets. Tone features TF1. baseline TF2. baseline + AF TF3. baseline + SSM TF4. baseline + AF + SSM TF5. baseline + ISFM TF6. baseline + AF + ISFM
% 86.45 87.77 87.29 89.22 91.62 93.35
Stressed s.d. 2.56 2.30 1.53 1.16 0.89 0.84
%ER 9.69 6.19 20.41 38.14 50.93
Unstressed % s.d. %ER 64.03 3.26 66.11 3.42 5.79 74.58 3.73 29.34 77.36 4.22 37.07 77.57 2.12 37.64 80.00 0.53 44.40
% 80.02 81.55 83.65 85.82 87.59 89.52
Total s.d. 2.76 2.57 2.04 1.81 0.97 0.57
%ER 7.68 18.17 29.01 37.89 47.56
Table 4: Confusion matrices of tone recognition of (a) stressed and (b) unstressed syllables with TF6. Reference #Tokens Recognition rate (%) mid 2380 96.43 low 420 87.14 fall 80 88.75 high 400 84.25 rise 300 91.00 Total 3580 93.35
Recognized results (#tokens) mid low fall high rise 2295 42 3 32 8 46 366 0 3 5 6 1 71 2 0 45 16 0 337 2 7 16 0 4 273 (a)
Reference #Tokens Recognition rate (%) mid 200 75.50 low 780 80.90 fall 20 55.00 high 440 81.59 rise 0 Total 1440 80.00
Recognized results (#tokens) mid low fall high rise 151 45 0 4 0 67 631 2 64 16 8 0 11 1 0 10 69 2 359 0 0 0 0 0 0 (b)
7
[9]
5. Conclusion In this paper, we have demonstrated two methods, i.e., stress detection and tone recognition, for polysyllabic words of Thai speech. A series of stress detection experiments with several linguistic units, i.e., vowel, syllable and rhyme units, and several stress feature sets were presented. A feedforward neural network was employed to evaluate the experiments. The best result of 84.10% is reported for rhyme units with SF5 (DURATION TOTENE + MAXENE + ZF0). For tone recognition, the tonal assimilation and stress effects have been considered. We have proposed tone features to compensate these effects. To simulate these tone features, a feedforward neural network was employed. The experimental results show that the proposed features surpassed the baseline features. The best result of 89.52% are reported for TF6 (baseline + assimilation features + incorporate stress features method).
[10] [11] [12]
[13] [14]
6. Acknowledgement This paper is supported in part by the ThailandJapan Technology Transfer Project.
[15]
7. References
[16]
[1] Jian, F. H. L., Classification of Taiwanese tones based on pitch and energy movement, International Conference on Spoken Language Processing, 2, 329-332, 1998. [2] Luksaneeyanawin, S., Intonation in Thai. in Hirst, D. and Cristo, A. C. Intonation systems: A Survey of Twenty Language, 376-394, 1998. [3] Owens, F. J., Signal Processing of Speech, Macmillan, 1993. [4] Thubthong, N., A Thai tone recognition system based on phonemic distinctive features, The 2nd Symposium on Natural Language Processing, 379-386, 1995. [5] Thubthong, N., Pusittrakul, A., and Kijsirikul, B., An efficient method for isolated Thai tone recognition using combination of neural networks, The 4th Symposium on Natural Language Processing, 224-242, 2000. [6] Tungthangthum, A., Tone recognition for Thai, The 1998 IEEE Asia-Pacific Conference on Circuits and System, 157-160, 1998. [7] Kongkachandra, R., Pansang, S., Sripra, T., and Kimpan, C., Thai intonation analysis in harmonic-frequency domain, The 1998 IEEE Asia-Pacific Conference on Circuits and System, 165-168, 1998. [8] Potisuk, S., Harper, M. P., and Gandour, J., Classification of Thai tone sequences in syllable-segmented speech using the analysisby-synthesis method, IEEE Transactions on Speech and Audio Processing, 7(1), 95-102, 1999.
[17]
[18]
[19]
[20]
[21]
[22]
8
Gandour, J., Potisuk, S., and Dechnongkit, S., Tonal coarticulation in Thai. Journal of Phonetics, 22, 477-492, 1994. Potisuk, S., Gandour, J., and Harper, M. P., Acoustic correlates of stress in Thai, Phonetica, 53, 200-220, 1996. Gandour, J., Tumtavitikul, A., and Satthamnuwong, N., Effects of speaking rate on Thai tone, Phonetica, 56, 123-134, 1999. Wang, Y. R., and Chen, S. H., Tone recognition of continuous Mandarin speech assisted with prosodic information, Journal of the Acoustical Society of America, 96(5), 1738-1752, 1994. Shih, C., Declination in Mandarin, An European Speech Communication Association Workshop, 293-296, 1997. Chen, S. H., and Wang, Y. R., Tone recognition of continuous Mandarin speech based on neural networks, IEEE Transactions on Speech and Audio Processing, 3(2), 146150, 1995. Luksaneeyanawin, S., Intonation in Thai. Ph.D. Dissertation, University of Edinburgh, 1983. Thubthong,N., Pusittrakul, A., and Kijsirikul Tone recognition of continuous Thai speech under tonal assimilation and declination effects using half-tone model, International Conference on Intelligent Technology, 229234, 2000. Ying, G. S., Jamieson, L. H., Chen, R., and Michell, C. D. Lexical stress detection on stress-minimal word pairs, International Conference on Spoken Language Processing, 1612-1615, 1996. van Kuijk, D., van den Heuvel, H., and Boves, L., Using lexical stress in continuous speech recognition for Dutch, International Conference on Spoken Language Processing, 1736-1739, 1996. van Kuijk, D., and Boves, L., Acoustic characteristics of lexical stress in continuous speech, International Conference on Acoustics Speech and Signal Processing, 1655-1658, 1997. van Kuijk, D., and Boves, L., Acoustic characteristics of lexical stress in continuous telephone speech, Speech Communication. 27, 95-111, 1999. Potisuk, S., Harper, M. P., and Gandour, J., Using stress to disambiguate spoken Thai sentences containing syntactic ambiguity, International Conference on Spoken Language Processing, 2, 805-808, 1996. Dietterich, T. G., Machine learning research: four current directions, AI Magazine, 4, 97136, 1997.
[23] Tebelskis, J., Speech recognition using neural networks. Ph.D. Dissertation, Carnegie Mellon University, 1995. [24] StrÖm, N., Phoneme probability estimation with dynamic sparsely connected artificial neural networks, The Free Speech Journal, 1 (5), 1997. [25] Lee, T., Ching, C. P., Chan, L. W., Cheng, Y. H. and Mak, B., Tone recognition of isolated Cantonese syllables, IEEE Transactions on Speech and Audio Processing, 3(3), 204-209, 1995. [26] Ross, M. J., Shaffer, H. L., Cohen, A., Freudberg, R., and Manley, H. J., Average magnitude difference function pitch extractor, IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-22: 353-362, 1974. [27] Jun, L., Zhu, X. and Luo, Y., An approach to smooth fundamental frequencies in tone recognition, International Conference on Communication Technology, 1, S16-10-1S16-10-5, 1998. [28] Chang, P. C., Sue, S. W., and Chen, S. H., Mandarin tone recognition by multilayer perceptron, International Conference on Acoustics, Speech, and Signal Processing, 1, 517-520, 1990. [29] Gandour J., Potisuk, S., and Dechongkit S., Inter- and intraspeaker variability in fundamental frequency of Thai tones, Speech Communication, 10, 355-372, 1991. [30] Luksaneeyanawin, S., Tone transformation, The 2nd Symposium on Natural Language Processing, 345-353, 1995.
9