TONE RECOGNITION IN THAI CONTINUOUS SPEECH BASED ON COARTICULAION, INTONATION AND STRESS EFFECTS Nuttakorn Thubthong1 , Boonserm Kijsirikul2 and Sudaporn Luksaneeyanawin3 1
Department of Physics, Faculty of Science, Department of Computer Engineering, Faculty of Engineering, 3 Centre for Research in Speech and Language Processing, Faculty of Arts, Chulalongkorn Uiveristy, Phayathai Rd., Bangkok 10330, THAILAND Tel. +66-2-218-5301, Fax. +66-2-2531150 1
[email protected] 2
[email protected] 3
[email protected] 2
ABSTRACT Tone recognition is a critical component for speech recognition in a tone language. One of the main problems of tone recognition in continuous speech is that several interacting factors affect F0 realization of tones. In this paper, we focus on the coarticulatory, intonation, and stress effects. These effects are compensated by the tone information of neighboring syllables, the adjustment of F0 heights and the stress acoustic features, respectively. The experiments, which compare all tone features, were conducted by feedforward neural networks. The highest recognition rates are improved from 84.07% to 93.60% and 82.48% to 92.67% for Thai proper name and Thai animal story corpora, respectively. 1. INTRODUCTION Tone information is very important to speech recognition in a tone language, such as Thai. There are five different lexical tones in Thai: mid, low, falling, high, and rising. The identification of a Thai tone relies on the shape of the fundamental frequency (F0 ) contour. Although there are only five different tones, the tone behavior is very complicated in continuous speech. Several interacting factors affect F0 realization of tones, e.g., syllable structure, coarticulation, intonation, and stress [1, 2, 3]. Syllable structure affects an F0 contour in terms of the voiced/unvoiced property of contiguous phones and also consonantally-induced perturbations on the F0 contour of the following vowel [1]. The tone of a neighboring syllable may affect the shape and level of the F0 contour [4]. The effects of the following syllable and the preceding syllable are called anticipatory coarticulation and carry-over coarticulation, respectively. Gandour et al. [2] studied these effects on Thai tones. The study shows that Thai tones are more influenced by carryover coarticulation than by anticipatory coarticulation.
Intonation refers to a distinctive pitch of an information unit, either a word or a set of words which are semantically and syntactically unified to other information units in the information whole. The F0 will be adjusted to conform to the intonation pattern of the sentence [4]. An important characteristic of intonation is declination, which refers to the tendency of F0 to gradually decline over the course on an utterance. The F0 contours of stressed syllables are generally quite different from unstressed ones [5]. For standard Thai, despite systematic changes in F0 contours, all five tonal contrasts are preserved in unstressed as well as stressed syllables. However, F0 contours of stressed syllables more closely approximate the contours in citation forms than those of unstressed syllables [1, 3]. In this paper, we propose several tone features to compensate the effects of coarticulation, intonation, and stress and improve tone recognition rates. In the following sections, we first presents the baseline tone features (Section 2). We then describe the proposed tone features for compensating coarticulatory, intonation, and stress effects (Section 3, 4, and 5, respectively). After that we demonstrate experiments and analyze the results in several directions (Section 6). Finally, we summarize this paper (Section 7). 2. BASELINE TONE FEATURES Since the tone information from the rhyme portion (vowel + coda) has been shown to provide better recognition rate than that from the whole syllable [6], our tone features are extracted form the rhyme portion. The Average Magnitude Different Function (AMDF) algorithm [7] is applied for F0 extraction with 60 ms frame size and 12 ms frame shift. The differences in the excursion size of F0 movements related to differences in voice range between speakers are normalized by converting raw F0 values to an ERB-rate scale [8]. An
F0 is basically a physiologically determined characteristic and is regarded as being speaker dependent. Therefore, a z-score normalization is then employed based on the mean and standard deviation computed from the raw F0 values of all utterances for each speaker. Since not all syllables are of equal duration, F0 contour of each syllable is equalized for duration on a percentage scale [2, 3]. F0 -normalized data are then fitted with a third-order polynomial (y = a0 + a1 x + a2 x2 + a3 x3 ). To evaluate changes in the F0 height and slope of each syllable, a time aligned F0 profile is used. The profile is obtained by calculating the F0 heights at five different time points between 0% to 100% throughout each syllable with the equal step size of 25% (see Fig.1 (a)). Moreover, the slopes at these five points are also computed by using the polynomial coefficients. The five F0 heights and five slopes are used as the baseline features. Therefore, the tone feature vector of each syllable has the same dimension of 10.
Fig. 1. The different time points of (a) the baseline tone features, (b) the contextual tone features (for CTF34). The circle points represent five F0 at different time points of a syllable and bar points represent the F0 ’s used for different tone features. The dash lines and dot lines represent syllable boundaries and onset-rhyme boundaries, respectively.
3. COARTICULATION The coarticulatory effect of neighboring syllables will make the F0 contour of a syllable change in level and in shape to interfere with tone discrimination. The coarticulatory effect decreases when the distance between neighboring syllables increases. This means that the coarticulatory effect is inversely proportional to the distance between neighboring syllables. Therefore, we used some features extracted from neighboring syllables and the distance between neighboring syllables to identify tones. They include: (i) the F0 heights and slopes of neighboring syllables (both preceding and following syllables) and (ii) the two inversions of duration: from the ending point of the preceding syllable to the beginning point of the considering syllable (dp ) and from the ending point of the considering syllable to the beginning point of the following syllable (df ) where the duration is measured in second. The former are the primary features for coping with the coarticulation while the latter are employed to implicitly represent the tightness of relationships between the considering syllable and the two neighboring syllables. These features are referred to as contextual tone features. As described in Gandour et al. [2], carry-over effects extend forward to about 75% of the duration of the following syllable, while anticipatory effects extend backward to about 50% of the duration of the preceding syllable. However, they analyzed on a small set of hypothesis sentences. We therefore explored different pairs of F0 points from preceding and following syllables. We refer to these features as CTFmn where m and n are the numbers of time points of the preceding and following syllables used to extract contextual tone features, respectively. For example, CTF34 refers to the features when we employ the three F0 heights
and slopes at 50%, 75% and 100% of the preceding syllable and the four F0 heights and slopes at 0%, 25%, 50%, and 75% of the following syllable as shown in Fig.1 (b). In the case of the beginning or ending syllable, F0 heights and slopes are set to 0 and two duration features are set to 1. 4. INTONATION An important characteristic of intonation is declination that refers to the tendency of F0 to gradually decline over the course on an utterance. The declination plays an important role in tone recognition in term of F0 height adjustment of tones [9]. There is no un-criticized method available to quantitatively determine the slope and domain of intonation contour yet. Lieberman et al. [10] determined three different declination measures, by fitting linear regression lines to local peaks (topline), local valleys (baseline) and all F0 points. In this paper, the all-point approach is applied to determine the intonation line (see Fig.2 (a)). All utterances in a speech corpus are assumed that they have a similar underlying intonation contour [11]. To deal with different utterances with different durations, the time scale of each utterance is normalized by the utterance duration. The mean F0 contour of all utterances is then fitted with a first-order polynomial (y = a0 + a1 x). The straight line represents the intonation contour. To neutralize the intonation effect, the F0 heights will be adjusted. We have obtained an adjusting line called centerpoint adjusting line (see Fig. 2 (b)). Each F0 will be adjusted to confirm the adjusting line. This adjustment is re-
300
Table 1. Tone recognition rates (%) and error reduction rates (%ER) of the baseline tone features and more refined tone features on TPC and TASC.
Declination line
F0 (Hz)
250
200
Tone features
150
100 0 0
0.5
1
1.5
2
Time (sec)
(a) 300
Simple + CTF34 + CIN + ISFM + CTF34 + CIN + CTF34 + CIN + ISFM
TPC % %ER 84.07 92.93 55.65 86.75 16.84 89.23 32.43 93.03 56.28 93.60 59.83
TASC % %ER 82.48 89.91 42.42 86.13 20.81 87.24 27.16 90.59 46.28 92.67 58.18
Declination line 250
F0 (Hz)
tone. This method is referred to as incorporated stress feature method (ISFM).
200
150
Center-point adjusting line
6. EXPERIMENTS
100 0
0.5
1
1.5
2
Time (sec)
(b)
Fig. 2. F0 contours: (a) before and (b) after applying centerpoint intonation normalization.
ferred to as center-point intonation normalization (CIN). 5. STRESS The F0 contours of stressed syllables are generally quite different from unstressed ones. The most difference between the tone space for stressed and unstressed syllables is the reduction in excursion size of F0 movements [3]. It makes the tone recognition of unstressed syllables very hard. To compensate the stress effect, there are usually two approaches: (a) building separate tone models for stressed and unstressed syllables, and (b) incorporating stress information into tone models [12]. We decide to use the second approach because it does not need stressed/unstressed transcription. The stress information used in the paper is a comprehensive set of acoustic features that includes duration, energy and average F0 . The duration is manually determined. The energy of a particular speech frame is computed with a root mean square energy algorithm. Two energy features, i.e., the maximum energy (MAXENE) and the total energy (TOTENE) are used [13]. MAXENE is defined as the log energy of the frame that has the highest energy value, while the TOTENE is an integration of the energy over all frames. The average of F0 of a syllable is also applied. All features are extracted from the rhyme portion of a syllable and normalized by the z-score technique using the mean and standard deviation of all utterances of a speaker in a corpus. All stress features are incorporated with the baseline tone features for recognizing
We conducted several experiments on Thai proverb corpus (TPC) and Thai animal story corpos (TASC) [6]. The first one is a short sentence corpus, and the last one is a read speech corpus. Five-fold cross-validation approach was used. Every experiment was performed using a three-layer feedforward neural network. The network had three layers, i.e., input, hidden, and output layers. The number of input units depended on the number of tone features. The number of hidden and output units were 20 and 5, respectively. All feature parameters were normalized to lie between -1.0 and 1.0. The network was trained using the standard back-propagation method. Initial weights were set to random values between -1.0 and 1.0. The results are summarized in Table 1. The refined tone features achieve high improvements in recognition rate for both corpora. Comparing the refinements with CTF341 , CIN, and ISFM, we found that CTF34 yields the highest error reduction rates, while CIN provides the poorest error reduction rates. We then used the CTF34 refinement as the primary tone models. CIN and ISFM were then incorporated for enhancing the recognition rates. CIN slightly increases the recognition rates from 92.93% to 93.03%, and from 89.91% to 90.59% for TPC and TASC, respectively. The additional refined tone features with ISFM also improves recognition rates. The over all highest recognition rates are 93.60% and 92.67% for TPC and TASC, respectively. Since the tones at sentence initial and final positions seem to behave differently from those at other positions, we further analyzed the effect of intonation on different syllable positions by examining the recognition rates of syllables located in different positions of an utterance. We found that the error rates at the beginning and ending of utterances are 1 We explored several configurations of contextual tone features, i.e., CTF12, CTF22, CTF23, CTF33, CTF34 and CTF44, and found that CTF34 provided the best recognition rates. These results confirm the study of [2].
[2] J. Gandour, S. Potisuk, and S. Dechnongkit, “Tonal coarticulation in Thai,” Journal of Phonetics, vol. 22, pp. 477–492, 1994. [3] S. Potisuk, J. Gandour, and M. P. Harper, “Acoustic correlates of stress in Thai,” Phonetica, vol. 53, pp. 200–220, 1996. [4] Y. R. Wang and S. H. Chen, “Tone recognition of continuous Madarin speech assisted with prosodic information,” Journal of the Acoustical Society of America, vol. 96, no. 5, pp. 1738–1752, 1994.
Fig. 3. Error rates (%) of tone recognition along different rhyme durations for TASC.
dropped, when CIN are employed; while the error rates of the other syllable positions are consistent. This confirms that CIN can partially compensate the effect of the intonation contour. We then analyzed the tone error rates by plotting the error rates of the baseline tone features and the refined tone features with ISMF along different rhyme durations. As shown in Fig. 3, we found that ISMF decreases error rates for all syllables along different rhyme durations. The extreme error reductions are reported for the syllables having rhyme duration below 100 ms. This confirms the usefulness of our features on a very short syllable (such as neutral or unstressed syllable). 7. CONCLUSION In this paper, we have demonstrated a method for Thai continuous tone recognition. The coarticulatory, intonation, and stress effects have been considered. We have proposed several tone features to account these effects. We performed experiments on Thai proverb corpus and Thai animal story corpus using feedforward neural networks. The experimental results show that the proposed tone features highly increase tone recognition rates. The best results are 93.60% and 92.67% for Thai proper name and Thai animal story corpora, respectively. 8. REFERENCES [1] S. Potisuk, M. P. Harper, and J. Gandour, “Classification of Thai tone sequences in syllable-segmented speech using the analysis-by-synthesis method,” IEEE Transactions on Speech Audio Processing, vol. 7, no. 1, pp. 95–102, 1999.
[5] S.-H. Chen and Y.-R. Wang, “Tone recognition of continuous Madarin speech based on neural networks,” IEEE Transactions on Speech Audio Processing, vol. 3, no. 2, pp. 146–150, 1995. [6] N. Thubthong and B. Kijsirikul, “An empirical study for constructing Thai tone models,” in Proc. the 5th Symposium on Natural Language Processing and Oriental COCOSDA Workshop, 2002, pp. 179–186. [7] M. J. Ross, H. L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, “Average magnitude difference function pitch extractor,” IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP22, pp. 353– 362, 1974. [8] D. J. Hermes and J. C. van Gestel, “The frequency scale of speech intonation,” Journal of the Acoustical Society of America, vol. 90, no. 1, pp. 97–102, 1991. [9] S. Potisuk, M. P. Harper, and J. Gandour, “Speakerindependent automatic classification of Thai tones in connected speech by analysis-by-synthesis method,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1995, pp. 632–635. [10] P. Lieberman, W. Katz, A. Jongman, R. Zimmerman, and M. Miller, “Measures of the sentence intonation of read and spontaneous speech in american english,” Journal of the Acoustical Society of America, vol. 77, no. 2, pp. 649–657, 1985. [11] C. Wang and S. Seneff, “Improved tone recognition by normalizing for coarticulation and intonation effects,” in Proc. Int. Conf. Spoken Language Processing, 2000. [12] N. Thubthong and B. Kijsirikul, “Stress and tone recognition of polysyllabic words in Thai speech,” in Proc. Int. Conf. Intelligent Technologies, 2001, pp. 356–364. [13] D. van Kuijk and L. Boves, “Acoustic characteristics of lexical stress in continuous telephone speech,” Speech Communication, vol. 27, pp. 95–111, 1999.