Modeling of Sentence-medial Pauses in Bangla

Interspeech 2010, Makuhari, Japan, September 26-30, 2010

Modeling of Sentence-medial Pauses in Bangla Readout Speech: Occurrence and Duration Shyamal Das Mandal 1, Arup Saha1, Tulika Basu1, Keikichi Hirose2, Hiroya Fujisaki3 1

2

Centre for Development of Advanced Computing, Kolkata Department of Information and Communication Engineering, University of Tokyo 3 Professor Emeritus, University of Tokyo

{shyamal.dasmandal, arup.saha, tulika.basu}@cdackolkata.in, [email protected], [email protected]

Abstract Control of pause occurrence and duration is an important issue for text-to-speech synthesis systems. In text-readout speech, pauses occur unconditionally at sentence boundaries and with high probability at major syntactic boundaries such as clause boundaries, but more or less arbitrarily at minor syntactic boundaries. Pause duration tends to be longer at the end of a longer syntactic unit. A detailed analysis is conducted for sentence-medial pauses for readout speech of Bangla. Based on the results, linear models (with variables of syntactic unit length and distance to directly modifying word) are constructed for pause occurrence and duration. The models are evaluated using the test data not included in the analyzed data (open-test condition). The results show that the proposed models can predict occurrence probability for 87% of phrase boundaries correctly, and pause duration within ±100 ms for 80% of the cases. Index Terms: pause occurrence probability, pause duration, Bangla readout speech, text-to-speech synthesis

1. Introduction Man-machine communication in speech mode involves the integration of all technologies needed for both speech input and output, for all the attributes demanded by the discipline of associated language. In India, Speech Synthesis is considered to be of primary need to empower not only disabled people, but also functionally illiterate population. Direct speech-mode of interaction with the computer may empower them to undertake mediator-free direct access to e-knowledge. Various speech synthesis systems are now appearing for some of the major Indian languages [1]; however, all of them can only generate flat and monotonous speech – causing perceptual difficulties to sustained listening. Here, prosody (intonation and rhythm) of spoken language plays an important role in bestowing both intelligibility and naturalness to synthesized speech even only for the readout mode. Hence, applying knowledge of intonation and rhythm patterns of a language is extremely important in machine-generated speech in the arena of spoken language technology. In Text-to-Speech Synthesis (TTS), information for prosody generation needs to be predicted only from the input text. Prediction of a pause at a proper position with proper duration is one of the important parts of prosody modeling. The synthesized speech will not be natural and even unacceptable sometimes, if pauses with appropriate lengths are not properly placed [2]. Pauses facilitate comprehension on the part of the listener, and allow the speaker to inspire when necessary. A

Copyright © C-DAC, Kolkata

good speaker is generally capable of fulfilling these two requirements. Therefore, good modeling for the occurrence and duration of pauses is necessary for TTS systems with acceptable speech quality. The pauses occur in speech generally at syntactic boundaries [3]. The occurrence of pauses is unconditional at certain boundaries such as between sentences, but is more or less arbitrary at minor syntactic boundaries [4]. Their duration tends to be longer at the end of larger syntactic units, but have statistical variations. The research on development of pause model has not been paid as much attention as that of the F0 model and of the segmental duration model. The rule-based method [5] [6] is one of the typical approaches. It uses linguistic expertise to infer some pause generation rules based on the observations on large speech corpora. This approach is simple and convenient, but it is quite time-consuming to get lots of rules. The ANN model for pause duration prediction has also been reported [7]. It generates better results than rulebased methods, but it needs a very large corpus for training. Also, the results are still unsatisfactory in most cases. Fujisaki et al. developed linear models for predicting the occurrence probability and duration of pauses in readout speech of Common Japanese [8]. Pauses in text reading can be divided into three categories (a) Pauses between paragraphs, (b) Pauses between sentences in a paragraph, (c) Pauses within a sentence. Since (a) and (b) are unconditional, this study will concentrate only with (c), and name them as sentence-medial pauses. The aim and scope of the current study are limited to the prediction of sentence-medial pauses which occur at syntactic boundaries in Bangla readout speech without specific foci and emotions. The present paper reports a detailed investigation of occurrence probability and their duration of sentence-medial pauses in readout speech of Bangla. Based on that detailed investigation, models are developed and integrated with the ESNOLA method based concatenated TTS system in Bangla developed by CDAC, Kolkata [1]. Development of such kinds of pause model is the first of its kind for the Bangla language.

2. The Bangla Language Bangla is a part of the Indic group of the Indo-Aryan (IA) branch of the Indo-European family of languages. It is the official state language of the Eastern Indian state of West Bengal and the national language of Bangladesh. With nearly 230 million total speakers, Bangla is one of the most spoken languages (ranking fifth) in the world [9]. Dialect-wise Bangla is divided into two main branches: Western and Eastern. The Western branch consists of Rarha (South), Varendra (North


Central) and Kamrupa (North Bengal) dialect clusters. Rarha is further sub-divided into South Western Bangla (SWB) and Western Bangla, the Standard Colloquial Bengali (SCB) spoken around Kolkata [10]. The present study is based on the official dialect of West Bengal, i.e., Standard Colloquial Bengali (SCB).

These tables indicate that the occurrence probabilities of ADVP and PP are almost equal and the difference between these two categories for pause duration is not significant statistically. Therefore, ADVP and PP are merged together, and henceforth termed as ADVP.

3. The Speech Material

Phrase length is the length l of the phrase in terms of number of syllables. The phrase length is cumulative if a pause does not occur at the end of the phrase. In Figure 1, Phi represents the ith phrase of the sentence S and li represent the length of the corresponding phrase. In this example, the pause occurs at the 3rd phrase boundary. The cumulative phrase length (Lci) can be defined and calculated as illustrated.

The present study aims at developing pause models for readout mode TTS system of Bangla. Three pieces of text material, each containing three paragraphs on different topics such as science, sports and general news, are taken from popular Bengali newspaper. The text material contains 72 Bangla sentences of different length. For model construction, the text is read by 8 native Bengali speakers (2 male and 6 female; in the age group between 20 to 40 years) with 6 repetitions. The average speech rate of the speakers is 5.9 syllable/sec. For testing the models, a separate paragraph containing 17 sentences is recorded by 4 informants (out of the above 8 informants) with 6 repetitions, read at the same speech rate. In order to let the speaker decide where and how long he/she should pause, sentence medial punctuation marks are removed from the text. For the fluency of reading, the speaker is instructed to read out the text several times before recording. The above data is recorded in a speech studio environment with 16 bit 22050 Hz digitization format. A semi-automatic pause marker based on the state phase algorithm [11] was developed for the extraction of the pause from the speech signal. The result was visually verified and aligned with text to ensure high accuracy.

4. Factors Affecting the Occurrence and Duration of Sentence-Medial Pauses Preliminary study indicates that the occurrence and duration of intra sentential pauses in Bangla mainly depends upon the following three factors [6]

4.1. Type of the Phrase Generally, in Bangla, phrases are categorized into the following 5 types; namely (a) noun phrase (NP), (b) adjective phrase (AJP), (c) adverb phrase (ADVP), (d) post-positional phrase (PP) and (e) verb phrase (VP). However, the adjective phrase type is not considered in this study, since out of 190 adjective phrases in the analyzed data, only 4 are followed by a pause. Table 1 shows the average occurrence probability of pauses for the remaining four phrase types. Table 2 shows the results of a t-test of pause duration for the above four phrase types. Table 1: Average occurrence probability of pause for the four phrase types: NP, ADVP, PP and VP. Phrase Type NP ADVP PP VP

Occurrence Probability 0.25 0.38 0.42 0.63

NP -


ADVP 90.0 % -

PP 90.0% Not significant

-

S: ph1 (no pause) ph2 (no pause) ph3 (pause) ph4 (end). l3 l4 l1 l2 lc3 Figure 1: Illustration of the cumulative phrase length (Lci).

4.3. Distance between the Current Phrase and its Dependent Counterpart For the extraction of the above parameters, the text is manually tagged with Parts of Speech (POS) and phrase information. The annotated sentences are represented by a binary tree. The following logic is used to construct the binary tree. Step1: If VP is at the sentence-final position, the words are arranged in the right-branching order to form the binary tree. Step2: If VP is at a sentence-medial position, the tree is divided into two sub-trees and Step1 is applied for the formation of each of the sub-trees. The sub-trees are then joined together to form the complete tree. Distance d is calculated using the distance information (in terms of words) between the last word of the current phrase and the last word of its dependent phrase. Figure 2 illustrates the calculation of this distance parameter d for a given Bangla sentence which can be translated into English as “If there are games in winter, they will be physically fit”. S

(/ʃite/)(/k ælɑ/) (/t ɑkle/)

o

(/ʃɑririkb ɑbe/) (/õrɑ/)(/pou/) (/t ɑkbe/)

Word No. 1

2

3

4

5

6

Related 3 Word No. Distance d 2

3

7

7

7

7

-

1

-

1

4

3

2

7

Figure 2: Illustration of the distance parameter calculation for a Bangla sentence.

Table 2: Results of a t-test on the significance of differences in the mean of pause duration for the four phrase types: NP, ADVP, PP and VP. Phrase Type NP ADVP PP

4.2. Phrase Length

VP 99.9% 99.9% 99.0%

5. Occurrence Probability and Duration of Pauses For each of the three phrase types, the occurrence probability is defined as the number of phrase boundaries which are followed by a pause divided by the total number of phrase boundaries having a given set of values of l and d. The pause duration is the amount of pause measured in millisecond (ms) at the phrase boundary.


upon

Panels (a) to (c) in Figure 3 respectively indicate the individual effect of the three factors upon the occurrence probability of sentence-medial pauses. It is observed from these panels that all three factors have significant effects, and occurrence probability is seen to increase almost linearly with both l and d.

Occurrence Probability

0 .8 0 .6 0 .4 0 .2 0 NP

A D VP

VP

Phrase Type

(b)


1 0 .8 0 .4 0 .2 0 6

9

12

15

18

21

Phrase Length l

(c)

1


300

280

260 1

2

3

4

5

6

>7

Distance d Figure 4: The individual effect of the factors on the sentence medial pause duration: (a) effect of phrase type, (b) effect of phrase length l, (c) effect of distance d. Panels (a) to (c) in Figure 4 respectively indicate the individual effect of the three factors upon the duration of sentence medial pauses. It is observed from the figures that all three factors have significant effect and pause duration is seen almost linearly dependent on both l and d.

5.2. Modeling of Pause Occurrence Probability

0 .6

3

0 .8 0 .6 0 .4 0 .2 0 1

2

3

4

5

6

>7

Distance d Figure 3: The individual effect of the factors on the occurrence probability of sentence-medial pauses: (a) effect of phrase types, (b) effect of phrase length l and (c) effect of distance d.

(a)

330

Duration [ms]

(c)

320

(a)

1

For each of the three phrase types, the rate of occurrence of pause is almost 0 for very small values of l and d, and increases almost linearly with l and d until it reaches its highest probability 1.0. This portion of the data can be approximated by the following linear equation in l and d: ( X : NP , ADVP , VP ) Pc ( X ) = al + bd + c. (1) The coefficients a, b and c can be determined by the liner regression analysis over the ranges of l and d where the linearity is approximately valid. Using the training data for each of the three phrase types, the above coefficients a, b and c are obtained. The values of these coefficients are shown in Table 3 along with the R square values of the model for the respective phrase types. Since the value of occurrence probability (Pc) has to be within the range of 0 and 1, the rate of occurrence probability Pc for each of the three phrase types (NP, VP, ADVP) can be approximated by the following equations:

Pc ( NP ) = min{max(0.077l + 0.122d − 0.436,0),1},

(2)

Pc ( ADVP ) = min{max(0.072l + 0.139 d − 0.486, 0),1}, (3) Pc (VP ) = min{max(0.078l + 0.162d − 0.467, 0),1}. (4) Table 3: Model parameters for pause occurrence probability.

290

Phrase Type NP

250 N P

A D V P

V P

Phrase Type

(b)

360

Duration [ms]

Duration [ms]

5.1. Effect of Individual Parameters Occurrence Probability and Duration

ADVP

320

VP

280

240 3

6

9

12

15

18

Phrase Length l


21

Coefficients

Value

a b c a b c a b c

0.077 0.122 -0.436 0.072 0.139 -0.486 0.078 0.162 -0.467

R square value 0.902

0.813

0.791

5.3. Modeling of Pause Duration For each of the three phrase types, the mean pause duration DX can be approximated by a linear model with respect to l and d over the whole range of l and d extracted from the training data. ( X : NP , ADVP , VP ) DX = α l + β d + γ . (5) The coefficients α, β and γ can be determined by the liner regression analysis over the ranges of l and d.


data set. Figure 6 shows the cumulative percentage of cases that the predicted values of pause duration are within the range given by abscissa. The error of prediction is within ± 100 ms in about 80 % of all the sentence-medial pauses observed in the test data set. Cumulative Percentage

Using the training data of different phrase types, the above coefficients α, β and γ are obtained for each of the phrase types. The values of the above coefficients are shown in Table 4 along with the R square values of the model for the respective phrase types. Table 4: Model parameters for pause duration. Phrase Type NP

Coefficients

Value

α β γ α β γ α β γ

5.29 11.06 194.88 5.84 8.46 197.2 5.72 6.11 237.18

ADVP

VP

R square value 0.813

0.833

8 0 6 0 4 0 2 0 0 0

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

Prediction Error [ms] Figure 6: Cumulative percentage vs. prediction error of pause duration.

0.846

7. Conclusions

6. Result and Discussion The performance of the models is evaluated using the test data. The models are used to predict the probability of sentencemedial pause occurrence at each of the phrase boundary, and predicted values are compared with the actual probability of occurrence of the test data. Figure 5 shows the cumulative percentage of cases that the predicted values are within the range given by the abscissa. The error of prediction is within the range of ± 10 % in about 87% of all the phrase boundaries. 1 0 0

Cumulative Percentage

1 0 0

8 0 6 0 4 0

This paper reported the results of a systematic study on the various factors that have influences on the occurrence of sentence-medial pauses and their duration for read-out text of Bangla. The objective of the study is to develop models for pause insertion and pause length assignment in a read-out mode Bangla text-to-speech synthesis system. The results shown in this paper clearly indicate the effect of some of the important factors in the process of pause insertion by a human speaker in text reading mode. The occurrence and duration of sentence- medial pause after a VP is much higher than after other phrase types. The result shown in this paper is quite satisfactory for use of those models in Bangla TTS systems. The models are actually integrated in an ESNOLA-based Textto-Speech synthesis system of Bangla [1].

2 0

8. References

0 0

0 .1

0 .2

0 .3

[1]

Prediction Error Figure 5: Cumulative percentage vs. prediction error of occurrence probability. The total number of phrases in the test set of 6 repetitions is 2304, of which 991 phrase boundaries are followed by a pause. Table-5 shows the results of another method of evaluating pause occurrence prediction on the test set. Table 5: Result of the pause occurrence prediction Pause Occurrence Prediction Result in Percentages Correct Detection (Dc). 85% False Inclusion (FI) 19% False Rejection (FR) 15% Here, Dc is defined as the percentage of phrase boundaries where the actual and the model-predicted occurrence probabilities are both above 50%. FI is defined as the percentage of phrase boundaries where the actual probability is below 50% while the model-predicted probability is above 50%. FR is defined as the percentage of phrase boundaries where the actual probability is above 50% while the modelpredicted probability is below 50%. It is observed from Table 5 that the percentage of false inclusion is rather high. It is also observed from the test data that most of false inclusions are due to the presence of a short VP after an NP. In these cases false inclusion can be avoided by introducing a simple rule, and the rate of false inclusion comes down to 12 %. The derived duration models are used to predict the mean duration of a sentence-medial pause at each of the phrase boundaries where a pause occurs. The predicted pause duration is compared with the observed duration of the test Copyright © C-DAC, Kolkata

Das Mandal, S. K. and Datta, A. K., “Epoch Synchronous NonOver Lapping Add (ESNOLA) Method Based Concatenative Synthesis System for Bangla”, Proc. of 6th ISCA Speech Synthesis Workshop, University of Bonn, Germany, pp. 351-355, 2007. [2] Yu, J. and Tao, J., "The Pause Duration Prediction for Mandarin Text-to-Speech System”, 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE 2005) Wuhan, China, pp.204-208, 2005, LSBN:0-78039361-9 [3] Fujisaki, H. and Omura, T. “Characteristics of Durations of Pauses and Speech Segments in Connected Speech,” Annual Report, Engineering Research Institute, Faculty of Engineering, University of Tokyo, vol. 30, pp. 69–74, 1971. [4] Kaiki, N. and Sagisaka, Y., “Pause Characteristics and Local Phrase-dependency Structure in Japanese", Proc. ICSLP-1992, Banff, Canada, pp. 357-360, 1992. [5] Lee, L.-S., Tseng, C.-Y. and Ouh-Young, M., "The Synthesis Rules in a Chinese Text-to-Speech System" IEEE Trans. Acoustic, Speech, Signal Processing, vol 37, no. 9, pp. 269 -285, 1989 [6] Saha, A., Basu, T. and Khan, S., “Analysis of Occurrence and Duration of Intra and Inter Sentential Pauses in Bangla Read Out Speech”, Proc. of Oriental COCOSDA, 2008, Kyoto, Japan, pp. 53-58, 2008. [7] Chen, S.-H., Hwang, S.-H. and Tsai, C.-Y., "A First Study on Neural Net Based Generation of Prosodic and Spectral Information for Mandarin Text-to-Speech" ICASSP'92, San Francisco, 1992. [8] Fujisaki, H., Ohno, S. and Yamada, S. “Factors Affecting the Occurrence and Duration of Sentence-medial Pauses in Japanese Text Reading,” Proc. ICPhS'99, San Francisco, vol. 1 pp. 659-662, 1999. [9] Lewis, M. Paul (ed.), 2009. “Ethnologue: Languages of the World,” Sixteenth edition. Dallas, ISBN 978-1-55671-216-6. [10] Bhattacharya, K., Bengali Phonetic Reader, published by Central Institute of Indian Languages, 1999. [11] Das Mandal, S. K., Gupta, B, and Datta, A. K.. “Word Boundary Detection Based on Suprasegmental Features: A Case Study on Bangla Speech,” International Journal of Speech Technology, Vol.9, pp. 17-28, 2007.