Empirical Evidence for Prosodic Phrasing: Pauses

0 downloads 0 Views 230KB Size Report
read speech, the presence of a silent pause can generally be taken as an indication of the ... produce utterances such as: If you look in the::… right hand corner.
INTERSPEECH 2007

Empirical evidence for prosodic phrasing: pauses as linguistic annotation in Korean read speech. Hyongsil Cho & Daniel Hirst Laboratoire Parole et Langage CNRS UMR 6057 Université de Provence, Aix en Provence {hyongsil.cho; [email protected]}

Abstract

correlated with judgments of prosodic phrasing in read speech, are considerably less reliable in spontaneous conversational speech. In spontaneous speech, both lengthening and pausing can be a sign of hesitation. Speakers systematically produce utterances such as: If you look in the::… right hand corner... where (::) indicates lengthening and (…) a silent pause. Here the word "the" is both lengthened and followed by a silent pause without being perceived as either prominent or phrasefinal. In spontaneous speech, the presence of a pause, as shown in the above example, is not a sufficient cue for a prosodic boundary. In read speech, on the other hand, the correlation between the presence of a pause and a prosodic boundary is much more reliable but it can easily be shown that it is not a necessary cue either. To demonstrate this, we edited several passages of read speech, removing all silent pauses over 0.25 secs. long. Informal listening tests gave the impression that the speech was faster but that the prosodic boundaries were unaffected. One of the best definitions of a major intonation boundary is, in fact, as "a place where a pause may be inserted". Since, in careful read speech, pauses are rarely present without a prosodic boundary, they may be considered redundant but we may use the presence of silent pauses as a linguistic signal indicating the presence of boundaries, which are at the same time signaled by other acoustic cues. This leads us, then, to treat the insertion of pauses, in read speech, as a form of annotation by the speaker, but which, unlike most types of annotation, is purely linguistic rather than metalinguistic. In the first experiment reported here, in order to test the hypothesis that the silent pauses are essentially redundant, we asked listeners to identify boundaries while listening to extracts from passages of read speech. In half of the extracts, the original pauses produced by the reader were untouched, in the other half, the extracts were edited so that all silent pauses over 0.25 s were removed. In the second experiment we examine the correlation of a number of other acoustic cues with the presence of a silent pause in order to see whether it is possible to predict the position of the boundaries without making use of the information concerning the silent pause. If this were to be the case it would provide us with a useful way of bootstrapping a prosodic boundary recognition system which could be trained on data from read speech and then applied to spontaneous speech where the silent pauses may well not be good indicators of boundaries. While we would not expect the transfer from read speech to spontaneous speech to be perfect, it should at least give us a useful first approximation.

This paper looks at the relationship between acoustic cues and judgments of prosodic boundaries. It is argued that in read speech, the presence of a silent pause can generally be taken as an indication of the presence of a prosodic boundary although in spontaneous speech the presence of a silent pause is neither a necessary nor a sufficient condition for a prosodic boundary. Two experiments are described concerning Korean read speech. In the first, subjects were asked to say whether extracts of speech (filtered to make them unintelligible) were taken from the same or from two different sentences. The results confirmed that in the majority of cases the listeners do not rely on the presence of a silent pause since even when the pause has been removed the boundary is correctly predicted almost as well as when it has not been removed. In the second experiment a number of acoustic cues, temporal and frequential, were used to predict the distribution of pauses without reference to the silence itself.

1. Introduction Discussions of prosodic phrasing rely crucially on judgments of acceptability. The empirical basis for such judgments is, however, notoriously difficult to establish. Yet such evidence is clearly necessary in order to avoid circularity. Even judgments by trained listeners such as those used in the ToBI paradigm [1] are not above all suspicion [19]. It is clear that before empirical studies from large oral databases can be carried out, it will be necessary to establish automatic or semi-automatic techniques of empirical annotation. Work in the area of speech synthesis and automatic speech recognition has, in the last few decades, brought to light a number of acoustic cues that can be shown to correlate with judgments of prosodic boundaries (for a review cf. [15]). Yet, once again, the problem arises of the validation of these judgments. Asking listeners to indicate prosodic boundaries amounts to eliciting meta-linguistic judgments. As is wellknown, such metalinguistic judgments are subject to caution and highly dependent on the listeners' conception of the task at hand and on the theory-dependent training which they have been given [17]. Ideally we should like to be able to validate acoustic models of prosodic phrases by evidence obtained directly from untrained subjects. In this presentation, we propose such a technique and present some preliminary results from a study of Korean read speech. This study is part of a larger project using similar material for English, French, Russian, Japanese and Korean. The most often cited acoustic cues for prosodic boundaries are segmental lengthening and the presence of silent pauses. Both of these cues, however, while highly

446

August 27-31, Antwerp, Belgium

2. Experiment 1: the influence of silent pauses on the perception of speech boundaries

60 50 40 %

One of the most obvious phonetic cues, in all languages, for the presence of a prosodic boundary is the presence of a silent pause. This has also been specifically claimed for the Korean language in [7], [8], [9], [10], [11], [12], [13], [18]. After some preliminary studies and informal reports, however, we came to the conclusion that not only in spontaneous speech but also in read speech, the presence of a silent pause is neither a necessary nor a sufficient condition for the identification of a prosodic boundary. To test this hypothesis, we designed a simple experiment on the correlation between the silent pauses and the perception of prosodic boundaries.

30 20 10 0 0.4s

0.5s

0.6s

0.7s

0.8s

Pause

Figure 1. Distribution of pauses (in seconds) for the selected extracts.

2.1. Material and subjects

whether the extract they heard was taken from one sentence or from two different sentences. To clarify the meaning of the instructions, subjects were first asked to listen to extracts from the following artificial examples 다시 방으로 들어갔다. 나와 엄마는 청소를 했다. [(we) went back into the bedroom. Mother and I did the housework.]

To create stimuli for the test, we first made a recorded corpus: the forty continuous passages from the Eurom1 corpus were freely translated into Korean by the first author [4]. They were then recorded by the same author in an anechoic chamber and digitised. The recordings were annotated manually on the level of word and phoneme using the Praat software [2]. From the forty recorded passages we selected 42 short extracts of about 2 seconds’ duration each. Half of these consisted of speech material containing a silent pause (the duration of which was not counted in the duration of the extract): we call this group of extracts category P (pause). The other half, containing no silent pause and no obvious prosodic boundary, is named category N (no pause). The first group of these extracts was then edited by removing the silent pauses, and renamed category R (removed). The recordings were subsequently grouped into two blocks, A and B. Block A contained the 21 extracts of category N, plus 10 extracts of R together with 11 extracts of P. Block B contained the same 21 extracts of N, the 10 remaining extracts of P and the 11 remaining extracts of R. Each block thus contained all the extracts with no pauses and all the extracts containing pauses - approximately half of which had been edited to remove the acoustic silence. In each block the order of the stimuli was randomized. In all of the extracts containing a silent pause, the duration of the pause was considerably more than 0.2 s, ranging from 0.348 s to 0.814. The distribution of the pauses by duration is shown in the histogram of Figure 1. We decided to present the recordings to native speakers and to elicit judgments on the presence or absence of a prosodic boundary. Twenty native speakers of Korean language were chosen to take part in this test, ten listening to recordings of block A and ten to those of block B. None of them reported any hearing impairment. Owing to the difficulty of obtaining reliable metalinguistic judgments mentioned above, we chose to explain the task without mentioning boundaries or pauses. Listeners were presented with a forced choice task, asking

다시 방으로 들어갔다 나와 엄마는 청소를 했다. [(we) went back into the bedroom and came out again, (and then) mother did the housework.] where the sequence of syllables is identical and where the choice of prosodic boundary distinctively determines the interpretation. The subjects were told that the first extract was part of one single sentence whereas the second extract was taken from two different sentences. It is worth noting that in Korean, prosodic boundaries are very often accompanied by specific particles or grammatical morphemes [4], [14]. This makes it difficult to be sure that a listener's identification of a boundary relies on prosodic rather than on segmental cues. To ensure that listeners were not using segmental cues, we filtered the speech extracts to make them unintelligible. After experimentation, we used a low pass filter of 600Hz with the Praat command Filter... (formula) and the value: if x>600 then 0 else self fi; The resulting recordings were unintelligible but the prosody appeared sufficiently audible. The stimuli were presented in two continuous sound files corresponding to blocks A and B. Stimuli were separated by a silence of 5 seconds. Each stimulus was preceded by a beep and a silence of 500ms. Every fifth stimulus was preceded by two beeps to help subjects to follow the test. Subjects were requested to listen to the complete recording (either block A or B) without stopping and to answer "one" or "two" for each stimulus depending on whether they identified the stimulus as part of one sentence or of two. They were then permitted to listen to the recordings a second time and to correct their initial answers if they wished.

447

will obviously be desirable to apply the same technique to the recordings of the other nine speakers and to check how successfully predictions of pauses can be carried out on a speaker independent and even perhaps language independent basis.

2.2. Results Since the responses were dichotomous (one or two) we fitted a logistic regression (linear logit model) and several anova to the data. The result demonstrates that the difference between blocks (A or B) is not significant (p = 0.675) but between subjects it is slightly significant (p = 0.016). On the other hand, as shown in Figure 2, most of the stimuli belonging to the category removed were in fact identified - across subject as containing a boundary. The difference between categories (P, R or N) appeared as the most significant variable (p < 0,0001), analysis by Scheffe’s method (95%CI) shows the difference between P and R (0.145 to 0.312) to be slighter than that between P and N (0.723 to 0.868) or R and N (0.494 to 0.639). Moreover, anova tests on categories per subject reveal that the difference between P and R is NOT significant for 15 of the 20 subjects while the distinction between P and N remains significant for all subjects.

Figure 3. Histogram of pause durations for speaker MHY. 100

A number of different acoustic parameters have been used for the prediction of pauses. For this demonstration, we selected 12 parameters, 4 in the temporal domain and 8 in the frequency domain. For each pair of adjacent words in the corpus, whether separated by a silent pause or not, we extracted the following parameters: • the duration (in secs) of the words to the left and to the right of the word boundary • the duration of the phonemes to the left and the right of the word boundary • the mean, maximum and minimum f0 and the min to max interval during the 250 ms to the left and to the right of the word boundary In order to avoid variability due to the nature of the individual phonemes, duration was expressed as a lengthening/shortening factor by dividing the measured value by the mean duration of that phoneme as calculated from the whole forty passages. For the same reason, the F0 values were taken from the continous smooth modelled curve as provided by the output of the Momel algorithm [5], [6]. In all, this made a total of 12 parameters which we used to predict the existence of a silent pause between each pair of adjacent words (a total of 1373 pairs of words). The analysis was carried out using discriminant analysis by means of the R software [16] using the lda function from the MASS package. The percentage of correct predictions was 95.7% with the following distribution of errors.

88,1

80 65,2 60 % 40 20

11

0 P

R

Categories

N

Figure 2. Percentage of stimuli identified as containing a prosodic boundary (i.e. response = "two") by category.

In conclusion, this study demonstrates that when silent pauses have been edited out of an utterance, speakers of Korean are generally capable of identifying a prosodic boundary, even when the speech has been filtered to render it unintelligible.

3. Experiment 2 : the prediction of silent pauses from acoustic data. The same Eurom1 corpus [3] that we used for the perception test was also used to study the acoustic cues accompanying silent pauses in read speech. In the framework of the bilateral Franco-Korean project Star [6], the 40 continuous passages were each read by ten native speakers of Korean (5 male and 5 female). The manually corrected phonetic transcriptions of the passages were then aligned with the acoustic signal. To make sure that only genuine pauses were included in our analysis, we set a minimum threshold of 0.25 s. A total of 2637 pauses were counted for the entire corpus with a mean duration of 0.566 s. For most speakers there was clearly a bimodal distribution with a threshold at around 0.750 s (cf figure 3). But given the far greater number of pauses () shorter than 0.750 s (86%, mean 0.463 s) and the smaller number of pauses longer than 0.750 s (14%, mean 0.990 s), we decided to treat all the pauses as a single category. In the rest of this paper, preliminary results are reported on the forty passages of one speaker only (not the author). It

Predicted \ Observed No pause Pause

No-pause 1207 33

Pause 26 107

Table 1. Predicted and observed silent pauses using a posteriori predictions.

Of course, this table gives predictions a posteriori since the discriminant analysis was carried on all the word pairs and the resulting model was then applied to the same data. In order to check whether these correspond to genuine predictions or are simply a recoding of the data, we split the corpus into two equal halves. Two linear discriminant analyses were then carried out separately on the two sets of data and the predictions of each model were then applied to the other half of the data so that this time none of the

448

predictions were actually based on a posteriori knowledge of the data in question. The percentage of correct predictions this time was only marginally less: 95.2% with the following distribution of errors. Predicted \ Observed No pause Pause

No-pause 1203 35

[5] Hirst, Daniel 2005. Form and function in the representation of speech prosody. in K.Hirose, D.J.Hirst & Y.Sagisaka (eds) Quantitative prosody modeling for natural speech description and generation (=Speech Communication 46 (3-4)), 334-347 [6] Hirst, Daniel; Cho, Hyongsil; Kim, Sunhee & Yu, Heunji 2007. Evaluating two versions of the Momel pitch modelling algorithm on a corpus of read speech in Korean. Proceedings Interspeech (this conference) Antwerp, 2007. [7] Jun, Sun-Ah 1993. The phonetics and phonology of Korean Prosody. PhD dissertation, Ohio State University. [8] Jun, Sun-Ah. 2000. K-ToBI Labelling conventions. http://www.linguistics.ucla.edu/people/jun/ktobi/Ktobi.html (UCLA.) [9] Kim, Hyo Sook; Kim, Chung Won; Kim, Sun Ju; Kim, Seoncheol; Kim, Sam Jin & Kwon, Chul Hong 2002. Prediction of Break Indices in Korean Read Speech. Malsori, Jounal of The Korean Society of Phonetic Sciences and Speech Technology [10] Kim Sang-hun, Park Jun, Lee Young-jik 2001. Corpusbased Korean Text-to-speech Conversion System, The Journal of the Acoustical Society of Korea, 4 v.020, n.003 [11] Ko Hyun-ju, Kim Sang-hun, Lee Sook-hyang 2001. Cues to phrase boundary perception in the Korean read speech. Proceedings of Eurospeech,vol. 2, 853-857, 2001 [12] Lee Chan-Do 1997. A computational study of prosodic structures of Korean for speech recognition and synthesis: Predicting phonological boundaries. Journal of Korean Information Processing Society [13] Lee Jong-Ju, Lee Sook-Hyang 1996 On phonetic characteristics of pause in the Korean read speech. Proceedings of ICSLP ’96, [14] Lee Sangho, Oh Yung-Hwan 1998. The modeling of prosodic phrasing and pause duration using CART. KSCSP ’98 [15] Nesterenko, Irina, 2006. Analyse formelle et implémentation phonétique de l'intonation du parler russe spontané en vue d'une application à la synthèse vocale. Doctoral thesis, Université de Provence. [16] R Development Core Team 2006. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-90005107-0, URL http://www.R-project.org. [17] Scarnà A. and Ellis A. W. (2002) On the assessment of grammatical gender knowledge in aphasia: The danger of relying on explicit, metalinguistic tasks Language and Cognitive Processes, 17 (2), 185-201 [18] Seong Cheol-Jae 1996. An experimental phonetic study of the correlation between prosodic phrase and syntactic structure. Journal of The Linguistic Association of Korea Vol.18 [19] Wightman, Colin 2002. ToBI or not ToBI? in Proceedings of the First International Conference on Speech Prosody, Aix en Provence April 2002.

Pause 30 105

Table 2. Predicted and observed silent pauses using a priori predictions obtained by splitting the data into two halves and applying the model obtained from one half to the data of the other half.

The high number of correct predictions is of course largely due to the high percentage of words not followed by a pause - a prediction of no pauses in all cases would have made 90% correct predictions. A better estimate of the global efficiency of the prediction is the widely used F-measure, which is the harmonic mean of the measure of recall (percentage of observed boundaries which are predicted) and that of precision (percentage of predicted boundaries which are observed) which in this case is just over 75%.

4. Conclusion These preliminary results encourage us to pursue the development of the prediction of prosodic boundaries on the basis of acoustic data without reference to the presence or absence of a pause except to train the model. In particular it will be interesting to look at the distribution of the "errors" of prediction keeping in mind the possibility that a prosodic boundary, even in read speech, may be present without a silent pause or that a silent pause may occur without a clear prosodic boundary. It will also be interesting to compare results from several speakers and further to examine how far such results are language or dialect specific and how far they can be applied to speech of a more spontaneous nature.

Acknowledgements This work was carried out with the support of the FrancoKorean PAI STAR project AANVIS-KR (Egide).

5. References [1] Beckman, Mary & Gayle Ayers 1994. Guidelines for ToBI Labelling. Unpublished ms. Ohio State University. Version 3. March 1997. [http://ling.ohiostate.edu/Phonetics/etobi_homepage.html]. [2] Boersma, Paul, & Weenink, David. 2005. Praat: doing phonetics by computer. (Version 4.4) [computer program] retrieved December 22, 2005 from: htttp://www.fon.hum.uva.nl/praat/ [3] Chan, D., Fourcin, A., Gibbon, D., Granström, B., Huckvale, M., Kokkinas, G., Kvale, L., Lamel, L., Lindberg, L., Moreno, A., Mouropoulos, J., Senia, F., Trancoso, I., Veld, C., & Zeiliger, J. 1995. EUROM: a spoken language resource for the EU. Proceedings of the 4th European Conference on Speech Communication and Speech Tecnology, Eurospeech '95, (Madrid) 1, 867880. [4] Cho, Hyongsil & Hirst, Daniel 2006. The contribution of silent pauses to the perception of prosodic boundaries in Korean read speech. Proceedings of the Third International Conference on Speech Prosody, Dresden 2006.

449