Using Probabilistic Methods to predict Phrase ... - Semantic Scholar

Using Probabilistic Methods to predict Phrase Boundaries for a Text-to-Speech System Eric Sanders University of Nijmegen, Department of Language and Speech, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands

Supervisors:

Prof. Dr. L.W.J. Boves (Nijmegen) S.D. Isard (Edinburgh) Dr. P.A. Taylor (Edinburgh)

Give me a break

Copyright 1995 by E.P. Sanders. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the author.

ii

Summary This report describes the research that I have carried out in the Centre for Speech Technology Research (CSTR) of the University of Edinburgh. The goal of this research was to develop and test dierent methods of inserting phrase boundaries in arbritary input text for the use of a text-to-speech system. Phrase boundaries strongly aect the prosody of a sentence; they can change the intonation and they often cause a pause in speech. The naturalness of a spoken sentence depends heavily on the right placing of the phrase boundaries (or breaks). The methods investigated in this report are based on probabilities. The Machine Readible Spoken English Corpus (MARSEC), which contains transcriptions of radio broadcasts with part of speech tags of the words and phrase boundaries, was used to try to nd a relationship between the part of speech tags of words, the number of words between two breaks and the placing of the breaks. Tests were done to nd the best method with the optimal parameter setting for the implementation in the text-to-speech system developed in CSTR. The tests were done on a dierent part of the same corpus that was used to train the models. A computer programme has been written that takes text as input, assigns tags to the words with a tagger (by Eric Brill) and inserts the breaks by the best method.

iii

iv

Preface What you are reading is my thesis, an obligatory part of the study Taal, Spraak & Informatica (TSI) or Computer Linguistics & Speech Technology of the University of Nijmegen. This report is a result of my practical (research) period. I have done my practical period in the Centre for Speech Technology Research in Edinburgh. I went there for several reasons. My reason to go abroad in the rst place was to test myself; I wondered how I would behave when I would stay away from home for 7 months. I went to Scotland because I had a good feeling about the country and its people. The fact that the Scots speak (a sort of) English made it all the easier. Besides, they do know a lot about how to make good drinks (malt whiskies, ales and stouts). Because I wanted to go in the Erasmus students exchange programme (thanks to the Erasmus coordinators in Edinburgh and Nijmegen!) there was only one place I could go to: the Linguistics Department of the University of Edinburgh. From there I got my research place in the connected CSTR. Although I had not much of a choice, CSTR appeared to be a good place to do my project. I had all the facilities I needed and I had two excellent supervisors who were always willing to help me with my problems and who gave me helpful suggestions with my writing of this thesis. Many thanks go therefore to Steve Isard and Paul Taylor. Back in the Netherlands, the thesis was still un nished. My supervisor in Nijmegen was Loe Boves, who also gave a lot of hints of how to improve this report and who eventually approved of my work. Many thanks to him as well. Finally, I would like to thank everyone who has supported me in one way or the other to nish my study. People who deserve to be mentioned are my parents, my sisters, my friends and my best mate Annemarie for their interest, support and patience. v

vi

Contents Summary Preface 1 Introduction

1.1 Introduction : : : : : : : : : : : : : : : : : : 1.2 Phrase boundaries : : : : : : : : : : : : : : 1.2.1 Introduction : : : : : : : : : : : : : 1.2.2 Minor & major : : : : : : : : : : : : 1.2.3 Phrase boundaries in speech systems 1.2.4 Choice of placing breaks : : : : : : : 1.3 Part of speech tags : : : : : : : : : : : : : : 1.4 MARSEC : : : : : : : : : : : : : : : : : : : 1.5 Tagger : : : : : : : : : : : : : : : : : : : : : 1.6 The CSTR text to speech system : : : : : : 1.6.1 Text to phoneme : : : : : : : : : : : 1.6.2 Phoneme to speech : : : : : : : : : :

iii v 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2 Preparations

2.1 Introduction : 2.2 MARSEC : : 2.3 Tagger : : : :

11 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

3 Experiment 3.1 3.2 3.3 3.4

1 2 2 3 4 5 6 7 7 8 9 9

Introduction : : : : : Choice of les : : : : Training and test set Choice of tag set : : 3.4.1 Introduction

11 11 15

17 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vii

17 18 19 19 19

3.4.2 Number of tags : : : : : : : : : 3.4.3 Categories of tags : : : : : : : 3.4.4 Tag set : : : : : : : : : : : : : 3.5 Training method : : : : : : : : : : : : 3.5.1 Trigram probabilities : : : : : : 3.5.2 Distance probabilities : : : : : 3.5.3 Combining the probabilities : : 3.6 Method of inserting breaks : : : : : : 3.6.1 Method 1 : : : : : : : : : : : : 3.6.2 Method 2 : : : : : : : : : : : : 3.6.3 Method 3 : : : : : : : : : : : : 3.6.4 Method 4 : : : : : : : : : : : : 3.6.5 Method 5 : : : : : : : : : : : : 3.6.6 Method 6 : : : : : : : : : : : : 3.6.7 Method 7 : : : : : : : : : : : : 3.6.8 Variables : : : : : : : : : : : : 3.7 Computing a score : : : : : : : : : : : 3.7.1 Introduction : : : : : : : : : : 3.7.2 Method of computing the score

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 Results

4.1 Introduction : : : : : : : : : 4.2 Value of results : : : : : : : 4.3 The hard numbers : : : : : 4.3.1 Introduction : : : : 4.3.2 Results of method 1 4.3.3 Results of method 2 4.3.4 Results of method 3 4.3.5 Results of method 4 4.3.6 Results of method 5 4.3.7 Results of method 6 4.3.8 Results of method 7 4.4 Examples from the eld : : 4.4.1 Introduction : : : : 4.4.2 Example : : : : : : : 4.4.3 Another example : :

19 20 22 22 22 24 26 26 26 27 27 28 29 30 31 32 33 33 33

35 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

viii

35 35 37 37 37 38 39 39 40 41 42 43 43 43 45

5 Conclusions

5.1 Introduction : : : : : : : : 5.2 Conclusions on the results 5.3 Future work : : : : : : : :

49 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

ix

49 49 50

Chapter 1

Introduction 1.1 Introduction The goal of my project was to improve the prosodic part of the text to speech system in the Centre for Speech Technology Research of the University of Edinburgh (CSTR). This is a real time TTS system, developed in CSTR (described in section 1.6). It works well, but needs some improvements in particular areas. The part I concentrated on was the placing of phrase boundaries (or breaks). The way breaks were inserted in the system was based on punctuation and content word - function word occurrences; this, however, is a crude technique and a more sophisticated model was desired. Placing phrase boundaries is important to ensure a naturally sounding text to speech system. Incorrectly inserted or deleted breaks may cause synthetic speech to sound unnatural and even produce misinterpretations of the text. The approach used in this research project was partly based on Veilleux et al. (1990) [14]. In this article an approach is described that uses a Markov model to assign phrase boundaries, using the part of speech (POS) tags of the words. Although this article serves as a good starting point, it is clear that their study was still limited in that they only use ve categories of POS tags and that their corpus consists of only 114 sentences (92 for training and 22 for testing). The research described in this thesis carries on from the work of Veilleux et al. Several probabilistic methods were examined that predict phrase boundaries for arbitrary input texts using the POS tags of the words. The 1

2

CHAPTER 1. INTRODUCTION

one that was best suited for the application of the speech synthesiser was chosen for implementation. The decision on the most suitable method is based on judging speed against performance and nding a compromise. The corpus used for training and testing is the Machine Readable Spoken English Corpus (MARSEC), see Arns eld (1994) [2]. This is a collection of spoken sentences from BBC radio broadcasts with several word-level transcriptions. Because this corpus is prosodically (phrase breaks) and syntactically (POS tags) transcribed, it was very suitable for our purpose. More about MARSEC can be found in section 1.4. For the implementation of the phrase break assignment module, a tagger was needed to assign tags to the words. A smaller tag set was needed than the common taggers provide, because too many tags will result in few occurrences in the training data and therefore a low performance of the model. Thoughts about the subject of the tag set are described in section 3.4. Since there are a lot of taggers around, and it is not that easy to write your own, it was decided to make a mapping from a tag set from an existing tagger to this special tag set, used in our experiments. The tagger that was chosen is a high performance tagger developed by Eric Brill [3], described in section 1.5.

1.2 Phrase boundaries

1.2.1 Introduction

A phrase boundary separates dierent parts (phrases) in a sentence that are not as closely related to each other as the parts within a phrase. There is a prosodic structure as well as a syntactic structure for every sentence. They are not necessarily the same, but they are closely related. The phrase boundaries are often in the same place in syntactic and prosodic analyses. Their consequence for speech is often a (short) pause, and a change in intonation. Paul went to school, j and studied hard, j but didn't pass the exams. In this example the phrase boundaries separate the conjoined clauses. The breaks can help to indicate the hierarchical structure of a sentence. A number of examples where phrase boundaries resolve unclear cases are: PP-attachment: Jane hit her friend # with the baseball bat. A phrase boundary at # would indicate that Jane used a baseball bat to hit her friend, where no boundary would indicate that Jane hit that friend of all her friends, who was holding a baseball bat at that moment.

1.2. PHRASE BOUNDARIES

3

Multiple coordination: In the UK I used to smoke Benson and Hedges and Silk Cut. If boundaries are placed in the wrong place and you are not aware of the fact that Benson and Hedges is one sort of cigarettes and Silk Cut another, you might think I smoked three brands of cigarettes or that Hedges and Silk Cut is one sort. Relative clauses: My sister #1 who's going to marry #2 called me yesterday. When boundaries are placed on the # spots, the meaning of the sentence is: My only sister, who happens to be going to marry, called me yesterday. Without boundaries it is: Of (all) my sisters, the one who is going to marry called me yesterday. In all these examples, the position of the phrase boundary aects the syntax and semantics of the sentence. The models examined and described in this thesis, cannot deal with these syntactical and semantic dierences directly, but the examples show the importance of the placing of phrase boundaries. More about syntactic disambiguation with the use of prosody can be found in Price et al. (1991) [7].

1.2.2 Minor & major

There is a distinction between minor and major breaks. A minor phrase break is a boundary between two less dependent constituents, often between two words or at a punctuation mark which usually causes a short pause. A major phrase break is a boundary between two even lesser dependent constituents, often at a sentence end and usually causing a longer pause. This is an example from the corpus I worked with: The usual j and the unusual j are both God 's doing j , and it 's miracles j which confront us with God most clearly jj . Where 'j' is a minor phrase break and 'jj' is a major phrase break. The distinction between these two types of breaks is not so obvious. The break at the comma, in the example above, is larger than the break between 'usual' and 'and'. Still they are labelled with the same phrase boundary. There are phonological indications that a distinction of at least ve types of phrase boundaries is justi ed, see Collier et al. (1993) [8]. In ToBI (Tone and Break Index) even seven dierent categories of breaks are used, see

4


Silverman et al. (1992) [11]. In the experiments described in this thesis, only two types of phrase boundaries are used for the following reasons: 1. The most obvious reason is that the MARSEC corpus was transcribed with only two categories of breaks; therefore, the training could be done with only two categories. 2. Secondly, the more categories are adopted, the fewer occurrences there will be in a corpus of a xed size; this will decrease the reliability of the probabilities and decrease the accuracy of the model. 3. The application of the model is speech synthesis, and so the subtle distinction between ve instead of two categories of breaks might not be noticed by a listener. As the parts of the speech synthesis system are not perfect, the use of two levels of phrase breaks should not unduly aect naturalness.

1.2.3 Phrase boundaries in speech systems

Placing of phrase boundaries in text to speech synthesis systems has always been dicult. Although some of the systems use parsing algorithms to indicate candidate places for the breaks, a lot of them use very straightforward algorithms. In the MITalk system as described in Allen et al. (1987) [1] the placing of boundaries is determined by punctuation and a function word following a content word. The same goes for the CSTR speech system, where minor phrase breaks are placed after a content word that is followed by a function word and major phrase breaks are placed at commas. Function words are determiners, prepositions, sub- and conjunctives, auxiliary verbs. Content words are the rest, such as nouns, normal verbs, adjectives, adverbs, pronouns. A third class of breaks is reserved for the space between two sentences. That this method of placing breaks is not sucient is proved by the following example from the corpus: He says j he and many of his colleagues j left their organisation j just before the Tripoli ghting jj .

There are three boundaries in the middle of the sentence, and none of them is on a punctuation mark. Only the rst minor break is between a content and a function word. The last two breaks would not have been placed by the old CSTR method. Instead they would both be placed one word further ahead. The issue of phrase boundaries in more theoretical prosodic and syntactic studies is also a struggle. Complex theories have been developed; in most of

1.2. PHRASE BOUNDARIES

5

the cases a prosodic parse tree is related to a syntactic one. Phrase breaks are placed according to prosodic phrases, that can be derived from syntactic phrases. Most theories like these are too complex for implementation in speech synthesis as they require accurate syntactic parsing. An example of such a theory can be found in Ladd (1986) [5]. Another routine of inserting phrase breaks can be found in Fujio et al. (1994) [9]. They describe a method of predicting prosodic phrase boundaries using a stochastic context free grammar for Japanese.

1.2.4 Choice of placing breaks There is, fortunately, more than one way to insert breaks in a sentence. The decision on placing breaks by humans in speech depends on factors like context, moment of breathing and speech rate. If a phrase break is (arti cially) inserted, it should be between two phrases, but omitting a break is often harmless. Consider the following sentence: The angry man #1 with the axe #2 in his hand #3 discovered the blood #4 on his shirt.

Where # are possible places for a phrase break. Some of these places are stronger than others. The position of #3 is the strongest candidate for a phrase break in this sentence, because it separates the two major constituents (NP and VP). The position of #2 is the weakest candidate, but still a valid place to insert a break. There is a hierarchy of the phrasing, though. One could, for example imagine a break at #1 and none at #2, but the reverse, a break at #2, but none at #1 is very unlikely. This indicates that indeed a kind of tree structure is needed to describe the placing of phrase boundaries. Some places are clearly inappropriate as locations of phrase breaks. For example a break in between an article and a noun is seldom correct. A boundary between 'the' and 'axe' in this example would sound very strange to say the least. It is clear that the sentence can be pronounced with dierent break decisions without being incorrect. This has a couple of consequences with respect to our experiments. The corpus that is used for our experiments is a collection of sentences, spoken by dierent speakers under dierent conditions at dierent times. The decisions they made as to where to place the breaks are not consistent,

6


neither with themselves nor with each other. Therefore there will be a lot of 'noise' in the data. In the corpus, phrase breaks are transcribed by two dierent people, and although they have tried to work in a consistent way, their opinions about hearing a break diered. Evidence for these dierences can be found in the parts that were transcribed by both transcribers. These overlap sections contain 4680 words or 9 % of the corpus. They agreed on 743 of the 1178 boundaries, which is 63 % of the cases. In 319 occasions (27 %) one transcriber marks a boundary where the other does not. In 116 cases they both transcribe a boundary, but disagree on the sort. This also caused some noise. When comparing a sentence with breaks produced by the model against a transcription of the original sentence, a dierence in the placing of a break is a failure of the model, but it may sound perfectly all right. The performance of the model may be higher than the results of testing on the corpus indicates. To reduce the noise, produced by the random factors mentioned above, a bigger corpus is needed. When the corpus is bigger, the more consistent patterns will result in more reliable probabilities and so reduce the number of errors.

1.3 Part of speech tags Grammatical tagging is assigning classes to words. There are various collections of these classes, called tag sets, but they all have some things in common. The label that is attached to the word is called a part of speech tag or tag. The tag sets dier mainly in how detailed they are. In most tag sets, broad categories as nouns, verbs, adjectives, etc. can be found. If the tag set is detailed, the class of verbs will be divided in common verbs and auxiliary verbs. Or a distinction is made between the several forms (base form, imperative, present participle etc.). An even more detailed tagger can have both distinctions of the verb category. The tag sets that are used for these experiments are CLAWS4, that was used in the Spoken English Corpus, and the Penn Treebank POS tag set, that is used by the tagger in the implementation. CLAWS4 is considerably more detailed than the Penn Treebank tag set; there are about 175 CLAWS4 tags used in MARSEC, whereas there are only 36 tags in the Penn Treebank tag set (both excluding punctuation).

1.4. MARSEC

7

1.4 MARSEC The experiments were done on MARSEC (The Machine Readable Spoken English Corpus), compiled at the Unit for Computer Research on the English Language (UCREL) from the University of Lancester, and the Speech Research Group at IBM UK Scienti c Centre in Winchester. The Machine Readable Spoken English Corpus is an extension to the Spoken English Corpus (SEC). It is a compilation of spoken British English (taken mainly from BBC Radio 4 broadcasts) and is available as lexicographically transcribed texts, as grammatically parsed texts, and as prosodically annotated texts. The annotations were produced manually with the exception of the parsing, which was done semi-automatically. The prosodic transcription was done by two expert transcribers. The total corpus size is about 52,000 words. The corpus is divided into 11 categories of dierent styles. These styles are Commentary, News Broadcasts, Lectures type I, Lectures type II, Religious Broadcast, Magazine Style Reporting, Fiction, Poetry, Dialogue, Propaganda and Miscellaneous. These categories are divided in dierent sections, each from a single recording. Every section is in one (computer) le. From the parsed text, only the POS tags were needed. These were assigned at the University of Lancaster using their CLAWS tagging programme. From the prosodically annotated texts, only the phrase breaks where needed. A problem with the corpus is that the annotations were done independently, so they are not lined up with each other. For example, in the parsed text it says '5.1' where the prosodically transcribed text says ' ve point one'. Since information from both les was needed, the les had to be compared and made uniform. More information about SEC and MARSEC can be found in Knowles and Taylor (1988) [4] and Arns eld (1994) [2].

1.5 Tagger The tagger used in the implementation is the one developed by Eric Brill, described in Brill (1994) [3]. The tags it assigns are from the Penn Treebank POS tag set. It works in the following way: Every word is looked up in a lexicon and is assigned its most likely tag; if a word is not in the lexicon, a tag is assigned by rules the tagger has learned.

8


Subsequently, the tags are reconsidered by lexical rules. The learning of the rules is done in a special way. A set of very basic rules are given. These are of the form: Change tag a to tag b when the preceding (following) word is tagged z, the word two before (after) is tagged z, etc. There are six such rules, with only information about the tags. There are also 5 rules that make reference to words, such as: Change tag a to b when the preceding (following) word is w, the word two before (after) is w, etc. These general rules are lled in by training them on a tagged corpus (Penn Treebank Tagged Wall Street Journal Corpus). When the performance becomes better by using a certain rule (with the values of the tags and words assigned to the variables 'a', 'b', 'w' and 'z'), that rule will be added to the set of rules for the tagger. The rules for assigning tags to unknown words are trained in the same way. Some rules that are learned by the tagger are: Change the tagging of a word from noun to verb if the previous word is tagged as a modal, from preposition to adverb if the word two positions to the right is 'as', etc.

1.6 The CSTR text to speech system The CSTR TTS system is meant to deal with totally unconstrained input. The speech synthesiser is implemented entirely in software, which enables it to run on dierent computer systems. It consists of two components: 1) A text to phoneme system, 2) A phoneme to speech system.

1.6. THE CSTR TEXT TO SPEECH SYSTEM

9

1.6.1 Text to phoneme

This system takes free formatted text and converts it to a phonetic representation along with syllabic, syntactic and stress information. A large lexicon of 27,000 words is used to look up part of speech, lexical stress, syllable boundaries and word pronunciation. The lexicon covers 95 % of the words. When a word is not found in the lexicon, the necessary information will be determined by letter to sound rules, syllabi cation and word stress assignment. The part of speech, which is solely used to determine the phrase breaks, will be assumed to be a content word. Phrase breaks are placed on punctuation and after a content word when it is followed by a function word. There is a list of function words to decide whether a word is a function word. All other words are content words. This method was meant to be replaced by something more sophisticated, which is what this thesis is about.

1.6.2 Phoneme to speech

This part of the system takes a set of input instructions relating to phoneme type, duration and intonation, and produces a speech waveform. The synthesiser works by concatenating diphones and then modifying the pitch and duration patterns, using the PSOLA algorithm. The intonation assignment is done using the Pierrehumbert style notation, see Pierrehumbert (1980) [6]. Accent assignment is similar to a system used by Silverman [12] and is under construction. A list of 2400 diphones, by one male British English RP speaker is used. For a more detailed description of the CSTR TTS system see Taylor et al. (1991) [13].

10


Chapter 2

Preparations 2.1 Introduction This chapter describes the preparations which had to be done to do training, testing and implementation. There were two dierent types of les from the MARSEC, those with the prosodic information and those with the parsing information. They had to be merged into one other type of le which had to be used for training and testing. This preparation was a time-consuming process. Also, the input of the tagger used in the implementation had to be preprocessed, because the tagger works with text in a special format.

2.2 MARSEC The les from the Spoken English Corpus that were used are text les that contain the spoken text written with either parsing information or prosodic information. All text was available in both formats. The two les had to be combined into one le before the experiment could take place. The text with the parsing information looks like this: [J More RGR important JJ J] , , however RR , , [V is VBZ [Fn that CST [N the A [ biblical JJ writers NN2 ] themselves PPX2 N][V thought VVD [Fn that CST [N events NN2 [Fr that CST [V followed VVD [N natural JJ laws NN2 N]V]Fr]N][V could VM still RR be VB0 regarded VVN [ as CSA miraculous JJ ]V]Fn]V]Fn]V] . .

The codes attached to the words (by an underscore) are the tags, the rest is parsing information. Only the text and the tag information was needed. 11

CHAPTER 2. PREPARATIONS

12

Close examination shows that it is fairly easy to accomplish this. All one has to do is delete any string not containing an underscore (' '). The result is the text tagged with the CLAWS tag set (the underscore has been replaced by a slash): More/RGR important/JJ ,/, however/RR ,/, is/VBZ that/CST the/AT biblical/JJ writers/NN2 themselves/PPX2 thought/VVD that/CST events/NN2 that/CST followed/VVD natural/JJ laws/NN2 could/VM still/RR be/VB0 regarded/VVN as/CSA miraculous/JJ ./.

The text with the prosodic information looks like this: 46.23 46.44 46.92 47.38 47.54 47.69 47.87 47.98 48.40 48.68 49.41 49.63

122 122 122 123 122 122 122 122 122 122 123 122

more im`/portant how*ever

j

is that the biblical writers them`/selves

j

thought

The rst column gives information about the duration of the spoken utterance, but is not important for the experiment. The second column is also not important for the experiment. In the third column are the words with prosodic information about intonation (' ' indicates a low level, '`' indicates a fall etc.). The vertical bars indicate the phrase boundaries, one for a minor break, two for a major break. For our experiments only the words and the phrase break information were needed. To accomplish this, the third column is selected rst. Then all alphanumerical characters plus the apostrophe ('), the dot (.) and the vertical bar (j) are saved, as the rest is redundant information (for this purpose). This leaves us with:

2.2. MARSEC

13

more important however

j

is that

The most complex part of this preparatory phase was the merging of this information into one le. This procedure is not dicult in itself, but the dierences between the two sorts of les caused the problems. Once the spoken text was in written format, changes made by labelers in one le were not made in the corresponding les with other formats, this causing discrepancies. For the parsing of the corpus, one could do with the plain text, except that some compound elements had to be separated. Things like "it's" and "won't" had been split up in respectively "it 's" and "will n't", because the tagging programme assigned dierent tags to "'s" and "n't". This was not done for the other (among others prosodic) annotations. The basic operation is to split everything conataining an apostrophe into two words, the second word beginning with the apostrophe. That covers the largest group, since "'s", in the meaning of "is", can be attached to many words. So: X'Y ? > X 'Y Exceptions are words with "n't" and normal words containing apostrophes, like "o'clock". These were (for programming purposes) put in one list with exceptions, since the number of such cases was limited. For the prosodic annotation, some words had been rewritten, to be able to insert the prosodic mark characters. These changes were in general an expansion of the original term: "2,500" was rewritten to "two thousand ve hundred pounds" and "Mr." was rewritten to "Mister". The transformed terms are suitable to be transcribed with prosodic mark characters, that were not needed for our experiments. These transformations were not done in the parsed version of the text. We had to get rid of these dierences by changing one or the other. These

14


changes could not be made automatically, so they were all changed by hand. Most changes were done in the prosodic annotated les, because then the changes were reductions ("three point one" ? > "3.1") and the lost prosodic information was not needed anyway. If these changes had been made to the tagged versions, then these new words would have had to be tagged. Since there was never a break within these reduced parts, and the amount of work would have been enormous, it was decided to make the changes in the prosodic annotated les. Some les contained so many of these dierences in writing that it was decided to exclude them from the experiment. Another dierence between the two sorts of les is that the parsed version contains punctuation, whereas the prosodically annotated version does not. Fortunately, punctuation symbols were tagged with punctuation tags by the CLAWS tagging programme, so they could easily be recognised (as punctuation) when inserting the breaks. The prosodically annotated les were originally chunked into a number of smaller les (to prevent the les from becoming too large) with small overlap sections in the end of one and the beginning of the following le. For the merging of two les together into one le, the le with the prosodic information was read into an array. The le with the tag information was the basic le, a sentence was read, split into word/tag pairs and these were split into words and tags. Then the following procedure was used:

WHILE not end of sentence IF not a punctuation THEN word is compared to word in array IF the same word THEN next word in array is checked IF that is a break THEN insert break ELSE (IF not a break) THEN nothing (go to next word/tag pair) ELSE (IF not the same word) THEN check next word in array, until matching word is found ENDWHILE

2.3. TAGGER

15

2.3 Tagger The Brill tagger expects its input in the same format as the corpus it was trained on (Penn Treebank Corpus): one sentence per line, with punctuation and elements with an apostrophe tokenised. For example: He 's the winner ( at least , that 's what I was told ) . The boy said : \ I am here . " When allowing unpreprocessed input, this means that all punctuation and attached complements containing an apostrophe, like "n't" and "'s", have to be separated. Also the end of a sentence has to be found, in order to separate the sentence nal mark from its preceding word. The rst problem needs extra attention for the "." and the "'". The "." can appear as full stop, but also in abbreviations, numbers etc. The "'" acts not only as a quote mark, but also as in words like "I'm" and "hasn't" and in addition, also in words like "o'clock". Therefore they are both treated separately from the other punctuation. The "." should only be separated from the preceding word when it is used as a full stop. A "." is considered a full stop if the preceding character is a non-capital and the following character is a space, followed by a capital or another space, or if the following character is a double quote. The "." is separated from the preceding word in that case, in all other cases not. In short: "c. C" ? > "c . C" "c. " ? > "c . " "c."" ? > "c . "" Errors will occur with cases like "Dr. Who"; "Who" will faultly be seen as the beginning of a new sentence (because of the rst rule). If an apostrophe appears in "n't" then that part has to be separated from the word. In case of "won't", "shan't" and "can't" the rst part has to be changed to respectively "will", "shall" and "can". In cases other than "n't", the "'" is separated from the words it is attached to, unless both the preceding and following characters are alphabetic characters (letters) (these are cases like "it's" and "we're"). In the latter


16

case it is separated from the preceding word only. In short: wordn't ? > word'word ? > word' word ? > word 'word ? > word ' word ? >

word n't word 'word word ' word word ' word word ' word

This rule will cause errors with words that contain an apostrophe like "o'clock". Fortunately, there is only a small number of these words. All other punctuation symbols are separated from both preceding and following words they are attached to. This may not be correct in every case imaginable, but it is the closest to a straightforward solution. Each of the symbols "()fg[]*#/:;!?,"" will be separated from the preceding or following word, or: word:word word: word word :word word : word

? ? ? ?

> > > >

word : word : word : word :

word word word word

Chapter 3

Experiment 3.1 Introduction The experiment consisted of examining various methods of inserting phrase breaks in the les of the MARSEC, with only the tagging information, as input. The results of the so newly created les were compared to the versions in which the boundaries were manually transcribed. The best method was chosen for implementation in the text-to-speech system. However, not every le (collection of data sentences from one category) in the corpus was useful for the training; some les from the corpus, that is divided in dierent categories, dier too much in their structure of sentences that they were left out of the data. The les that were selected as suitable for the experiment had to be divided over the test and training set. Very crucial for good results is the tag set that is used. The original tag set of the corpus, as well of the Brill tagger, was too large to do the experiments on. Most methods use combinations of tags to predict a break. If the tag set is too large, then there are too many possible combinations. This results in few occurrences in the corpus so that these tag combinations will not provide good predictors for a break being in that tag combination. Therefore a mapping had to be made from the original tag sets to a new one, that is used in the experiment. The experiment itself consisted of a training phase and a test phase. In the training phase, occurrences of breaks in certain tag sequences were counted. The results of this training phase were the probabilities of breaks in a certain sequence of tags. Also probabilities concerning the number of words between two breaks were determined. These probabilities were used 17

18

CHAPTER 3. EXPERIMENT

in the test phase to predict the breaks in les from the corpus (without the transcribed breaks of course). Various methods of inserting the breaks (using the probabilities) were examined. Some methods worked with variable parameters, which were thoroughly tested and optimised. Finally, some words must be said about how to judge the dierent methods and how to present the results.

3.2 Choice of les The Spoken English Corpus is divided in 11 categories. These are Commentary, News Broadcasts, Lectures type I, Lectures type II, Religious Broadcast, Magazine Style Reporting, Fiction, Poetry, Dialogue, Propaganda and Miscellaneous. Some categories are evidently unsuitable for the experiments. Poetry, for example, diers in many ways from the type of speech that is aimed to be produced in a text to speech system. The placing of breaks is dierent, because of the special rhythm in verses. The pauses are also longer than in normal speech, which causes a considerable number of major breaks. Therefore it is unsuitable for use as training data. Fiction, for instance, is a more dicult case to judge in this respect, and therefore it needs a closer examination. A story with many quotes has a dierent structure from 'normal' sentences. It may contain contradictions in comparison with the other training material (a break in a certain tag sequence, where it normally does not appear, or vice versa) and disturb the training results. The best situation is where a certain tag sequence has always or never a break in it, because it is a good predictor then. If there are many tag sequences in a le that violate the trend, then that le might be unsuitable for the training data. It is best to have as much training data as possible, but it is important that it is in line with the purpose and that it is as homogeneous as possible. An optimum has to be found where there is as many data with as few contradictions as possible. Some small tests were done from which it appeared that categories that are not useful in general are Magazine Style Reporting, Poetry and Dialogue. The data in these categories are not consistent with the rest of the corpus. This is not so surprising for Poetry and Dialogue, but Magazine Style Reporting looks quite normal.

3.3. TRAINING AND TEST SET

19

3.3 Training and test set When the les are chosen that will be included in the experiment, they have to be divided over a training set and a test set. An important law of experimenting with corpora says that something should not be tested on data it is trained on. So the sets have to contain dierent les. It is furthermore important to keep the training set large. A common distribution of the data is approximately 9=10 for training and 1=10 for testing. The distribution should be at random. The distribution in this experiment was a little dierent. It was desirable to have some dierent categories in the test set. From ve categories a le was picked at random for the test set. The categories were Commentary, News Broadcasts, Lecture Type II, Propaganda, Miscellaneous. These consisted of 6420 words and punctuation in total. The training set consisted of 18131 words in 16 les. The distribution is therefore about 3=4 ? 1=4. The training will be a little worse, because there is less data than the common 9=10. The reason this was done was to avoid chances of 'good' or 'bad' luck in the test set. If there are only 1 or 2 les in the test set, it may well be that these one or two les accidentally score better or worse than the average. The more les one tests on, the more the result tends to the average score.

3.4 Choice of tag set

3.4.1 Introduction

The tag set used in MARSEC is CLAWS4; it has about 200 dierent tags. Since this number is too large for the experiment, ways of reducing this number had to be found. Moreover, a lot of tags can be regarded equivalent for the present purpose. For example, the distinction between a neutral article (the) and a singular article (a) is too ne and can be narrowed down to one category. To decide what the tag set will look like, one has to consider the number of tags and which tags one will put in the same category. The goal is roughly speaking to get as many similar tags in as few categories as possible.

3.4.2 Number of tags The number of categories has a big in uence on the results. Too many categories will result in too few occurrences of (sequences of) tags in the

20


training data, hence unreliable estimates of probabilities. When selecting too few categories one will miss the ner distinctions that may cause large dierences. If the number of tags in the tag set is represented by M , a tag sequence of length N will result in N M dierent combinations. Some of these combinations will not appear in the corpus, irrespective of its size; if the category 'determiner' existed, then it would not be likely that the sequence of three determiners appeared in the corpus. This is inevitable, but it must be made sure that there are sucient occurrences in the corpus of combinations that are not that strange. Note that it is not possible to provide a precise number as a criterion. The optimal number of tags also depends on the choice of categories and has to be determined by testing dierent tag sets.

3.4.3 Categories of tags The problem of which categories will be used can be reformulated as the problem of which tags in the original tag set will be put together, so that a smaller set of categories is created, that satis es the demands just stated. This problem is doubled since there are two dierent tag sets (one for training and one for the implementation) that have to be mapped onto this third one. There are two dierent tag sets, because the corpus was tagged with the CLAWS tag set and the tagger in the implementation assigns tags from the Penn Treebank POS tag set. It is the easiest to map the tag set of the corpus rst onto this new tag set and do the experiments. After the optimum tag set has been determined, the tag set of the Brill tagger has to be mapped onto this ideal tag set. With respect to the rst problem mentioned (grouping tags from the two original tag sets into a reasonable number of basic tags), it is not only necessary to consider syntactic word class of the tags, but also their place in the context. It is obvious that words that are syntactically related are likely to be grouped together. Singular common nouns and plural common nouns, for instance, should be in one category together instead of apart, although there are some dierences between them. A plural common noun can never be preceded by a singular article, whereas a singular common noun can. And the verb of which it is the subject has to agree too. After the most obviously related tags are put together, the less obvious need to be considered. It is important to look where the words of such a speci c tag are usually placed in the sentence. This should be considered even more important than the word class when classifying the tags. The

3.4. CHOICE OF TAG SET

21

possessive pronoun, for instance, suits best in the class of determiners, since it appears in roughly the same context, independent of the fact whether there has been a category of pronouns created. This can be illustrated by comparing the sentence "This is my house" with the sentence "This is the house", where there is no dierence in phrasing at all. Therefore it is better to group the possessive pronoun with the determiners than with the pronouns.

To summarise one might say that an ideal category is a category from which every element can be replaced by every other element from that category, without a change of break placing, in every sentence that element appears in. One should also bear in mind the total number of appearances in the corpus of all tags from one category. A category with few appearances in the corpus would result in few appearances of all sequences with this tag in it, which would make their prediction unreliable. A category with many appearances in the corpus will result in a higher reliability of the prediction of sequences with the tag in it, but it may be good to split it up to get ner distinctions, as long as the new categories are big enough. It would, for example be desirable to have a separate class for the coordinating conjunctions 'and' and 'or', but there are too few occurrences in the data set to make this feasible. Again it is hard to tell what a reasonable number of appearances for the tags from one category is, but it seems best to take care that the tags of one category have a minimum number of appearances in the corpus. Unfortunately a word with a certain tag can appear in several contexts, with dierent grammatical functions. A common noun can be a subject or an direct object or an indirect object etc. This kind of information as to grammatical relations can undoubtedly be useful to determine the placing of the phrase breaks. Since this information is not present in the pos tags it is necessary to look at the context of the words, as we try to do here by looking at sequences of tags. It would, however, be better to have more or other information about the words like functional categories and semantic information to get rid of this ambiguity (like subject-object). However, in the present situation there were no such additional labels available, except for the little parsing information in the MARSEC, but that is beyond the scope of this study.

22


3.4.4 Tag set The tag set that has eventually been used in the experiments contains 11 dierent tags. Experiments have been done with other tag sets, but this one gave the best results. Although they were sorted out according to the restrictions described above, one can describe the classi cation as follows (there are a few exceptions): 1. major punctuation (f) 2. minor punctuation (p) 3. adjective (j) 4. noun (n) 5. determiner (d) 6. sub- or conjunctive (s) 7. preposition (i) 8. pronoun (o) 9. adverb (a) 10. auxiliary verb (x) 11. normal verb (v)

3.5 Training method To get optimal results, the training corpus has to be used in a sensible way. Two sorts of probabilities have been used in the experiments, the trigram probability and the distance probability. The trigram probability deals with the probability that a break appears in a certain tag sequence, whereas the distance probability deals with the number of words in between two breaks. They have been used in dierent methods, alone or in combination with each other.

3.5.1 Trigram probabilities

The corpus has to be searched for phrase boundaries in certain tag sequences which will provide the probabilities, but there are a number of (reasonable) possibilities between the number of tags in a sequence and the place where the break is to be found. And again it is the case that the more tags are present in a sequence, the more detailed the information, but the fewer appearances in the corpus. The sequences that have been considered as useful for the experiments are:

3.5. TRAINING METHOD 1. 2. 3. 4.

23

two tags preceding a possible break three tags preceding a possible break two tags with a possible break in between three tags with a possible break after the second. Or to put this more symbolically:

tag tag j tag tag tag j tag j tag tag tag j tag

There appeared to be enough data to use sequences of three tags instead of two. A four tag sequence will be too long, for there are too many of them, which leads to too few occurrences in the corpus. With our 11 tags, there are 114 = 14; 641 possible combinations. The total corpus size is about 52,000 words; therefore every combination would have an average of only 4 occurrences in the corpus. It was thus decided to use trigrams. The next thing was to decide whether to look for sequences of the form tag tag tag j or tag tag j tag. Some little test were done from which it appeared to be more fruitful to examine one word after a possible break than one word more preceding a possible break. Therefore the most obvious sequence to use is tag tag j tag. To get the probabilities from the training corpus all trigrams of tags were counted and the number of times there is a break between the second and third tag. These numbers are an indication for the probability of that trigram having a break. A distinction is made between the minor and major phrase breaks. The trigram probability, however, is the sum of the probabilities of a minor break and a major break appearing in a trigram. The distinction becomes useful when it is already decided to place a break (with the total trigram probability), but a choice still has to be made whether to place a minor or a major break. Since there are 11 dierent categories there are 113 = 1331 trigram probabilities. Their value can go from 0 (never a break in that particular trigram) to 1 (always a break in that trigram). If a certain trigram often appears in the training data, then its percentage of containing a break is a reliable predictor. Yet, in case there are only a few, then the percentage is a very unreliable predictor. Chances are small that it will appear often in the use of the text to speech system if it is rare in the corpus. This goes at least for that type of input that is covered in the corpus. If such a trigram appears in the implementation anyway, the


24

decision of the placing of a break is more likely to be wrong, because of the unreliable predictor.

3.5.2 Distance probabilities

The other type of probability that is used is the distance probability. The idea behind it is that use of the trigram probabilities alone does not take into account the distribution of the breaks. It is unlikely that three phrase breaks will occur after consecutive words. It is also unlikely that a sequence of 25 words will not have any break in it. These distance probabilities are computed by counting the words between two breaks. It might also be a good idea to use syllables to compute the distance probabilities. Syllables are a logical prosodical unit to measure phrase lengths. This would have made the experiments a bit more complicated, because the words have to be split up into syllables every time. Therefore it was decided to compute the distance probabilities with words. To use this distance information, probabilities are needed for the distance between two phrase boundaries. To get these probabilities, all the distances between two breaks in the training corpus were counted and divided by the total number of breaks. Where distance simply means the number of words, including punctuation. This gives 18 numbers that represent the probabilities of the distance between two breaks. The probabilities of all these numbers sum up to 1. The distances that appear the most are 3 and 4. The maximum number of words between two breaks is 17. Figure 3.1 shows the distribution of the phrase lengths from which the distance probabilities are directly derived. The x-axis gives the length of the phrase in words and the y-axis gives the number of phrases of that length in the training set. Another distance probability that is used, which is derived from the one just described, is the cumulative distance probability. That means that the probability of a break after N words is the sum of all the probabilities of a break after any number of words up to N or: cum (N ) =

P

PN

i=1 P (i)

Thus, the greater the distance to the previous break, the higher this probability is. The distribution of this probability is shown in gure 3.2. In this study, no distinction has been made between minor and major phrase breaks for computing the distance probabilities, because more infor-

3.5. TRAINING METHOD

25

800

600

400

200

0

1

2

3

4

5

8

7

6

9

10

11 12 13 14 15

16

17

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3.1: Distribution of phrase lengths.

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17

Figure 3.2: Distribution of the cumulative phrase distance probability.

26


mation has to be stored and used, and it might not result in a much better performance.

3.5.3 Combining the probabilities When combining the two types of probabilities one has two bear in mind that they might not be totally independent. If sequences of trigrams (thus sequences of sequences), with a high percentage of breaks, do not often appear together, then that will automatically prevent many breaks being inserted close together. Consider, for instance, the sequence w1 w2 w3 w4 w5. If the trigram probabilities for w1 w2 w3 and w2 w3 w4 and w3 w4 w5 are high (above 0.5) then it might be that the sequence w1 w2 w3 w4 w5 does seldom or not appear in the corpus. In that case, the trigram probabilities and distance probabilities are dependent. Expectations are that this will not be so strong that the distance information does not provide any optimalisation of the results. Dierent methods are examined where both sorts of probabilities are used. But the operation that is used to combine the probabilities is always multiplication. Since the probabilities are used as predictors for placing a break, not too much attention is given to the actual meaning of this multiplication. The factual meaning of Ptri (X ) Pdist(Y ) is the chance of a break in the trigram X times the chance of a break after Y words since the previous break.

3.6 Method of inserting breaks There are many imaginable methods of inserting phrase breaks. Some methods involve only the trigram probabilities, other methods also include the distance probabilities. A number of them were examined and various parameters were tested.

3.6.1 Method 1 The most elementary way to insert the breaks is just to place them where a punctuation mark occurs. This is very simple and one does not even need the tags of the words. Two sorts of punctuation marks are recognised; the rst group ("(", ")", ",", "-") is assigned a minor phrase break, the second group ("!", ".", "...", ":", ";", "?") is assigned a major phrase break. It

3.6. METHOD OF INSERTING BREAKS

27

is expected that the breaks it assigns will be pretty good, but that it will assign too few breaks to be worth to implement.

3.6.2 Method 2 This method is nearly the same as in the current implementation; minor phrase breaks at punctuation and between a content word and a function word, major phrase breaks at sentence endings. Expectations are that always inserting a break between a content word and a function word will cause too many breaks in these positions and that quite a few breaks, that are not after a content word and before a function word, will be omitted.

3.6.3 Method 3 This method makes use of the trigram probabilities only. It considers every trigram in a sentence and places a break when the probability exceeds a certain threshold T . The trigram probability which decides whether a break is placed is the probability of a minor break plus the probability of a major break being in that trigram. The decision of which break will be placed (when it is decided that a break will be placed) is done by comparing the minor break probability and the major break probability. If the minor break probability is higher than the major break probability, a minor break will be placed. And vice versa. T can vary from 0 (always a break) to 1 (only a break when all occurrences in the training data had a break). The most obvious looking choice for T is 0.5; in that case a break will only be inserted when over a half of the trigrams in the corpus was assigned a break. However, when the probability of a certain trigram is 0.5 and T is assigned 0.5, then an error will occur in 50 percent of the cases that that trigram appears. The closer to 0 or 1 a probability is, the fewer errors (should) occur provided that the trigram has a sucient number of occurrences in the training corpus. A problem with this method arises from the use of trigrams with a break between the second and third tag, for it is not possible to place a break between the rst and second word in a sentence. The solution adopted here is a separate probability matrix for all sentence starting digrams. All sequences of two tags at the beginning of a sentence were counted and the number of times a phrase boundary was in between them, just like the trigrams. These numbers are again used for the probabilities for predicting a break. From

28


this study it appeared that only in a few cases there is a break in the second position of a sentence. Although the bigram probabilities are implemented in this and some of the following methods, it was concluded that omitting these bigram probabilities will not decrease the performance of the model signi cantly. This third method will probably be considerably better than the previous two. It is more sophisticated and makes full use of word class information in an ecient way.

3.6.4 Method 4 When only the information about the trigrams is used, there is a possibility that a considerable number of trigrams that occur close together will be assigned a break. In other words, the third method does not take density of breaks into account. Therefore, it was decided to take information about the distance to the other boundaries into consideration. There are a lot of ways in which these distance probabilities could be used together with the trigram probabilities. This fourth method uses the cumulative distance probabilities. The algorithm works from left to right, it considers every trigram in the sentence by multiplying the trigram probability and the cumulative distance probability. If the product probability exceeds a certain threshold T , a break will be inserted, otherwise it will not. Again the placing of a minor or a major break is decided by comparing the minor break probability and the major break probability of that particular trigram. As soon as a break is inserted the distance parameter is reset to 1. In pseudo programming code, it looks like this: DO UNTIL end of sentence Pi (break) = P (trigrami ) * Pcum (distancei ) IF Pi (break) > THRESHOLD THEN insert break ENDDO This method is expected to be at least as good as the third method, if the threshold parameter is set correctly. There is, however, a chance that the


29

distance probabilities are bad predictors and disturb the total probabilities in such a way that it would be better to leave them out.

3.6.5 Method 5

This method is a little more sophisticated than the previous one. The distance probabilities are not cumulative. It works with a lookahead parameter L. In a sentence all L trigrams from a certain point are considered. The probability of the trigram (containing a break) and the probability of a break since the previous break are multiplied. The maximum of these L probabilities is the only one that is considered for a break. A break will be placed when the trigram probability is bigger than the threshold T . The decision for a minor or major break depends on which one of the two has a higher trigram probability. In pseudo computer code it looks like this: DO UNTIL end of sentence FOR i=1 TO LOOKAHEAD or UNTIL end of sentence Pi (break) = P (trigrami ) * P (distancei ) ENDFOR IF P (trigramMAX (P (break))) > THRESHOLD THEN insert break mark new starting point at MAX(P (break)) ENDDO The main dierence between this method and the fourth method is that this method takes only the maximum probability of a certain number of trigrams (lookahead) whereas the other method takes every probability higher than the threshold. The advantage of this method is that it looks forward to see if there are higher probabilities. That means that a trigram may not be assigned a break because there is a trigram with a higher probability further in the sentence. This is good because there will not be breaks close together. But it can also be wrong, because a trigram that should de nitely be assigned a break can be missed if there is a trigram with an even higher probability little further in the sentence. It is therefore hard to tell whether this method will be better than the previous method. Fact is that there is one more parameter to test with,

30


which is the lookahead parameter.

3.6.6 Method 6

This method is an exhaustive search for the best combination of breaks in a sentence, using both the trigram and the distance probabilities. It examines every possible combination of boundaries between words in a sentence and computes the probability of that combination. After all combinations are examined, it picks the maximum and inserts the breaks according to that combination. If a sentence has N words, then there are 2N ?2 combinations possible. This is because trigrams are used, only the place after the second word is the rst to be considered for a break. A sentence with 5 words has the following possibilities: word word word word word word word word word j word word word word j word word word word word j word j word word word j word word word word word j word word j word word word j word j word word word word j word j word j word

The way the probability of a certain combination is computed is the following: All trigrams in a sentence are considered from left to right. The total probability of one trigram is the distance probability of the distance to the previous break multiplied by the trigram probability of that trigram, in case of a break, or 1 - the trigram probability, in case of not a break. All the probabilities in the sentence are multiplied. This gives the total probability of one combination. So, for the sentence w1 w2 w3 j w4 w5, the total probability is: Ptri (w 1w 2w 3) (1 ? Pdis (w 1w 2w 3)) Ptri (w 2w 3w 4) Pdis (w 2w 3w 4) Ptri (w 3w 4w 5) (1 ? Pdis (w 3w 4w 5)) The decision for a minor or major break depends again on the probabilities for a trigram having a minor break and the probabilities for a trigram having a major break. A problem is that when one single probability is zero, the total proba-


31

bility will be zero, since all probabilities are multiplied. A zero probability has therefore to be replaced by a very small probability. This must be small enough to decrease the total probability signi cantly, but not so small that the total probability stands no chance whatsoever to be the maximum. It is, for instance, possible that a trigram appears in a test sentence, that occurs only once in the training data, without a boundary. Its probability is then zero, but that may not prevent placing a boundary in that particular trigram in any case. Because of the many (2N ?2) possible combinations, this method has practical disadvantages. A sentence with a length of 22 words { which is not extraordinary, certainly not in the corpus { , will cause over a million combinations to be examined. This explains why this method is not ideal for a real time text-to-speech system. If the information about distance proves to be of any use, this method should be the best, since it considers every possible combination and not only a few of them like the other methods. The big disadvantage of this method is the time it takes to compute all combinations if the sentence is not short. Even a sentence of 15 words will take forever. A solution to this problem is presented in method 7.

3.6.7 Method 7 Some sentences in the corpus are very long; the longest sentence in the corpus is 124 words including punctuation marks. This gives 2122 = 5:3 1036 combinations, and that would be too much even for a super-computer. To handle this problem a new algorithm was developed. Most of the time there is a break at a punctuation mark in the transcribed corpus. With testing every method appears to always place a break at a punctuation mark, because of a very high trigram probability for punctuation marks. This method will therefore always place a break at a punctuation mark and take the clauses between two punctuation marks apart and use the sixth method to assign the breaks. This reduction is not enough; there are still parts without punctuation that are too long. There is one long sentence of 53 words in the corpus without any punctuation whatsoever. To solve this problem a maximum M is built in. If the number of words between two punctuation marks is higher than this maximum, another method of break placing will be used. The method that is used then is the third method, with only the trigram probabilities.

32


Note that every trigram more gives 2 times as many possible combinations of break placing (because of the 2N ). This means that the computing of the maximum probability for a clause with 11 words will take twice the time of that for a clause with 10 words. The placing of a break at every punctuation mark seems to be very acceptable, and all the methods appear to do it anyway. This method should therefore not be much worse than the previous one.

3.6.8 Variables There are two variables that play a role in some or all of the methods described above. These parameters have to be carefully tuned, as they in uence the results. The parameters are the threshold T above which a break is inserted and the lookahead parameter L, only used in the fth method. Before testing, indications can be given of what the in uence of these parameters will be. The threshold T is the parameter to which the (trigram or total) probability is compared; if the probability is higher than T , then a break is inserted; if not, no break is inserted. The lower T is, the more breaks will be inserted. This will result in more breaks rightly placed, but also in more breaks wrongly inserted. The best value for T seems to be the one where the dierence between the correctly placed breaks and the wrongly inserted breaks is the highest (provided that the rst number is the highest). The lookahead parameter L, used in method 5, decides how many trigrams in advance are considered to place a break. The maximum will be picked and the algorithm goes on from there. The insertion of a break still depends on T , if the trigram probability is smaller than T , a break will not be inserted. If L is small, only a few trigrams are considered. Many breaks will be inserted and many of them will be right, but some wrong breaks will also be inserted. Is L large, then the maximum of many trigrams is chosen and a lot of possible break candidates will be skipped. This causes fewer insertions, but more deletions. The best value of L has to be determined experimentally. The variables can be tuned so that the test on the test data will give a higher score (this is actually training on test data), but it can of course also be used in the implementation. When one would want many breaks (to simulate a slow speech rate for example) one could set the threshold T low. More breaks will be inserted, of which a couple in the wrong place, but that might be compensated by less errors of breaks that are correctly inserted

3.7. COMPUTING A SCORE

33

that would not have been inserted with a higher threshold.

3.7 Computing a score

3.7.1 Introduction

When the breaks have been inserted during testing, the then newly created le has to be compared to the le with the original breaks to deliver the score. There are quite a number of ways to do this and a number of matters to consider. One could look at whole sentences only; scoring them as right when the whole sentence is right and as wrong when there is at least one error. But that does not tell you much about a methods performance. One could compare the newly placed breaks with the original breaks, but that leaves us with a strange interpretation of insertions; the number of breaks in the originally le is the number of breaks placed right in the new les plus the number of deletions plus the number of substitutions. The number of insertions is independent of the number of breaks in the original le. It is unclear how this number of insertions should be interpreted when it is compared to the number of breaks in the original le. The method presented here looks at the interword spaces; the place in between two words, where a break can be inserted. However, also the number of breaks inserted rightly compared to the total number of breaks will be given.

3.7.2 Method of computing the score Our way to compute the score is to compare every interword space in the created and original le. If it is the same (both a break or both not a break) then it is right. The errors can be divided into insertions, deletions and substitutions. An insertion is a break inserted in the new le, where there is not a break in the original le. A deletion occurs when there is not a break in the created le, where there is one in the original. A substitution is a major break instead of a minor and vice versa. The score (S ) is then the number of interword spaces that were done right divided by the total number of interword spaces. The worst error is an insertion in a place where a break is totally out of place. This should not happen too often, because the probability of a break in such a place is in general very low, since there are few or no occurrences in the training corpus. If a break is in a wrong place it will often be in something like a wrong PP attachment, but chances are very small that

34


it will be between a determiner and a noun. Deletions can cause sentences that sound unnatural, but more often they do not sound strikingly unnatural. The substitution is the least severe error. It is the wrong type of break in the right position, which is often not disastrous for the prosody of the sentence. A disadvantage of this method is that it is dependent on the proportion of breaks to the interword spaces. This means that if there are few breaks in the original le, one will score highly by inserting few breaks. If only 10 % of the interword spaces contain a break, one would score 90 % by not placing any breaks at all. The following calculation is used to adjust the method at this point. It is called the kappa statistic, more information about it can be found in Siegel & Castellan (1988) [10]. If the score using the normal method is S and the proportion of no breaks to the number of interword spaces is B then the adjusted score is S ?B . In the corpus, 78 % of the interword spaces does not contain a break 1?B (B = 0.78). If a method scores 90 % of the interword spaces correct (S = :78 = 0:55. Without placing any break, 0.9) then the adjusted score is 01:9??00:78 B = S , the adjusted score would be 0 (zero). A score of 1 is reached when every single interword space is correct, S = 1. An adjusted score smaller than zero, S < B , is worse than the score for not placing any breaks at all. This would mean that many breaks are inserted in the wrong places.

Chapter 4

Results 4.1 Introduction Before presenting the results, one needs to consider that the score is not more than an indication of how good the methods are, because a dierent placing of breaks does not mean a wrong placing of breaks. The main purpose of the results is therefore to compare the dierent methods and variables and pick the best one for the implementation.

4.2 Value of results As far as is known at present, in human speech, phrase breaks are not placed according to xed rules. If that were the case we would just have to implement these rules to reach our goal. But there is more than one correct way to insert breaks. The following sentence is from the corpus, the phrase boundaries in the rst one are as they were originally transcribed, and in the second one they are inserted by one of the methods described here. The ocial reason was j that the move was necessary j to encourage exports

jj .

The ocial reason was j that the move j was necessary to encourage exports jj .

There are two dierences in break placing in these sentences. One might think that the rst one sounds a little bit better, smoother, so to speak, but 35

36

CHAPTER 4. RESULTS

the second one sounds quite all right as well and at least it is not totally unacceptable. However, the second sentence, in spite of sounding acceptable, will be scored as accounting for two errors. Another thing to bear in mind when looking at the scores is the double error that a break placed at a wrong place can cause. When a break is placed, close to the spot where it should be, but not on it, there are two errors; an insertion and a deletion. Not only is it possible that the sentence sounds all right, but it sounds probably better than with two breaks inserted with only one word in between, which would score only one error, the insertion. With just a small distance probability of only one or two words in between two breaks, the methods using the distance probabilities will seldom insert both breaks. He says j he and many of his colleagues j left their organisation j just before the Tripoli ghting jj . He says he j and many of his colleagues j left their organisation just j before the Tripoli ghting jj .

The rst sentence is how it was originally transcribed, in the second sentence, the breaks are inserted by one of the methods. The sentence has 4 errors in total: 2 insertions and 2 deletions. But in fact there are only two errors (although one could even argue about the rst one). In both cases the break has been placed one word too far to the right. If breaks had been placed in the right spots too, there would have been two fewer errors, but the sentence would not have been any better. This is another reason why the scores may look worse than they actually are. As a result of this, not too much attention should be paid to the numbers of breaks correctly inserted, compared to the number of breaks in the transcriptions. This percentage is easily brought up to 90 % just by inserting a break everywhere. There would be an enormous number of insertions then and the adjusted score would be very low, despite of so many breaks rightly inserted. The substitution is another problem. It could be reasoned that it can be counted as a rightly placed break, because the position of the break is right. But since there are two types of breaks chosen in the rst place, a substitution should be considered as one sort of an error. In practice, a major break is only placed at the following punctuation marks: ".", ":", ";", "!", "?". This is very reasonable and if it was done dierently in the corpus,

4.3. THE HARD NUMBERS

37

then there must have been special circumstances, such as high speech rate, causing a minor break at sentence ends, and low speech rate, that can cause major breaks at commas. The score of a certain method tells us more about how good it is in predicting a break in the way it is done in the training data than about how good it actually sounds to the human ear. There is a maximum possible score to be reached, that is signi cantly lower than 100 %. This goes even for the most sophisticated methods, trained on very large corpora, because of the variability in the corpora, as explained before. The score is important, however, to compare the dierent methods and variables used.

4.3 The hard numbers 4.3.1 Introduction This section describes the in uences of the various methods and variables, supported by the scores. Approximately half of the total MARSEC was used for the experiments. There were 18,131 words (including punctuation) in the training set. The test set consisted of 6420 elements. The data for the test set was picked randomly and contained dierent categories from the corpus. There are 6171 interword spaces in the test set, 1065 minor phrase breaks and 311 major phrase breaks. This means that in 22 % of the interword spaces there is a break. The results will be compared to these numbers. Table 4.1 gives the (best) results of the 7 methods with all parameters set to their optimal value. Column 1 gives the number of the method, column 2 the percentage of the correctly inserted breaks compared to the total number of breaks in the test data. Columns 3 and 4 give the percentages of respectively insertions and deletions of (minor and major) phrase breaks compared to the total number of interword spaces. Column 5 gives the percentage of correct interword spaces and column 6 gives the adjusted (kappa) score as described in section 3.7.2. The percentages for substitutions is not in this table; it is 1 % for every single method.

4.3.2 Results of method 1 The method of inserting breaks only at punctuation is independent of the tag set. It is also independent of look-ahead and threshold parameters, it is even independent of training! It is clear that too few breaks will be inserted,

CHAPTER 4. RESULTS

38

Method % Br Cor % Ins % Del % Iws cor Adj score 1 49 1 11 85 0.36 2 78 16 4 80 0.22 3 70 4 5 89 0.50 4 68 6 6 87 0.40 5 64 3 6 89 0.50 6 72 3 4 90 0.54 7 68 5 4 88 0.45 Table 4.1: Results. since a break at punctuation is in most cases correct, but there should be a lot more breaks. With this method 393 minor phrase breaks were inserted. That is slightly over 1/3 of the number in the original transcriptions. That means that at least 2/3 of the minor phrase breaks are not at punctuation. There are 278 major phrase breaks inserted, which is almost 90 % of the transcribed major breaks. This indicates that most major breaks are at a punctuation mark. There are only 47 insertions, but 752 deletions, which is 11 % of the interword spaces. There were 45 major breaks inserted, where a minor break should have been inserted, and 18 minors where a major was expected. The adjusted score of this method is 0:36. This method will evidently not do. The breaks it inserts are quite good, but it inserts far too few breaks. This method is not good enough for implementation, as expected.

4.3.3 Results of method 2 The second method, like the method used in the CSTR text to speech system, inserts breaks at punctuation and after content words that are followed by function words. The prediction was that there would be too many breaks, because there should not always be a break between a content and a function word. 1074 breaks are rightly inserted, which is 78 % of the breaks. There are, however, also 1001 insertions. That is 16 % of the interword spaces. There are 236 deletions and the number of substitutions is 66. The score for this method is 0:22. This method behaves in the opposite way as the rst method. It covers a large part of the breaks, despite the 236 deletions, but there are far too


39

many insertions. This is not a surprise at all, because there are many content word - function word occurrences. The high percentage of correctly inserted breaks must not be misinterpreted; it shows that this methods nds most places where a break should be, but it nds too many. Looking at the adjusted score shows that this method is not that good at all. In fact, it has the lowest score of all.

4.3.4 Results of method 3 The third method uses the bi- and trigram probabilities only. The threshold, the percentage above which a break should be added, must be speci ed. If the threshold is set at 0.5, than the results are the following: 866 breaks are inserted correctly, 62 % of the transcribed breaks. There are 205 insertions, 441 deletions and 69 substitutions. The number of interword spaces right is 88 % which results in a score of 0:45. This method is obviously better than the previous two methods. Since this method has a variable parameter, it is possible to ne-tune it to get the optimum result. There are (a lot) more deletions than insertions, therefore the threshold should be set lower. The best threshold appears to be 0.4. With this newly set threshold, 70 % of the breaks is inserted rightly, there are 299 insertions and 330 deletions. 89 % of the interword spaces is correct and the score is 0:50. This method is, of course, preferable over the rst two. The sentences are, after the breaks have been inserted, a bit repetitive. This is caused by the fact that a break is inserted in either every pattern of a sort, or in none. The distribution of the breaks is somewhat strange; some parts show many breaks together where other parts contain no breaks at all. This indicates that the distance probabilities could be useful.

4.3.5 Results of method 4 Method 4 makes use of the distance probabilities. The trigram probabilities are multiplied with the cumulative distance probabilities and if the result is above a set threshold, a boundary is added. To nd the right threshold is a matter of 'trial and error' here more than elsewhere, because the probability is a multiplied one. The threshold which is intuitively the best is that one with which there are as many deletions as insertions. This appears to be a threshold of 0.15. It results in 375 insertions and 365 deletions. There are 69 substitutions.

CHAPTER 4. RESULTS

40

Threshold 0 0.1 0.2 0.3 0.4 0.5

6

0.36 0.40 0.45 0.45 0.50 0.45

Lookahead 8 10 12

0.45 0.45 0.45 0.45 0.50 0.45

0.45 0.45 0.45 0.45 0.50 0.45

0.45 0.45 0.45 0.45 0.50 0.45

14

0.45 0.45 0.45 0.45 0.50 0.45

Table 4.2: Adjusted scores of method 5 with dierent lookaheads and thresholds. 87 % of the interword spaces is right and the score is 0:40. It is not certain that evenly divided insertions and deletions result in the best score. By setting the threshold lower, one might lose more deletions than one gains new insertions, and vice versa for a higher threshold. After testing it is concluded that a lower threshold results in worse results. A higher threshold results in about the same score, up to a threshold of 0.35. At which there are only 167 insertions, but 605 deletions, together only slightly more errors than with the 0.15 threshold. However, the score never exceeds 0:40. Although this method is signi cantly better than the rst two, it scores, to our surprise and disappointment, worse than method 3, which uses only the trigram probabilities.

4.3.6 Results of method 5

This method uses the lookahead parameter. The maximum total probability in the scope of the lookahead is computed by multiplying the trigram probability and the (not cumulative) distance probability. If the trigram probability of the trigram with maximum total probability is higher than the threshold, a break is inserted. The total probability is computed by multiplying the trigram probability and the distance probability. The two most important variable parameters are the lookahead L and the threshold T . To nd the optimal parameter setting, one parameter has to be varied while the other stays the same. L was tested with values of 6, 8, 10, 12 and 14. T was tested with values 0, 0.1, 0.2, 0.3, 0.4, 0.5. The two parameters have been tested against each other. The results (given as the adjusted score) can be found in table 4.2.


41

The results show that the threshold parameter is more important than the lookahead parameter. A L of 6 scores worse than other values for L, if T is small. But for a T of 0.2 and higher, the score is the same for all values of L. As expected, a smaller value for L results in more insertions and fewer deletions, but this holds only for a value for L of 6. The dierence between the number of deletions and insertions for values for L of 8 and 10 is very small. From 10 onwards, there is no dierence at all. The in uence of the threshold T is bigger. The best value for T is 0.4, which results in a score of 0:50 for every tested value of L. This score is as high as the highest we got so far from the third method. This is a bit disappointing for a method that uses more information. We can conclude that the best setting of parameters is a L of 10 and a T of 0.4. This results in 893 breaks inserted rightly, which is 89 % of the interword spaces. 221 insertions, 412 deletions and 71 substitutions are together 704 errors and lead to a score of 0; 50. Disappointingly, this method does not score better than the third one, even with the optimal parameter setting. The breaks that are inserted are all right, but there are too few breaks.

4.3.7 Results of method 6

The method that computes the probabilities of every possible combination of break placing and picks the best is unfortunately too slow to be of any practical use. To test it anyway, 68 sentences from the test data were picked, all containing 15 words or less. Because this test set is so small, and the sentences are short, the score will not be a very accurate indication of how good the method is. An indication of how slow this method is, is the two hours it took to insert the breaks. The results are good, though. Of the 166 breaks there are 120 inserted rightly. There are 23 insertions and 51 deletions. These numbers should not be confused with the numbers mentioned in the previous method, since the test set diers. There are 683 interword spaces right from the 758, which results in a score of 0:54. The other methods, tested on this special test set, do not get this high score. The third method, that uses only the trigram probabilities, scores 0:50, the same it did with all the test data. The fth method, that scored 0:50 on the total test set, scores only 0:45. The parameter settings were the same as that were used on all the test data when we got maximum results. If any conclusions may be drawn from the experiments on this speci c

CHAPTER 4. RESULTS

42

Maximum 6 8 10 12

Threshold 0.3 0.40 0.40 0.40 0.40 0.4 0.45 0.45 0.45 0.45 0.5 0.45 0.40 0.40 0.40 Table 4.3: Adjusted scores of method 7 with dierent maxima and thresholds. test set, it can be concluded that this method is better than the other methods. This may be expected, since it does not leave any possibility out of consideration. The problem is, of course, that it is so slow that it is not useful in practice.

4.3.8 Results of method 7 This method places a break on every punctuation and computes the probabilities for all possible combinations of the clauses in between two punctuation marks, unless the clause has more trigrams than the maximum M . In that case the third method is used; the one that uses only the trigram probabilities. Except M , the other parameter that was tested in this method is the threshold T , used if the third method is used. If the trigram probability is above T , then a boundary is added. The results can be found in table 4.3. The results are again disappointing; the maximum parameter does not make any dierence, except for a threshold of 0.5, and in that case the score is better for a low maximum. In other words: the score is better when the third method is used more often. The scores are the best, in agreement with earlier results, when a threshold parameter of 0.4 is used. The results are quite dierent from what was expected. Two methods are used together here. One would expect the one that considers every possibility and uses the distance probabilities to be the best. Thus when using that one more often, i.e. a high value for M , the results should be better. But the results appear to be the same for a high or a low maximum. In case of a T of 0.5, the results are even better for a low maximum.

4.4. EXAMPLES FROM THE FIELD

43

4.4 Examples from the eld

4.4.1 Introduction

The numbers and scores give a good indication which method is better in inserting breaks in the same way as it was done in the training data, but they give no idea of how the sentences look like. In this section sentences from the corpus are given with the original transcription and the transcriptions of each of the methods.

4.4.2 Example

This example is the rst sentence from a BBC news broadcast. The originally tagged version looks like this: BBC/NNJ news/NN1 at/II eight/MC o'clock/RA on/II Saturday/NPD1 ,/, the/AT twenty-second/MD of/IO June/NP1 ./. The tags are changed to one of the tags as described in section 3.4.4 which results in: BBC/n news/n at/i eight/j o'clock/a on/i Saturday/n ,/p the/d twentysecond/j of/i June/n ./f The breaks were originally transcribed as follows: BBC news j at eight o'clock j on Saturday j , the twenty-second of June jj . With breaks inserted by the rst method: BBC news at eight o'clock on Saturday j , the twenty-second of June jj . The breaks only appear at the punctuation marks. That means seldom breaks in the wrong place, but of course also far too few breaks. That is also the case in this sentence. Applying of the second method results in: BBC news j at eight o'clock j on Saturday j , the twenty-second j of June jj . Placing of a break between a content and a function word works wonderful for this sentence. There is only one dierence with the original: the insertion between 'twenty-second' and 'of'. That is an error, because 'the twentysecond of June' really is one phrase, that should be without a break. The third method makes the sentence into this:

44

CHAPTER 4. RESULTS

BBC news j at eight j o'clock j on Saturday j , the twenty-second of June jj .

Again the sentence is the same as the original but for one insertion. But the break between 'eight' and 'o'clock' is really disastrous. Apparently the sequence 'at eight o'clock' is of a kind that has often a break in it. The sequence is preposition, adjective, adverb. It is dubious whether these are the right tags for the words, but that is one problem when we work with a small tag set (although the errors can already be made by the tagger). The question remains why the sequence 'i j a' gets a break in over (or equal to) 40 % of the cases where it should not have a break here. There are only 5 cases with 'i j a' in the training set. Of these 5 cases, 2 contain a break, or exactly 40 %. Five cases is of course too few to be representative, but the trigram probability does not take into account the number of appearances of the trigram in the training set. Breaks inserted by the fourth method: BBC news at eight j o'clock on Saturday j , the twenty-second of June jj .

This method does not place a break between 'news' and 'at'. The cumulative distance probability is still quite low and the trigram probability is not high enough to get the combination above the threshold. Between 'eight' and 'o'clock' however the distance probability is higher and the trigram probability is also high (0.4 as we saw), so that their combination is higher than the threshold, and a break is (faultly) placed. No break is placed between 'o'clock' and 'on'. Although the trigram probability is quite high (0.66), the distance probability prevents a break, because a break was placed just one word back. The fth method goes: BBC news at eight o'clock j on Saturday j , the twenty-second of June jj .

This result is better again. The reason that there is not a break between 'eight' and 'o'clock' now is that the product of the distance probability (0.18) and the trigram probability (0.40) is lower than that of a break in the place between 'o'clock' and 'on' (0.15 and 0.66), since 0:18 0:40 = 0:072 < 0:15 0:66 = 0:099. Breaks inserted by the sixth method: BBC news j at eight o'clock on Saturday j , the twenty-second of June jj .


45

The same result as with the third method. Since this is a small sentence, it was in the test set for the sixth method, which computes all possibilities and selects the best. There are 2048 possibilities for this sentences. It is therefore hard to understand why some breaks are placed and others not. The seventh method goes: BBC news at eight o'clock on Saturday j , the twenty-second of June jj . The only dierence with the sixth method is that breaks are placed at punctuation and the phrases in between are done together. The phrase before the rst comma is not assigned any break with this method, where a break was placed between 'news' and 'at' when the sentence was done in whole, in the sixth method. When we look at all the seven methods, we conclude that they all appear to place a break on punctuation marks, also when they are not directly programmed to do so as are methods 1, 2 and 7. There are places where they generally agree on a break (between 'twenty-second' and 'of') and places where they don't agree at all (between 'news' and 'at'). Some decision are explicable, but others look very strange.

4.4.3 Another example

Another example from the test set is the following: The original tagged version: In/II a/AT1 city/NNL1 where/RRQ there/EX 's/VBZ always/RR something/PN1 to/TO see/VV0 twenty-four/MC hours/NNT2 a/AT1 day/NNT1 ,/, the/AT latest/JJT attraction/NN1 is/VBZ the/AT Federal/JJ courthouse/NN1 in/II Manhattan/NP1 ./. The transformed tagged version: In/i a/d city/n where/a there/n 's/x always/a something/o to/i see/v twentyfour/j hours/n a/d day/n ,/p the/d latest/j attraction/n is/x the/d Federal/j courthouse/n in/i Manhattan/n z/z ./f With the originally inserted breaks: In a city where there 's always something to see j twenty-four hours a day j , the latest attraction j is the Federal courthouse j in Manhattan jj . Breaks inserted by the rst method:

46

CHAPTER 4. RESULTS

In a city where there 's always something to see twenty-four hours a day j , the latest attraction is the Federal courthouse in Manhattan jj .

It is the same story; breaks at punctuation marks and nowhere else, thus too few breaks. Breaks inserted by the second method: In a city where there j 's always something j to see twenty-four hours j a day j , the latest attraction j is the Federal courthouse j in Manhattan jj .

The break within 'there's' is caused by the tagging of "s' apart. It would have been wrong if it said 'is', but now it is as wrong as you can get. The break between 'something' and 'to' is also wrong, but that is caused by the placement of 'to' in the class of prepositions. Then we see an example of hierarchal misphrasing. The break between 'hours' and 'a' is wrong, since 'twenty-four hours a day' is one phrase. But it is extra wrong since there is not a break between 'see' and 'twenty-four'. Breaks inserted by the third method: In a city where there 's always something to see twenty-four hours j a day j , the latest attraction j is the Federal courthouse j in Manhattan jj .

No trigram appears to have a high trigram probability in the rst part of the sentence. The probability of a break in 'to see twenty-four' is only 0.23, so that no break is inserted. The rst break is (wrongly) inserted in 'twenty-four hours a'. The tagging of that trigram is 'adjective, noun, determiner', which probability of a break is 0.92. These breaks appear in phrases like: "Next/j week/n j a/d delegation/n ...", "... the/d manic/j speed/n j which/d ..." and "... we/o heard/v automatic/j re/n j a/d few/d yards/n away/a ...". A break is right in most of these 'j n d' trigrams, but in special phrases like 'twenty-four hours a day' or 'seven days a week', our trigram method goes wrong. It goes partly wrong because the cardinal number in the originally tagging was mapped to a preposition, which is right in many cases, but is part of the error in this case. The rest of the sentence is done well. Breaks inserted by the fourth method: In a city where there 's always something to see j twenty-four hours j a day j , the latest attraction j is the Federal courthouse j in Manhattan jj .

The cumulative distance probability is at 'see twenty-four' so big, that in combination with the trigram probability it is (justly) decided to place a


47

break. Between 'hours' and 'a', the cumulative distance probability is still quite small, but the trigram probability is so high that it is still (unjustly) decided to place a break. Breaks inserted by the fth method: In a city where there 's always something to see twenty-four hours j a day j , the latest attraction j is the Federal courthouse j in Manhattan jj . The placing of breaks is the same as with the third method. In the rst part of the sentence, a maximum is picked, but its trigram probability is so low that a break is not inserted. In the part 'to see twenty-four hours a day', the trigram probability of 'twenty-four hours a' is much bigger than the one of 'to see twenty-four', as we have seen. It is so much bigger that after multiplication with the distance probability the probability of the trigram 'twenty-four hours a' is still bigger than the probability of 'to see twentyfour'. So, a break is wrongly inserted again. The sentence was too long for the sixth method. The seventh method goes: In a city where there 's always something to see twenty-four hours j a day j , the latest attraction j is the Federal courthouse j in Manhattan jj . The same again. The 'twenty-four hours a' probability is so high that a break will always be placed between 'hours' and 'a'. The other probabilities are too weak, so that despite a high distance probability a break is not placed. Looking at all methods, we can conclude that the second part of the sentence is not a problem for any of them, except for the rst (of course). What is a problem for all of them (except the rst) is the phrase 'twentyfour hours a day'. It is not recognised as a single phrase and the trigram probability is so strong that every method places a break between 'hours' and 'a'. What we would want is that such phrases are recognised as one phrase. This might be possible by starting a procedure before inserting the phrase breaks that lters such phrases and tags them with a special phrase tag.

48

CHAPTER 4. RESULTS

Chapter 5

Conclusions 5.1 Introduction After all tests are done and all results are looked at, it is time to draw conclusions. Also some things need to be said about the possibilities for implementation. Finally, some ideas for the future will be given.

5.2 Conclusions on the results (A part of) the goal of the experiment was to develop a method that is better in inserting phrase breaks than the method used in the CSTR TTS system. This goal seems to be reached. All methods tested here score better than the second, which is the method used in the CSTR system. The methods that use the distance probabilities do not work better than the third method, without the distance probabilities, despite the extra information that is used. The main reason for this is probably the distribution of the two dierent sorts of probabilities. The trigram probabilities have a range that goes from 0 to 1, where the (non cumulative) distance probabilities are never higher than 0.25 and the probabilities of the dierent distances are a lot closer to each other. Therefore, the in uence of the distance probabilities is far lower than the trigram probabilities. Also, it is likely that the trigram and distance probabilities are not completely independent, so that the information from the distance probabilities is partly included in the trigram probabilities. In that case the distance probabilities might not add enough information to come to better results. This higher in uence of the trigram percentages and the possible dependency of the two probabilities 49

50

CHAPTER 5. CONCLUSIONS

lead to the best results for the third method. Since this is also the simplest method (except for the rst two, with very bad results), it is clearly the best to be used for implementation. However, none of the methods is ideal to be implemented and some adjustments seem to be necessary before implementation takes place. One disadvantage that goes for all of the candidates for implementation (only the last ve methods are seriously taken into account) is the use of the stand alone tagger and the necessity of the mapping of the tag sets. Although the Brill tagger has a very good performance, it is also terribly slow. It would take away the real time aspect of the TTS system and is therefore unsuitable. Another problem is the mapping that has to be made from the tagger to the tag set that is used to train the methods. It is an extra step that should be made unnecessary. First, the tagger assigns tags from a rather detailed tag set to the words and then it makes a mapping to a reduced tag set. The solution to both problems mentioned (slow tagger and extra step) is to build our own tagger that assigns directly the tags that are needed to the words. This would make the system faster. The tagging would also be more accurate since we are no longer dependent on the tag set of the tagger we use.

5.3 Future work The work described in this report is not nished. There are some things that need a closer look and there are some ideas that have not been tested yet. The used combinations of the trigram and distance probabilities are probably not optimal. They have been treated as totally independent and that might be a false assumption. When one would want to examine their dependency, one would have to start with a more complex training procedure, with both the trigram and the distance information examined together. One could for example look at the distance of two breaks taking into account the pos tags between the two breaks. Methods like this would however increase the number of probabilities considerably. Various interesting methods can in principle be examined, but one has always to bear in mind the limited size of the corpus, so that the number of (possible) probabilities will not be too high. In that case one will not nd enough occurrences in the data, so that the probabilities will be unreliable.

5.3. FUTURE WORK

51

Another thing that is interesting to be examined is a dierent approach to the distance probabilities. Where in this experiment the distance probabilities looked solely at words, it seems very useful to compute the distance probabilities with syllables. Syllables are a logical unit to measure the distances. This is because their lengths are more equal to each other than the lengths of words are. However, the method of placing breaks would become more complex then because the number of syllables need to be extracted. Finally, the way of testing can be improved. In this study the only test was based on a comparison between the newly inserted breaks and the originally transcribed breaks. It is probably very interesting to examine how good the dierent methods sound to listeners. To do this, listeners should give their opinion about sentences spoken (preferably by the text-to-speech system) with the breaks placed according to the dierent methods. The opinions of the listeners would unfortunately strongly depend on how the breaks are treated in the phonetic part of the TTS system. Therefore, this test would not be ideal either.

52

CHAPTER 5. CONCLUSIONS

List of Figures 3.1 Distribution of phrase lengths. : : : : : : : : : : : : : : : : 3.2 Distribution of the cumulative phrase distance probability.

53

: :

25 25

54

LIST OF FIGURES

List of Tables 4.1 Results. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38 4.2 Adjusted scores of method 5 with dierent lookaheads and thresholds. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 40 4.3 Adjusted scores of method 7 with dierent maxima and thresholds. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42

55

56

LIST OF TABLES

Bibliography [1] J. Allen, S. Hunnicut, and D. Klatt. From Text to Speech: the MITalk System. Cambridge University Press, 1987. [2] Simon Arn eld. Prosody and Syntax in Corpus Based Analysis of Spoken English. PhD thesis, University of Leeds, 1994. [3] Eric Brill. Some advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Arti cial Intelligence (AAAI-94), 1994. [4] G. Knowles and L. Taylor. Manual of information to accompany the sec corpus. 1988. [5] D. Robert Ladd. Intonational phrasing: the case for recursive prosodic structure. Phonlogy Yearbook 3, pages 311{340, 1986. [6] Janet B. Pierrehumbert. The Phonology and Phonetics of English Intonation. PhD thesis, MIT, 1980. Published by Indiana University Linguistics Club. [7] P. J. Price, M. Ostendorf, S. Shattuck-Hufnagel, and C. Fong. The use of prosody in syntactic disambiguation. Journal of the Acoustical Society of America, 90(6), December 1991. [8] Jan Roelof de Pijper Rene Collier and Angelien Sanderman. Perceived prosodic boundaries and their phonetic correlations. In Human language technology 1993, New Jersey, USA, 1993. [9] Yoshinori Sagisaka Shigeru Fujio and Norio Higuchi. Prediction of prosodic phrase boundaries using stochastic context-free grammar. In ICSLP '94, Yokohama, Japan, 1994. 57

58

BIBLIOGRAPHY

[10] S. Siegel and N.J. Castellan Jr. Non-parametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988. [11] K. Silverman, M. Beckman, J. Piterelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. Tobi: a standard for labelling english prosody. In International Conference on Spoken Language Processing '92, Ban, Canada, 1992. [12] Kim Silverman. The Structure and Processing of Fundamental Frequency Contours. PhD thesis, University of Cambridge, 1987. [13] Paul A. Taylor, I. A. Nairn, A. M. Sutherland, and M. A. Jack. A real time speech synthesis system. In Proc. Eurospeech '91, Genova Italy, 1991. [14] N.M. Veilleux, M .Ostendorf, P. J. Price, and S. Shattuck Hufnagel. Markov modelling of prosodic phrase structure. In International Conference on Speech and Signal Processing. IEEE, 1990.

Using Probabilistic Methods to predict Phrase ... - Semantic Scholar

Using Probabilistic Methods to predict Phrase ... - Semantic Scholar

Suggest Documents

Using Probabilistic Methods to predict Phrase Boundaries ... - CiteSeerX

Using Probabilistic Reasoning to Develop ... - Semantic Scholar

A Probabilistic Model to Predict Clinical ... - Semantic Scholar

USING COMPUTATIONAL METHODS TO PREDICT NMR ... - CiteSeerX

using probabilistic methods to define reliability ...

Methods for Generating Probabilistic Rough ... - Semantic Scholar

PROBABILISTIC AND FUZZY METHODS FOR ... - Semantic Scholar

Mining phrase - Semantic Scholar

(MEMS) Using Probabilistic Methods - CiteSeerX

Using Probabilistic Latent Semantic Analysis to ... - Semantic Scholar

Using multiple probabilistic hypothesis for ... - Semantic Scholar

Using Rank Propagation and Probabilistic ... - Semantic Scholar

Scalable Probabilistic Computing Models using ... - Semantic Scholar

Traffic Analytics using Probabilistic Graphical ... - Semantic Scholar

Using In-Process Metrics to Predict Defect Density ... - Semantic Scholar

Using data mining techniques to predict industrial ... - Semantic Scholar

Using Scouts to Predict Swarm Success Rate - Semantic Scholar

Using Linear Regression to Predict Changes in ... - Semantic Scholar

Using Web-based Search Data to Predict - Semantic Scholar

Using Decision Trees to Predict Forest Stand ... - Semantic Scholar

Using Interest and Transition Models to Predict ... - Semantic Scholar

Using Steered Molecular Dynamics to Predict and ... - Semantic Scholar

Using mathematical models to predict annoyance ... - Semantic Scholar

using artificial neural networks to predict ... - Semantic Scholar