Language generation is used to produce a natural language text expressing the ..... 7 In this and the following examples, accented words are printed in italics. 9 .... Sentence-internal syntactic conditions determine which of these trees are ..... a word which is in a list of `unaccentable' words, e.g., certain function words.
From Data to Speech: A Generic Approach Mariet Theune
Esther Klabbers
IPO: Center for Research on User-System Interaction
IPO: Center for Research on User-System Interaction
Jan Odijky
Jan Roelof de Pijper
Lernout & Hauspie Speech Products
IPO: Center for Research on User-System Interaction
We present a combination of techniques for the creation of dierent data-to-speech systems. These are implemented in a generic system called D2S, which is suciently general to be used as a basis for applications in dierent languages and domains. Language and speech generation in D2S is done by means of techniques which are aimed at nding a balance between theoretical attractiveness and practical usability. In our hybrid language generation technique the use of syntactically enriched templates is guided by knowledge of the discourse context, and for speech generation prerecorded phrases are combined in a prosodically sophisticated manner, using linguistic information provided by language generation. This makes it possible to create linguistically sound but ecient systems with a high quality language and speech output.
1. Introduction In this paper we propose a combination of techniques for creating data-to-speech systems, i.e., systems which present data in the form of a spoken monologue.1 Such systems might be part of more general automatic information service applications, for example telephone services. They could also be used in situations where eyes and hands are occupied (e.g., in a car), or in combination with other modes (e.g., textual or graphic modes). The most important characteristic of data-to-speech is that it combines language and speech generation. Language generation is used to produce a natural language text expressing the system's input data, and speech generation is used to make this text audible. In both language and speech generation, as in many other domains, two extremes can be found: on the one hand there are theoretically motivated but not very practical methods, and on the other there are practical but theoretically less interesting methods. The techniques presented in this paper are intended to achieve a balance between these two extremes, so that they are both theoretically interesting and can be used in practical applications. We also show that data-to-speech is more than simply adding speech output to a natural language generation system, and that speech generation can bene t from its combination with language generation. This bene t stems from the fact that language IPO: Center for Research on User-System Interaction, P.O. Box 513, 5600 MB Eindhoven, The Netherlands. E-mail: ftheune/klabbers/[email protected]. y Lernout & Hauspie Speech Products, Sint-Krispijnstraat 7, 8900 Ieper, Belgium. E-mail: [email protected]. 1 Sometimes also called `concept-to-speech' systems.
c
1997 Association for Computational Linguistics
Computational Linguistics
Volume xx, Number yy
generation can be used to provide prosodic information, which is exploited by speech generation to produce natural-sounding speech. A generic system, D2S, has been developed in which these techniques have been implemented. The main goal with making this system was to gain knowledge about and experience with text generation and speech output techniques that can be used for the construction of data-to-speech systems for various domains and languages. D2S was initially developed with the construction of the DYD-system, which generates spoken monologues in English, giving information about compositions by Mozart (see van Deemter et al. (1994); Odijk (1995); van Deemter and Odijk (1997)). In the near future D2S will be used in a large travel information system, the OVIS-system.2 In this paper, a simple datato-speech system, called GoalGetter, will be used to illustrate D2S and the techniques it is based on. GoalGetter generates spoken reports of football matches. The system takes as input data on a football match, derived from a Teletext page.3 An example input Teletext page is given in Figure 1. This particular example contains information about two football matches. The output of the GoalGetter system is a correctly pronounced, coherent monologue in Dutch which conveys the data from the input Teletext page.4 A system which has the same application domain as GoalGetter, is the SOCCER system (see Andre, Herzog, and Rist (1988); Herzog and Retz-Schmidt (1989)). The SOCCER system generates spoken descriptions of image sequences of football scenes. This is dierent from the GoalGetter system, which generates summaries of football matches, taking tabular information about the match as input. In that respect GoalGetter is more like the STREAK system (Robin (1994); McKeown, Robin, and Kukich (1995)), which also generates sports summaries, though the sports domain is basketball instead of football, and this has consequences for the character of the texts generated. An important dierence with GoalGetter is that STREAK produces only written, not spoken, output.
Figure 1
Example Teletext Page. (Arbiter=referee; Toeschouwers=spectators; Geel=yellow card)
This paper is organized as follows. In section 2 we describe the extremes of a range of possible techniques for language and speech generation, and show that the techniques used in D2S can be situated about halfway between these extremes, achieving both practical usability and theoretical well-foundedness (2.1 and 2.2). We also point out an important advantage of combining language and speech generation in a data-to-speech 2 Information on OVIS can be found at http://grid.let.rug.nl:4321/. 3 Teletext is a system with which textual information is broadcast along with the television signal and decoded in the receiver. The information is distributed over various `pages', each lling a screen, which are continuously refreshed. Most pages contain textual information, but some contain tables. 4 An interactive on-line demo of the GoalGetter system can be found at http://iris19.ipo.tue.nl:9000/.
2
Theune et al.
From Data to Speech: A Generic Approach
system: linguistic information about the generated text is available from the language generation component, and can be used for sophisticated computation of prosody, thus increasing the naturalness of the speech output (2.3). In this section we also present the general architecture of D2S, and give an overview of the types of information that are needed for adequate prosody computation. In the next section (section 3), we rst show an example of the system's input and output, taken from the GoalGetter system, and then give a detailed description of the techniques that are used in the dierent modules of D2S, illustrated with examples from GoalGetter. In 3.1 we describe the text generation component of D2S, focusing on the use of syntactic templates. In 3.2, we present the rules that are used for the computation of prosody, and nally, in 3.3, the speech output techniques that can be used in D2S are discussed. A discussion of the strengths and weaknesses of D2S follows in section 4, where we also point to future work. Finally, some concluding remarks are given in section 5.
2. Techniques for language and speech generation There are many dierent approaches to language and speech generation. It is impossible to discuss all options in this paper so we will concentrate on the extremes of a range of techniques scaled between very in exible approaches which have the practical advantages of eciency and ease of development (language generation) or high quality output (speech generation) and exible approaches which are theoretically more interesting but are either less ecient in use and development or have a lower quality output. The main goal of our language and speech generation techniques is to nd a balance between these extremes, achieving exibility, eciency and good output quality. After the discussion on language generation in 2.1 and speech generation in 2.2, we will discuss the role of prosody in our system in 2.3. D2S provides a tight coupling between the two modules, which puts certain demands on the generation techniques used. These will also be discussed.
2.1 Language generation: linguistic NLG versus canned text
On the most global level, two contrastive approaches to natural language generation can be distinguished: linguistically motivated `deep' generation techniques on the one hand, and the use of `canned text' on the other. In the literature on language generation, attention has been paid almost exclusively to the rst approach, whereas in most commercial applications the second approach is taken. What we call the `canned text' approach is based on pure string manipulation. In the simplest case, ready-made strings are printed without any change. A little more advanced is the use of so-called `templates', string patterns that contain empty slots where other strings must be lled in. Which string patterns are chosen and how slots, if present, are lled in, is usually decided in a simple rule-based fashion (e.g., by means of conditionaction pairs). Linguistic notions hardly play any role in this approach; as Reiter (1995) points out, only some of the more sophisticated systems can handle things like agreement and conjunction, though not in a theoretically motivated way. In the other approach, generation is guided by linguistic notions and often explicitly based on some linguistic theory. Despite theoretical dierences, most linguistic generation systems have roughly the same architecture. Abstracting from minor dierences with respect to the names of the modules and the division of labour between them, the following components can be distinguished (cf. Dale, Mellish, and Zock (1990), Mykowiecka (1991), Reiter (1994)): or text planning - the information which must be expressed is selected and ordered. Determination of the text structure is often done through
Content
3
Computational Linguistics
Volume xx, Number yy
schemata (cf. McKeown (1985)) or rhetorical relations (RST, Mann and Thompson (1985)). Sentence planning - in this intermediate stage, information is chunked into sentences and the underlying sentence structure is determined. In many cases, this includes lexical choice. Sometimes no separate sentence planning stage is distinguished, in which case these tasks are dealt with in the linguistic realization module. Linguistic realization - deep sentence structures are converted to grammatically correct surface forms. Word order, choice of function words, realization of negation etc. are handled here. Proper morphological forms are chosen. Generally, linguistic generation techniques are not aimed at achieving high computational speed and eciency, since they are mostly used in research systems that serve a theoretical rather than a practical purpose. Moreover, developing them is a time-consuming and complex task, requiring speci c (linguistic) expertise. In contrast, building generation systems which use canned text is relatively quick and easy, and generation from canned text is both fast and ecient. For this reason, in most commercial applications canned text is used. However, there are some obvious advantages to the use of linguistic generation techniques as opposed to canned text. First of all, the `canned text approach' generally produces output texts which are not of very high quality: they tend to be very simple and show almost no variation.5 In contrast, linguistic generation systems can generate more complex texts and allow for more variation in the output. Linguistic generation is context sensitive, i.e., it can take into account the preceding discourse, the user's knowledge etc., and tailor the output accordingly. This is hardly possible in a canned text approach. Also, canned text systems are quite in exible, so maintenance can be a problem. Any changes in the domain or the required output will involve a lot of recoding of rules or rewriting of strings. Generation systems which are based on more general linguistic techniques are much easier to modify; changes usually involve only one entry in the relevant knowledge base or the lexicon. This makes them less domain- and application dependent. So, we can say that canned text approaches oer high computational speed and eciency at the cost of text quality, exibility and portability, whereas for linguistic generation techniques the opposite holds. In some circumstances, e.g., applications that require both speed/eciency and a higher text quality than can be achieved through using canned text, it may be fruitful to use a technique which combines the two approaches. We will refer to such techniques as hybrid generation techniques (cf. Reiter (1995), Coch (1996)). The term `hybrid generation techniques' covers many dierent kinds of techniques, because canned text and linguistic generation can be combined in various ways. For instance, a generation system could use high level planning techniques for the determination of text structure and canned text for the realization of surface structure. An example is the Migraine project (Carenini, Mittal, and Moore, 1994) which uses a fairly sophisticated text planner, but does realization by means of simple templates. The opposite is also possible: text planning is done by means of ready-made templates, whereas a grammar is used for linguistic realization. The TEXT system (McKeown, 1985) can be seen as an example of the latter approach. In the TEXT system, text structure is based on schemata, which are like templates at paragraph level. 5 See Coch (1996) for a discussion of the weak points of such texts, based on a formal evaluation of the quality of business reply letters, written by means of dierent techniques.
4
Theune et al.
From Data to Speech: A Generic Approach
Although integration of canned text or templates with linguistic generation can be motivated by the need for computational speed and eciency, other reasons are possible as well. Reiter and Mellish (1993) describe the IDAS system, which incorporates those portions of text that are dicult to generate linguistically as canned text in the output. This can be done in two ways: canned text can be inserted into a linguistically generated matrix sentence or linguistically generated referring expressions can be embedded in canned text. Alternatively, canned text may be used for those tasks in a language generation system that are considered to be of secondary importance, and for which the use of linguistic generation techniques would be too costly. This may be the case for generation systems which only focus on a certain aspect of language generation (e.g., planning) or for certain applications that do not require equally high quality on all textual levels (e.g., text structure may be simple but word choice must be varied). Our motivation for using a hybrid technique was a mixture of all of the above. The aim was to develop a Language Generation Module (from now on abbreviated as LGM) which could serve as a vehicle for research into discourse generation in various domains and languages, but which would also be suitable for use in the `real world', e.g., as part of a working information system. This meant that the LGM should be portable and be able to produce a wide range of linguistic constructions, while still allowing for ecient, real-time text generation. Since the LGM should focus on variation of text structure and choice of referring expressions rather than on surface realization, we decided to make central use of so-called syntactic templates. Syntactic templates are similar to `standard' templates in that they contain readymade sentence patterns, with slots that must be lled in with appropriate expressions. However, they are much richer in information. First of all, syntactic templates incorporate a full syntactic tree representation of the sentence they express (hence the name syntactic templates).6 Furthermore, they contain several other kinds of information, such as topic information and conditions on their use. The presence of this information obviates the need for a separate planning component; determination of text and sentence structure is done solely on the basis of the information in the syntactic templates. This will become clearer in section 3.1, where our use of syntactic templates is discussed. Here it suces to say that none of the three `standard' modules discussed in section 2.1 are distinguished in the architecture of the LGM; the system achieves text and sentence planning and realization without having speci c modules for these purposes. However, some other components which are commonly used in linguistic generation systems, such as a discourse model and a knowledge state, are present in the LGM. In short, on the one hand our generation technique incorporates some aspects of the canned text approach to language generation, notably the use of a kind of templates. On the other hand, generation in our system is context sensitive and based on linguistic notions, as we will show in section 3.1 where the LGM is described in some detail.
2.2 Speech generation: record-and-play-back versus speech synthesis
The most straightforward way to provide a system with speech output is to simply record all utterances that one wants the system to be able to pronounce, and then play them back as required. This is a bit like using canned text in the NLG component (section 2.1), and would in fact be the preferred method in a system using the simplest form of canned text (i.e., using ready-made sentences or even larger chunks of text, which are completely xed). This `technique' is theoretically not very interesting. The obvious advantage is that 6 One of the motives for including full syntactic analyses in the templates was that syntactic information is needed for prosody computation (see section 2.3).
5
Computational Linguistics
Volume xx, Number yy
the speech output quality that can be achieved is limited only by the medium through which the speech is transmitted. The most apparent disadvantage is that this approach is impracticable in all but the simplest of applications. For one thing, memory and storage limitations will usually severely restrict the number of possible sentences. For another, this scheme will work only if it is exactly known beforehand what sentences have to be produced. In a D2S system such as DYD or GoalGetter, the simple record-and-play-back method cannot be used. Memory and disk space are not the principal obstacles, since all the information, including speech databases, can be kept in a central computer, which can be arbitrarily big and powerful. The problem is in the number of potential sentences the system can produce. Take for instance a simple sentence that can be generated by GoalGetter (see Figure 4 in section 3), De wedstrijd PSV - Ajax eindigde in 1-3, `The match PSV - Ajax ended in 1-3'. This is a short and simple sentence, but thousands of dierent variants of the sentence might also be generated, given dierent input data. These sentences might vary with respect to the combination of teams and scores. The GoalGetter system accommodates only matches played in the Dutch rst division, which contains eighteen teams. If we assume that no team will score more than nine goals in a single match, we can derive 18 (possible home teams) * 17 (possible visiting teams) * 10 (possible number of goals, including zero) * 10 (idem) = 30600 dierent sentences, each of which is perfectly likely to occur. Since the other sentences that can be generated by the system will have a similar number of variations, recording all these sentences will be an impossible task. Clearly, the record-and-play-back approach does not work here. The other extreme is to use an unrestricted text-to-speech program. Input to such a program is unrestricted text, and the output consists of synthesized speech. This oers great exibility and has none of the disadvantages inherent to the record-and-play-back scheme. It is not necessary to restrict the number of possible sentences to be generated by the application on the grounds of limited memory resources, or for any other reason. Similarly, addition of new material to be pronounced presents no problem. Unfortunately, there is a price to be paid: the quality of the speech generated by such systems still leaves a great deal to be desired (cf. de Pijper (1997) and others). Speech technology has reached the stage where unrestricted text-to-speech systems are capable of generating speech which has a high degree of intelligibility, but, in general, the speech still sounds quite unnatural. Another solution, somewhere between these two extremes, is the concatenation of prerecorded phrases: entire words and phrases are prerecorded, and these are played back in dierent orders to form complete utterances. This method is particularly well suited to be used in a carrier-and-slot situation, i.e., when there are a limited number of types of utterances (carriers, templates) to be pronounced, with variable information to be inserted in xed positions (slots) in those utterances. In D2S, the carriers are the syntactic templates (see section 2.1), and these have slots for variable information. In the GoalGetter system, this variable information consists of match results, football team names, names of individual players, and so on. In order to generate all 30600 possible variants of the example sentence used above (De wedstrijd PSV - Ajax eindigde in 1-3), it is now only necessary to record the carrier sentence itself, the team names and the numerals zero to nine. In this way, a large number of utterances can be pronounced on the basis of a limited number of prerecorded phrases, saving memory and disk space and increasing exibility. Phrase concatenation thus attempts to reconcile the high quality and inherent naturalness of normal prerecorded speech with the exibility of speech synthesis. The key merit of the technique is that it becomes possible to generate sentences that have never been produced as such by any human speaker, but with a quality approaching natural 6
Theune et al.
From Data to Speech: A Generic Approach
speech. The use of phrase concatenation in limited-domain applications such as GoalGetter is quite common. Commercial applications in which it is used are, for instance, travel information services (Aust et al., 1995), the speaking clock, telephone banking systems, and market research teleservices. In most applications, a simple concatenation scheme is used, where all the necessary words and phrases are recorded only once, and speech generation consists of concatenation of these fragments to form the required utterance, which is then played. This approach has two major problems: Very
careful control of the recordings is needed. Usually, insucient attention is paid to this, so that dierences in loudness, rhythm and pitch patterns occur, leading to dis uencies in the speech. Phrases seem to overlap in time, creating the impression that several speakers are talking at the same time, at dierent locations in the room. These prosodic imperfections are often disguised by inserting pauses, which are clearly audible and make the speech sound less natural. In natural speech, the prosody of an utterance varies depending on its linguistic context. However, in the standard approach to phrase concatenation, the words and phrases to be concatenated are recorded in one prosodically neutral version only. This way, contextual variation is not accounted for, and the prosodic quality of the speech output will be suboptimal.
One simple application that does take prosodic properties into account is a telephone number announcement system described in Waterworth (1983). In order to increase the naturalness of the telephone number strings that are output by the system, they are split into smaller chunks. Digits are recorded in three versions with dierent intonation contours. There is a neutral form, a continuant, with a generally rising pitch, and a terminator, with a falling pitch contour. Most digits in a telephone number, e.g., 010 583 15 67, are pronounced using the neutral form. However, the numbers occurring before a space, viz. 0, 3 and 5 are pronounced using the continuant form to signal a boundary and indicate that the utterance has not yet nished, and the nal 7 is pronounced with a terminator to signal the end of the string. Experiments showed that people preferred this method over the simple concatenation method. Another application called Appeal, which is a computer-assisted language learning program, uses a more sophisticated form of word concatenation to deal with prosodic variations (de Pijper, 1997). When making the recordings the words were embedded in carrier sentences to do justice to the fact that words are shorter and often more reduced when spoken in context. The duration and pitch of the words are adapted to the context using the PSOLA technique (Pitch Synchronous Overlap and Add, (Charpentier and Moulines, 1989)). This ensures a natural prosody, but the coding scheme may deteriorate the quality of the output speech to some extent. Our approach to phrase concatenation can be seen as an extension to the simple concatenation approach. Like Waterworth, it takes prosodic variation into account by recording dierent prosodic versions of otherwise identical phrases. In this way, no manipulation of the speech signal is required, thus retaining a natural speech quality. The technique will be explained in detail in section 3.3.
2.3 Computation of prosody: the missing link
A data-to-speech system incorporates both language and speech generation. In the two preceding sections, we have discussed several approaches to language and speech generation and we have given an overview of the advantages and the disadvantages of the 7
Computational Linguistics
Volume xx, Number yy
dierent approaches. What we did not yet discuss is how the dierent techniques for language and speech generation can best be combined in one data-to-speech system, and which speci c demands are made on the techniques used in such a system. Those questions will be addressed in this section. Most language generation techniques are aimed at producing plain written texts. Since most speech generation techniques work with plain text as input, an obvious way of combining language and speech generation in a data-to-speech system is to incorporate them as two separate modules whose interface consists of plain text. In such an architecture, language and speech generation are quite independent from each other, and dierent (existing) techniques may be combined depending on the requirements of the application. However, a serious problem of such an architecture is that valuable information for speech generation is lost (cf. Pan and McKeown (1997), Zue (1997)). In the previous section we already brie y remarked that prosody depends on linguistic context. As we saw, in standard phrase concatenation linguistic context is not taken into account at all. In text-to-speech systems, the linguistic context of a word or phrase must be obtained through linguistic analysis of the input text. Such an analysis usually yields unreliable and incomplete results, resulting in speech output with a rather low prosodic quality. However, in data-to-speech the text which is to be made audible has been generated by the system itself, so information about linguistic context is present in the language generation component. In order to exploit this information, dierent solutions are possible. One solution is to have a monolithic architecture, where language and speech generation are closely integrated. This design may be ecient, but it has the disadvantage that language and speech generation are usually so intertwined that they cannot be used as separate modules. An alternative solution, proposed by Pan and McKeown (1997) is to have an architecture in which language and speech generation are independent modules which are interfaced by a general prosodic component. Again, our approach is somewhere in between. In D2S, language and speech generation form separate, re-usable modules, which are connected through a prosodic component. However, the prosodic component is not an independent module in the system, but is embedded in the language generation module, with which it shares a mutual knowledge source, containing information about the context. The general architecture of D2S is sketched below, in Figure 2.
Language Data
Generation Module
Prosody Module
Enriched Text
Speech Generation
Speech Signal
Module
Figure 2
Global architecture of D2S
The Language Generation Module of D2S (LGM) takes data as input and produces enriched text, i.e., text which is annotated with prosodic markers. This annotation is performed by the prosodic component of the LGM. The enriched text is input to the Speech Generation Module (SGM), which turns it into a speech signal. Although the prosodic component cannot be used independently from the LGM, the LGM may be used without the prosodic component, in which case it returns plain text. The speech generation techniques used in the SGM require prosodically annotated text as input. However, as long as accents and phrase boundaries are marked, the SGM does not care 8
Theune et al.
From Data to Speech: A Generic Approach
where the markers came from or in what form they occur. This means that the SGM can be used independently from the LGM and that other kinds of language generation techniques can be used instead, as long as they provide the required markers. We nish with a brief overview of the kind of linguistic information that is required for the computation of prosody, in particular accents and phrase boundaries. First of all, information about the preceding discourse should be available. This is important for the assignment of pitch accents. As was observed by Halliday (1967), Chafe (1976), Brown (1983) and others, phrases expressing information that is new with respect to the discourse are normally accented, while phrases expressing given information (i.e., information that is already available from the preceding discourse) are not. As was noted by Chafe (1976), a phrase may be regarded as given if (a) it refers to an object which was introduced earlier in the discourse, or (b) it expresses a concept which has been evoked earlier in the discourse. The small discourse in (1) shows an example of both cases.7 (1)
Last week, Clinton visited Amsterdam. The president loves Holland.
In the second sentence of (1), the phrase the president is not accented because it refers to Clinton, who has already been referred to in the preceding sentence. Holland is deaccented because the concept has already been introduced in the discourse by the NP Amsterdam. Note that the eect of givenness can be overridden by contrast. The following variant of (1) shows that if Clinton is contrasted to another person, no deaccentuation of the president takes place: (2) Last week, Clinton visited Amsterdam. His wife doesn't like Holland, but the president really loves it. We see that accent assignment depends on discourse factors; the system should know which objects and concepts have been mentioned in the preceding discourse, and whether they are contrasted to something else in the discourse. However, it is not sucient to know that some phrase expresses new or contrastive information. Syntactic knowledge is also required in order to determine the nal distribution of accents within the phrase. For example, in (1) and (2) the entire VP visited Amsterdam expresses new information, but the accent lands only on Amsterdam and not on visited. Syntactic information is also needed for the computation of phrase boundaries. Experiments by Sanderman (1996) show that the presence of phrase boundaries re ects the syntactic structure of an utterance, and may therefore help to disambiguate structurally ambiguous sentences like (3)a. In this example, placement of a phrase boundary (indicated by a slash) after the word policeman indicates that the PP with a gun modi es the VP (as in (3)b), while the absence of a phrase boundary indicates that the PP modi es the NP (as in (3)c). See Pierrehumbert and Hirschberg (1990) for similar examples. (3) a John killed a policeman with a gun b John killed [NP a policeman] / with a gun c John killed [NP a policeman with a gun] The placement of phrase boundaries is also related to accentuation (Gussenhoven (1988), Dirksen and Quene (1993), Marsi et al. (1997)). Although there are dierent theories on the exact nature of this relation, theories agree on the fact that, in general, only accented constituents may be separated by a phrase boundary. So, no phrase boundary 7 In this and the following examples, accented words are printed in italics.
9
Computational Linguistics
Volume xx, Number yy
occurs in (4), which has the same syntactic structure as (3)b, but where the NP the policeman is deaccented due to givenness. (4) Q How did John kill the policeman? A John killed [NP the policeman] with a gun. Although not very detailed, this overview shows that for the computation of the prosodic properties of a sentence, information is required about the preceding discourse and about the syntactic structure of the sentence. In the next section, we will see in more detail how the language generation technique used in D2S makes such information available and how it is used for the computation of accents and phrase boundaries.
3. Description of D2S In this section we use examples from a simple data-to-speech system, GoalGetter, to illustrate D2S and the techniques it is based on. Figure 3 shows an example input of the GoalGetter system derived from a Teletext page (see Figure 1). Because the Teletext pages have a xed format, it is always clear which team is the home team (team 1) and who are the visitors (team 2). One possible output text, expressing the data of Figure 3, is given below in Figure 4, together with its translation. In the enriched text, accented words are preceded by double quotes ("), and phrase boundaries of dierent strengths are indicated by a number of slashes (/, // or ///). The other symbols in the text indicate speci c pronunciations for certain words, which we will not discuss in this paper. team 1: goals 1: team 2: goals 2: goal 2: goal 2: goal 2: goal 1: referee: spectators: yellow 1:
Example input data for the LGM, derived from a Teletext page
In section 3.1 we will describe the architecture of the LGM and illustrate it using an example from Figure 4. In section 3.2 we explain how the prosodic markers are assigned, and in section 3.3 we discuss how they are used by the SGM.
3.1 Language generation
The general architecture of the LGM is depicted in Figure 5. The module Generation takes data from outside the system as input; in GoalGetter these are data concerning the results of a particular football match (see Figure 3). It also uses domain data, i.e., a collection of relatively xed background data on the relevant domain. In GoalGetter these are data about the football teams and their players, such as the hometown of each team and the position of each player. These background data serve as a supplement to the system's input data from Teletext, and are used to achieve more variation in the generated texts. For instance, knowledge about the positions of the players provides the system with more possibilities for the generation of referring expressions than if only the Teletext data were available. 10
Theune et al.
From Data to Speech: A Generic Approach
De "wedstrijd tussen "PSV en "Ajax / eindigde in "@een // - "@drie /// "Vijfentwintig duizend "toeschouwers / bezochten het "Philipsstadion /// "Ajax nam in de "vijfde "minuut de "leiding / door een "treer van "Kluivert /// "Dertien minuten "later / liet de aanvaller zijn "tweede doelpunt aantekenen /// De % "verdediger "Blind / verzilverde in de "drieentachtigste minuut een "strafschop voor Ajax /// \Vlak voor het "eindsignaal / bepaalde "Nilis van "PSV de "eindstand / op "@een // - "@drie /// % "Scheidsrechter van "Dijk / "leidde het duel /// "Valckx van "PSV kreeg een "gele "kaart ///
The match between PSV and Ajax ended in 1-3. Twenty- ve thousand spectators visited the Philips stadium. In the fth minute, Ajax took the lead through a goal by Kluivert. Thirteen minutes later the forward had his second goal noted. The defender Blind kicked a penalty home for Ajax in the 83rd minute. Just before the end signal, Nilis of PSV brought the nal score to 1-3. Referee Van Dijk led the match. Valckx of PSV received a yellow card.
Figure 4
Example output of the LGM
Generation additionally uses a collection of syntactic templates, each expressing a part of the input data. Figure 6 shows one such template. The basis of each template is formed by a syntactic tree with variable slots. The syntactic trees are based on the grammatical analysis of Dutch presented in Model (1991). The slots in the syntactic trees are lled with other syntactic trees for the variables. This is done by means of a so-called express function. More details on this function will be given below. Each template is associated with a certain topic, i.e., a label which globally describes what the syntactic template is about, like `goalscoring' or `cards'. During generation, templates sharing the same topic are grouped together in one paragraph, thus ensuring coherency of the generated text. Additionally, each syntactic template has conditions which determine when the template can be used properly. Most of these conditions are formulated as conditions on the Knowledge State, in which it is recorded which input data have been conveyed, and which ones have not yet been conveyed. A typical condition on a template is that the information expressed by the template should not yet have been conveyed. In addition to the Knowledge State, there is another record which is kept during generation: the Context State. In it, various aspects of the linguistic context are recorded. A central part of the Context state is the Discourse Model, which keeps track of the discourse objects that have been mentioned. Finally, the Prosody component computes the prosodic features of each generated sentence. Domain Data
Data
Knowledge State
GENERATION
Templates
PROSODY
Enriched Text
Context State
Figure 5
The architecture of the Language Generation Module (LGM)
11
Computational Linguistics
Volume xx, Number yy
Now we will describe the generation algorithm. As we already pointed out in section 2.1, the LGM does not have a separate text planner, so no `text structure' or `discourse plan' is formed in advance. The structure of the text is built up during generation, and depends fully on the conditions on the templates and the topics they are associated with. An advantage of this `local condition' approach is that it enables the system to achieve maximal variation in the generated texts.8 Sentence planning and linguistic realization are completely driven by the templates as well. At each level, whenever there are multiple possibilities which are equally suitable given the context, a random choice is made. Thus, the variation in the generated texts is increased even further. As a consequence, many dierent texts may be generated expressing the same input data. Generation proceeds as follows. From the list of available topics, one topic is picked at random and designated as the current topic. The algorithm determines if any of the syntactic templates belonging to the current topic can be expressed at this point in the text. This is done by checking whether the conditions associated with the templates evaluate to true. If no syntactic templates can be used (for instance because their conditions specify that a piece of information should have been conveyed, but this has not yet happened), another topic is chosen. If it is possible to express one or more templates from the current topic, one of those templates is picked and the express function is called to ll the variables in the tree with subtrees expressing the function's argument. In most cases, several dierent llings of the slots are possible, resulting in dierent syntactic trees. Sentence-internal syntactic conditions determine which of these trees are well-formed, e.g., if the Binding Theory (Chomsky, 1981) is not violated. The Generation module then nds out for each of the well-formed syntactic structures whether it is an appropriate extension of the Discourse Model. Conditions on the Discourse Model mainly concern the use of referential and quanti cational expressions. For example, a de nite description may only be used if it speci es a unique discourse object. If more than one syntactic structure satis es all of these conditions, one is selected arbitrarily. Its prosodic properties are computed and it is used in the text, after which Generation updates the Knowledge State and the Context State. Taking the updated Knowledge State into account, if possible a new template is picked from the current topic, and a new sentence is generated. This continues until there are no more suitable templates within the current topic. Then another topic is chosen as the current topic. The procedure is repeated until there are no more topics with templates that can be used. We will now illustrate in more detail how the LGM functions with a concrete example from GoalGetter. In this example Figure 3 is the input for the LGM. We will illustrate how the system might generate the fourth sentence of the example text in Figure 4. Suppose the generation system has already generated the rst two sentences of the text, i.e., De wedstrijd tussen PSV en Ajax eindigde in een - drie. Vijfentwintig duizend toeschouwers bezochten het Philipsstadion, `The match between PSV and Ajax ended in 1-3. Twenty- ve thousand spectators visited the Philips stadium.' The information concerning the teams (Ajax, PSV), the number of spectators and the result of the match is derived directly from the information contained in the Teletext page, but the information that the match took place in the Philips stadium is derived from the domain-speci c data on the teams and their properties, e.g., that PSV is based in the Philips stadium. As was said earlier, these data are part of the xed domain database of the system. After the rst two sentences have been generated, there are no more appropriate templates for the rst introductory topic, and the system selects a new topic as the 8 However, if required explicit text planning can be achieved as well, e.g., by enforcing a prede ned ordering of the topics.
12
Theune et al.
From Data to Speech: A Generic Approach
current topic so that a new paragraph is created. In the example, the topic `goalscoring' is selected. After having selected this topic, the system consults the input data structure, which contains a sequence of goal scoring events in the order of their occurrence, and a set of syntactic templates with the appropriate topic. One way of presenting the goal scoring events is a presentation in chronological order, and this type of presentation will be adopted in this example. In order to guarantee that the information about the goals is presented in the proper order each syntactic template that expresses a goal scoring event has a condition that it can only express the rst goal scoring event that has not yet been conveyed, which can be checked in the Knowledge State. We will assume that the third sentence of Figure 4 has been generated from a template containing the expression de leiding nemen `to take the lead', and that the Discourse Model is extended with entities corresponding to the phrases Ajax, Kluivert and in de vijfde minuut, `in the fth minute'. In addition, the Knowledge State is updated to re ect the fact that information about the rst goal has been conveyed. The system goes on and attempts to convey the second goal scoring event. It cannot use the same template as the one used for the third sentence, since the second goal scoring event of the match does not make one team take the lead. But there are other syntactic templates that express goal scoring events as well, e.g., the one in Figure 6. TEMPLATE