The SPEAK! - Semantic Scholar

12 downloads 0 Views 133KB Size Report
field (cf., e.g., Belkin et al., 1995 and Stein et al., 1998, in this volume, ...... Belkin, N. J., Cool, C., Stein, A., and Thiel, U. (1995). Cases ... London: Edward Arnold.
In: Fankhauser, P. & Ockenfeld, M. (Eds.), Integrated Publication and Information Systems. 10 Years of Research and Development at IPSI. Samkt Augustin: GMD – Forschungszentrum Informationstechnik, 1998, pp. 149-168.

Speech Generation in a Multimodal Interface for Information Retrieval: The SPEAK! System John A. Bateman, Elke Teich, and Adelheit Stein Abstract: This article presents the major goals, methods and results of the SPEAK! project. The aim of SPEAK! was to construct a proof-of-concept prototype of a multimodal information system for information retrieval that combines graphical and spoken language output. Supporting goals were to advance the state of the art in the domains of speech synthesis, spoken language generation, and graphical interface design in order to provide enabling technology for higher functionality information systems that are appropriate for general public use. The concrete outcome of the project is a prototype information system in which meta-communication concerning the information retrieval interaction is generated from a deep representation of information and of speaker and hearer goals. To this end, state-of-the-art human-machine dialogue modeling methods have been combined with speech processing techniques and a functionally inspired approach to natural language generation. Particular emphasis was placed on achieving adequate intonation of the synthesized speech output.

1

Introduction

The purpose of this article is to present the major goals, methods and results of the SPEAK! project.1 The objectives of SPEAK! were two-fold: first, to make a significant contribution to the state of the art in human-machine multimodal interfaces by applying text generation techniques extended to support spoken language generation; and second, to increase acceptance of the underlying natural language processing technologies – particularly speech synthesis technology and text generation technology. Particular emphasis was placed in the project on the development of techniques that demonstrate the practical feasibility and desirability of including spoken language output in multimodal user interfaces. The first objective was achieved with respect both to the constitutive components necessary for an intelligent multimodal interface and to their integration in a complete system. This complete system, the final product of the project, was a proof-of-concept multimodal interface incorporating spoken language output with intonation appropriate to the context of interaction between system and system-user. The system provides a sophisticated functionality that distinguishes it from other such interfaces currently under research or development. The interface is able to generate automatically, from a deep representation of information and of speaker and hearer goals, meta-communication concerning the information retrieval process. The second objective was met by demonstration of the proof-of-concept prototype in operation and a detailed evaluation of the complete system that clearly demonstrated the role and value of the functionalities supported. Preliminary work in this area prior to commencing 1. The SPEAK! project was funded by the Commission of the European Union under the Copernicus program (CP10393) for cooperation with Eastern Europe. It ran from May 1994 until September 1996. The project combined expertise in natural language generation and dialogue modeling provided by the Darmstadt University of Technology in close cooperation with GMD-IPSI and in speech synthesis provided by the Technical University of Budapest.

149

the SPEAK! project convinced us that the functionality targeted would be crucial for improving the information retrieval process and human-machine interfaces in general. The proof-of-concept interface developed within SPEAK! made possible concrete empirical evaluation of this hypothesis that had not been possible hitherto. The results of our evaluation were striking – not only for the adoption of output modalities (e.g., spoken vs. written), but also for the degrees of “interactivity” and user-responsiveness desirable for human-machine interfaces. The evaluations also demonstrated that differing interaction patterns are desirable for novice, casual, and expert users, lending support in any case to fully flexible solutions of the kind that SPEAK! has pursued. SPEAK! was initially targeted at improving the computer-interaction of novices with information retrieval and knowledge-based systems: for this kind of user our evaluations show that the proof-of-concept prototype was very effective and successful, even given some limits or problems with particular aspects of the system’s performance. The construction of an evaluable prototype was therefore vital for convincingly grounding our claim that the techniques embodied in SPEAK! represent an important basis for future research and development, as well as offering a real basis for practical application. The remainder of the paper is organized as follows. In Section 2 we provide a brief state of the art in speech generation. Section 3 motivates the particular methodology adopted, which combines a formalization of communicative intentions as they are given in information-seeking human-machine dialogues, a functional-linguistically inspired automatic text generation approach and a state-of-the-art speech synthesis technique. In Section 4 we present the requirements that the concrete information retrieval application places on the modeling of humanmachine interaction. Section 5 describes the concrete technical realization of the approach taken, including its empirical evaluation. Section 6 concludes the paper with a brief summary of the major achievements of the SPEAK! project.

2

State of the art in speech generation

It is generally unequivocal that, in spoken language, intonation is crucial for conveying the intended meaning of a stretch of discourse. In the more specific case of information-seeking human-machine dialogue with spoken language output, intonation often is even the only means to distinguish between different dialogue acts, thus making the selection of the appropriate intonation crucial to the success of the information-seeking process. To illustrate this point, imagine an information-seeking dialogue where the user wants to know a specific flight connection. At some point in the interaction, the system produces a sentence such as “You want to travel from Frankfurt to Berlin on Monday.” There are several interpretations of this utterance, one being that the system presents some kind of information to the reader. However, the same sentence – employing a different intonation – could be part of a clarification dialogue, where the system wants to reassure that it got the user’s request right. In this case, the user would be expected to react, i.e., either confirm or reject this statement. Note that the syntactic question has quite different, and for a clarificatory dialogue inappropriate presuppositions. Only by means of intonation can the user interpret the system’s contribution correctly and react accordingly. Even though current speech synthesizers can support sophisticated variation of intonation, no existing text-to-speech or concept-to-speech system is available that provides the semantic or pragmatic guidance necessary for selecting intonations appropriately. The major shortcoming is that traditional text-to-speech systems and earlier concept-to150

speech systems (e.g., Dorffner et al., 1990) alike use a syntactic encoding of information for controlling prosodic features. Moreover, with text-to-speech systems, where the syntactic structure has to be reconstructed from the written text by means of a syntactic analysis, the resulting analysis is seldom unambiguous and never complete. Concept-to-speech systems avoid the latter problem by generating spoken output from a pre-linguistic conceptual structure. However, many concept-to-speech systems have used the conceptual representation only to avoid syntactic ambiguities: that is, the intonation synthesis is still based on a syntactic representation. Intonation is evidently more than the mere reflection of the surface linguistic form (see Halliday, 1967a, Pheby, 1969, Selting, 1993): intonation is selected to express particular communicative goals and intentions. Effective control of intonation therefore requires synthesizing from meanings rather than word sequences. Achieving a computational instantiation of this understanding is, however, far from straightforward. One attempt is the SYNPHONICS system (Abb et al., 1996) where prosodic features are assumed to have a function independent of syntax: Abb et al. replace the common, if often implicit, idea of syntax-dependent prosody with the notion of the pragmatic function of prosodic features. This allows prosodic features to be controlled by factors other than syntax, e.g., by the information structure (focus-background or topic-comment). The aspects of the generation task addressed are, however, limited to a detailed account of a mapping between conceptual structure and surface form and, although also of central concern in certain text generation accounts (e.g., Iordanskaja et al., 1991), the conceptto-language mapping paradigm alone is not sufficient for language generation. It is also essential not only to address the transitions between areas of conceptual representation in an unfolding text/discourse, but also to consider the form of the conceptual representations themselves. As a consequence, the function of intonation in a SYNPHONICS-like approach is still restricted to what is called the textual function of intonation, without considering aspects such as communicative goals and speaker’s attitude, i.e., the interpersonal function of intonation (cf. Halliday, 1967a). In most dialogue types of the kind realistically pursued between people and machines at present, there is more than an opportunistic sequence of transitions between utterances: the dialogue has an overall goal and accordingly exhibits larger-scale structures, just as is the case with written text. It is insufficient to look at how isolated sentences can be joined together; instead one has to look at the utterance in its context, as part of a larger interaction. Intonation is not only used to mark sentence-internal information structures, but additionally it is employed in the management of the communicative demands of interaction partners. We need to consider the function of intonation with respect to the whole conversational interaction, taking into account the discourse (dialogue) history (see also Fawcett et al., 1988, Selting, 1993). In most speech production systems, therefore, the root of the problem of making intonation assignments is insufficient knowledge about the sources of constraints on intonation. If meaning as constraining factor is considered at all, it is propositional meaning and possibly some aspects of textual meaning (e.g., discourse structure) that are acknowledged as influencing intonational selection. It remains difficult, however, to derive such meanings from a string (in the text-tospeech analysis view) and the meanings included remain in any case functionally under-diversified (in the concept-to-speech view). What is left unrepresented – which is the major constraint on intonation selection – is the interpersonal aspect, such as speech function (or: speech act), speakers’ attitudes, hearers’ expectations, and speaker–hearer role relations in the discourse. A sufficiently diversified view of the meanings that intonation expresses is lacking. 151

Natural language (NL) generation is one area where there has been substantial research to uncover the interplay between the various language functions (i.e., textual, interpersonal, ideational), contextual parameters and discourse history. From the perspective of NL generation a differentiated view of meaning is usually the starting point. This acts as the major constraint on lexicogrammatical expression and allows selection from the very wide range of “paraphrases” that any reasonably developed sentence generator offers for any particular propositional content. Full-fledged generation systems also have knowledge about the type of discourse or genre that is to be generated, and about the communicative goals of a piece of discourse to be produced (the global context or intentional structure; cf., e.g., Moore & Paris, 1993). Without this knowledge, grammatical and lexical expression are underconstrained, and a generation grammar will fail to produce appropriate output. Additionally, the local context or attentional structure of utterances (cf., e.g., McKeown, 1985) must be taken into account, so that the development of the discourse can be used as a constraint on subsequent choices in the generation process. All of these factors taken together – genre, global and local context, sufficiently finegrained semantic and lexicogrammatical resources, and knowledge about the constraints between these – provide a NL generator with the grounds to make appropriate choices in generation. In the same way that written text generation would be under-constrained if these factors were not taken into account, in speech production speech-specific selections (intonation, prosodic focus, etc.) remain under-constrained if they are not explicitly related to their meanings in global and local context. The SPEAK! framework meets this challenge and provides an illustrative implementation in which NL generation and speech are combined in the way fullfledged NL generation systems suggest.

3

Integrated SPEAK! framework

A major reason for the limited performance and accordingly restricted marketability of text-to-speech systems is that they fail to control intonation appropriately in the context of individual utterances. High quality speech synthesis that is to be accepted by human hearers demands appropriate intonation patterns. As suggested above, the effective control of intonation requires synthesizing from meanings, rather than word sequences, and requires understanding of the functions of intonation. This is a simple consequence of the fact that intonation is meaningful, i.e., that intonation is selected to express particular communicative goals and statuses and is not a simple reflection of the surface linguistic form (cf., e.g., Halliday, 1967b, Bolinger, 1972, Brown, 1983, Terken, 1984, Prevost & Steedman, 1994). In order to control intonation appropriately for the context, therefore, a system needs access to the meanings that language is trying to express. In the traditional domain of application of text-to-speech systems, this problem is still insoluble. There is no natural language understanding technology that can be expected to result within the foreseeable future in systems that analyze written texts sufficiently deeply for intonation control. However, in the domain of sophisticated human-machine interfaces, we can make use of the increasing tendency to design such interfaces as independent agents that themselves engage in an interactive dialogue (both graphical and linguistic) with their users. Such agents need to maintain models of their discourses, their users, and their communicative goals and so already have significant components of the information that research is showing to be 152

necessary for controlling intonation. The combination of more functional information concerning communicative goals and speech synthesis devices is likely to dramatically improve the marketability and applicability of speech synthesis devices. Until such a combination is achieved, acceptance of synthesizers will remain marginal. This line of development is equally important to the design and construction of text generation systems. These are systems which rely on the specification of abstract communicative goals (externally provided, either by the researchers themselves or by applications such as expert systems, machine translation components, document generators, etc.) in response to which corresponding surface linguistic forms are generated. Although the design and construction of such text generation systems forms an active area of current research (see, e.g., Reiter et al., 1995, Adorni & Zock, 1996, Bateman, 1998), the breadth of potential practical applicability is here also limited by the restriction to written texts. As we have argued in more detail in Grote et al. (1997) and Teich et al. (1997), approaches to dialogue generation so far have been quite restricted in their uptake of broader NLG techniques. It is clearly necessary to include spoken texts among the capabilities of the more powerful generation systems if they are to become more widely accepted in practical domains. Thus, while text generation systems can in theory provide a sophisticated multimodal human-machine interface with much of the necessary linguistic competence for engaging, where appropriate, in linguistic interaction with the user, they are not at present able to provide the spoken output that would enable their breakthrough to a wider field of real-world application. Finally, it is also the case that current multimodal interfaces to information systems are now so complex that purely graphical output is insufficient. Hence, an interface must be able to produce natural language dialogue to support optimally effective systems; otherwise the complex graphics and the possibilities open to the user for interaction become unintelligible. Furthermore, the practical limitation in text generation to written texts is also restricting the general acceptance of text generation technology and is, in any case, inappropriate in a highly visual environment such as that produced in multimodal interfaces. And finally, speech synthesis devices are now sufficiently stable and phonetically of high enough quality to support sophisticated variation of intonation, which has been shown to be crucial for creating synthesized speech that human hearers find acceptable, but they lack the kind of semantic/pragmatic guidance necessary for selecting intonations appropriately. Each of these technical bottlenecks restricts the real-world applicability and competitiveness of devices involving the component technologies, either singly or combined. In summary, each of the areas of knowledge involved in the SPEAK! project – graphical interface design, text generation, and speech synthesis – has reached the point where a close synthesis of technologies is possible, necessary, and mutually beneficial. To succeed in constructing a prototype system conforming to the above requirements, the SPEAK! project assembled at its outset a number of state of the art components that together supported the intended functionality of the prototype. For conversational control of the information retrieval process it was necessary to have a dialogue manager for interpreting requests for information and for initiating output, both graphical and in natural language. It was also necessary to employ a graphical presentation system, an information retrieval engine, and a knowledge base which is so organized as to support interfacing with the other system components. Finally, it was necessary to have an NL generation component that accepts communicative goals from the dialogue manager in order to construct appropriate spoken language output using 153

information from the knowledge base as required. The effort of the project was then focused directly on an increment in the capabilities of these systems which substantially improved their relevance and usability for advanced information system user interfaces. The detailed system architecture of SPEAK! is shown in Figure 1. When the mode chosen for meta-communication is spoken language, the text generation system (KOMET-PENMAN) receives input from the dialogue module and other information sources as available (e.g., confidence measure from a speech recognition unit if spoken input were to be added). Together the information from these input sources controls the traversal of a large systemic grammar for German. This grammar can generate two types of output: (i) written text, which can be presented in, for instance, a dialogue box in a graphical user interface and, (ii) text that is marked up with intonational features. This latter is then passed on to the MULTIVOX-SPEAK text-to-speech system (see Olaszy & Németh, 1997) and presented acoustically to the user. In the construction of the complete prototype we have, therefore, drawn both on a broader ongoing line of development in information system technology – particularly in sophisticated models of mixed initiative information-seeking dialogue – and previous work involving functional grammar components that cover both written and spoken language generation (cf. Halliday, 1967a, Pheby, 1969, Fawcett, 1990). User Interface meta-communication Dialogue Manager

Text Generator

COR model

KOMETPENMAN

Dialogue history

Text plain text

retrieved

Speech

results

Hypertext

Speech Synthesizer MULTIVOX-SPEAK

marked-up text

Retrieval engine Retrieval Engine INQUERY

database access

DB

Figure 1: System architecture of SPEAK! The SPEAK! prototype interfaces the probabilistic full-text retrieval system INQUERY (Callan et al., 1992) providing access to a textual database (a subset of Macmillan’s “Dictionary of Art” in electronic form) which consists of about 15,000 biographies of artists and reference articles. A graphical querying interface allows users to successively construct complex queries by direct graphical manipulation combining search terms by Boolean operators (see also Golovchinsky & Chignell, 1993). The integrated interface thus combines graphical user input with spoken or written meta-communication in order to particularly support first time and casual users of the system and novice users of IR systems.

154

4

SPEAK! application area

The concrete application area within the SPEAK! project concerned the use of spoken language for meta-communication between user and information system. When using such systems, users are typically involved in a process of information seeking or browsing. The possibilities for interaction supported are extensive and, given some point in an actual situation of interaction, it is often not transparent to a user exactly what possibilities are available and why. To improve the effective use of such systems, SPEAK! provides meta-communication that complements the other modes of information presented. This meta-communication provides information concerning the state and development of the information retrieval or browsing process itself. The utility of modeling information system use, particularly information retrieval (IR), as interaction or dialogue has been extensively argued by several researchers in the IR field (cf., e.g., Belkin et al., 1995 and Stein et al., 1998, in this volume, for more detailed discussions).

Figure 2: A SPEAK! screen (spoken active help version) A screenshot of the SPEAK! interface during use is presented in Figure 2. In the particular situation shown in the figure, the user is inspecting a full-text biography which was retrieved after a user’s query combining three search terms (the respective terms are highlighted in the 155

article window, showing AND connectives by connecting lines). Before this situation, the user had entered a number of simple-term and complex queries as shown in the history window (“Anfrage-Geschichte”), and after each such query and result presentation successive queries could be entered in the article window by adding search terms and new connectives in order to refine the previous queries. In response to each partial query, the INQUERY retrieval system turns back a ranked list of retrieved items (in the window “Ergebnisse”) based on the frequency of search terms occurring in the full-text documents in the database. Considerable complexity arises in the retrieval system because the user can at any stage solicit further information by replacing or modifying the query, save search results, quit the interface, or ask about the possible actions that the system supports. Without further support of the kind envisioned in SPEAK!, it might be expected that it is easy for users to loose track of where they are in their interactions: this was later supported by the results of our whole-system evaluations. In order to assist first time users of SPEAK! and, in particular, novice users of IR systems, the system generates meta-dialogic help2 in each interaction step considering the dialogue history. In the target interface version, all user actions and system responses are accompanied by spoken help messages. The system draws the user’s attention to the interaction options available in each particular dialogue situation but comments on the interaction only as far as needed in a given situation. The help messages, being quite detailed at the beginning, get shorter as the retrieval dialogue develops. After several query-result cycles the output is reduced to short sentences, e.g., “Der neue Artikel ist da” (Here’s the new article). If users find this insufficient, they may request a more detailed paraphrase or a repetition of the spoken help message. To allow for a comparative evaluation two additional interface versions were implemented: one of them generates the same messages in the form of written texts, and the other version provided no active help but standard static help texts only. These kinds of functionality and the general framework supporting them clearly placed the SPEAK! system apart from other such interfaces under research or development at this time.

5

Technical approach and major results of SPEAK!

5.1 Natural Language (NL) generation and intonational resources The KOMET-PENMAN generator adopted in the baseline SPEAK! architecture is a multilingual generation system, including a continually growing number of general language grammars, such as components for English, German, and Dutch. KOMET-PENMAN is based on systemic-functional linguistic (SFL) where the factors we noted above (Section 3) as crucial for NL generation – and now also for speech production – are given recognition by a multidimensional organization of linguistic resources. The main organizing principles are stratification (genre, register or context, semantics, and lexicogrammar) and functional diversification (the ideational, textual, interpersonal metafunctions). In addition, SFL is inherently use-orient2. For simplicity, we use here the term meta-dialogue/communication for all utterances that address the dialogue itself (e.g., the current state) or the handling of interface objects (see Stein et al., 1999, for a sophisticated logic-based approach to the use of meta-dialogues in conversational IR systems).

156

ed, i.e., geared to considering language in context (situational and cultural contexts of application), rather than in isolation. The kinds of knowledge needed for generation can be readily allocated within any generation architecture derived from such a linguistic model. Intentional structure or global context is embodied in the representation of genre and register (cf. Halliday & Hasan, 1985, Martin, 1992). The linguistic resources proper are constituted by the strata of semantics and lexicogrammar. Furthermore, each stratum is functionally diversified in that ideational, textual and interpersonal kinds of linguistic information are distinguished. All these linguistically relevant resources are uniformly represented as systemic networks (cf. Bateman, 1992), which are representationally similar to type lattices as used in many current computational linguistic models (cf. Henschel, 1994). A systemic network consists of systems (i.e., named disjunctions), which represent the communicative-functionally motivated points of alternation made available by a language. Individual alternations are termed features. The input to lexicogrammatical processing is given in the form of a “Sentence Plan Language” expression (SPL, cf. Kasper, 1989). This is a logical form enhanced by textual and interpersonal information. Input SPL expressions are typically constructed by a text planning strategy that uses knowledge about the global context, i.e., about the genre and its associated text structure, to constrain choices on the semantic level of the system, and ultimately chunks up the information that is to be textualized into sentence-size units. Details of this process have been given in Teich & Bateman (1994) and Bateman & Teich (1995). The task of the SPEAK! project was then to extend these linguistic descriptions to make provision for intonational patterns for spoken output. The most straightforward way of building in the kind of intonational control required and to provide a mapping to dialogue goals was to extend our available grammars so that options in intonation are covered. This strategy was first proposed by Halliday (1967a), who details around 40 points of connection between the clause grammar of English and choices that require realization in intonation. Since then, such an approach has been taken further by, for example, Fawcett et al. (1988) and Matthiessen (1995). The “realizations” of such added discrimination are then constraints on the specification of an appropriate intonation contour rather than constraints on structural form. Modeling intonation in the KOMET grammar of German therefore involved the introduction of more delicate systems in those areas of the lexicogrammar where intonational distinctions are also possible, thus specifying the relation between intonation features and the other linguistic resources (lexis and syntax). Here, we will restrict ourselves to the description of the system networks reflecting the choices in tone. The networks are primarily based on the descriptive work of Pheby (1969) who first applied Halliday’s results to German (in a noncomputational context). Three distinct kinds of intonation specification needed to be combined for any clause generated: tonality (the division of a text into a certain number of tone groups), tonicity (the placing of the tonic element within the tone group), and tone (the choice of a tone for each tone group). Following Pheby (1969), we initially assumed five tones, the primary tones, plus a number of so-called secondary tones that are necessary for the description of German intonation contours. These tones are: fall (tone1), rise (tone2), progredient (tone3), fall-rise (tone4), risefall (tone5), where the first four can be further differentiated into secondary a and b tones. The criteria for the distinction of primary tones is the type of the tone movement, for instance rising or falling tone contour, whereas the degree of the movement, i.e., whether it is strong or weak in expression, is considered to be a variation within a given tone contour. Our approach here is 157

primarily empirical and pragmatic: we provide distinctions necessary to support the concrete dialogues generated and the meaning distinctions that they need to express. The primary tones are the undifferentiated variants, whereas the secondary tones are interpreted as realizing additional meaning. They are interpreted as follows: 1a = neutral 1b = emphatic 2a = neutral 2b = negative

3a = weak contrast 3b = strong contrast 4a = neutral 4b = negative

5 = assertive/clarifying

Consider the following example: The computer has retrieved an answer to a query, and this answer is presented graphically to the user. As a default, the system would generate a neutral statement choosing tone 1a to accompany the presentation, as in //1a die ergebnisse sind unten dargestellt// (“The results are given below”). If, however, the results had so far been presented at a different position on the screen, the system would generate tone 1b in order to place special emphasis on the statement: // 1b die ergebnisse sind UNTEN dargestellt//.3 The interpersonal part of the grammar provides the speaker with resources for interacting with the listener, for exchanging information, goods and services, etc. (see Halliday, 1985, Martin, 1992). Within the lexicogrammatical stratum, the MOOD systems (declarative vs. interrogative vs. imperative) are the central resource for expressing these speech functions. More delicate speech functional distinctions – specific to spoken language – are realized by means of tone. The (primary) tone selection in a tone group serves to realize a number of speech functional distinctions. For instance, depending on the tone contour selected, the system output //sie wollen ein WEITERES dokument sehen// (“You want to see an additional document”) can be either interpreted as a question (tone 2a) or a statement (tone 1a). Equally important is the conditioning of the (secondary) tone by attitudinal options such as the speaker’s attitude towards the proposition being expressed (surprise, reservation, ...), what answer is being expected, emphasis on the proposition etc., referred to as KEY features. If one defines KEY as that part of the speech functional distinctions that is expressed by means of tone rather than mood alone, one can integrate the MOOD and KEY systems into the grammar by positioning KEY systems as dependent on the various MOOD options. Figure 3 gives the system networks of the extended KOMET grammar for interrogative sentence mood. Finally, an add-on component was created for the text generator so that the output strings generated contained appropriate high-level markup for the extended MULTIVOX-SPEAK speech synthesis component. This then supported smooth interfacing from high-level semantic input and speech. This worked by passing the generated structure through a postprocessor that could recognize the grammatical constituent that was coreferential with the focus constituent defined as the Tonic and apply a presentation method to that constituent to render it in terms of the highlevel markup defined for MULTIVOX-SPEAK. 3.

The notation used here is a standard systemic one: “//'” marks tone group boundaries, and CAPITAL LETTERS are used to mark the tonic element of a tone group. Numbers following the “//'” at the beginning of a tone group indicate the type of tone contour. The MULTIVOX-SPEAK synthesizer was extended within the project in order to accept such mark-up specifications and to produce appropriate intonation contours for them. This extension is described in detail in Olaszy & Németh (1997).

158

YES/NO TYPE

request tone2 unmarked info-seek

preemptory tone1a

KEY involved (interrogative) clarifying tone4a

wh-tonic

neutral-assessment tone2a strong-assessment tone2b

neutral tone4a strong tone4b

surprised tone2a

WH-TYPE

wh-nontonic

neutral tone1a involved tone2a

Figure 3: KEY systems in interrogative clauses (simplified)

5.2 Dialogue modeling and management Our initial investigation and analyses of the dialogues to be supported by the SPEAK! prototype and their required intonational realizations showed that all of the parameters necessary for managing dialogue (dialogue move type, dialogue history, speech function, mood and key) are logically independent and that different combinations of them go together with different selections of intonational contour. This was modeled within the SPEAK! approach in terms of a two-level stratified model (see Figure 4). Only this organization provides the required flexibility in mapping between the different categories. Such an organization is also proposed in systemic functional work on interaction and dialogue (Berry, 1981, Martin, 1992, Ventola, 1987) that has not been explicitly involved with spoken language. The strata assumed in the systemic functional model are context (extra-linguistic), semantics and grammar (linguistic). As we saw in Section 5.1 above, the MOOD and KEY systems represent the grammatical realization of an exchange move (again restricting ourselves to the case in which a move is realized as a clause). The ultimate constraint on the selection of features in this interpersonal semantics and grammar is then the information located at the stratum of context, i.e., knowledge about the type of discourse or genre. In the present scenario, this contextual knowledge is provided by the dialogue model, reflecting the genre of information-seeking human-machine dialogue. Since the stratum of context is extra-linguistic, locating the dialogue model – which has originally not been designed to be a model of linguistic dialogue, but of information-seeking dialogue in general – is a 159

extra-linguistic context

linguistic resources

COR dialogue model

dialogue history

negotiation question statement

semantics speech function

grammar

mood . . .

offer command key

Figure 4: The stratified model straightforward step. We also made good use here of a split proposed within SFL by Martin (1992) between different aspects of context: that between a more abstract level of discourse structuring (called genre) and a less abstract level of structuring (called register). The dialogue model (COR) as introduced here describes the theoretically possible dialogue state transitions which represent different types of illocutionary moves/acts (Sitter & Stein, 1992). Which transitions are actually most preferred are, however, constrained by particular dialogue types. This motivates a further description of abstract scripts (representing 16 types of information-seeking strategies elaborated by Belkin et al., 1995), which can serve an analogous role to genre in restricting the options that are to be taken up within register (or, in the current instantiation, within the COR dialogue model). This stratified architecture is depicted graphically in Figure 4. This abstract model allows us to specify very generally the kinds of resources and interactions among those resources that are necessary for a flexible account of dialogue. Each of the components that serve to help constrain the appropriate selection of context-appropriate intonation patterns is naturally placed within a complete view of a functioning linguistic system. The model is, however, often still too underconstrained for particular selections of intonational patterns. Just as the particular lexicogrammatical options that are taken up in a given text type (register) need to be ascertained empirically and do not follow from an abstract context-neutral account of grammar, so equally the most likely intonational selections had to be ascertained similarly. The dialogue model used in SPEAK! consists of two interrelated tiers. Scripts represent stereotypical interaction sequences and recommended problem-solving steps for particular tasks (such as searching a database by specification of attributes with the goal of selecting retrieved items versus, for instance, browsing in a hypertext with the goal of learning). Global steps in the first case may be query specification, inspection of results (if found) in overview and/or detail, giving explicit relevance judgements (if relevance feedback facilities are included in the system), consulting a thesaurus for selecting new search terms, etc. This global organization of dialogue does not include representations of all possible moves and substructures – 160

i.e., the local organization of the discourse, including embedded clarification dialogues and other “deviations” from the current script. The local organization is modeled in terms of the “COnversational Roles” dialogue model. COR has been used in a range of studies of conversational multimodal interaction. It is specified in terms of a set of related state transition networks for the dialogue and its elements, i.e., the dialogue moves. Transitions in the dialogue net represent possible moves between dialogue states, whereas transitions in the move nets can either be (i) simple (atomic) dialogue acts or themselves (ii) complex moves also modeled by their own transition networks or (iii) subdialogues modeled by the dialogue net. Dialogue Act Types

Possible Follow-up Moves

Transition Condition

Name A: request

1→2

A wants: B does a

Transition

Expected Move

Alternative Move

promise (B,A,T)

reject_request (B,A,T) withdraw_request (A,B,T)

2 → 1/7 2 → 1/7

subdialogue (B,A,T) B: offer

1 → 2’

B intends: B does a

accept (A,B,T)

reject_offer (A,B,T) withdraw_offer (B,A,T) subdialogue (A,B,T)

2’ → 1/7’ 2’ → 1/7’

B: promise

2→3

B intends: B does a*

inform (B,A,T)

withdraw_promise (B,A,T) subdialogue (A,B,T)

3 → 1/8

A: accept

2’ → 3

A wants: B does a*

inform (B,A,T)

withdraw_accept (A,B,T) subdialogue (B,A,T)

3 → 1/8

B: inform

3→4

B believes: p

evaluate (A,B,T)

evaluate (A,B,T) subdialogue (A,B,T)

4→5

A: evaluate

4→1

A believes: p

request (A,B,T) offer (B,A,T)

dialogue (_,_,m) withdraw (_,_,m) subdialogue (B,A,T)

1→1 1→6

A: withdraw_ 2 → 1 request

A wants: [not (B does a*)]

request (A,B,T) offer (B,A,T)

dialogue (_,_,m) withdraw (_,_,m) subdialogue (B,A,T)

1→1 1→6

[ END ]

subdialogue (A,B,T)

B: reject_ request

2→7

B intends: [not (B does a*)]

......

......

......

A B T move:

information seeker information provider A or B type of dialogue or

......

......

A: < ... >atomic act < ... > (_,_,T) complex move p a a*

proposition action (defined here) action adopted from previous

States 1-4 are within a dialogue cycle; states 5-8 are terminal states of a dialogue or subdialogue.

Figure 5: Functions of some COR dialogue acts and responses

161

Figure 5 shows how two participants, A and B, may perform a complete dialogue (or subdialogue) by moving from the start state (1) to any of the terminal states (5–8). A retrieval dialogue is either initiated by a request for information (transition 1–2) or an offer to provide some information (transition 1–2’), and it is within these moves that the conditions of action and the global subject matter of the current dialogue cycle are defined. We can see that the roles attributed to the interlocutors may vary, as indicated by the parameters A and B and their relative orders. An “ideal” dialogue is indicated by moving to state 5. All dialogues that terminate in states 6–8 are in some degree unsatisfactory or are broken off prior to solving the given information problem. Note that it is essential for an adequate model of natural dialogue that these other possible paths through a dialogue are represented: otherwise the system using such a dialogue model will often not be able to follow the intentions of the human interlocutor. Extensive descriptions of (different versions) of the COR model are available, for example, in Sitter & Stein (1992), Hagen & Stein (1996), and Stein et al. (1999). The global organization of dialogue by scripts described above then constrains the possibilities that COR provides, in order to motivate particular paths through more local organization. For the representation of constraints between dialogue move and speech function on the side of interpersonal semantics and mood and key on the part of the grammar, the type of dialogue move in context (dialogue history) serves as the ultimate constraint on tone selection. A typical move in the genre of information-seeking dialogue is the request (for information) move, which contains as its nucleus a simple/atomic request act. In terms of speech function, such a request act is often a question. The request–question correlation in the kind of dialogue we are dealing with here constrains the choice of mood to interrogative or declarative, e.g., (a) Was möchten Sie sehen (“What would you like to look at”) (interrogative) – (b) Sie wollen die Gemälde von Gauguin sehen (“You want to look at the paintings of Gauguin”) (declarative). So, in information-seeking dialogues, the type of move largely constrains the selection of speech function, but it only partially constrains the mapping of speech function and mood. Deciding between declarative and interrogative as linguistic realization of a request requires information about the immediate context of the utterance, i.e., about the dialogue history. It is in the dialogue history that speaker’s attitudes and intentions and hearer’s expectations are implicitly encoded. The area in the grammar recoding this kind of interpersonal information is key. The key systems are subsystems of the basic mood options and realized by tone. Consider the contexts in which (a) or (b) would be appropriate: (a) would typically be used as an initiating act of an exchange, where there is no immediately preceding context – the speaker’s attitude is essentially neutral, and tone 1 is appropriate. (b) would typically be used in an exchange as the realization of a responding act. In terms of the COR dialogue model, (b) would be a possible realization of a request initiating a clarification dialogue embedded in an inform or request move – the speaker wants to make sure she understood the preceding inform or request act correctly, and tone 2 is the appropriate intonation.

5.3 Evaluation of the complete SPEAK! prototype The principal aims of the whole-system evaluations of the integrated prototype were to investigate the following questions: 162

• How does the dialogue guidance by system-generated active help influence the behavior of real end-users and their subjective satisfaction with the interaction? • How do users perceive the different output modalities (speech vs. writing)? • Are there any differentiations by user characteristics such as retrieval expertise and professional background? The study used an experimental setting in order to investigate the users’ interactive behavior and their subjective assessments under controlled conditions. We mainly relied on observational and interview techniques, since these are especially suited for cost-effective evaluation of proof-of-concept prototypes. Combining results from qualitative analyses of the observational data collected, the subjects’ reports during interaction with the system, and postexperimental interviews, we analyzed some quantitative data from a questionnaire mainly in support of the qualitative analyses. In the following we summarize selected results, concentrating on the comparison of the interface versions and differential analyses by user background (more detailed findings are extensively discussed in Stein, 1997). The sample consisted of 24 staff members of the research institute GMD-IPSI including a number of non-researchers (secretaries and administrative staff). The subjects (Ss) were randomly divided into three groups in order to test the interface versions: “active spoken” (S), “active written” (W), and “control” (C, control group testing the baseline system with no active help). Each subject was to first solve four retrieval tasks with one version and then try a second version solving an additional, more complex task, i.e., each interacted with two interface versions. The test users were exposed to the system versions in the order: S → W, W → S, and C → S. They used the system in the presence of an observer in a separate room with the SPEAK! system running on a SPARC 10 workstation and the MULTIVOX-SPEAK synthesizer connected in the spoken output version. Our general hypotheses were that the active help mode would be (perceived as) more useful than passive help to most/certain users, and that the output modality would have an effect on the users’ performance and satisfaction with the interaction. Since synthesized speech is an unusual medium and may provoke quite varied reactions, the latter was not formulated as a specific directional hypothesis but as an open-ended question for detailed investigation. All test users solved all retrieval tasks, which took them about 30 minutes for the first and 15 minutes for the second version. Concerning the total time spent no significant differences between the three test conditions were observed. This could be expected, given that the tasks were relatively easy and each version provided detailed help. However, it indicates that there were no system features that drastically impeded retrieval performance. The Ss only very rarely encountered usability problems in the spoken and written interface versions. Most Ss (all IR novices) expressed that the interaction style and the help messages were useful in most situations, and Ss rarely requested additional, more detailed help. A few Ss checked the static help texts once or twice during their session in case they forgot the specific instructions they were given before or had not payed sufficient attention to, and the repeat button and the detailed help button were used in two cases only. In the control group the situation was quite different. Here, Ss were often disoriented and complained about missing feedback from the system (although they checked the static help texts frequently and found them quite helpful). Note that the static help texts contained all of the information that was generated in the active help versions, but in a different form, and included even more detailed instructions. 163

Differences in the search and navigation behavior in the three test conditions mainly concerned the timing of user actions. In the active help versions the timing was easier, as most Ss expressed. In fact, we observed that most users of the spoken version had less orientation problems, being prompted by the acoustic output, whereas in the other versions Ss often did not notice when the system was done, e.g., when a presentation had been changed. After the experiment Ss were asked to fill out a questionnaire with some open-ended questions and five-point rating scales in order to evaluate some distinct system features: retrieval (performance and accuracy), user–system interaction (interaction style and user guidance), help texts (quality and usefulness of the text content), and speech understanding (intelligibility of the spoken output). As can be seen in Table 1, the Ss’ ratings were similarly positive on the first three rating scales (except the judgements of interaction by the control group). The fourth feature, speech understanding, was judged less well. Almost all Ss stated that the long spoken instructions at the beginning of the session were quite hard to understand but that they did much better with shorter comments in later phases of the session. During interaction with the system none of the Ss commented on the quality of the intonation, neither positively nor negatively. In the post-experimental interview, however, all Ss said that the intonation was “almost perfect”, and many Ss claimed that in the spoken version it was much easier to direct the focus of attention to the retrieval task and not to overlook the “important things”. Table 1: Ratings of system properties, by test condition Rating Scale Values

Retrieval C n=8

W n=8

Interaction S n=8

1

[– –]

2

[–]

1

3

[0]

1

1

4

[+]

4

2

4

5

[+ +]

4

4

3

4.5

4.5

4.3

Median

C n=8

W n=8

Help Texts S n=8

Speech Understanding

C n=8

W n=8

S n=8

C n=8

W n=8

S n=8

5

1

3

1

3

2

2

1

1

5

3

4

7

2

6

3

4

2

1

3

1

3

4

3

1

1

4.1

4.0

4.5

4.3

3.3

2.8

2.3

**

**

3.8

*

3.3

+

The ratings refer to the first system version tested, except the ratings of “speech understanding” by the C and W groups, which were made after their second test with the spoken help version. Medians calculated from grouped data, with linear interpolation of exact position; Mann-Whitney U tests comparing the subsamples C-W, C–S, and W–S for each rating scale: ** p < .01; * p < .05; + p < .10.

There were no significant differences between the S and W conditions, and the other comparisons shown in Table 1 mainly reveal effects of the active vs. passive help. To support the hypothesis that active help is better accepted than passive help, we expected differences between the C–W and C–S conditions mainly in the judgements of the interaction and help texts but not concerning retrieval, since the latter addresses the system’s performance, which 164

was the same in all conditions. We did find statistically significant differences in the ratings of interaction and help texts between the samples C–W and C–S. Users of the baseline version (C) gave clearly more negative ratings of the interaction and the usefulness of the help texts (in this case: static texts), probably due to the missing user guidance and the absence of situationdependent help. Actually, most users of the C version mentioned during the experiment and in the interview explicitly that they missed more feedback and guidance from the system. After having completed their tests with the second system version, Ss were asked for a comparison of the two versions and which output modality (speech or active/passive written help) they preferred for what reasons. We assigned the Ss’ responses to another scale as shown in Table 2. It is interesting that clearly more Ss favored spoken output to written (active or passive) help – some of them even if the intelligibility of the synthesized speech could not be improved. Many Ss said during interaction and in the interview that an appropriate combination would be most desirable, the written form for very long and complicated instructions and speech for simpler, situation-specific explanations or warnings. This result also provides strong support for the adoption of a sufficiently flexible approach to message presentation such as that used in SPEAK! Table 2: Favored output mode, by test condition and user background Scale Values

Test Condition Total n=24

C n=8

W n=8

S n=8

IR Expertise

Comp. Scient.

Exp. n=11

Nov. n=13

CS n=11

Researcher

No CS n=13

Res. n=16

No R. n=8

1

[Wo ++]

1

1

1

2

[Wo +]

4

2

1

1

1

3

3

1

3

1

3

[0]

5

1

2

2

3

2

1

4

3

2

4

[So +]

9

3

3

3

4

5

6

3

6

3

5

[So ++]

5

2

1

2

2

3

1

4

3

2

3.7

3.8

3.5

3.8

3.6

3.8

3.8

3.7

3.7

3.8

Median

1

1

Wo= written output (C condition: static help texts; W and S conditions: active written help); So = spoken output; ++ strongly favored; + moderately favored; 0 = none/mixed mode favored; Medians calculated from grouped data, with linear interpolation of exact position.

A number of further differential analyses were carried out in order to explore potential effects of certain user characteristics. Comparing the judgements of interaction and help texts by the users’ previous IR expertise, we found that the ratings of IR novices were slightly more positive than the experts’ judgements. Although the differences were just below conventional significance levels, this tendency was clearly supported by the open-ended interview comments. When discounting the ratings of the C group and regarding only those Ss who encountered active help in the primary test condition, we found significant differences concerning the preferred output mode: the IR novices (and non-researchers) were clearly more in favor of speech output than IR experts and researchers in the sample. Non-researchers (secretaries and persons from the administration staff) evaluated the quality and usefulness of the help texts clearly better than researchers. Further splitting up groups and comparing their ratings, we found that it was the subgroup of persons who were non-researchers, non-computer scientists, and IR novices 165

(n=7), whose ratings and interview comments were particularly positive with respect to the interaction and the usefulness of the generated spoken or written help messages. By contrast, their ratings of the system’s retrieval performance and speech understanding were very similar to those of the other Ss. Since the current SPEAK! application was designed to support mainly first-time or casual users of the system and IR novices, the positive judgements of the interaction support by nonresearchers and novice users of IR systems indicate that the implemented proof-of-concept prototype was quite effective and successful for this type of user. Frequent users of the system, however, might need more flexible and varying kinds/degrees of adaptive interaction support and possibly additional task-based facilities and functions for supporting advanced search and retrieval strategies.

6

Conclusions

In this article we have described the motivation, architecture and evaluation of a multimodal interface for information systems. We have shown that, particularly for users that are new to information retrieval and information systems, flexible, context-sensitive help information is highly regarded. Moreover, the provision of spoken help was shown to help users orientate their interaction with the information system more effectively, no doubt largely due to the high visual load that is placed on them when interacting with a multimodal system. The requirements that the information presented be tailored to the dialogue context and to the needs of the user, as well as being presentable in both spoken and written form, clearly argues that simplistic approaches to providing such help information will not be perceived so positively. Maximally reusable components for producing natural language dialogue contributions in both spoken and written form are therefore an obviously desirable research and development direction for the immediate future.

Acknowledgments We thank all those who participated in the SPEAK! project, and particularly to Eli Hagen for her work on dialogue modeling and its implementation and to Brigitte Grote for her contributions to the specification of the intonational resources for German. The work would also not have been possible without the support of our Hungarian project partners, Gábor Olaszy and Géza Németh and other researchers at the TU Budapest involved in the project.

References Abb, B., Günther, C., Herweg, M., Maienborn, C., and Schopp, A. (1996). Incremental syntactic and phonological encoding – an outline of the SYNPHONICS formulator. In Adorni, G., and Zock, M., eds., Trends in Natural Language Generation: An Artificial Intelligence Perspective. Berlin and New York: Springer, pp. 277–299. Adorni, G., and Zock, M., eds. (1996). Trends in Natural Language Generation: An Artificial Intelligence Perspective. Berlin and New York: Springer. Bateman, J. A. (1992). Grammar, systemic. In Shapiro, S., ed., Encyclopedia of Artificial Intelligence, Second Edition. New York: Wiley, pp. 583–592.

166

Bateman, J. A. (1998). Automatic discourse generation. In Kent, A., ed., Encyclopedia of Library and Information Science. New York: Marcel Dekker. Bateman, J. A., and Teich, E. (1995). Selective information presentation in an integrated publication system: An application of genre-driven text generation. Information Processing & Management 31(5):379–395. Belkin, N. J., Cool, C., Stein, A., and Thiel, U. (1995). Cases, scripts, and information seeking strategies: On the design of interactive information retrieval systems. Expert Systems and Applications 9(3):379–395. Berry, M. (1981). Systemic linguistics and discourse analysis: A multi-layered approach to exchange structure. In Coulthard, M., and Montgomery, M., eds., Studies in Discourse Analysis. London: Routledge and Kegan Paul. Bolinger, D. (1972). Accent is predictable, if you’re a mind-reader. Language 48. Brown, G. (1983). Prosodic structure and the given/new distinction. In Cutler, A., and Ladd, R., eds., Prosody: Models and measurements. Berlin and New York: Springer. Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Application. Berlin and New York: Springer, pp. 78–83. Davis, J. R., and Hirschberg, J. (1988). Assigning intonational features in synthesized spoken directions. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, pp. 187–193. Dorffner, G., Buchberger, E., and Kommenda, M. (1990). Integrating stress and intonation into a conceptto-speech system. In 13th. International Conference on Computational Linguistics (COLING-90), volume II, pp. 89–94. Fawcett, R. P. (1990). The computer generation of speech with discoursally and semantically motivated intonation. In Proceedings of the 5th International Workshop on Natural Language Generation (INLG ’90). Fawcett, R. P., van der Mije, A., and van Wissen, C. (1988). Towards a systemic flowchart model for discourse. In New Developments in Systemic Linguistics. London: Pinter, pp. 116–143. Golovchinsky, G., and Chignell, M. (1993). Queries-R-Links: Graphical markup for text navigation. In Ashlund, S., et al., eds., Human Factors in Computing Systems: INTERCHI ’93 Conference Proceedings. New York: ACM Press, pp. 454–460. Grote, B., Hagen, E., Stein, A., and Teich, E. (1997). Speech production in human-machine dialogue: A natural language generation perspective. In Maier, E., Mast, M., and LuperFoy, S., eds., Dialogue Processing in Spoken Language Systems. Berlin and New York: Springer, pp. 70–85. Hagen, E., and Stein, A. (1996). Automatic generation of a complex dialogue history. In McCalla, G., ed., Advances in Artificial Intelligence. Proceedings of the Eleventh Biennial of the Canadian Society for Computational Studies of Intelligence (AI ’96). Berlin and New York: Springer, pp. 84–96. Halliday, M. A. K. (1967a). Intonation and Grammar in British English. The Hague: Mouton. Halliday, M. A. K. (1967b). Notes on transitivity and theme in English — parts 1 and 2. Journal of Linguistics 3:37–81 and 199–244. Halliday, M. A. K. (1985). An Introduction to Functional Grammar. London: Edward Arnold. Halliday, M. A. K., and Hasan, R. (1985). Language, Context and Text: a social semiotic perspective. Geelong, Victoria: Deakin University Press. (Language and Learning Series). Also: Oxford University Press, London, 1989.

167

Henschel, R. (1994). Declarative representation and processing of systemic grammars. In Martin-Vide, C., ed., Current Issues in Mathematical Linguistics. Amsterdam: Elsevier, pp. 363–371. Iordanskaja, L. N., Kittredge, R., and Polguère, A. (1991). Lexical selection and paraphrase in a meaningtext generation model. In Paris, C. L., Swartout, W. R., and Mann, W. C., eds., Natural language generation in artificial intelligence and computational linguistics. Boston: Kluwer, pp. 293–312. Kasper, R. T. (1989). A flexible interface for linking applications to PENMAN’s sentence generator. In Proceedings of the DARPA Workshop on Speech and Natural Language. Available from USC/Information Sciences Institute, Marina del Rey, CA. Martin, J. R. (1992). English Text: System and Structure. Amsterdam: Benjamins. Chapter 7, pp. 493–573. Matthiessen, C. M. I. M. (1995). Lexicogrammatical Cartography: English Systems. Tokyo: International Language Science Publishers. McKeown, K. R. (1985). Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge, UK: Cambridge University Press. Moore, J. D., and Paris, C. L. (1993). Planning text for advisory dialogues: Capturing intentional and rhetorical information. Computational Linguistics 19(4):651–694. Olaszy, G., and Németh, G. (1997). Prosody generation for German CTS/TTS systems: From theoretical intonation patterns to practical realisation. Speech Communication 21(1-2):37–60. Pheby, J. (1969). Intonation und Grammatik im Deutschen. Berlin: Akademie-Verlag (2nd edition: 1980). Prevost, S., and Steedman, M. (1994). Specifying intonation from context for speech synthesis. Speech Communication 15(1-2):139–153. Reiter, E., Mellish, C., and Levine, J. (1995). Automatic generation of technical documentation. Applied Artificial Intelligence 9. Selting, M. (1993). Phonologie der Intonation: Probleme bisheriger Modelle und Konsequenzen einer neuen interpretiv-phonologischen Analyse. Zeitschrift für Sprachwissenschaft 11(1):99–138. Sitter, S., and Stein, A. (1992). Modeling the illocutionary aspects of information-seeking dialogues. Information Processing & Management 28(2):165–180. See also: Modeling Information-Seeking Dialogues: The Conversational Roles (COR) Model. RIS: Review of Information Science (online journal), 1996, 1(1), available from http://www.inf-wiss.uni-konstanz.de/RIS/... Stein, A. (1997). Usability and assessments of multimodal interaction in the SPEAK! system: An experimental case study. The New Review of Hypermedia and Multimedia (NRHM), Special Issue on Evaluation 3:159–180. Stein, A., Gulla, J. A., Müller, A., and Thiel, U. (1998). Abductive dialogue planning for concept-based multimedia information retrieval. In this volume. Stein, A., Gulla, J. A., and Thiel, U. (1999). User-tailored planning of mixed initiative information-seeking dialogues. User Modeling and User-Adapted Interaction. Special Issue on Computational Models for Mixed Initiative Interaction 8(1-2). To appear. Teich, E., and Bateman, J. A. (1994). Towards an application of text generation in an integrated publication system. In Proceedings of the Seventh International Workshop on Natural Language Generation, Kennebunkport, pp. 153–162. Teich, E., Hagen, E., Grote, B., and Bateman, J. A. (1997). From communicative context to speech: Integrating dialogue processing, speech production and natural language generation. Speech Communication 21(1-2):73–99. Terken, J. (1984). The distribution of accents in instructions as a function of discourse structure. Language and Speech 27:269–289. Ventola, E. (1987). The Structure of Social Interaction: A Systemic Approach to the Semiotics of Service Encounters. London: Pinter.

168