Employing AI Methods to Control the Behavior of ... - Semantic Scholar

Employing AI Methods to Control the Behavior of Animated Interface Agents Elisabeth André, Thomas Rist, Jochen Müller German Research Center for Artificial Intelligence (DFKI) Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany Email: fandre,rist,[email protected] Abbreviated Title: Behavior Control for Animated Agents

Abstract Life-like characters are increasingly gaining the attention of researchers and commercial developers of user interfaces. A strong argument in favor of using such characters in the interface is the rich repertoire of options they offer, enabling the emulation of communication styles common in human-human dialogue. This contribution presents a framework for the development of presentation agents which can be used for a broad range of applications including personalized information delivery from the WWW.

1

Behavior Control for Animated Agents

2

1 Introduction During the last decade, a growing number of research projects both in academia and industry embarked on the development of life-like agents as a new metaphor for highly personalized human-machine communication. Our interest in animated presentation agents arose from our previous work on the development of the knowledge-based presentation system WIP (cf. [André et al., 93]). Although the presentations (texts, pictures animations, and mixed presentations) synthesized by WIP are coherent and tailored to the individual settings of certain presentation parameters (target language, user characteristics, document type, and resource limitations such as screen/page size), WIP did not have the ability to plan when and how to present the generated material to the user. To enhance the effectiveness of presentations, we aimed at an augmented system in which an animated character plays the role of a presenter, showing, explaining, and verbally commenting textual and graphical output on a window-based interface. Despite of the raging debate on the sociological effects that life-like characters may have, yet can' t have, and will perhaps never have, it is safe to say that they enrich the repertoire of available options which can be used to effectively communicate information to the user. Among other things, they can be employed to attract the user' s focus of attention, to guide him/her through a presentation, to realize new presentation means such as two-handed pointing, and also to convey additional conversational and emotional signals which are difficult to communicate using other media. In this contribution, we report on two projects, PPP (Personalized Plan-Based Presenter) and AiA (Adaptive Communication Assistant for Effective Infobahn Access), which are both committed to the development of presentation agents for a broad range of applications including computer-based instruction, guides through information spaces, and web-based product advertisement. In the next subsections, we briefly sketch the application classes for these projects as well as the common and specific requirements on the involved agents. The central part of the paper is a detailed description of the methods employed for determining the behavior of presentation agents. Finally, we report on the


3

outcome of a recent empirical study which compared objective and subjective ratings of presentations with and without a Persona.

1.1 PPP Persona: A Desktop Agent The main focus of the PPP project lies in the situated generation of multimedia help instructions presented by an animated agent, the so-called PPP Persona. The three screen shots in Fig. 1 show the system in the process of explaining the elements of a modem circuit board to the user.

INSERT Figure 1 ABOUT HERE

To accomplish this presentation task, the system first creates a window containing a diagram of the circuit board. After the window has appeared on the screen, the PPP Persona takes up a position on the screen suitable for carrying out the necessary pointing gestures. The first step in explaining the graphics is to introduce the names of the depicted objects. While in a static graphics this task is usually accomplished by drawing text labels onto the graphics (often together with arrows pointing from the label to the object), the PPP Persona enables the realization of dynamic annotation forms as well. In our example, it points to the single objects one after the other, verbally uttering their names using a speech synthesizer. As illustrated by the three snap shots, the graphical display may change during a presentation in order to bring the corresponding objects into view. Note that the system maintains an explicit representation of all generated presentation parts and is thus able to refer to them. The example also demonstrates how facial expressions and head movements help to restrict the visual focus. By having the Persona look into the direction of the target object, the user' s attention is additionally directed to this object. In the previous example, all pointing acts of the Persona referred to a single window. The next example illustrates how cross-media links can be effectively built up between


4

several windows. In the example shown in the left part of Fig. 2, the Persona uses two pointing sticks to establish a visual link between an object' s graphical and textual representation. The screen shot also shows that the appearance of the Persona is not restricted to cartoon characters only. This time, the presentation system personifies itself as a “real” person (the paper' s first author) composed of grabbed video material. Unlike other approaches, e.g., [Ball, 96, Thórisson, 98], we primarily employ life-like characters for presenting information. We don' t allow for communication with life-like characters via speech in order to avoid problems resulting from the deficiencies of current technology for the analysis of spoken language. Nevertheless, the user has the possibility of influencing the course of the presentation by making specific choices at runtime. The PPP system supports user interaction too, though in a rather simplified way. The user can directly click on the Persona to obtain the system' s main menu. In this way, the user can enter a new goal to be achieved by the system, or interrupt and “criticize” an ongoing presentation by changing generation parameters, such as the degree of detail, or the preference for a certain medium. However, the Persona also offers the possibility of guiding the user through input menus. Instead of presenting different input devices (i.e. text boxes, sliders, buttons, etc.) in a formsheet-style menu, the Persona can request the single values one after the other giving verbal explanations on each option. The righthand side screen shot in Fig. 2 illustrates a situation where the Persona asks the user to input a numeric value.


1.2 WebPersona: An Agent for Presenting Information from the WWW With the advent of web browsers which are capable of executing programs embedded in web pages, the use of animated characters for the presentation of information over the web


5

has become possible. A web presentation can comprise dynamic media such as video, animation and speech, all of which have to be displayed in a spatially and temporally coordinated manner. Coordination is also needed for dynamic presentations in which a life-like character points to and verbally comments on other media objects, such as graphics, video clips, or text passages. The principle is to pack a web page with: (a) the selected media objects along with a specification on how they have to be arranged and temporally scheduled, (b) a presentation runtime engine (for example implemented as a Java applet) which displays the media objects according to a layout specification, and ship them to the client. This approach has been applied for a number of different application scenarios in the context of the AiA project. Our most recent application is a personalized travel agent which gathers task-relevant information from web sources, restructures the information into self-contained units to be presented by the life-like agent. The screen shots in Fig. 3 illustrate this scenario. Suppose the user wants to travel to Frankfurt and is starting a query for typical travelling information (via the menu interface shown in the first frame of Fig. 3). To comply with the user' s request, the AiA system retrieves information about Frankfurt from the WWW, selects relevant units and groups these units into different sections, such as weather, lodging, latest news and so forth. WebPersona starts the presentation by informing the user about the available information units. That is WebPersona first explains the purpose of the navigation buttons which are part of the presentation (and displayed along the right borders of the frames shown in Fig. 3). Pressing one of these buttons will trigger the display of a sub-presentation. For instance, if the user uses the hotel button WebPersona starts presenting hotel offers (middle frame of Fig. 3). Such presentations will vary in content and form depending on what it finds on the web. In some cases, AiA is even able to combine information units retrieved from different sources and combine them into a single presentation item. For example, the address entry of a hotel is used as input for another web search in order to generate a map display on which the hotel can be located (right-hand frame of Fig. 3).


6


Usually, pointing gestures to illustrations retrieved from the web are tricky because the system would have to apply sophisticated image analysis methods to find out what is depicted in an illustration. In the case above, this problem didn' t occur because the Mapquest server (http://www.mapquest.com), which our system accesses to find a hotel' s location, automatically positions the objects for which a search is started in the center of the map it provides. In other cases, the author of a web page established links between image regions and concepts. For example, many maps available on the web are already mouse-sensitive, and the system just has to follow the links to find the concepts related to the mouse-sensitive regions. The novelty of WebPersona is that presentation scripts and navigation structures are not stored in advance, but generated automatically from pre-authored document fragments and items stored in a knowledge base. Following a navigation link does not cause paging as in the case of most conventional web presentations. Rather, a new presentation script for the agent along with the required textual and pictorial material is transferred to the client-side presentation runtime engine.

2 Design Considerations for the Presentation Agent The conception of our Persona agent has been guided by a number of requirements, such as multipurpose presentation capabilities, life-likeness, adaptivity, and broad usability including web users with low-end computers. Furthermore, we have been striving for a technical realization which supports the shift to other applications.


7

2.1 The Persona's Visual Design Inspired by human presenters, we aimed at a character that is able to perform similar presentation acts. Therefore, a human-like visual appearance seemed quite reasonable for this purpose. However, anthropomorphism as such is not the focus of our work on animated agents. In order to keep the costs for character production low and to be independent of high-speed graphics engines for running presentations, we currently rely on 2D animations for the visual appearance of our Persona(s). As illustrated in the examples of Section 1, these are hand-drawn cartoon figures or prerecorded video clips. The underlying frames for drawing the animations are scalable in size. However, in our applications the characters will always interact with other graphical objects on the display. Therefore, we usually employ relatively small sized characters (e.g., a character height of between 200 and 300 pixels on a 21” monitor) in order to save screen real estate for other display items.

2.2 The Persona's Behavior Though a number of similarities may exist, our presentation agent is not just another animated 2D-icon in the interface. Rather, the behavior of the character follows the equation: Persona behavior := directives + self-behavior By directives we understand a set of tasks which can be forwarded to the character for execution. To accomplish these tasks, the Persona relies on gestures that: express emotions (e.g., approval or disapproval), convey the communicative function of a presentation act (e.g., warn, recommend or dissuade), support referential acts (e.g., look at an object and point at it), regulate the interaction between the Persona and the user (e.g., establishing eye contact with the user during communication) and indicate that the Persona is speaking. Of course, these gestures may also superimpose each other. For example, to warn the user, the Persona lifts its index finger, looks towards the user and utters the warning.


8

Directives are defined externally, either by a human presentation author or by another system which employs the character as part of its user interface. We also use the term presentation script to refer to a temporally ordered set of directives. While directives are determined by an underlying application, self-behaviors are application-independent and strongly form the character' s personality. The self-behaviors of our Persona are compiled from different action types (cf. Fig. 4), they currently comprise the following acts:

Low-level navigation acts In some cases, the Persona has to move to an appropriate location on the screen before carrying out presentation tasks, such as pointing to an object. The kind of navigation act depends on the chosen metaphor. For example, human-like agents like the Persona walk or jump to an appropriate position on the screen while agents like Microsoft' s parrot Peedy fly (cf. [Ball et al., 97]).

Idle-time acts To ensure that the Persona exhibits life-like qualities, it has to stay “alive” even in an idle phase. Typical acts to span pauses are breathing or tapping with a foot. Furthermore, the Persona may execute acts to indicate that the system is still active. For instance, it may pull out a book and start turning over pages while the system retrieves information from the web so as to satisfy a user request. However, in order not to distract the user, idle-time gestures with a high visual impact should only be executed in rare cases. Furthermore, monotonous repetitions should be avoided since they destroy a character' s believability.

Reactive behaviors The Persona should be able to react to user interactions immediately and give visual feedback. For instance, if the user drags the Persona across the screen, the Persona starts fidgeting.


9


Responses to user interactions usually have the highest priority and may lead to the interruption of a presentation, e.g., if the user signals that he/she is no longer interested in a certain topic. Presentation tasks have a higher priority than idle-time scripts which are only run if the Persona has no other tasks to perform. Though it is certainly possible to extend the set of directives by instructions corresponding to what we called self behaviors, the distinction between both has an important advantage. From a conceptual point of view, we consider it more adequate since a clear borderline is drawn between a “what to present part” which is determined by the application, and a “how to present” part which, to a certain extent, depends on the particular presenter. From the practical perspective, this separation considerably facilitates the exchange of characters, and the reuse of characters for other applications. On the other hand, it is clear that our system design requires a component which has to merge directives and self-behaviors in a reasonable manner. We have called this component the Persona Engine; it will be described in Section 5.

2.3 The Need for Automated Script Writing A Persona Engine with the above mentioned capabilities could be used by a human author for the production of multimedia presentations which include a life-like character. However, the goal of our work is much more ambitious since we aim at the automization of the upper-level authoring process as well. This goal is motivated by the observation that the manual preparation of presentation scripts becomes:

an error-prone and tedious task as the complexity of a presentation increases - even if directives can be formulated at a high degree of abstraction.


10

less and less feasible as the need for flexibility grows. For example, to allow for a flexible adaptation of the presentation to a user' s specific task, knowledge, and preferences, one would have to prepare a broad variety of different scripts and to hold them on stock.

impossible for nearly all time-critical applications such as live reporting services, or the presentation of information packages which are configured on the fly (e.g. individual travel packages).

Based on our previous work on multimedia presentation design (cf. [André & Rist, 93]), we utilize a hierarchical planner for the automated decomposition of high-level presentation tasks into scripts which will be executed by the presentation agent. To flexibly tailor presentations to the specific needs of an individual user, we allow for the specification of generation parameters (e.g., “verbal utterances should be in English”, or “the presentation must not exceed five minutes”). Consequently a number of presentation variants can be generated for one and the same piece of information, but different settings of presentation parameters. Furthermore, we allow the user to flexibly choose between different navigation paths through a presentation. That is, as in a hypermedia document, the course of a presentation changes at runtime depending on user interactions. An important aspect of the script generation task is the temporal coordination of presentation acts. For example, there may be several feasible schedules for presenting media objects. To handle this problem, we have combined the presentation planner with a temporal reasoner that computes a reasonable schedule for the presentation acts. Both the planning approach for hypermedia presentations and the temporal reasoner will be detailed in Section 4.

2.4 The Need for a Configurable Presentation System From our experience with a number of application scenarios, we learned that one unique system configuration can hardly accommodate for the specific requirements imposed by


11

different applications. Especially for WWW applications issues, such as agent ownership and component residence, must be addressed when designing a presentation system. Technically speaking, these issues refer to the distribution of software components. In order to reuse system components for a broad range of applications, the modularization should support the following three basic configurations: User-owned presentation systems: Here both the upper-level presentation planning part and the Persona Engine reside on the user' s machine. This is the classical configuration for a desktop agent which is to provide the user with various assistance and help services. From the user' s perspective the configuration has the desirable property that all private data, such as preferences and other settings of presentation parameters, will remain at the user' s site. However, a stand-alone version of the presentation system is of little use. Rather, it needs to be connected to an information source (e.g. an intelligent help system) which provides the data to be presented.

Provider-owned presentation systems: In this configuration, a presentation is completely worked out at the site of an information provider before it is shipped to the client in some displayable format. Thus, the situation is very similar to the classical production of multimedia titles which are distributed on CD-ROM. Since such presentations are to be viewed off-line, the user only has limited control over what will be presented.

Provider-side presentation planning, user-side Persona Engine: While the presentation planner resides at the information provider in this configuration, the user downloads the Persona Engine which receives the generated presentation scripts from the provider. Vice-versa, user input can be sent to the provider to be considered in the further design process. This configuration seems to be quite adequate for commercial presentation purposes, such as product advertisement and on-line information services. Of course, the user should be aware of the fact that the character is now remotely


12

controlled by the provider.

3 Structuring Principles for Interactive Multimedia Presentations In order to be able to generate multimedia presentations automatically, we need to know how they are structured. We characterize interactive multimedia presentations by the rhetorical and temporal relationships between presentation parts, and by the structure of the entailed navigation space.

3.1 Rhetorical Structure Following a speech-act theoretic perspective, we consider the composition and presentation of multimedia material as a goal-directed activity (cf. [André & Rist, 93]). That is, a presenter executes communicative acts, such as pointing to an object, commenting an illustration or playing back an animation sequence, to achieve certain goals. Communicative acts can be performed by creating and presenting multimedia material or by reusing existing document parts in another context. The rhetorical structure of a presentation is determined by their communicative acts and the relations between them.1 The rhetorical structure can be represented by a directed acyclic graph (DAG) in which communicative acts appear as nodes and relations between acts are reflected by the graph structure. While the top of such a DAG is a more or less complex communicative act (e.g. to introduce an object), the lowest level is formed by specifications of elementary acquisition tasks (e.g., retrieving a photo or drawing a diagram) or presentation tasks (e.g., pointing to an object). Fig. 5 exhibits a part of the rhetorical structure of the WebPersona example given in Section 1.2. The presentation as a whole serves to provide information on the city 1 Our work has been strongly influenced by RST (Rhetorical Structure Theory, [Mann &

Thompson, 87])

which characterizes coherent text in terms of rhetorical relations that hold between its parts.


13

of Frankfurt. It consists of an introduction and several elaborating parts one of them presenting a sequence of hotels. The introduction includes an introductory page and a summary which is decomposed of a verbal utterance: “I have found information for your trip to Frankfurt.” and several elaborations which refer to the functions of the buttons the user may press. The hotel presentations comprise of an introduction and an elaborating part providing information on the location of the hotels. The hotel introduction consists of an introductory page and a verbal utterance to emphasize hotel attributes that are of relevance to the particular user. The underlined nodes in the figure refer to part s of the presentation that have not yet been expanded.


3.2 Temporal Structure

The temporal structure of a presentation is represented by a collection of media objects together with a presentation script. Presentation scripts entail directions for the character concerning the display of media objects. As in other animation scripting systems, we visualize presentation scripts by timeline diagrams which position all actions to be executed by the character along a single time axis. An example of a timeline diagram is shown in Fig. 6. According to this timeline diagram, the Persona presents a map, points to an object on the map and verbally provides some additional information. The duration of complex acts corresponds to the length of the white bars, while the darker bars refer to the duration of elementary acts.


14


3.3 Navigational Structure Inspired by the Amsterdam Hypermedia Model [Hardman et al., 94, Hardman et al., 97], we use state transition graphs to describe the navigation structure of a presentation. A state-transition graph G is defined by a set of nodes and edges, i.e. each node

n

2

G =.

With

N , we associate a presentation unit which refers to the presentation of

a single web page. If a node is entered, the corresponding presentation script is run. Consequently, being in a certain state means that the corresponding web page is being

N a default duration, usually the duration of the presentation unit, i.e. n := (). An edge e E is defined by its connecting nodes, a condition and an action, i.e. e := ( ). presented. Furthermore, we assign to each node

n

2

2

A transition is made if one of the predicates associated with the edges leading away from the node is satisfied or if the default duration is over. Predicates usually refer to user interactions, such as clicking on mouse-sensitive items in a presentation. An interesting question is the timepoint of transition. Should the system wait until the presentation is completed or interrupt and resume it later? Since a presentation unit may be rather long, we have chosen the second possibility. However, to avoid loosing the coherency of a presentation, we don' t allow for the interruption of elementary presentation acts that vary in time, such as speaking or pointing, but wait until these acts are executed. When returning to a node, the system continues the presentation by playing only the remaining part of the script. In principle, it would also be possible to have the Persona provide metacomments on user interactions, such as “Now, let' s come back to ...” or “Please let me finish ...”. Such metacomments are, however, not generated in the current version of


15

the system. A path through a presentation graph is defined as a sequence of nodes ni , with 1

i

m where n1 is the starting node and nm is the end node.

It corresponds to a

specific way of viewing the presentation. The concepts introduced above will be illustrated using the WebPersona example again. The navigation graph of this example is partially shown in Fig. 7. The presentation is started by entering the first Introduce node. The Persona tells the user that it has found some information and explains to him/her which buttons he/she may use to get more details. Note that these buttons will remain visible during the whole presentation and the user has the choice to use them at any time. Let' s suppose the user presses the hotel button and the Persona starts with the presentation of the first hotel it found. During the presentation, the user presses the location button. As a consequence, the presentation is interrupted and the location script is played. That is the Persona now informs the user where to find the hotel by presenting him/her with a map. After that, the system returns to the first hotel node and plays back the remaining parts of the script. After the default time is over, a transition is made automatically to the second hotel node.


4 Planning Presentation Scripts for the Persona In the following, we will describe how to automate the creation of interactive web presentations. First, we introduce a declarative specification language for representing design knowledge. After that, we describe how presentation scripts are automatically created on the basis of this design knowledge. An important characteristics of our approach is that some parts of the presentation are only created on demand, i.e., if the user decides to follow certain hyperlinks. The last part of the section presents criteria for the integration


16

of such hyperlinks in a presentation.

4.1 Representation of Design Knowledge Design knowledge is represented by means of so-called presentation strategies which may be compared with a library of high-level authoring templates that can be combined in a flexible manner. While some strategies reflect general presentation knowledge and thus can be reused in wide range of applications, others are more domain-dependent and specify how to present a subject in a certain genre. They are characterized by a header, a set of applicability conditions, a collection of inferior acts, a list of qualitative and metric temporal constraints and a start and an end interval. The header corresponds to a complex presentation or acquisition act. The applicability conditions specify when a strategy may be used and constrain the variables to be instantiated. The inferior acts provide a decomposition of the header into more elementary presentation or acquisition acts. Qualitative temporal constraints are represented in an “Allen-style” fashion which allows for the specification of thirteen temporal relationships between two named intervals: before, meets, overlaps, during, starts, finishes, equal and inverses of the first six relationships (cf. [Allen, 83]). Allen's representation also permits the expression of disjunctions, such as (A (before after) B), which means that A occurs before or after B. Metric constraints appear as difference (in)equalities on the endpoints of named intervals. They can constrain the duration of an interval (e.g., (10 Duration A2 40)), the elapsed time between intervals (e.g., (4 < End A1 - Start A2 < 6)) and the endpoints of an interval (e.g., (Start A2 6)). In the following, we list two presentation strategies which may be employed to build up parts of the presentation given in Section 1.2. The first strategy may be used for the design of a hotel description. It only applies if the Persona finds an introductory text in its database which has been provided by one of the hotel servers (see 1.2) or stored in advance. 2 The strategy prescribes to add the in2 We

use the notation (Bel Persona Fact), (Bel User Fact) and (MB Persona User Fact) to refer to the


17

troductory text, a hyperlink that refers to the hotel' s location and an illustration to ?page. While S-Include-Text and S-Include-Link are elementary acts that can directly forwarded to the Persona for execution, the act Illustrate has to be refined by applying further presentation strategies. [1]

(defstrategy :header (A0 (Design-Intro-Page Persona User ?hotel ?page)) :applicability-conditions (Bel Persona (Introduces ?text ?hotel)) :inferiors ((A1 (S-Include-Text Persona User ?text ?page)) (A2 (S-Include-Link Persona User ?page (Elaborate-Location Persona User ?hotel ?page))) (A3 (Illustrate Persona User ?hotel ?page))) :qualitative ((A1 (meets) A2) (A2 (meets) A3)) :start A1 :finish A3) Strategy [1] will always lead to the creation of a hyperlink. Factors, such as the user' s

previous navigation behavior, may be considered by defining appropriate applicability conditions which control the selection of strategies. The qualitative constraints of the strategy specify the temporal order of the inferior acts. Note that they only indicate when the hyperlink is created and not when it is expanded. Information on the hotel' s location is only provided on demand, i.e. if the user selects the corresponding mouse-sensitive item at presentation runtime. To comply with the user' s request, Strategy [2] may be used.

Persona's beliefs, the user's beliefs and their mutual beliefs.


[2]

18

(defstrategy :header (A0 (Elaborate-Location Persona User ?hotel ?page)) :applicability-conditions (Bel Persona (Includes ?map ?hotel)) :inferiors ((A1 (S-Include-Map Persona User ?map ?page)) (A2 (Label Persona User ?hotel ?map ?page))) :qualitative: ((A1 (before) A2)) :metric ((End A1 - Start A2 -2)) :start A1 :finish A2)

As specified in the applicability conditions, Strategy [2] only applies if the database contains a map with the hotel, i.e., if the request to the Mapquest server was successful (cf. Section 1.2). In this case, the Persona shows the user where he/she can find the hotel by pointing to its location on the map. Besides a qualitative constraint which expresses that the labeling act should be performed after including the map, the strategy contains a metric constraint which prescribes that the labeling act should start not earlier than two time units after the display of the map. Note that we are not forced to completely specify the temporal behavior of all acquisition and presentation acts at definition time. This enables us to handle acts with an unpredictable duration, start and endpoints, i.e. acts whose temporal behavior can only be determined by executing them. For instance, strategy [1] doesn' t contain any metric constraints, and strategy [2] only prescribes that the temporal distance between the labeling and the display act should be at least 2 time units. In contrast to earlier work (cf. [André & Rist, 93]), we don' t represent the effect of a strategy in terms of the user' s mental state. Instead we record the information that has already been presented. This method has proven to be more adequate in applications which present information retrieved from the web since we cannot rely on a complete


19

semantic specification of that information.

4.2 Automatic Creation of Presentation Scripts To automatically create presentation scripts, the strategies introduced above are considered operators of a temporal planner (cf. [André & Rist, 96]) which is based on MATS (Metric/Allen Time System, cf. [Kautz & Ladkin, 91]). The basic idea behind the planning process is as follows: Given a presentation goal, try to find a matching strategy and post the inferior acts of this strategy as new subgoals. For each subgoal, create a local temporal constraint network which contains all qualitative and metric constraints corresponding to the applied strategy. In case a subgoal cannot be achieved or the temporal constraint network proves inconsistent, apply another matching strategy. The goal refinement process terminates if all goals are expanded to elementary acquisition or presentation acts or to goals that will be realized by hyperlinks in the final presentation and only be expanded on demand. The last step of the planning process is the creation of a schedule which reflects the temporal behavior of the presentation. The complete algorithm for designing a presentation script appears in Fig. 8. The planner receives as input a presentation goal. In case this goal is not yet satisfied (line 1), it creates a data structure for it (line 2) which contains the following information:

The goal to be accomplished

Tried and untried alternatives The slot untried alternatives contains pairs of matching strategies and binding environments for which the applicability conditions of the corresponding strategy hold. In order to enable backtracking, the planner records in tried alternatives which strategies and binding environments have already been tried.

Local temporal constraint network For each node of the presentation plan, the planner creates a local constraint network which includes the temporal constraints of the corresponding plan operators.


20

In the following, we refer to these networks as Mats Systems.

Predecessor Node

Successor Nodes


The data structure for the user' s information request forms the root node of the planning tree to be built up. To accomplish the goal, the planner tries to expand the root node (line 3). It iterates over the untried alternatives (line 7) until it finds a strategy whose applicability conditions are satisfied (line 8), whose inferior acts can be executed (line 9) and which is temporally consistent (line 10). To check the temporal consistency of a strategy, the planner copies the mats system of Node into the MATS environment (line 13), adds all temporal constraints of the strategy (14), and updates the mats system of the node if the resulting network proves consistent (line 15). 3 If the planner does not succeed in expanding a node, it tries the next applicable alternative (line 11 and line 7). Note that the system doesn' t try to make unsatisfied applicability conditions true, e.g., by applying further strategies. When applying a strategy, the system iterates over all applicable binding environments (line 16) until it finds a binding environment which leads to the successful execution of all inferior acts (line 17- line 20). Unless an act has not yet been satisfied, the function ExecuteOneAct is called (line 18). If the applicability conditions of the corresponding operator are satisfied (line 23), the planner starts with the execution of the act (line 24 line 29). While complex acts that will not be realized as hyperlinks have to be further refined 3 For a

description of this algorithm, see [Kautz & Ladkin, 91].


21

(line 27), elementary acts are forwarded to the media-specific generators 4 , the retrieval components, or to the Persona Engine (line 25). If a goal is realized as a hyperlink, the system generates a mouse-sensitive item for it. Furthermore, creates a node in the navigation graph and specifies how this node can be reached from other nodes and vice versa. These conditions then correspond to the predicates associated with the edges of the navigation graph. After the completion of the presentation planning process, the system builds up a schedule which reflects the temporal behavior of a presentation (line 31-35). It first determines for each planning node when the corresponding communicative act should start, end and how long it should take by propagating the constraints top-down and bottom-up in the planning tree (line 31). In Fig. 9, the left inner boxes refer to the starting point, the right inner boxes to the end point and the middle inner boxes to the duration of a communicative act. In the example, the exact interval endpoints are not known before the propagation process starts. Since the presentation starts at timepoint 0 and the act B1 takes at least 1 timepoint, it doesn' t end and B2 which meets B1 doesn' t start before timepoint 1. By propagating the constraints associated with the subnodes of the B2 node, the planner finds out that the action B2 requires at least 4 time units. Since exactly 3 time units have been calculated for B3 and B3 takes place during B2, B3 can' t start before timepoint 2 and end before timepoint 5 and the minimal duration of B2 is 5 time units. Consequently, B2 can' t end before timepoint 6. Finally, the minimal duration of A1 is 6 because the endpoint of B2 and A1 and the starting point of B1 and A1 coincide.


4 Currently, our system includes generators for text [Kilger, 94], graphics [Rist

tion [Butz, 97].

& André, 92] and anima-


22

In line 32 of the algorithm, a global temporal constraint network is built up by collecting all intervals of the single planning nodes and the corresponding temporal constraints. This global constraint network is bound to the variable MatsSys (line 33). After that, schedules are built up by resolving all disjoint temporal relationships between intervals (line 34) and computing a total temporal order (line 35) for each Mats System. For example, the schedules shown in Fig. 10 would be created for a network containing the constraints (A (before after) B), (A (equals) C), 1 Duration A 1 and 1 Duration B

1.


Since the temporal behavior of presentation acts may be unpredictable at design time, the system can only build up a preliminary schedule which has to be refined at runtime. That is for some communicative acts, it only indicates an interval within which they may start or end instead of prescribing an exact timepoint. The temporal behavior of a presentation is controlled by a presentation clock which is set to 0 when the system starts to show the planned material to the user and incremented by the length of one time unit5 until the presentation stops. For each timepoint, the planner indicates which events must or can take place. For instance, a communicative act whose starting point is between 0 and 2, may start at timepoint 0 or 1, but must start at timepoint 2 in case it has not yet started earlier. Whether the event actually takes place or not is decided by the PPP Persona. Currently, our system chooses the earliest possible timepoint. In order to satisfy the temporal constraints set up by the planner, the Persona may have to shorten a presentation, skip parts of it or make a pause. In some cases, this may lead to dissatisfying presentations, e.g. if Persona stops speaking in the midst of a sentence. As soon as the system has determined that a certain event should take place, the 5 The

length of a time unit can be interactively changed by the user.


23

planner is notified, and if necessary, adds a new metric constraint to the global temporal constraint network and refines the schedule accordingly. Let' s assume that the planner informs the Persona that the creation of an illustration should start at timepoint 0 and that the Persona may show it to the user at timepoint 1 or later. However, it turns out that 10 time units are required for the creation of the illustration. The Persona forwards this information to the planner, which adds 10

Duration Create-Illustration

10 to the global temporal constraint network.

Since

Create-Illustration meets Show-Illustration, the display of the illustration can start only at timepoint 10.

5 Conception of the Persona Engine The generated presentation scripts are forwarded to the Persona Engine for execution. However, in contrast to other display components for media objects (e.g. video or audio players, graphics viewers, etc.) the output of the Persona Engine is not only determined by the directives (i.e., presentation tasks) specified in the script. As stated in Section 2, directives have to be smoothly merged with the Persona' s characteristic self-behaviors, such as navigation and idle-time acts.

5.1 Context-Sensitive Decomposition of Directions The directives of a presentation script do not necessarily correspond to primitive Persona acts, but require a further decomposition for their execution at runtime depending on the current situation. For example, during presentation planning, we do not care about the character' s possible location on the screen. If the planner decides that the character should perform a pointing act to a certain object, the corresponding presentation script will only contain the directive S-Point with the object and the time interval as parameters. In case the Persona is reasonably close to the object, it may immediately perform a pointing gesture using


24

its left or right arm - depending on whether the object is on the left or on the right. In other situations, however, Persona may be too far away from the object. Rather than simply using a telescope pointer, Persona should perform some navigation acts to achieve a more believable behavior (see [Lester et al., 98]). Fig. 11 shows such a context-sensitive decomposition of a pointing act into an animation sequence.


5.2 Defining Persona Behaviors To facilitate the definition of Persona behaviors, we have developed a declarative specification language relying on standard representation constructs common in AI planning (e.g., see [Allen et al., 90]). For instance, the following definition specifies the pre- and postconditions for the action: bottomupjumping. [1]

(defprimitive bottomupjumping :pre ((leftarm standard)(rightarm standard) (iconified no)(bodydir front) (bodypos stand)(stick off)) :post ((posy -= 1)) :gesture 42) The action can only be performed if both arms are in a standard position, the Persona

is not iconified, faces the user, is standing, and doesn' t hold a stick. If this is the case, the image sequence associated with the action (:gesture 42) is played and the y-position of the Persona is updated as indicated in the post conditions :post, i.e. decreased by 1. Otherwise, the system tries to achieve the desired preconditions by executing further actions.


25

While primitive actions like bottomupjumping are directly associated with an image sequence, complex actions are composed of several subactions. An example of a complex action is:

[2]

(defactionseq MoveUp :pre ((icon noicon)(bodydir front) (leftarm standard)(rightarm standard) (bodypos stand)(stick off)) :prim startbottomupjumping :while ((posy 6= target) :prim bottomupjumping) :prim endbottomupjumping)

This definition specifies a jump to a given target. The preconditions of this action coincide with the preconditions of bottomupjumping. If they are satisfied, the Persona starts with the jump (startbottomupjumping) and continues to jump upwards until it reaches the target ((posy 6= target)). After that, it finishes the jump (endbottomupjumping).

5.3 Compiling Behaviors Since animations have to be performed in realtime (and our system should run on ordinary PCs/workstations as well), it' s not advisable to decompose actions into animation sequences at runtime. Following [Ball et al., 97], we have developed a multi-pass compiler that enables the automated generation of a finite state machine from declarative action specifications which is converted to efficient machine code (cf. Fig. 12). That is we compute for all possible situations beforehand which animation sequence to play. As a result, the system just has to follow the paths of the state machine when making a decision at runtime.


26


When creating the source code for the finite state machine, 6 action specifications are translated as follows:

Primitive Actions, such as bottomupjumping, are mapped onto functions in which (1) a function is called to achieve the precondition of the action, (2) a command for playing an image sequence is executed and (3) the settings of the state variables are updated.

Complex Actions, such as MoveUp, are mapped onto functions which may invoke other functions according their control structures. For example, the middle part of Fig. 12 lists the source code for the action MoveUp. First, the function get in stateMOVEUP is called to satisfy the precondition of the action. Next, the function for the primitive action startbottomupjumping is applied. This function is followed by a while-statement which repeatedly invokes the function BOTTOMUPJUMPING until the vertical distance between the Persona' s current position

and the target position is less or equal 50 screen units. Finally, the function for the primitive action endbottomupjumping is called.

Idle-Time Actions are mapped onto functions which apply heuristics to select an idle-time script, play it and update the state variables.

The next step is the definition of functions for achieving the preconditions specified in the action definitions. In particular, we have to compute action sequences which transform each possible state in which the system might be into these preconditions. For instance, we have to generate the code for the function get in stateMOVEUP which establishes the 6 We

are able to generate both C- and Java-Code.


27

preconditions for the action MoveUp. This is done by regression-based planning. For each precondition specified in an action definition, we apply primitive actions in reverse order until all possible start states have been achieved. The result of this process is a list of tuples that consist of a state and an action sequence which has to be performed in this state to satisfy the precondition. For instance, the following tuples would be generated for the action MoveUP:

(((icon noicon) (bodydir front) (leftarm standard) (rightarm standard) (bodypos stand)(stick off)) ()) (( (Leftarm Up) (Rightarm Standard) ) LowerLeftArm1) (( (Leftarm Up) (Rightarm Up) ) (LowerRightArm1 LowerLeftArm1))

If all preconditions are already satisfied (first tuple), nothing has to be done. Therefore, the corresponding action sequence is empty. In case the Persona has raised its left arm, it has to lower it before moving up (second tuple) and so on. These tuples are converted to if-then-else program blocks which are then compiled into efficient machine code.

5.4 Architecture of the Persona Engine The compiled state machine forms the so-called behavior monitor. Besides the behavior monitor, the Persona Engine also comprises an event handler, a character composer, and a platform interface which is tailored to the target platform (cf. Fig. 13) .



28

The task of the event handler is to recognize whether input derived from the platform interface needs immediate responses from the Persona. That is, for each input message the event handler checks whether the message triggers one of the so-called “reactive behaviors” stored in an internal knowledge-base. If this is the case, the selected behavior is made accessible to the behavior monitor. Depending on the application, notifications may be forwarded to the application program, too. For example in our PPP system, some events are interpreted as requests for the satisfaction of new presentation goals and thus activate a presentation planner (thus the dotted line in Fig. 13). The postures determined by the behavior monitor are forwarded to a character composer which selects the corresponding frames (video frames or drawn images) from an indexed data-base, and forwards the display commands to the window system. Implementations of the Persona Engine are currently available for Unix platforms running X11, and JavaTM -enhanced WWW-browsers. In the X-version, the Persona server builds upon the X11 library and the X11-Shape extension (cf. [Packard, 89]) which allows for the definition of non-rectangular regions in an otherwise invisible window. To use this feature for the graphical realization of the Persona, we first create an invisible window that covers the whole screen. This ensures that in case this window lies on top of the window stack, all other windows below still remain activated for mouse and keyboard input. Second, single postures of the Persona as well as the requisites such as pointing sticks are drawn into the invisible window. Since items drawn into the invisible window remain invisible unless the regions they cover were masked before, we create bitmasks of the same shape as the items. For the Persona postures (images or video frames), these masks are computed in a preprocessing phase, and stored in the X-Server's memory. Bitmasks for other items, e.g. pointing sticks, are computed only on demand at runtime.

For WWW applications, the user downloads an instance of the Persona Engine in form of a Java applet. In contrast to the X-windows version, the spatial action ratio of the Persona is restricted to the Java canvas of the web page. The animation is simply done by


29

bitplotting the corresponding frames onto the canvas. Since Java supports transparency additional masking is not required as in the X version.

6 Evaluation Our research on animated interface agents was motivated by the assumption that they make man-machine communication more effective. In order to find empirical support for this conjecture, we conducted an empirical study with 30 adult participants (see also [Mulken et al., 98]). Our study focused on two issues: 1) the effect of a Persona on the subject' s rating of the presentations (a subjective measure), and 2) its effect on the subject' s comprehension of presentations (an objective measure). We assumed that three effects on comprehension and recall might emerge: 1. The Persona contributes to the comprehension and recall of presentations because of its strong motivational impact. 2. The Persona has a negative effect on the comprehension and recall of presentations because it distracts the user. 3. There is no effect of the Persona on comprehension or recall because the Persona neither motivates nor distracts the user or these two factors compensate for each other. Since earlier studies already provided evidence that a life-like characters has a strong affective impact on the user (e.g., see [Takeuchi & Nagao, 93, Walker et al., 94, Lester et al., 97]), we expected that this effect also occurs in our case.

6.1 Experimental Setting Participants were 15 females and 15 males, on average 28 years of age, all native speakers of German and recruited from the Saarbrücken university campus. Most of them were no


30

computer specialists, but all of them had some experience in using computers for web surfing and editing purposes. The subjects were confronted with 5 web-based presentations and subsequently asked questions about them. They were allowed to spend as much time as they required to answer the questions, but not allowed to watch a presentation several times. On the average, each subject spent 45 minutes on the experiment. In the experiment, two variables were varied. The first variable referred to the Persona itself. The Persona was either absent or present. In the experiment without the Persona, a voice spoke the same explanations as in the Persona-version. The pointing gestures by the Persona were replaced with an arrow. That is the Non-Persona version conveyed exactly the same information as the Persona-version. This was important because we were interested in the effect of the mere presence of a Persona. The second variable was the information type. Subjects were confronted with technical descriptions of pulley systems and with person descriptions (i.e., information about DFKI employees). In the first case, we showed the subjects illustrations of four different pulley systems and conveyed information concerning the parts of the pulley systems and their kinematics auditorily. Whenever a part of the pulley system was mentioned, a pointing gesture was performed. For the condition with non-technical material, we designed a presentation in which 10 fictitious office employees were introduced. For each employee, his or her photograph was shown and information concerning his or her name and occupation auditorily conveyed. Furthermore, the employee' s office was shown using a map of the office floor. The first variable was manipulated between-subjects, while the second variable was manipulated within-subjects. Thus, each subject viewed either presentations with or without the Persona, but was confronted with both kinds of presentation. None of the two groups knew about the existence of the other. The Persona' s learning effect was measured by comprehension and recall questions following the presentations. For the technical scenario, the subjects had to answer questions, such as: “Which objects does the red rope touch?” or “In which direction does


31

the lower pulley move if the free end of the red rope is pulled down?”. For the office experiment, we presented the subjects with photographs of office employees and a layout of the office floor. The subjects had to recall the employees' name, occupation and office number. The Persona' s affective impact was measured through a questionnaire at the end of the experiment. Part A of the questionnaire contained general questions on the presentation, such as “Was the presentation difficult to understand?” or “Did you find the presentation entertaining?”, while part B contained specific questions on the Persona, such as “How appropriate was the Persona's behavior?”, “Did the Persona help you in concentrating on relevant information?” or “Would you prefer presentations with or without a Persona in the future?”.

6.2 Experimental Results Regarding our first objective, the evaluation of the Persona' s affective impact, our study revealed a positive effect (cf. Tables 1 and 2).7 Only one subject indicated that he/she would prefer presentations without a Persona in the future. T-tests on the data listed in Table 1 show that subjects confronted with technical descriptions found the presentation significantly less difficult to understand (t(26)=-2.51; p=.0186) and more entertaining (t(26)=-2.38; p=.0247) if the material was presented by the Persona.

INSERT Table 1 ABOUT HERE

In the case of the office experiment, we didn' t find a significant difference between the ratings of the difficulty of the presentation and its entertaining value. Also subjects found the Persona' s behavior less appropriate in this domain and felt that the Persona was less 7 Because

of technical problems, the data of two subjects had to be discarded.


32

helpful as a concentration aid (cf. Table 2). We hypothesize that the less positive result for the non-technical domain is due to the fact that the Persona' s realization as a workman is more appropriate to technical descriptions than to person-related descriptions.


Concerning the Persona' s learning effect, we didn' t find a significant difference between the Persona and the No-Persona version (t(26)=-.73;p=.47 for the technical domain, t(26)=.82;p=.42 for the non-technical domain). That is the Persona did neither contribute to the students' comprehension of the technical matters in the pulley experiment, nor to the students' recall capabilities in the second experiment (see Table 3).


As a possible reason, we indicate that we only exploited Persona behaviors that can be easily replaced with other means of communication not necessarily requiring the existence of a Persona. In our experiments, Persona gestures were restricted to neutral facial expressions (i.e. head and eye movements towards the objects currently being explained and lip movements indicating that the Persona is speaking), pointing gestures and simple idle time actions, such as breathing or tapping with a foot. On the other hand, initial concerns that people would be distracted by the Persona and concentrate too much on the Persona' s facial expressions instead of looking at the referent of the pointing gestures were not confirmed. In the questionnaire, all subjects indicated that they were not distracted by the Persona (third row of Table 2).


33

7 Related Work A considerable amount of work on life-like characters has concentrated on computer graphics issues, such as the modeling and simulation of synthetic humans, animals and phantasy creatures. In this section, however, we restrict ourselves to a review of previous approaches for authoring and controlling the behavior of life-like characters. Closely related to our work is Microsoft' s Persona project which uses a parrot named Peedy as an interface agent (cf. [Ball et al., 97]). Nevertheless Peedy is an anthropomorphic character since it interacts with the user in a natural-language dialogue, and also mimics some non-verbal (human) communicative acts, e.g., Peedy raises a wing to the ear in case speech recognition fails. Since Peedy is to act as a conversational assistant (at least for the sample application, a retrieval system for music CD' s), the system comprises of components for processing spoken language, dialogue management and the generation of audio-visual output. However, the system doesn' t have to create presentation scripts since the presentation of material is restricted to playing the selected CDs. Lester and Stone [Stone & Lester, 96] combined a coherence-based behavior sequencing engine to control the behavior of Herman the Bug, the pedagogical agent of Design a Plant. This engine dynamically selects and assembles behaviors from a behavior space consisting of animated segments and audio clips. This material has been manually designed by a multidisciplinary team of graphic artists, animators, musicians and voice specialists. On the one hand, this allows the authoring of high quality presentations as the human author has much control over the material to be presented. On the other hand, enormous effort by the human author is required to produce the basic repertoire of a course. In contrast to their work, our approach aims at a higher degree of automatization. The basic animation units from which a presentation is built correspond to very elementary actions, such as taking a step or lifting one' s arm, which are flexibly combined by the Persona Engine. Furthermore, we don' t rely on prestored audio clips, but use a speech synthesizer to produce verbal output. Rickel and Johnson [Rickel & Johnson, 98] developed a pedagogical agent called Steve


34

based on the Jack Software, a tool for modeling 3D virtual humans [Badler et al., 93]. Instead of creating animation sequences for a course offline and putting them dynamically together as in Design a Plant, the 3D character Steve is directly controlled by commands, such as “look at”, walk to” or “grasp an object”. In this case, the character interacts with virtual objects in the same way as a human will do in a real environment with direct access to the objects. In contrast to this, our system strictly distinguish between domain and presentation objects. That is the PPP Persona is part of a multimedia presentation and interacts with domain objects via their depictions or descriptions. This setting resembles a setting where a tutor presents and comments slides or transparencies. Similar applications were described by Noma and Badler [Noma & Badler, 97] who developed a virtual human-like presenter based on the Jack Software and by Thalmann and Kalra [Thalmann & Kalra, 95] who produced some animation sequences for a virtual character acting as a television presenter. While the production of animation sequences for the TV presenter requires a lot of manual effort, the Jack presenter receives input at a higher level of abstraction. Essentially, this input consists of text to be uttered by the presenter and commands, such as pointing and rejecting, which refer to the presenter' s body language. Nevertheless, the human author still has to specify the presentation script, while our system computes this automatically starting from a complex presentation goal. However, since our presentation planner is application-independent, it may also be used to generate presentation scripts for the Jack presenter or the TV presenter. Perlin and Goldberg [Perlin & Goldberg, 96] developed an “English-style” scripting language called IMPROV for authoring the behavior of animated actors. To a certain extent, the library of agent scripts in their approach can be compared to the repertoire of presentation strategies in our approach since they both allow for the organization of behaviors into groups. However, their scripts are represented as a sequence of actions or other scripts while we exploit the full set of Allen relationships. A novelty of our system is that it doesn' t require the human author to specify the desired temporal constraints between the single presentation acts, but computes this information dynamically from a complex presentation goal. Furthermore, our system does not only design presentation


35

scripts, but also assembles the multimedia material to be presented to the user. Besides work on the technical realization of characters, empirical studies on the social interaction between humans and computers are of interest. Nass and colleagues have shown that people tend to treat computers like human beings even if the computer interface is not explicitly anthropomorphic (cf. [Nass et al., 94]). The group around HayesRoth analyzed the interaction of children with animated puppets [Huard & Hayes-Roth, 97, Maldonado et al., 98] and the interaction of adult users with a bartender called Erin and other customers in a virtual bar (cf. [Isbister & Hayes-Roth, 97]. Though it is not always possible to directly transfer such findings to a complete different application, they all provide valuable inspirations for the definition of a character' s behavior.

8 Technical Data The Persona Engine is implemented in JavaTM and C++ . It relies on about 250 frames for each Persona8 . To control the behavior of the personas, more than 150 different behaviors have been defined. The presentation planner and the Persona Compiler are implemented in Allegro Common Lisp. To plan presentation scripts, about 70 presentation strategies have been defined. Concerning the runtime behavior of the whole system, we have to distinguish between the information gathering phase, the script generation phase, and the presentation display phase. Relying on the WWW for information gathering is the most time-consuming and also unpredictable phase. Like all other WWW users, our system often has a rather long wait until it gets connected to a certain web-site and can load data from it (or until it receives a time-out in case a server is down). In our travel agent scenario, up to 10 different websites are visited, partially in parallel. Information gathering usually takes between one and 10 minutes. 8 Currently, we

use two cartoon personas and three real personas composed of grabbed video material.


36

Since we treat script generation as a combinatorical problem, the computational complexity of this process can lead to exponential time in the general case. It is important to note that the modeling of the design knowledge, i.e, the way how it is coded in the design strategies, can have a dramatic effect on the runtime behavior of the planning process. For example, rather than striving for a minimal set of presentation strategies with almost no redundancy, it is more advisable to anticipate some of the search steps and add more dedicated strategies for specific presentation tasks. Such dedicated strategies represent a kind of pre-compiled task decomposition. In the current version of our system the decomposition of a typical presentation task and the generation of the material (e.g., generating audio, scaling images etc.) usually require only a few seconds on a Sun 20 workstation. This effort can be neglected in view of the much more time-consuming information gathering phase. During the presentation display phase, the update of the temporal constraint network is the most expensive subtask. That is, after the execution of presentation acts we have to add the concrete instantiations of the start and end points to the network of temporal constraints and compute new numeric ranges over the timepoints. By restricting the computation to the freshly changed links in the network, the update procedure is still fast enough to guarantee fluent character animations. Sample presentations of our system are available via the Persona' s web page http://www.dfki.de/imedia/java/applets/persona/.

9 Conclusion In this article we have described our efforts to develop life-like presentation agents which can be utilized for a broad range of applications including personalized information delivery from the WWW. The central characteristics of our approach is that we distinguish between a character-dependent how-to-present task and an application-dependent whatto-present task. The two tasks are reflected in our system design by a Persona Engine and a Presentation Planner.


37

The Persona Engine not only executes directives which it receives from an application client, but also implements a basic behavior independent of the applications it serves. Besides character-specific navigation acts, this basic behavior comprises idle-time actions, and immediate reactions to events occurring at the user interface. Both action types have to be supported, in order to obtain a lively and appealing presentation agent. Due to a built-in mechanism for action specialization and decomposition application clients can request presentation scripts at a high-level of abstraction. By including the Presentation Planner, we also address the automated generation of presentation scripts since the manual scripting of agent behaviors is tedious, error-prone, and sometimes even unfeasible. Our current prototype is capable of generating both presentation scripts for life-like characters, and navigation structures to allow the user to dynamically change the course of a presentation at runtime. This was achieved by combining our previous work on presentation planning in the WIP system (cf. [André & Rist, 93]) with a module for temporal reasoning. We would like to emphasize that the new system relies on a simpler decomposition mechanism than WIP. In WIP, all parts of the presentation (i.e., text paragraphs and graphical illustrations) were planned from internal representation units (nodes of a semantic network and 3D object models). Especially the design of graphics frequently required replanning because of unforeseen side effects, such as occlusions after adding an additional object to a scene. In contrast, the system described here fetches most presentation parts from the web and tries to reuse them in a new presentation. Clearly, the overall quality of the Persona' s presentations depends to a large extent on the information gathered from the web. Unfortunately, a presentation agent cannot anticipate which information will be available on the web, i.e. it also has to operate in unknown environments. There are several approaches to tackle this issue. One direction is to rely on sophisticated methods for information retrieval and extraction. However, we are still far away from robust approaches capable of analyzing arbitrary web pages consisting of heterogeneous media objects, such as text, images and video. Another approach uses so-called annotated environments (see [Doyle & Hayes-Roth, 98]) which provide the


38

knowledge agents need to appropriately perform their tasks. These annotations can be compared to markups of a web page. Our hope is that with the increasing popularity of agents, a standard for such annotations will be developed which will significantly ease the presentation planning process. An empirical study of our system revealed an affective impact of a Persona - even on adult users. Our subjects perceived the Persona as being helpful and entertaining. Furthermore, they rated learning tasks presented by the Persona as less difficult than those without a life-like character. However, this effect does obviously not occur in all applications, and users seem to have clear preferences about when to have a personified agent in the interface. Thus, user interface designers should not only take into account interindividual, but also intra-individual differences.

10 Acknowledgments This work has been supported by the German Federal Ministry of Education, Science, Research and Technology (BMBF) under the contracts ITW 9400 7 and 9701 0. We would like to thank Peter Rist for drawing the cartoons, H.-J. Profitlich and M. Metzler for the development of the temporal reasoner, Frank Biringer for implementing the Persona Compiler, Wolfgang Pöhlmann for his help with the design of the presentations for the empirical evaluation and Susanne van Mulken for supervising the empirical study. In addition, we are grateful for the comments of the anonymous reviewers.


39

References [Allen et al., 90] J. Allen, J. Hendler, and A. Tate (eds.). Readings in Planning. San Mateo, California: Morgan Kaufmann, 1990. [Allen, 83] J. F. Allen. Maintaining Knowledge about Temporal Intervals. Communications of the ACM, 26(11):832–843, 1983. [André & Rist, 93] E. André and T. Rist. The Design of Illustrated Documents as a Planning Task. In: M. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 94– 116. AAAI Press, 1993. [André & Rist, 96] E. André and T. Rist. Coping with Temporal Constraints in Multimedia Presentation Planning. In: Proc. of AAAI-96, volume 1, pp. 142–147, Portland, Oregon, 1996. [André et al., 93] E. André, W. Finkler, W. Graf, T. Rist, A. Schauder, and W. Wahlster. WIP: The Automatic Synthesis of Multimodal Presentations. In: M. Maybury (ed.), Intelligent Multimedia Interfaces, pp. 75–93. AAAI Press, 1993. [Badler et al., 93] N.I. Badler, C.B. Phillips, and B.L. Webber. Simulating Humans: Computer Graphics, Animation and Control. New York, Oxford: Oxford University Press, 1993. [Ball et al., 97] G. Ball, D. Ling, D. Kurlander, J. Miller, D. Pugh, T. Skelly, A. Stankosky, D. Thiel, M. van Dantzich, and T. Wax. Lifelike Computer Characters: the Persona project at Microsoft. In: J.M. Bradshaw (ed.), Software Agents, pp. 191–222. AAAI/MIT Press, Menlo Park, CA, 1997. [Ball, 96] G. Ball. Dialogue Initiative in a Web Assistant. In: Proc. of Life-Like Computer Characters ' 96, Snowbird, Utah, 1996. [Butz, 97] A. Butz. Anymation with CATHI. In: Proc. of AAAI-97, Providence, 1997.


40

[Doyle & Hayes-Roth, 98] P. Doyle and B. Hayes-Roth. Agents in Annotated Worlds. In: Proceedings of theSecond International Conference on Autonomous Agents (Agents ' 98), pp. 173–180, Minneapolis/St. Paul, 1998. [Hardman et al., 94] L. Hardman, D.C.A. Bulterman, and G. van Rossum. The Amsterdam Hypermedia Model: Adding Time and Context to the Dexter Model. Communications of the ACM, 37(2):50–62, 1994. [Hardman et al., 97] L. Hardman, M. Worring, and D. Bulterman. Integrating the Amsterdam Hypermedia Model with the Standard Reference Model for Intellgent Multimedia Presentation Systems. Computer Standards and Interfaces, 18(67):497–507, 1997. [Huard & Hayes-Roth, 97] R. Huard and B. Hayes-Roth. Character Mastery with Improvisational Puppets. In: Proc. of the IJCAI-97 Workshop on Animated Interface Agents: Making them Intelligent, pp. 85–89, Nagoya, 1997. [Isbister & Hayes-Roth, 97] K. Isbister and B. Hayes-Roth. Social Implications of Using Synthetic Characters. In: Proc. of the IJCAI-97 Workshop on Animated Interface Agents: Making them Intelligent, pp. 19–20, Nagoya, 1997. [Kautz & Ladkin, 91] H. A. Kautz and P. B. Ladkin. Integrating Metric and Qualitative Temporal Reasoning. In: Proc. of AAAI-91, pp. 241–246, 1991. [Kilger, 94] A. Kilger. Using UTAGs for Incremental and Parallel Generation. Computational Intelligence, 10(4):591–603, 1994. [Lester et al., 97] J.C. Lester, S. Converse, S. Kahler, T. Barlow, B. Stone, and R. Bhogal. The Persona Effect: Affective Impact of Animated Pedagogical Agents. In: Proceedings of CHI' 97, pp. 359–366, Atlanta, 1997. [Lester et al., 98] J. Lester, J.L. Voerman, S.G. Towns, and C.B. Callaway. Deictic Believability: Coordinated Gesture, Locomotion, and Speech in Lifelike Pedagogical Agents. Applied Artificial Intelligence Journal, 1998. this volume.


41

[Maldonado et al., 98] H. Maldonado, A. Picard, P. Doyle, and B. Hayes-Roth. Tigrito: A Multi-Mode Interactive Improvisational Agent. In: Proceedings of the 1998 International Conference on Intelligent User Interfaces, pp. 29–32, San Francisco, CA, 1998. [Mann & Thompson, 87] W. C. Mann and S. A. Thompson. Rhetorical Structure Theory: A Theory of Text Organization. Report ISI/RS-87-190, Univ. of Southern California, Marina del Rey, CA, 1987. ¨ [Mulken et al., 98] S. Van Mulken, E. André, and J. Muller. The Persona Effect: How Substantial Is It. In: Proceedings of HCI' 98, Sheffield, UK, 1998. to appear. [Nass et al., 94] C.S. Nass, J. Tauber, and R. Ellen. Computers are Social Actors. In: Proc. of CHI-94, pp. 72–77, Boston, MA, 1994. [Noma & Badler, 97] T. Noma and N.I. Badler. A Virtual Human Presenter. In: Proc. of the IJCAI-97 Workshop on Animated Interface Agents: Making them Intelligent, pp. 45–51, Nagoya, 1997. [Packard, 89] K. Packard. X11 Nonrectangular Window Shape Extension Version 1.0, X11 R5. Technical report, MIT X Consortium, MIT, Cambridge, Massachusetts, 1989. [Perlin & Goldberg, 96] K. Perlin and A. Goldberg. Improv: A System for Scripting Interactive Actors in Virtual Worlds. Computer Graphics, 28(3), 1996. [Rickel & Johnson, 98] J. Rickel and W.L. Johnson. Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control. Applied Artificial Intelligence Journal, 1998. this volume. [Rist & André, 92] T. Rist and E. André. Incorporating Graphics Design and Realization into the Multimodal Presentation System WIP.

In: M. F. Costabile,

T. Catarci, and S. Levialdi (eds.), Advanced Visual Interfaces (Proceedings of AVI ' 92, Rome, Italy), pp. 193–207. Singapore: World Scientific Press, 1992.


42

[Stone & Lester, 96] B.A. Stone and J.C. Lester. Dynamically Sequencing an Animated Pedagogical Agent. In: Proc. of AAAI-96, volume 1, pp. 424–431, Portland, Oregon, 1996. [Takeuchi & Nagao, 93] A. Takeuchi and K. Nagao. Communicative Facial Displays as a New Conversational Modality. In: Proc. of ACM/IFIP INTERCHI' 93, pp. 187–193, 1993. [Thalmann & Kalra, 95] N. Magnenat Thalmann and P. Kalra. The Simulation of a Virtual TV presenter. In: Computer Graphics and Applications, pp. 9–21. World Scientific, 1995. [Thórisson, 98] K. Thórisson. A Mind Model for Multimodal Communicative Creatures and Humanoids. Applied Artificial Intelligence Journal, 1998. this volume. [Walker et al., 94] J.M. Walker, L. Sproull, and R. Subramani. Using a Human Face in an Interface. In: Proc. of CHI-94, pp. 85–91, Boston, MA, 1994.


43

Figures and Captions

Figure 1: PPP Persona Annotating Graphical Objects Through Pointing and Speech

Figure 2: Left: Cross-Media/Window References with a Video Persona. Right: Persona Requesting User Input

Figure 3: WebPersona in the Role of a Personal Travel Agent


44

Locomotion

Restlessness

Signaling Activity

Immediate Reactions

Figure 4: Classification of Persona Self Behaviors ProvideInformation

ProvideSummary

DesignIntro-Page

...

S-Speak

...

Elaborate

Introduce

Elaborate

...

Elaborate

Elaborate

...

Sequence

Elaborate

...

Describe

Describe

...

Elaborate

Introduce

Label I have found information for your trip to Frankfurt.

S-Speak

Here you can find the latest weather information.

S-Point

Design-IntroPage

Emphasize

S-IncludeMap

S-Speak S-Include- Illustrate Text S-IncludePhoto

S-IncludeLink

Label

S-Speak

This hotel is located in the heart of Frankfurt.

This hotel has a very large fitness room with a sauna.

Hotel Location

Figure 5: Rhetorical Structure of the Hotel Example

S-Point


45

Elaborate S-Include-Map Label S-Speak S-Point

0

2

4

6

8

14

10 12

16

Figure 6: Example of a Timeline Diagram

... Elaborate Sequence

Introduce

Describe Introduce

...

S-Speak

Elaborate

...

Elaborate

Introduce

ProvideSummary

DesignIntro-Page

Describe

Elaborate

Hotel Link Selected

...

Design-IntroPage

Emphasize

Label

Location Link Selected

I have found information for your trip to Frankfurt.

S-Speak

S-Point

S-Include- Illustrate Text

S-IncludeText Here you can find the latest weather information.

Introduce Default Time Over/ Next

Introduce

S-IncludePhoto

S-IncludeLink

S-Speak

Elaborate

S-IncludeMap

Label

S-Speak

S-Point

This hotel has a very large fitness room with a sauna. This hotel is located in the heart of Frankfurt. Hotel Location

Default Time Over/ Up

... Figure 7: Navigation Structure of the Hotel Example

Previous

S-DisplayText


A1

A1

? ? B1 meets B2 B3 during B2 B1 starts A1 B2 finishes A1

>=0 >=6 B1 meets B2 B3 during B2 B1 starts A1 B2 finishes A1

>= 1

>= 6

B1 ?

46

B2 ?

?

?

C1 meets C2 C1 starts B2 C2 finishes B2

>= 1

B1

B3

?

?

0

=3

B2 >=1

>=6

>=2

C1 before C2 C1 starts B2 C2 finishes B2

>= 1

>= 1

>=5 =3

>= 5

C1 ?

B3

>=1

C1

C2 ?

?

>= 1

>=1

?

C2 >=2

>=2

>= 1

>= 2

>=4 >= 2

a)

b)

Figure 9: Starting Points, Endpoints and Duration of Presentation Acts Before and After Constraint Propagation

High-Level Persona Actions

Point

Move PointwithStick

ContextSensitive Expansion

Walk

. .. Basic Postures

RTurn

RStep

FTurn

LiftRH

Frames . ..

. ..

. ..

. ..

Figure 11: Context-sensitive Decomposition of a Pointing Gesture

Stick


Function TopLevelPlanningLoop(Goal) (1) If Not(SatisfiedP(Goal)) (2) Then RootNode := AddNewNode(Goal,Nil) (3) If ExpandNode(RootNode) (4) Then BuildUpSchedule(RootNode) (5) Endif (6) Endif Function ExpandNode(Node) (7) While UntriedAlternatives(Node) (8) Unless And(ApplicableP(Node,Op(Node)), (9) ExecuteInferiorActs(Node), (10) TemporallyConsistentP(Node)) (11) Do UpdateAlternatives(Node) Else Do Return true (12) Endwhile Function TemporallyConsistentP(Node) (13) CopyInMats(MatsSystem(Node)) (14) AddTemporalConstraints(Op(Node)) (15) If Calculate() Then UpdateMatsSys(Node,CopyFromMats()) Endif Function ExecuteInferiorActs(Node) (16) While UntriedBindings(Node) (17) Unless Foreach Act In InferiorActs(Node) (18) Unless Or(SatisfiedP(Act),ExecuteOneAct(Act,Node)) (19) Return nil Finally Return true (20) Endforeach (21) Do UpdateBindings(Node) Else Do Return true (22) Endwhile Function ExecuteOneAct(Act,Node) (23) If ApplicableP(Node,Op(Node)) (24) Then If Or(ElementaryP(Act,Node),HyperlinkP(Act,Node)) (25) Then ExecuteElementaryAct(Act,Node) (26) Else If ComplexP(Act,Node) (27) Then ExecuteInferiorActs(ExpandOperator(Act,Node)) (28) Endif (29) Endif (30) Endif Function BuildUpSchedule(RootNode) (31) Traverse(RootNode) (32) MakeGlobalMats(RootNode) (33) MatsSys := CopyFromMats() (34) MatsList := CreateDisjMatsSys(MatsSys) (35) Foreach Sys in MatsList Do MakeTotalOrd(Sys) Endforeach Figure 8: Planning Algorithm

47


2 66 66 66 66 66 66 4

32 Schedule 1 77 66 7 66 1: Start A, Start C 7 77 66 2: End A, End C 7 77 666 77 66 3: Start B 75 64 4: End B

48

3 Schedule 2 7 7 7 7 1: Start B 7 7 7 2: End B 7 7 7 3: Start A, Start C 7 7 5 4: End A, End C

Figure 10: Schedules after Resolving Disjunctions

Declarative Specification (defaction MoveUp :pre ((icon noicon) (bodydir front) (leftarm standard) (rightarm standard) (bodypos stand) (stick off)) :prim startbottomupjumping :while ((posy target) :prim bottomupjumping) :prim endbottomupjumping) (defprimitive StartBottomUpJumping :pre ((icon noicon) (bodydir front) (leftarm standard) (rightarm standard) (bodypos stand) (stick off)) :post (()) :gesture 45) (defprimitive bottomupjuming :pre ((icon noicon) (bodydir front) (leftarm standard) (rightarm standard) (bodypos stand) (stick off)) :post ((posy -= 1)) :gesture 42) ...

Regressive Planning + Code Generation

Source Code int actionMOVEUP (int start, int time, int x, int y, char *text) { int res,a,atime,reltime,jobid; char c, *buff; float l,sl,ss; int nlong,nshort,sp1,sp2,wn,i,j,g; int cps,al; relstart = start; get_in_stateMOVEUP (relstart, time, x, y,""); res = primitiveSTARTBOTTOMUPJUMPING (relstart, time, x, y, text); a = abs (posy - y); while (a > 50) { res = primitiveBOTTOMUPJUMPING (relstart, time, x, y, text); a -= 50; } res = primitiveENDBOTTOMUPJUMPING (relstart, time, x, y, text); return res; } ...

Source Code Compilation

Figure 12: Compilation of Persona Actions

Executable Machine Code ... sethi %hi(25000),%10 or %10,%lo(25000),%10 move 1,%11 move %10, %o0 move %11, %o1 ...


49

Application (e.g. a presentation planner) scripts incl. ( presentation tasks)

Persona Engine

...

Behavior Monitor (situated selection and decomposition of actions)

...

behavior space

primitive actions

requests for reactions

Event Handler

Character Composer Platform Interface (X11-Version)

WWW Browser (with Media APIs) Window System and Device Drivers

Figure 13: Architecture of the Persona Engine


50

Tables and Captions Type of Information

Question

Technical Info

Person Descriptions

Persona Condition

Persona Condition

No Persona Persona No Persona Persona

Presentation Difficult

1.63

1.09

2.07

2.14

Presentation Entertaining

1.28

2.07

1.78

2.0

Test Difficult

2.00

1.50

2.86

2.93

Presentation Interesting

1.71

2.21

2.0

2.28

Information Overload

1.43

1.14

2.50

2.86

Table 1: Means for the General Questions Asked in the Questionnaire (Part A). Ratings ranging from 0 (negative answer, i.e., indicating disagreement) to 4 (positive answer, i.e., indicating agreement).


51

Type of Information Question

Technical Info Person Descriptions

Persona' s behavior is tuned to

3.00

1.64

2.70

2.00

1.00

0.93

2.21

2.00

presentation Persona helps concentrate on relevant parts Persona distracts subject from relevant information Persona encourages subject to further pay attention to presentation

Table 2: Means for the Persona-Specific Questions Asked in the Questionnaire (Part B).

Persona Condition Type of Info

No Persona Persona

Technical Material

36.14

37.57

Person Descriptions

11.43

10.35

Table 3: Means for Comprehension and Recall Performance by the Conditions Persona and Type of Information.