Presentation Agents for Speech User Interfaces

0 downloads 0 Views 34KB Size Report
In this paper we introduce a presentation agent framework for speech applications. In this framework presentation agents are used to produce dynamic, adaptive ...
Presentation Agents for Speech User Interfaces -DDNNR+DNXOLQHQ Human-Computer Interaction Group Department of Computer Science University of Tampere FIN-33014 University of Tampere, Finland +358-3-2158558 [email protected]

0DUNNX7XUXQHQ Human-Computer Interaction Group Department of Computer Science University of Tampere FIN-33014 University of Tampere, Finland +358-3-2158559 [email protected]

ABSTRACT

PRESENTATION AGENTS

In this paper we introduce a presentation agent framework for speech applications. In this framework presentation agents are used to produce dynamic, adaptive and prosody rich speech outputs. Using this framework in our speechonly e-mail reader we have been able to handle multilingual issues and support different user groups. Our goal is to build unique computer ‘voices’ to make speech outputs more intelligible and pleasant for the users.

Presentation agents are software modules that handle the generation of outputs in speech applications. In other words, presentation agents are presenters of speech messages. Presentation agents act like filters between the application and the user. The application sends a message to the agent and the agent is responsible for interpreting that message. The agent presents it to the user using all of its capabilities to make the message as intelligible and pleasant as possible. A real world metaphor for a presentation agent could be a radio announcer or an actor in a radio play.

Keywords

Speech user interfaces, speech output, user interface agents, dynamic output generation, prosody INTRODUCTION

In current speech user interfaces speech output is often treated in a very inefficient way. Usually speech output is generated using a fixed string that is sent to a synthesizer. It is not uncommon that the resulting utterance is unpleasant and unintelligible for the user. To improve speech output we should use dynamic, adaptive and highly context sensitive messages. It is noteworthy that different people prefer different types of output [2]. We can also utilize prosody to improve both intelligibility and pleasantness of speech [1]. In order to provide dynamic speech outputs we can use SUHVHQWDWLRQDJHQWV. Presentation agents are special kind of software modules that can present messages dynamically and use a rich and varying set of prosodic features in their speech. In this paper we introduce a framework for presentation agents to be used in speech user interfaces. The rest of the paper is organized as follows. First we introduce presentation agents. Second we describe a framework for agents. Next we introduce what kind of information our framework uses and how it is implemented as a part of our speech-only e-mail client. Finally, conclusions are drawn and ideas for future work are presented.

The messages that the agents receive could be conceptual or fixed strings. In the latter case agents are usually unable to interpret the messages and they can usually only forward messages to a synthesizer. In order to present messages efficiently they should be presented on a conceptual level. Agents are able to express conceptual messages in any way that they want to. For example, a greeting to the user could be ignored completely (a reserved agent), it could contain several paragraphs (a chatterbox agent) or contain slangwords (a street credible agent). Presentation agents can be very specialized: usually one agent is able to present only a certain kind of messages. For example, an error-message agent knows how to tell the user that an error happened but does not know how to inform the user about any other issues. Some of the agents are application independent and application specific agents can be built on top of these. We believe that small and specialized agents would be more maintainable and reusable than large and omniscient agents would ever be. Every presentation agent contains a set of attributes that describe what the agent can do and how it will do that. In other words, attributes are features and capabilities of this particular agent. An example of a binary attribute is an ability to speak Finnish: this could be either true or false. Another example is the verbosity of an agent: this could be a floating-point value between zero and one. A FRAMEWORK FOR PRESENTATION AGENTS

Since presentation agents are very specialized and their behavior should adapt to the user and the situation, the task of choosing an appropriate agent is not trivial. Our frame-

work uses a scheme in which a SUHVHQWDWLRQDJHQWPDQDJHU is used to choose an agent for every message. The presentation manager uses HYDOXDWRUV to decide which agent would be the best choice in the current situation. The presentation manager acts like the director of a theater and the evaluators could be seen as headhunters. Figure 1 illustrates our framework. user Input handler

Dialog manager

Output handler

Presentation agent manager Evaluators

Presentation agent

Presentation agent

...

Presentation agent

)LJXUH3UHVHQWDWLRQDJHQWIUDPHZRUN

8VHUPRGHO has two parts; H[SOLFLWXVHUPRGHO contains the preferences and settings that the user has actively set. ,P SOLFLW XVHU PRGHO contains information that is obtained by following the user’s actions without explicitly requesting information. ([WHUQDOFRQWH[W is information about the situation the user is in. This includes the user’s physical position and constraints of that place. 7HFKQLFDOIHDWXUHV define the capabilities of the system. An important part is the output channel in a wide sense including, for example, features of a speech synthesizer and sound quality of a telephone line. All sources of information can be divided into two parts: JHQHULF and DSSOLFDWLRQ VSHFLILF information. By splitting the evaluation process into several atomic evaluators we can keep the system manageable and write general selection rules separately. Maximizing the set of generic evaluators and therefore minimizing the application specific evaluators is an interesting challenge for future work.

When the presentation agent manager gets a message from the dialog manager it consults a set of evaluators. A single evaluator is a software component that compares the attributes of a presentation agent to the information about the current situation and gives a score to the agent. The first and most simple evaluator can e.g. be one that checks if the current agent is able to handle the message to be presented. When all evaluators have done their evaluation work for a single agent the presentation manager calculates an overall score for the agent by multiplying individual scores. When all agents have been evaluated the presentation manager selects the one with the highest score.

IMPLEMENTATION

There is both generic and application specific information that can be used when evaluating agents. By designing evaluators in such a way that a maximum number of evaluators use only generic information, we can make a major part of the evaluation process generic.

We have proposed a presentation agent framework as a solution for dynamic and adaptive speech output generation. Using this framework an application can produce highly context sensitive speech outputs utilizing advanced presentation techniques like prosody. Using this framework the application can also handle multilingual issues and serve as an adaptive interface for different user groups. In the future we will build natural agents based on real human speakers. We will also expand this framework to cover input agents as well.

INFORMATION SOURCES IN THE AGENT FRAMEWORK

There are several sources of information that agents can use when they present messages. This information is used also when the presentation manager is selecting an appropriate agent for each. This information can be stored in a single storage that is reached with a single interface. However, we can manage this information better by grouping it into different groups. We suggest the following groups: 1. 2. 3. 4. 5.

Internal context Dialogue history User model External context Technical features

,QWHUQDOFRQWH[W is the current state of the system including all the information that the system is serving to the user. 'LDORJXH KLVWRU\ contains previous steps of the current dialogue with the user.

We have built a Java-based implementation of the presentation agent framework. It handles all speech outputs of our multilingual speech-only e-mail reader. Using this framework we have built various presentation agents for different purposes. We have agents for specific tasks, agents speaking English or Finnish, agents using prosody and agents just speaking differently than the others. The presentation agent framework supports also different user groups and individual users since the system always tries to choose the most suitable agent for the current user. CONCLUSIONS AND FUTURE WORK

REFERENCES

1. Hakulinen, J., Turunen, M. and Räihä, K.-J. The Use of Prosodic Features to Help Users Extract Information from Structured Elements in Spoken Dialogue Systems. 3URFHHGLQJVRI(6&$7XWRULDODQG5HVHDUFK:RUNVKRS RQ'LDORJXHDQG3URVRG\, Eindhoven, The Netherlands, September 1-3, 1999, 65-70. 2. Reeves, B. and Nass C.. The Media Equation: How People Treat Computers, Television and New Media Like Real People and Places. 7KH 0,7 3UHVV &DP EULGJH0$, 1997.