for novice users of software components and/or web services. ... the use of information and services is more and more sporadic: to hold a ticket will be ... The framework is based on the conventional sequence of help systems: formalization of.
Architecture of a Framework for Generic Assisting Conversational Agents Jean-Paul Sansonnet, David Leray, and Jean-Claude Martin LIMSI-CNRS BP 133 F-91403 Orsay Cedex, France {jps, leray, martin}@limsi.fr
Abstract. In this paper, we focus on the notion of Assisting Conversational Agents (ACA) that are embodied agents dedicated to the function of assistance for novice users of software components and/or web services. We discuss the main requirements of such agents and we emphasize the genericity issue arising in the dialogical part of such architectures. This prompts us to propose a mediator-based framework, using a dynamic symbolic representation of the runtime of the assisted components. Then we define three strategies for the development of the mediators that are validated by the implementation of experiences taken in various situations.
1 Introduction 1.1 The Need for Assisted Components and Services For a few years, the problematics of the Embodied Conversational Agents (ECA) [1,2] stressed on the ‘embodiment’ feature of the interactive virtual characters and significant progress was made in terms of software architectures [3,4] of course but more especially in terms of: realism [5], dynamicity [6], multimodal expressivity [7], etc. This made it possible to develop, to a certain extent, the credibility of these tools. But their use is however still restrained to auxiliary and/or optional functions in the applications and services because their credibility is not yet sufficient so that designers take any risks of irritation or even of rejection on behalf of the users. At the same time, one notes a strong and fast evolution in the sociology of the users of computer applications: whereas the initial users were expert users, the current users are inexperienced beginners (we will say further ‘novice users’) who are confronted with a field that they do not grasp but that they are committed to use within their professional or family life: if one considers only the field of the Internet, ordinary users are very numerous (near one billion individuals are connected), and the applications and services are expanding continuously. What complicates even more the problem, is that the use of information and services is more and more sporadic: to hold a ticket will be done today on a website and tomorrow on another; to buy a particular product will be made on a specific website, browsed only once by the user... In this new context, current static, instruction manual based, help systems cannot satisfy the requirements of reactivity and dynamicity when confronted to novice users in a situation of failure. J. Gratch et al. (Eds.): IVA 2006, LNAI 4133, pp. 145 – 156, 2006. © Springer-Verlag Berlin Heidelberg 2006
146
J.-P. Sansonnet, D. Leray, and J.-C. Martin
1.2 Assisting Agents These new needs prompt the revival of the particular issue of the Assisting Agents fully dedicated to the semantic mediation between individuals and computer systems [8]. When a beginner is in a situation of failure in front of an unknown application, the Assisting Agent must carry out three major functions: The Dialogical Agent: provides the function of comprehension [9]: it must be able to grasp the problem of the user. Though it was shown that if individuals prefer to command machines by means of direct GUI interactions (Graphical User Interface), as soon as “things go wrong” natural language becomes the first mode of expression of the user’s distress. It is necessary first that the help system should be provided with a textual input (optionally an oral input) making it possible for the user to enter help requests and it is also necessary that the agent should be able to analyze them, with appropriate NLP tools (Natural Language Processing) [10,11,12], in order to transform the natural requests into formal ones. The Rational Agent: provides the function of competence [13,14]: In order to be able to handle the formal help requests, the system must be provided with a dynamic symbolic representation (‘at runtime’) of the structure and of the operation of the software component that it is intended to assist and it must be able to carry out heuristic reasoning (in the sense of the Common Sense Reasoning community [15,16]) upon this representation in order to answer the users’ requests such that the user judges the agent competent. The Embodied Agent: provides the function of presence [17,18]: The expression of the assistance via an anthropomorphic entity restores the trust of the user by the feeling of a ‘benevolent presence’. Once regarded as very optional when assistance is given to expert users, this function proves to be crucial in the case of ordinary users and even has proven to be effective (see for example the ‘persona effect’ of Lester [19] or even the conventional ‘chatterbot effect’ [20]). In this paper we discuss the general principles of a framework dedicated to the development of Assisting Conversational Agents (ACA) based on the requirements associated to the three above mentioned functions: dialogism, rationality and embodiment. In section 2, we define the general architecture of our framework and we point out the genericity issue at stake. This prompts us to choose a mediator-based solution with three basic strategies. In section 3 we expose the implementation of concrete examples based on the proposed strategies.
2 Architecture Issues 2.1 Organization of the Framework The general organization of the framework replicates the three main functions of an ACA as shown in figure 1.
Architecture of a Framework for Generic Assisting Conversational Agents
147
Fig. 1. The framework is based on the conventional sequence of help systems: formalization of the user’s request; resolution of the request; presentation of the assistance. The main difference is that the resolution of the request is not achieved directly over the component but over its symbolic representation (called the mediator) which evolves dynamically with its current status.
The framework architecture is composed of three main tools, each one dealing with a specific domain of competence: a) the analyzer is supposed to translate the user natural language request into a formal one, thus dealing with the Natural Language Processing (NLP) community b) the mediator is a dynamic symbolic representation of the structure and the functioning of the actual component, on which the rational agent has to resolve the formal request, thus dealing with the Knowledge Representation (KR) and Artificial Intelligence (AI) community c) finally, the presenter has to return the reaction of the agent using multiple modalities as studied in the Computer Human Interaction (CHI) community. 2.2 The Genericity Issue The three domains of competence involved here point out immediately the multidisciplinary issue that developers of dialogical assisting systems have to beat. Indeed we know that the design of dialog systems can claim already great achievements [10,11,12] but they require huge efforts in terms of NLP expertise, human and implementation time, and they result in complex architectures which are specialized to a given application and cannot be so easily reused. This is a major drawback for the ACA problematics, because we need to develop assisting systems at the same rate as the development of the new components (that is faster and faster) and at a cost ideally lower than the cost of the component itself (even if we claim that assistance should not be considered optional but on the contrary a first class citizen in the development of applications and services for the public). This issue has been stated by J. Allen since 2001 [9] when he declared that the genericity of dialogue systems would be the key to their success in the future. Ideally, a generic dialogue system can be defined as a framework that is not designed for a
148
J.-P. Sansonnet, D. Leray, and J.-C. Martin
particular application but a) can be ‘plugged’ to various applications and b) with minimal linguistic knowledge and minimal adaptation effort for the applications’ developers. These considerations prompted us to propose a two main principles for the handling of the genericity issue in our framework: P1: A Mediator Based Architecture. The assisting tools do not deal with the application itself but with its symbolic representation. The idea is that developers of software components should need very little NLP competence (ideally none). Developpers need only filling the mediator model in a formal form. Once the model of the component is filled in the mediator (even with a gross model that can be refined incrementally) the ACA is operational and can interact with users: the quality of its assisting service is proportional to the accuracy of the dynamic symbolic representation of the component. We are fully aware that the statement that developers should know little if any NLP sounds like “wishful thinking” but a) we have to do something about the genericity problem and discuss solution even considered as ideal b) we don’t claim that we have fully implemented this principle; just that it has inspired our architecture. P2: A ‘Gradual Semantic’ Approach to the Representation. Then, depending on the degree of precision the developer wants to provide (or is able to provide) the rational agent with this representation can be a) variable from rough to accurate in terms of the representing structure and b) variable from static to very dynamic in terms of the synchronization of the status of the variables between the component and the mediator. Actually, in our framework, even if an empty model is provided, the ACA still can work, in a simple “chatbot” mode, because it can handle canonical aspects of conversation (e.g. saying Hello or Goodbye, …) independantly from any application 2.3 Three Mediator Strategies In this paper, we are mainly concerned with the mediator architecture so we will not discuss the important issue of request handling as stated in principle P2 (see section 3.2 for a quick insight and for more see [21,22]). So our problem can be roughly summarized by two questions: a) How can we easily develop mediator models for an application? In the following, we propose three strategies for developing mediators ranging from the simplest to the more advanced. They are then experimented in section 3; b) What is the mediator-based architecture’s real impact on the genericity issue? This question will be discussed in section 4 according to the experiences detailed in section 3. The three mediator-based strategies proposed are as follows: Post-creation Stitching. The first strategy is dealing with the issue of legacy software, corresponding to the situation where an assistant has to be built for an already
Architecture of a Framework for Generic Assisting Conversational Agents
149
existing software component. In this case, the model of the component has to be designed as a separate software entity and we have to face two problems: ― Post-creation: The developer of the component must describe the structure and the dynamics of the component in the mediator. This can be viewed as another programming phase, even if it is simpler; ― Stitching: At runtime, the variables of the mediator must be synchronized (quite like files can be synchronized between two computers) with their counterparts in the component’s runtime. This is achieved by manually installing a “stitching” between the mediator and the component events (see Figure 2). We have implemented this strategy in our framework. Three examples, corresponding to three variants, are experienced in section 3.1.
Fig. 2. The stitching diagram, designed for the component “counter” exhibited in figure 3 and detailed in table 1. This figure intends just to give a clue about the fact that developing even a simple model, for an already existing component, can result in complex interaction maps.
Model to Component Design. Model ↔ Component stitching is a cumbersome process that can only be achieved manually by competent programmers. In the second strategy we consider the situation where new components are to be developed. Hence we can take advantage of more advanced software techniques that are emerging such as the Model Driven Architectures (MDA) [23] where the actual code of the component is automatically generated from a meta model. In this strategy, the programmer designs the component only once, in the mediator modeling language. Then the framework generates a) a symbolic representation for the mediator which is dedicated to the assistance introspection tools and b) the actual runtime of the component. Moreover, this makes it possible to generate the stitching model ↔ component along with the code.
150
J.-P. Sansonnet, D. Leray, and J.-C. Martin
In the current state of work, we cannot generate a complete symbolic model from the meta model because this would say that we can automatically generate symbolic models of computation for any arbitrary software program which remains an open question. Therefore, we have restrained ourselves to a subpart of the components that is both tractable and interesting from the point of view of user-system interaction: its Graphical User Interface (GUI), together with the internal variables that the GUI controls directly (e.g. for a checkbox its Boolean status, for a textfield its string, for a button the name of the function that it triggers, …). We have implemented the GUI part of this strategy in our framework using Mathematica 5.1 as a support for a) the symbolic model b) the NLP handling and c) the automatic generation of Java Swing GUIs. An example is experienced in section 3.2, together with a brief overview of the NLP process involved. Model to Multi-environment Design. While we have to go deeper into the question of modeling software programs, it appears interesting to make use of our existing meta model approach to deploy components no only for stand alone application (e.g. Java-based applications) but also for web-based applications and services which are spreading out very rapidly and which concern mostly novice users. This is the reason why we are currently developing an extension of our framework for website assisting agents which is experienced in section 3.3.
3 Implementation of the Mediator Strategies 3.1 Stitching of Java Applets The first validation of the mediator strategies has been carried out by developing mediator models for several Java applets, while making use of the stitching technique.
Fig. 3. The assisting framework of the LEA agent. It can embed in the main frame various Java applets. On the right, are the Java GUI of the three examples detailed in the table 1.
Architecture of a Framework for Generic Assisting Conversational Agents
151
Once the symbolic model has been built manually, the applets are embedded into the assisting framework as shown in figure 3. We implemented three applets (one can see them respectively in the right part of figure 3. from top to bottom) which belong to three different application domains and which were processed according to three different approaches. This has lead to the cross-exploration of a large range of situations, as described in table 1. Table 1. Cross-exploration of three approaches to Java applet stitching
Component
Domain
Approach
Coco: a simple counter
A Java component provided with autonomous processes (threads). The counting process is controlled by an on/off switch and a cursor for speed.
The mediator and the Java applet are designed and coded at the same time. The mediator reflects exactly the Java objects contained in the application.
Hanoi: a well known game
A Java component functioning in a strictly modal way: if the user does not interact, it does not occur anything.
The Java applet is coded in an independent context then modelled a posteriori. The code of the Java applet is filtered in an automatic way in order to be able to send events to the model and to receive orders from the mediator.
AMI: an active web site
The application is a data base displayed as a web service. The user navigates within the base/the site and can update it dialogically.
Here the Java component is limited to the role of display function for the web pages. The site is managed completely in the mediator.
3.2 Generating Swing-Based GUIs from a Symbolic Model The second mediator strategy is dealing with the automatic generation of the components’ code (or part of the code) from its a priori description in the symbolic meta model. With regard to arbitrary components, this problem is not yet considered fully tractable; so we made a first attempt in that direction while restricting ourselves to the dialogisation of the GUI part of the components (that is, we model the widgets and the variables that they control – not the inner application functions that remain atomic to the rational agent). The figure 4 shows the developing framework, with the assisting agent answering questions about the structure and the functioning of a simple “Dice game” while being capable of deictic gestures towards the widgets of the GUI.
152
J.-P. Sansonnet, D. Leray, and J.-C. Martin
Fig. 4. Screenshot of the developing framework of the “model to component” strategy. On the right is the Java Swing GUI of some “Dice game”. It was willingly built to be not intuitive at all so that users are driven to ask questions about it in the chatbox line. Below the chatbox line, there is a debugging frame showing the main phases of the NLP processing of the request (the request “A quoi sert le cusreur” is detailed in table 2). On the left, there is a second debugging frame displaying the dynamic status of the symbolic model of the GUI on the right.
The dialogical agent analyzes the utterances of the users in several successive phases summarized (top-down) in the table 2 where CAPITALIZED words are internal semantic markers: Table 2. Phases of the semantic analysis of assisting requests prompted by the users Phase sequence Input utterance
Example « à quoi sert le cusreur ? stp » (what’s the use of the cusror please ?)
Possible orthographical or abbreviations correction
“ cusreur ” → “ curseur ” “stp” → “s’il te plait”
Lemmatization
{/a/, /quoi/, /servir/, /le/, /curseur/, /QUEST/, /PLEASE/}
Part of Speech (POS) labelling
{GG[/a/], GG[/quoi/], VV[/servir/], DD[/le/], NN [/curseur/], /QUEST/, /PLEASE/}
Lexical semantic classification of the lemmas
{AT[GG[/a/]], WHAT[GG[/quoi/]], USAGE[VV[/servir/]], DD[/le/], WIDGETCURSOR[NN [/curseur/]], /QUEST/, SPEECHACTPLEASE[/PLEASE/]}
Global Semantic extraction of the three mains parts of a typical request
SPEECHACT PREDICATE ARGUMENT Sbest=selector THE
Architecture of a Framework for Generic Assisting Conversational Agents
153
The global semantic analysis phase seeks to extract the three main components which appear in formal help requests, according to the following general request format: < < SPEECHACT >, < PREDICATE >, < ARGUMENT,* > > Speech Act (SA): is the general category of the request, along with [24]. Several SA ontologies have been proposed in dialog systems; here we use a simplified version dedicated to request handling (not dialog handling). In the example, we have: < ASK > Predicate: is an action or a propositional verb or even an attribute. For the sentence given as an example it is the attribute: < USAGE > Argument*.. : zero, one or more Associative Referential Expressions [25], making it possible to locate, via their perceptual properties (indeed not their internal programming identifier), the entities involved in the predicate. For the utterance given as an example we will have: REF[Qcursor]. It is a reference to any instance, within the application, for which the Qcursor predicate takes the value True. This referential expression is quite simple but one can find more complex expressions. From the formal request hence obtained, < , , > the rational agent must locate, at least, one instance of WIDGETCURSOR in the mediator representation. Then, supposing, as it is the case here, that there is one and only one instance of this kind, it must consult the instance’s HELP attribute and produce a multimodal answer. The multimodal answer is expressed through the character-linked modalities and also by optional programmed actions on the model which are in turn mirrored on the application; again, optional deictic gestures by the character can be accompanied by redundant enlightenments of the widgets referred to by the assistant. 3.3 Towards Website Assisting Agents The third strategy has to do with the idea of automatically generating from a single symbolic description of a component both a stand-alone version (say in Java) and a web version (say in DHTML JavaScript). As the stand-alone deployment has been discussed in the previous section we are concerned here by the web deployment: basically this requires a web server-based architecture with actives pages which is currently implemented. A JavaScript based client version is also accessible to the public at [26] where basic examples of agents interacting both with users and DOM based components are demonstrated. The figure 5 shows a screenshot of the WebLea site with the four available cartoon-like characters based on the LEA technology developed by Jean-Claude Martin at LIMSI-CNRS [27] in the IST-NICE project [28]. The WebLea agents are dynamically sizable and interchangeable. They can move over and within the pages of a given dialogized website. They can react to natural language users’ requests by a) displaying answers in a speech balloon b) displaying popup information c) pointing exactly at the DOM objects of the page d) activating JavaScript programs.
154
J.-P. Sansonnet, D. Leray, and J.-C. Martin
Fig. 5. On the WebLea site [26], one can see and control the LEA agents, displayed on both Mozilla-Firefox and Internet-Explorer navigators. An online “movie” editor for creating animations is available together with a “rule” editor (based on JavaScript RegExpr) for scripting the reactions of the agents to users’ questions. The animations are defined in a compact symbolic format and interpreted at the client level making it possible to have a great number of them without bandwidth problems.
The LEA technology is quite simple, being based on animated GIFs body parts, as compared with the state of the art 3D realistic agents of the IVA community (like REA[2], GRETA [17], MAX [3,29], … or even virtual reality systems [30]) but LEA agents can be easily displayed on web pages and they can still express quite a large range of cartoon-like expressions and gestures; this is largely sufficient for our purpose which focuses mainly a) on the genericity of our framework so that assisting agents can be easily deployed and b) on the their reasoning capabilities over the components meaning that fine expression of emotions is optional at that stage.
4 Discussion In this paper, we have first tried to propose the notion of Assisting Conversational Agent (ACA) inheriting its problematics on one side from Human-Machine Dialogue and Reasoning and on the other side from Embodied Conversational Agents. We have claimed that, with the explosion of new components and services and with the explosion of novice users there is a real need for new assisting tools and that ACA can be a
Architecture of a Framework for Generic Assisting Conversational Agents
155
user-friendly solution. The second point that we make is that this will not come so candidly: J. Allen and others have discussed the large cost involved in existing dialogue systems and placed the issue of genericity at the core of their actual spreading out. When ACA are concerned, genericity is even more crucial; this is the reason why we proposed a mediator-based architecture where request handling works on a model not directly on the application, making it possible a) to disconnect – to some extent – the NLP world and the programming world and b) to propose a ‘gradual semantic’ approach of the assistance where the agent can be ranged from a ‘daft’ chatbot to a ‘smart’ companion according to the accuracy of the mediator representation. In section 2, we could not present extensively the internal features of our framework so we focused on the principles and the strategies that we have attempted. In section 3, we developed the presentation of the some implementations of these strategies. According with our experience we can state the three following points: - The first strategy was implemented in various situations: three applets, belonging to three different domains, processed according to three different approaches. This has proved that, at least for small software components, stitching is tractable and that ‘smart’ assisting agents can be deployed quite rapidly. However, the third application (an active dialogically editable website) proved to be difficult to maintain in the end, prompting us to the second strategy. - The second strategy is the most promising for the ACA future. Besides the open question of the full introspectability of arbitrary application code, there is another mental obstacle: conventional programmers consider that dynamic symbolic models are mere gadget applications, “too slow and not professional” but this situation could change with the maturity of web-based scripting (like wikis, active technology,…). - The third strategy is indeed just an extension of the second one. Our experience with the WebLea site and its good acceptability encourages us to develop our framework in that direction so as to propose webpage assisting agents that are easy to develop and easy to install.
References 1. Cassell, J., Sullivan, J., Prevost, S., Churchill, E., Embodied Conversational Agents, MIT Press. 0-262-03278-3, 2000 2. Cassell J., Bickmore T., Billinghurst M., Campbell L., Chang K., Vilhjálmsson H., Yan H., Embodiment in conversational interfaces: Rea, Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the limit, p.520-527, Pittsburgh, 1999 3. Kopp S., Wachsmuth I., Model-based Animation of Coverbal Gesture. Proceedings of Computer Animation (pp. 252-257), IEEE Press, Los Alamitos, CA, 2002 4. Cosi P., Drioli C., Tesser F., Tisato G., INTERFACE toolkit: a new tool for building IVAs, Intelligent Virtual Agents Conference (IVA’05), KOS Greece, 2005 5. McGee D. R, Cohen P. R., Creating tangible interfaces by augmenting physical objects with multimodal language, Proceedings of the 6th international conference on Intelligent user interfaces, p.113-119, Santa Fe CA, 2001 6. Martin A., O'Hare G. M. P., Duffy B. R, Schoen B., Bradley J. F., Maintaining the Identity of Dynamically Embodied Agents, Intelligent Virtual Agents Conference (IVA’05), KOS Greece, 2005
156
J.-P. Sansonnet, D. Leray, and J.-C. Martin
7. Thorisson, K.R., Koons, D. B., Bolt, R. A., Multi-Modal Natural Dialogue. In: Bauersfeld, Penny, Bennett, John, Lynch, Gene (ed.): Proceedings of the ACM CHI 92 Human Factors in Computing Systems Conference, p.653-654, Monterey, 1992 8. Maes P., Agents that reduce workload and information overload, Communications of the ACM, 37(7), 1994 9. Allen J.F., Byron D.K., Dzikosvska M.O., Fergusson G., Galescu L., and Stent A., Towards conversational Human-Computer Interaction, AI magazine, 2001 10. Fergusson G., Allen J., TRAINS-95: Towards a mixed initiative planning assistant. Proc. Conference on Artificial Intelligence and planning systems AIPS-96 Edinburg, 1996 11. Fergusson G., Allen J., TRIPS: an intelligent problem-solving assistant. In Proc of the fifteenth National Conference on Artificial Intelligence AAAI-98, Madison WI, 1998 12. Wahlster W., Reithinger N., Blocher A., SMARTKOM: multimodal communication with a life-like character. In Proc Eurospeech 2001, Aalborg Denmark, 2001 13. Wooldridge M., Reasoning about Rational Agents. MIT Press, 2000 14. Rao A. S., Georgeff M. P., Modeling rational agents within a BDI architecture. In KR’91, pages 473–484, San Mateo, CA, USA, 1991 15. McCarthy J., Hayes P. J., Some philosophical problems from the standpoint of artificial intelligence. In B. Meltzer and D. Michie, editors, Machine Intelligence, volume Volume 4, pages 463–502. Edinburgh University Press, 1969. Reprinted in 1990. 16. Pearl J., Reasoning With Cause and Effect. In Proc. IJCAI’99, pages 1437–1449, 1999. 17. Pélachaud C., Some considerations about embodied agents, Proc. of the Workshop on “Achieving Human-Like Behavior in Interactive Animated Agents”, in The Fourth International Conference on Autonomous Agents, Barcelona, 2000 18. Buisine, S., Abrilian, S., Martin, J.-C., Evaluation of Individual Multimodal Behavior of 2D Embodied Agents in Presentation Tasks. Proceedings of the Workshop Embodied Conversational Agents, 2003 19. Lester et al. The Persona Effect: Affective impact of Animated Pedagogical Agents. CHI’97, 1997 20. Laven S., The Chatterbot webpage of Simon Laven: http://www.simonlaven.com/ 21. InterViews Project url: http://www.limsi.fr/Individu/jps/interviews/ 22. DAFT project url: http://www.limsi.fr/Individu/jps/research/daft/ 23. Blanc X, Bouzitouna S., Gervais M-P, A Critical Analysis of MDA Standards through an Implementation : the ModFact Tool, First European Workshop on Model Driven Architecture with Emphasis on Industrial Applications, 2004 24. Searle J. R., Speech acts. Cambridge University Press, 1969 25. Byron D. K, Allen J. F., What's a Reference Resolution Module to do? Redefining the Role of Reference in Language Understanding Systems, Proc. DAARC2002, 2002 26. WebLea site url: http://www.limsi.fr/~jps/online/weblea/leaexamples/leawebsite/index.html 27. Abrillian S., Martin J-C., Buisine S., Algorithms for controlling cooperation between output modalities in 2D embodied conversational agents. ICMI’03, 2003 28. NICE Project url: http://www.niceproject.com/ 29. Kopp S., Wachsmuth I., Synthesizing Multimodal Utterances for Conversational Agents. The Journal of Computer Animation and Virtual Worlds, 15(1), 2004. 30. Traum D., Swartout W., Marsella S., Gratch J., Fight, Flight, or Negotiate: Believable Strategies for Conversing under Crisis to be presented at 5th International Working Conference on Intelligent Virtual Agents, September 2005.