ENABLING COLLABORATIVE ... - GeoVISTA Center - Penn State

16 downloads 0 Views 922KB Size Report
Dec 4, 2004 - Conference on Spatial Information Theory, Lecture Notes in Computer Science. (LNCS) 2825, Ittingen ... In D. W. Rhind (Ed.),. Geographical ...
Dialogue-Assisted Visual Environment for Geoinformation

1

ENABLING COLLABORATIVE GEOINFORMATION ACCESS AND DECISION-MAKING THROUGH A NATURAL, MULTIMODAL INTERFACE

Alan M. MacEachren1&2, Guoray Cai1&4, Rajeev Sharma1&3, Ingmar Rauschert1&3, Isaac Brewer1&2, Levent Bolelli1&4, Benyah Shaparenko1&3, Sven Fuhrmann1&2, Hongmei Wang1&4

1

GeoVISTA Center, Penn State University

Department of Geography, 302 Walker, Penn State University Department of Computer Science and Engineering, Penn State University 4 School of Information Sciences and Technology, Penn State University [email protected]; http://www.geovista.psu.edu/grants/nsf-itr/intro.html 3

Accepted (4/12/04) for publication in the International Journal of Geographical Information Science

Dialogue-Assisted Visual Environment for Geoinformation

2

Abstract Current computing systems do not support human work effectively. They restrict human-computer interaction to one mode at a time and are designed with an assumption that use will be by individuals (rather than groups), directing (rather than interacting with) the system. To support the ways in which humans work and interact, a new paradigm for computing is required that is multimodal, rather than unimodal, collaborative, rather than personal, and dialogue-enabled, rather than unidirectional. To address this challenge, we have developed an approach to natural, multimodal, multiuser dialogue-enabled interfaces to geographic information systems that make use of largescreen displays and integrated speech-gesture interaction. After outlining our goals and providing a brief overview of relevant literature, we introduce the Dialogue-Assisted Visual Environment for Geoinformation (DAVE_G). DAVE_G is being developed using a human-centered systems approach that contextualizes development and assessment in the current practice of potential users. In keeping with this human-centered approach, we outline a user task analysis and associated scenario development that implementation is designed to support (grounded in the context of emergency response), review our own precursors to the current prototype system and discuss how the current prototype extends upon past work, provide a detailed description of the architecture that underlies the current system, and introduce the approach implemented for enabling mixed-initiative human-computer dialogue. We conclude with discussion of goals for future research.

Dialogue-Assisted Visual Environment for Geoinformation

1

3

Introduction Geographic Information Systems (GIS) and related geoinformation technologies

have become important tools for science, business, planning, and government policy cutting across multiple domains and user groups. Despite decades of development, however, their potential is far from being realized fully, because most systems remain hard to use. As a result, a priority research problem identified (repeatedly) for geographic information science (GIScience), is to make GIS and related geospatial technologies easier to use (Egenhofer & Franzosa, 1995; Mark et al., 1999; Mark & Gould, 1991; Muntz et al., 2003; Nyerges et al., 1995; Slocum et al., 2001). More broadly, developing natural interfaces to computer systems has been a long-term goal for human-computer interaction (HCI) research (Allen et al., 2001; Shackel, 1997; Shackel, 2000; Zue & Glass, 2000), thus efforts to improve GIS usability should be placed in this broader context. While natural (easy to use) interfaces to GIS (and information systems generally) have been investigated for more than a decade (Egenhofer, 1996; Neal et al., 1989; Oviatt, 1997), limited progress has been made. We believe that a fundamental impediment to progress toward natural interfaces, whether for GIS or other computer systems, has been the traditional focus on single modalities (e.g., natural language, sketching) and on one-way communication from human to computer. The approach we are taking, and the interface we are implementing, derives its power to reflect user intentions and meet user needs through support for in-context integration of multiple modalities and active (mixed initiative) human-computer dialogue (HCD), rather than one-way “interaction.”

Dialogue-Assisted Visual Environment for Geoinformation

4

Ease of use, however, goes beyond support for HCD. Productive work is facilitated when teams of individuals participate in multimodal dialogue (using speech and gesture). Most applications of GIS, specifically, are to problems addressed by teams (e.g., crisis management, urban and regional planning, environmental impact assessments, and business location decisions). Current GIS, in contrast, remain focused on support for individual work. Achieving ease of use for the applications that need it most will, thus, also require systems that facilitate (rather than impede) group work by enabling dialogue between or among human collaborators. Below, we sketch a scenario that illustrates the kind of dialogue-assisted system we envision and report progress toward here (this progress is detailed in section 3): The setting: Imagine the command room of a state emergency management center in which Jane Smith, Center Director, and Paul Brown, chief transportation engineer, are in front of a large-screen display linked to the state emergency management GIS. They are discussing preparations for an approaching hurricane, focusing on predictions of potential inland flooding and on evacuation plans, given different possible storm tracks and different evacuation routes.

The 3-way dialogue below, presents a small portion of this scene might play out given the natural, collaborative interface that our research is directed toward. In the scenario, DAVE_G (Dialogue Assisted Visual Environment for Geoinformation) is the multimodal GIS. {figure 1 about here} Progress toward the multimodal, multiuser human-computer-human dialogueenabled environment envisioned above requires both an understanding of individual modalities for interaction and the fusion of information at various levels. The central objective of the research we report on here is to develop a comprehensive theory of multimodal dialogue that provides a framework from which to design, implement, test,

Dialogue-Assisted Visual Environment for Geoinformation

5

and refine: (a) natural, human-centered interfaces to complex information systems that facilitate a two-way human-computer dialogue and that support human-human dialogue mediated by the computer; and (more specifically) (b) methods for engaging the modalities of speech, hand gesture, and gaze to support such dialogues with, and mediated by, an interactive dynamic map in the context of a GIS. To support natural human-computer interaction, we focus on the use of computer vision and speech processing as a means of interpreting and integrating information from two modalities, spoken words and free hand gestures. We are concerned specifically with how to enable a human-computer dialogue with an interactive, multi-layered, largescreen, map-based interface to a GIS and with map-mediated dialogue between human collaborators. The research is bootstrapped through use of an existing test-bed (called iMap) that integrates gesture and speech for simple map queries (Kettebekov et al., 2000; Sharma et al., 2000). While the approach we take is extensible to other domains, a core tenet in the theoretical perspective we present is that natural, multimodal systems are inherently context-dependent. The context we are focusing on is geospatial information access and decision-making. Important components of this context are the role of map-based display as a mediator for dialogue, frequent need for location specification, and the potential of the GISystem database to provide a rich source of information to ground HCD. The GIS context for our current work is being used to create the ontology needed for semantic analysis and formalizing speech-gesture-gaze theory for HCD more generally. The following section sets the context for our work on multimodal, multiuser GIS by highlighting key, recent multidisciplinary developments. Then, in section 3, we

Dialogue-Assisted Visual Environment for Geoinformation

6

introduce the human centered systems approach being applied to developing and implementing a Dialogue-Assisted Visual Environment for Geoinformation (DAVE_G) and detail the process of applying and outcomes from this approach. Specifically, we outline a user task analysis and associated scenario development (for a use-case grounded in the context of emergency response), introduce two precursors to the current DAVE_G prototype and detail current advances from these earlier efforts, describe the current DAVE_G system architecture, and discuss the approach taken to adding support for mixed-initiative human-computer dialogue. We conclude, in section 4, with a comparison of DAVE_G and earlier efforts do achieve multimodal GIS, a discussion of issues raised by our current DAVE_G implementation, and plans for further research. 2

Background Developing natural interfaces to geographical information is a multi-faceted

problem. To provide a basis for discussing the approach we have taken and the system we have implemented, we begin with a brief review of the relevant aspects of three primary research domains: supporting dialogue through natural-multimodal interfaces (with a focus on map-based environments), semantics to support geospatial information dialogue, and geospatial information access and group work environments. 2.1

Supporting dialogue through natural-multimodal interfaces Whether designed for individual or collaborative interaction, the user interface is

a critical component of any computer system. A frequently stated goal in design of interfaces for GIS is that they become transparent, thus that they decrease the difference between mental models required to frame a problem in the knowledge domain (e.g., urban planning) and in the GIS tool domain (Nyerges et al., 1995). This goal of

Dialogue-Assisted Visual Environment for Geoinformation

7

transparent interfaces has its roots in the interface design literature more generally ((Ruthkowski, 1982; Shneiderman, 1998; Stieiti et al., 1998)) and we draw upon that work in our approach. One component of efforts to achieve interface transparency for GIS has been research on natural language forms of human-system interaction (Mark et al., 1987; Wang, 1994). Frank and Mark (Frank & Mark, 1991) argued that effective natural language interaction with GIS must support a dialogue (a perspective matched by increased attention to dialogue-enabled systems in HCI more generally (Jokinen et al., 2000)). They predicted that dialogue-based GIS would be available “within a few years” however; the problem has proved harder than anticipated. Several research groups addressing the problem of voice-based interfaces for information access have focused on iterative dialogue with a map. Shapiro (Shapiro et al., 1991)

linked a natural language understanding system SNePS with Arc/INFO (a

commercial GIS) so that a natural language request (typed in by a user) could be translated into an Arc Macro Language (AML) command executable by the GIS. A second research prototype, GeoSpace (Lokuge & Ishizaki, 1995), supports natural language interactions with multi-layered maps. The significance of this system is its demonstration of the role of domain knowledge in enabling more reactive humancomputer dialogues, using information seeking dialogue in the mapping domain as an example. A third, more ambitious project, the MIT Voyager system (Zue et al., 1990) (Glass et al., 1995) focused on development of a semantic management framework common to multiple languages, a framework including discourse, dialogue and database components, and showed how it is possible to apply this framework in a limited domain

Dialogue-Assisted Visual Environment for Geoinformation

8

(query of spatial objects in a limited geographic region). This is arguably one of the most successful efforts thus far in verbal dialogue systems to databases (whether spatial or not), however, Glass and colleagues (Glass et al., 1995) reported performance of only around 50% understanding. A fundamental weakness of the natural language approach to map and GIS interfaces thus far, is that verbal descriptions of geographic phenomena are often sufficiently ambiguous that locations being referred to cannot be determined (Egenhofer & Golledge, 1998; Larson, 1996; Woodruff & Plaunt, 1994). Several authors have proposed using gesture as an alternative interface modality, that supports more explicit location specification by direct sketching on a map display (Al-Kodmany, 2000; Nobre & Câmara, 1995) or through freehand gestures (Florence et al., 1996). Several sketch-based interfaces have been implemented, particularly as a substitute for formal database query languages (Blaser & Egenhofer, 2000; Egenhofer, 1997; Hopkins et al., 2001). Freehand gesture interfaces to maps remained hypothetical until the recent successes with iMAP by Sharma and colleagues (Sharma et al., 2000; Sharma et al., 1999). To address the weaknesses of using either natural language or gesture alone, several authors have argued for integrated language-gesture interfaces to map-based environments (Blaser & Egenhofer, 2000; Frank & Mark, 1991; Glass et al., 1995), but only a few have implemented them (Corradini, 2002; Oviatt, 1996; Sharma et al., 2000). One system, CUBRICON (Neal & Shapiro, 1991; Neal et al., 1989) combined speech with simplistic pointing “gestures” for referring to objects on a map display (while CUBRICON introduced a novel approach to human-map dialogue, only point indication of locations was supported, and gesture interaction was not actually implemented, only

Dialogue-Assisted Visual Environment for Geoinformation

9

mouse input was used). Even for non-geospatial information, there have been relatively few attempts to develop integrated gesture/speech interfaces, due to the challenges involved (Chang, 2000; Codella & al, 1992; Corradini, 2002; Fukumoto et al., 1994; Koons & Sparell, 1994; Vo & Waibel, 1994; Wahlster, 2002; Zue & Glass, 2000). In all implementations above, other than by Sharma, electronic pen or data-glove based gestures were used, resulting in tethered interaction through specialized devices rather than “freehand” interaction directly with the information. Advances in computer vision, specifically in hand gesture analysis (Pavlovic et al., 1997) and multimodal (speech-gesture) integration (Sharma et al., 1998), extend the potential of using freehand gestures in a multimodal interface. Using an integration of methods, recognition rates approaching 97% have been reported, achieved partly by using precisely defined static symbolic gestures, such as the American Sign Language, that follow a rigid syntax and pre-defined grammar (Starner et al., 1998). Continuous, natural gestures are harder to interpret, since they exhibit considerable variability and the boundaries between successive gestures are not precisely defined; thus, recognition techniques still need to be developed that can handle continuous natural gestures (Pavlovic et al., 1997). A component of our research, not reported here, is focused on developing these gesture recognition methods. 2.2

Developing semantic frameworks to support geospatial information dialogue Solving problems of recognizing, integrating, and interpreting input from

different modalities is only part of the solution to supporting a natural human-GIS dialogue about geospatial phenomena (or GIS-mediated dialogue among humans). In addition, attention must be given to the multiple, perhaps very different semantic

Dialogue-Assisted Visual Environment for Geoinformation

10

frameworks through which geospatial knowledge is constructed, organized, and exchanged by human and computer. From a data modeling perspective, Egenhofer and Franzosa (Egenhofer & Franzosa, 1995) address one aspect of this semantic framework by describing a logic for different topological relationships among geographic objects. In Egenhofer’s (Egenhofer, 1997) Spatial-Query-by-Sketch system, this logic is then used as the core of an approach for matching sketches (translated into semantic networks of spatial objects and relations) with the GIS database. In related work, Gargano, et al. (1992) and Smith, et al. (Smith et al., 1992) both present approaches to a formal geospatial semantic algebra. Perhaps the main problem with these approaches is that they attempt to define meaning in an abstract form based on formal logic, algebra or calculus. It seems unlikely, however, that these concepts match well with the mental models used by individuals and communities to describe meaning; consequently they start from a place of poor semantic interoperability between people and machines. This mismatch makes most current approaches to geospatial semantics poorly suited to use in facilitating an effective humanmachine dialogue. As an example, the concept of ‘near’ in natural language is often ambiguous and context dependent (Cai et al., 2003; Mark et al., 1987); thus it is difficult for a machine to exactly match a user’s conceptualization of nearness. Such vague concepts are a likely cause for breakdowns in traditional, query-answer style of humanGIS interaction.

The lack of interoperable semantic frameworks may also impede

attempts to exchange multi-source data and its associated meaning among individuals working on different GIS platforms.

Dialogue-Assisted Visual Environment for Geoinformation

2.3

11

Group decision support and collaborative work Decision-making using GISystems is typically a multi-step, multi-participant

process (Jankowski et al., 1997), thus an ideal application context for use (and testing) of a dialogue-enabled, multiuser interface. Collaborative environments for analysis and decision-making with geospatial data have been an active research area, prompted in part by an NSF-funded National Center for Geographic Information and Analysis (NCGIA) workshop on collaborative spatial decision making (Densham et al., 1995). Much of the research stemming from the NCGIA workshop has focused on group negotiation and decision-making using GIS, issues that are relevant here (Armstrong & Densham, 1995; Nyerges & Jankowski, 1997; Shiffer, 1998). For reviews of subsequent work on geocollaboration, see (Jankowski & Nyerges, 2001; MacEachren, 2000, 2001). Semantic frameworks for human-GIS interactions must also be expanded to support human-GIS-human dialogue in support of collaborative decision-making. A place to start is provided by recent theory and model development outside of GIScience focused on human-computer collaboration (Terveen, 1995), collaborative discourse theory (Grosz, 1996; Lochbaum, 1998) (Grosz & Sidner, 1986, 1990) and agent-based dialogue management techniques dealing with task-oriented collaborative dialogue (BALKANSKI & HURAULT-PLANTET, 2000). Here, knowledge about the world is modeled as beliefs that can be modified and negotiated through interactive dialogue. Interacting agents (human or computer) are allowed (initially) to hold different semantic frameworks or knowledge constructs on a communicated concept, with the expectation that they will achieve compatible understanding through system-mediated dialogues to solve any semantic conflicts. Below, we outline how this approach is being adapted to the geospatial context with map-based display as an important part of system mediation.

12

Dialogue-Assisted Visual Environment for Geoinformation

3

DAVE_G: Initial Multiuser implementation Development of DAVE_G is following a human-centered approach that draws

upon the research outlined above and that involves attention to the needs of potential users at all stages of system design, development, and deployment. The approach is a highly iterative one that involves cycles of work with users (or potential users) and implementation (or refinement) of system components designed to support user tasks. Sections below discuss (a) the initial user task analysis and application of information obtained to develop an initial use scenario that prototypes are being developed to support, (b) an overview of system evolution and current functionality, (c) a discussion of the system architecture, and (d) a detailed account of our approach to system support for human-GIS dialogue. 3.1

User Task Analysis and Scenario Development As

detailed

above,

this

research

focuses

on

human-centered

design,

implementation, and assessment of a collaborative, multimodal interface to GIS. We are applying cognitive systems engineering (CSE) methods to the process of task analysis and scenario development. CSE for complex system design represents a user-centered approach aimed at capturing practitioner knowledge about the domain and strategies for performing tasks or solving problems as well as information about the practitioners themselves. The theoretical foundations of this approach are grounded in the work of Hollnagel and Woods (Hollnagel & Woods, 1983), Vicente (Vicente, 1999) and Rasmussen et al. (Rasmussen et al., 1994) and adapted to the current state of CSE approaches to system design (see (Potter et al., 2000)). Specifically, this approach constructs models that represent a detailed understanding of how real experts currently work within their domain and includes the experts as participatory designers. Our user-

Dialogue-Assisted Visual Environment for Geoinformation

13

centered knowledge acquisition process involves a methodology that facilitates flexible access to information (open-ended questions and interviews), provides a medium for shared communications, and involves procedures that are compatible with the way the domain experts think about their work domain. Specifically, as a first stage, we are using CSE to conduct an in-depth, work domain analysis of geospatial information use within our initial domain context of emergency management and response, with a focus on map use in hurricane response and management. Building on the AKADAM approach developed by McNeese et al. (McNeese et al., 1995) for designing aircraft pilot targeting systems, the approach uses a combination of questionnaires, individual and group concept mapping, critical incident analysis, and design storyboarding. Our overall objective is to elicit expert knowledge of the hurricane response and management process and the role of geospatial information in that process and to turn that knowledge into design blueprints. The work domain analysis is a multi-step process with each step generating insights that feed directly into system design. Here, we provide a brief synopsis of the initial step in this analysis (full results of the work domain analysis will be reported in a subsequent publication). One goal of this initial step was to develop a constrained but realistic scenario to serve as the basis for the first and the second generation DAVE-G prototypes (described in detail in section 4.2). The initial focus of our work domain analysis has been on the role and use of GIS and map based displays during emergency situations, thus in the response stage of crisis management. We further narrowed this context to hurricane response as a representative, tractable case study. Work domain analysis began with an e-mail questionnaire sent to 12 emergency mangers in Florida, Washington D.C., and Pennsylvania. Four replies were

Dialogue-Assisted Visual Environment for Geoinformation

14

received and detailed phone interviews were conducted with these individuals. The goal at this stage was to ground our initial prototype development in actual practice, not to develop a comprehensive understanding of GIS use for emergency management. Thus, detailed discussion with four individuals who are active users of GIS in emergency management applications was an adequate starting point to identify GIS-based response activities and operations in the context of disaster events and to determine at least some of the emergency response activities that require real-time map use. Results of the phone interviews and questionnaire guided selection of initial GIS functionality within DAVE_G as well as design of an initial emergency response scenario used as the target application of our first generation implementation. In terms of functionality, participants identified support for zoom, pan, buffer, display, and selection of geospatial data as the core functions that a GIS-based emergency response environment should have. These operations are fundamental to emergency tasks in transportation support, search and rescue, environmental protection, and firefighting. Hurricane response and management was selected as the domain for initial DAVE_G scenario development because it requires all of the emergency tasks reported as the most frequent GIS applications in emergency management. Based on analysis of the four questionnaire and interview results, the following tentative working assumptions about the nature of multimodal dialogue in emergency management underlie our prototype development efforts (it is important to remember that these assumptions, while driving some initial prototype implementation decisions, are likely to be modified over time, as will be the prototypes, as we continue with the iterative CSE process):

Dialogue-Assisted Visual Environment for Geoinformation

15

1. Human-GIS dialogues in real work settings are likely to use implicit (rather than explicit) commands to express information needs. For example, a user is likely to mention the need to see traffic conditions, rather than tell the system to show specific spatial data layers of streets and highways. 2. Human-GIS dialogues are not composed of grammatically correct sentences, but instead are full of disfluencies, interruption, co-reference, and sentence fragments. Still, they are coherent and understandable when the overall context of a dialogue is considered. For example, an emergency manager might request information about mobile home populations (using the word “trailers” to represent them), emergency shelters (perhaps abbreviated to “shelters”, and adult assisted living facilities (perhaps calling them nursing homes even though the category includes other kinds of facilities), and might mention these in no particular order, but with the overall goal of gaining information to make an evacuation decision. 3. People often make heavy use of task-related knowledge and common sense in communicating their information needs. Again this is observable from the scenario presented in Figure 1 (which was informed by the telephone interviews). For example, Jane and Paul were talking about ‘flooded area’, ‘trouble spots’, ‘bottleneck’, which all have domain/context specific meanings. 4. People’s expression of their spatial information needs is often vague, not sufficient or detailed enough for a system to act. This happens because most spatial or categorical terms are imprecise with context dependent meanings. One strategy to address this is for the system to initiate dialogues with the user when further information, clarification, and confirmation is required.

The phone and questionnaire results also determined the focus for and helped to recruit participants in subsequent onsite, in-depth work domain analysis. Emergency

Dialogue-Assisted Visual Environment for Geoinformation

16

managers in two state Emergency Operation Centers (Columbia, SC and Tallahassee, FL), and two county Emergency Operations Centers (Horry County, SC and Charleston County, SC) agreed to participate in the onsite component of the work domain analysis (which is still underway and will be reported in a subsequent publication). Preparation for these on-site visits involved several “bootstrapping” activities – activities designed to learn enough about the knowledge domain of interest to develop and carry out productive field activities with domain experts. In this case, the bootstrapping involved one of us (Brewer) studying crisis management training manuals and taking an online course for emergency responders. The combination of domain knowledge from phone interviews, questionnaires, and bootstrapping activities was used to inform development of initial crisis response scenarios (part of one is portrayed in figure 1). These scenarios serve two purposes. First, they provide the focus for initial system design and assessment activities (our initial goal has been for DAVE_G to support at least the combination of information requests inherent in this hurricane response scenario). Second, scenarios are being used as a device to prompt discussion and to structure concept mapping exercises with domain experts. 3.2

DAVE_G – An Overview Based on the framework for a multimodal GIS interface proposed in this paper, a

research prototype, DAVE_G, has been developed. The system builds on and extends the earlier multimodal interface framework developed by Sharma et al (Sharma et al., 1998) and two test-bed implementations, iMap and XISM (Kettebekov et al., 2000). DAVE_G has undergone two versions of evolution (identified as DAVE_G1 and DAVE_G2 below). We reported on DAVE_G1 elsewhere (Rauschert et al., 2002) and that version will be

Dialogue-Assisted Visual Environment for Geoinformation

17

described here only briefly, for the purpose of comparison. Subsections below detail the evolution of DAVE_G1 into DAVE_G2 and outline the current system functionality. 1.

Evolution of DAVE_G System

{figure 2 about here} In DAVE_G1 (see Figure 2), we achieved two initial goals of the envisioned multimodal GIS interface: (1) understanding well-formed geographical information requests expressed through highly restricted gesture and spoken commands, and (2) supporting (at a rudimentary level) the interaction of multiple users conducting collaborative group work with geospatial data. The system made use of ArcObjects® to provide basic GIS functionality. The interface implemented was able to recognize and respond to typical GIS commands (zoom, pan, buffer, display, and selection) issued through free-hand gesture/speech (within the application context of emergency management for hurricane events). It also supported two users who share control of the system. However, DAVE_G1 dialogue and collaboration functions were extremely limited. The system acted on single requests (utterances), and had no mechanism to represent or use dialogue history. DAVE_G1 only understood the facts or actions communicated from the users to the system (i.e. user-led dialogue). These system limitations derived from an initial decision to rely heavily on grammar-based syntax/semantic parsing techniques for language understanding and the extraction of GIS commands from a single user’s requests. Ill-formed and incomplete user requests were not supported. {figure 3 about here}

18

Dialogue-Assisted Visual Environment for Geoinformation

DAVE_G2 (see Figure 3) incorporates significant enhancements in dialogue capabilities to the core the multimodal interaction system. It also replaced the ArcGISbased geographical information server with ArcIMS as the GIS back-end. This provides the potential to support interaction with multiple, remote data sources. The system now uses an explicit discourse model to handle incomplete requests by keeping track of information exchange in iterative interactions and by initiating subdialogues for seeking additional information and clarifications from the user, if necessary. Discourse context is represented as semantic-frames that are activated and composed during language understanding processes.

This frame-based approach, when combined with mixed-

initiative dialogue control, allows more flexibility in user input, since information items needed to carry out a command do not have to be provided in a particular order and a complete command can be incrementally constructed by a combination of user instructions and system-led questions. For comparison of the two prototype versions of DAVE_G and planned future developments see Table 1. Table 1. Comparison of features of DAVE_G across versions of prototype implementation DAVE_G1 DAVE_G3 DAVE_G2 (current version) (Version 1) (Future version…) Speech processing

Full sentence parsing using syntax grammar

Grammar-based parsing to extract content-baring phrases

Grammar to capture contentbearing phrases as well as spontaneous speech (ellipses, pause, and after thought) explicitly.

Semantic Interpretation

Semantic grammar (unified with the syntax grammar)

Semantic Frame based input understanding for extracting GIS commands

Plan-based collaborative discourse understanding. Use of discourse history and focus stack to increase the efficiency of dialogue.

None

Mainly user-led dialogue, but system asks for missing parameters; Frame-based dialogue control

Mixed initiative dialogues, clarification, confirmation, and proactive response. Agent-based dialogue control to manage collaboration among multiple users.

Dialogue control

19

Dialogue-Assisted Visual Environment for Geoinformation

2.

Current Functionality

DAVE_G2 provides the user with a variety of data querying, navigating and drawing capabilities. Natural speech and gestures allow the user to express requests that are closely tied to many common GIS-queries (see Table 3). We distinguish between requests that rely on spatial references provided by gestures (e.g. pointing or outlining), and requests that can be expressed solely by speech. Some requests can be made either with or without gestures (e.g. “zoom here” or “zoom to Florida”). This offers a user the flexibility to indicate intentions through either modality at any time and thus increases the chance for a satisfactory and efficient interaction with the GIS. DAVE_G2 also supports a free hand drawing mode, which is currently limited to very simple user input, specifically drawing dots, lines or circles on the map. More sophisticated annotation functionality is under development. Table 2. Supported functionality and possible user requests Request

Data Query

Navigate

Draw

Functionality

Show/Hide Features Select/Highlight Features Create/Show/Hide Buffers

Pan left/right/up/down Center at Zoom in/out/area/full extend

Circle Line Free Hand

Deictic and Iconic gestures (following the taxonomy proposed by Rime and Schiaratura (Rime & Schiaratura, 1991)) are currently supported by DAVE_G2. Examples include the ability to: select cities that are on one side of a position (spatial component indicated with a linear gesture), highlight a critical facility (spatial component indicated by a pointing gesture), exclude locations outside of a region from consideration (spatial component indicated with an area gesture). More generally, users can indicate any specific location or extent through gesture (e.g. “zoom to include this area” – with a pointing or area gesture indicating the referent for this and the interpretation of the

Dialogue-Assisted Visual Environment for Geoinformation

20

request based on which kind of gesture is sensed). While gestures are particularly relevant when spatial references are used to subset displayed features, some requests to DAVE_G2 can be initiated by speech alone. These include requests to display features or themes (e.g. highways, cities, counties), create buffers around named places or kinds of entity (e.g. around cities or selected features), as well as some generic navigation commands (e.g. zoom in, pan left) that do not explicitly specify an extent or a desired magnitude (default values are used instead). Multimodal dialogues with DAVE_G2 are more flexible and robust than that with DAVE_G1. Real user requests often involve complex database queries and cartographic operations that are difficult to specify in one utterance, even if multimodal. Using system dialogue capabilities, a user can specify a complex request through a series of relatively simple statements/requests that are recognized as part of a dialogue. DAVE_G2 uses domain knowledge (about how common GIS operations are structured and composed) to decide how the user’s input from multiple dialogue iterations relate to each other in a goal-planning hierarchy. To avoid asking users an excessive number of questions in cases of unspecified parameters, the system weighs the relative importance to decide whether an explicit or implicit approach should be used to obtain the parameters. When an explicit approach is adopted, they system asks direct questions. This approach is used when a missing parameter is judged important and the system has little knowledge about how to obtain it otherwise. Implicit approaches involve providing a best-guess by the system and asking the user to verify the result through visual or verbal feedback. The later approach is applied when the system has captured sufficient knowledge about the common practices of a particular application or a user, or when a parameter is given a

Dialogue-Assisted Visual Environment for Geoinformation

21

low importance. Dialogue management needs to balance the use of these two strategies so that the dialogue is helpful and is not tedious. DAVE_G2 supports indirect and implicit references to themes and features on the map that were mentioned or used in previous requests, and allows the user to make further GIS-queries on these features. It does this with a focus stack representation of the attentional state of a dialogue. For example, the user might select certain features by using a circular gesture and then perform a buffer operation by simply referring to them as ”those features” without the need to perform the same gesture again. DAVE_G2 (again, in contrast to DAVE_G1) implements feedback strategies that can be visual, verbal (textual), or a combination of both. The most basic ones are updates of the GIS display in response to requested actions (e.g., visual feeback that indicates whether the GIS is busy/free or if a map is generated). Gesture feedback is provided by a cursor for each user, displayed on the map to indicate recognized hand movements. The interface also shows recognized gestures (as graphics overlaid on top of the map display) and recognized speech input (in a text box). Such feeback techniques are designed to help the user’s understanding of the system’s status and the construction of a correct model about the system’s behaviors. Another unique functionality of DAVE_G2 is its ability to handle requests that involve vague spatial concepts. Concepts can be vague spatially in multiple ways, for example they might include words or phrases that refer to geographical objects of indeterminate boundaries (such as the Rocky Mountain) (Burrough & Frank, 1996) or spatial relationships that exhibit the properties of Sorites paradox (Fisher, 2000). By incorporating the principles of human-computer collaboration, we have demonstrated

Dialogue-Assisted Visual Environment for Geoinformation

22

(reported in (Cai et al., 2003)) that DAVE_G2 is capable of communicating the concept of ‘near’ (a typical vague spatial concept) through iterative dialogues guided by a dynamic system model about the user. 3.3

Architecture of DAVE_G DAVE_G2 is based upon a client-server architecture comprised of three modules:

Human Interaction Handling, Human Collaboration and Dialogue Management, and Information Handling (see Figure 4). Communication among these modules is implemented as request-response message pairs using a predefined XML encoded protocol. The open, client-server architecture allows scalability and opens the possibility of future distributed and collaborative settings such as distributed Internet applications in which geographically separated users can participate in a joint decision-making process {figure 4 about here}

3.3.1

Human Interaction Handling module

The human interaction handling module is the client tier and it has two types of components: human reception control (HRC) and a display control (DC). Each HRC component captures a single user’s speech and gesture input and generates proper descriptions of recognized words, phrases, sentences, and gestures to be used for higher level processing. It uses a single, non-calibrated active camera to find and track a user’s head and hand in the current field of view based on skin color and motion cues. The captured hand trajectory is used to recognize pointing, line and circular gestures using Hidden Markov Models (HMM) (see (Krahnstoever et al., 2002) for a more detailed description of the tracking and gesture recognition methods applied). To recognize spoken commands, a microphone captures the user’s utterances, which are then processed

Dialogue-Assisted Visual Environment for Geoinformation

23

by standard speech recognition software. A context-free grammar defines syntactically correct phrases and sentences that are to be recognized. In a post-processing step, recognized sentences are split into structurally meaningful phrases using annotations associated with each word within the grammar definition. The reception control unit performs an early speech-gesture fusion by identifying the recognized gestures that are most meaningful for a given utterance. It separates and forwards immediate feedback information (such as hand position and recognized utterances) directly to the display control, while recognized utterances and corresponding gesture descriptions are also sent to the Human Collaboration and Dialogue Management module. To enable multiple users collaborating, the system uses several instances of HRC. This approach requires having one camera and one microphone for each user, but minimizes the problem of interference that would occur if users had to share one set of input devices. The DC handles screen rendering for the system. It receives system responses (GIS-generated maps, textual messages, and speech output) and synchronizes the presentation of multimedia dialogues. Theme information displayed includes a legend of available layers, active layers and visible layers. An important component of dialogue handled by the DC is direct feedback in response to user actions (see section 3.2.2 for description of this feedback). The Human Interaction Handling module itself is also designed in a client-server form, separating the individual reception control units from each other and from the display control. Since real-time visual, gesture feedback requires high information flow rates between the Reception Control client and the Human Interaction Handling server

Dialogue-Assisted Visual Environment for Geoinformation

24

module, communication between those sub components is implemented with TCP/IP streaming. 3.

Human Collaboration and Dialogue Management (HCDM) module

The Human Collaboration and Dialogue Management (HCDM) module receives recognized gesture and speech components from all active users (clients), derives the meaningful commands intended by individual users, and coordinates the execution of these commands. The HCDM module has two closely coupled components: collaboration manager (CM) and dialogue manager (DM). When multiple users are active on the system, the CM performs necessary fusion and conflict management. Inputs from individual users can be related in two ways. The first is “fusion by merging contributions”, and is applied when users work collaboratively toward a common goal. The second approach is “fusion by conflict resolution”, and is applicable to situations where users negotiate over conflicting goals and shared resources. As implemented so far, each user’s input is interpreted as separate commands, and temporal synchronization of incoming user requests is performed. This is currently being extended by an agentbased collaboration manager that keeps track of unresolved, ambiguous, and conflicting requests to GIS databases. The dialogue manager (DM) does more than simply pass instructions to and receive responses from the GIS server. It implements strategies to deal with ill-formed and ambiguous input, and it makes decisions about what information needs to be collected from the users and what actions need to be initiated on the geographical information server. Using a semantic frame-based dialogue state representation, dialogue flow between users and the system is not predetermined but depends on the content of

Dialogue-Assisted Visual Environment for Geoinformation

25

current user inputs and the information accumulated in the dialogue history. At any moment, there is one active frame representing the current focused action (GIS command) under discussion. Users’ inputs are interpreted as contributions to the current action frame by filling one or more parameter slots, or modifying values of some slots in the frame. After processing all the input, the system will begin a response planning stage generate a list of questions and prompts relevant at that moment, and make decisions about what to communicate with the user. When sufficient information is collected on the current action, the active frame is passed on to the information control for the formation and execution of a corresponding GIS query. Information returned from information control can be the results of a successful query (maps and textual messages) or error messages (when queries were not successful). 4.

Information Handling Module

Finally, the Information Handling module processes all filtered user commands by forming correct GIS queries and issuing the query to a GIS server through proper query interfaces. The process is driven by events (sent from the dialogue control) that specify the information necessary to query a GIS. GIS action control (GAC) takes care of all GIS related queries and maintains information regarding the current status of the GIS. The layer information maintains a list of all visible, active and non-visible layers. It also has information that tells whether the GIS is busy and informs the user that the GIS is processing the query. Moreover, the GAC holds an updated copy of the GIS-generated map (as an image) and it keeps updating it on the basis of the user’s queries. GIS query interface (GQI) is a library that uses ArcXML® to expose the GIS functionality through

Dialogue-Assisted Visual Environment for Geoinformation

26

the ArcIMS® Server. It takes the GIS queries and returns with the map generated by the Server. Currently, GAC does direct translation between an action frame and the corresponding GIS queries, and the GQI does simple dispatch of queries to one of the geographical information services available locally or over the network. Future extensions of this module will enable more ontology-driven queries and communications with multiple geographical information servers that may have different capabilities of responding to user queries. 3.4

Execution Cycles in DAVE_G2: Human-GIS Dialogue Users interact with the map display by giving gesture and speech inputs to

DAVE_G2. The system is driven by user’s interaction events. Each event is a multimodal utterance that is comprised of a spoken command accompanied (optionally) by one or more gestures. An execution cycle starts with the reception of a user’s utterance and ends with the system’s response to that event (see Figure 5). During each execution cycle, the system analyzes each user’s inputs to infer intended actions, collects parameters required by the action, derives corresponding GIS queries, and generates responses based on the result of the GIS query execution. These stages in an execution cycle are illustrated in Figure 10, and are discussed in more detail below. {figure 5 about here}

5.

Speech and Gesture analysis

The general conversational acts in a dialogue with DAVE_G involve three types of user actions: request, reply or inform. Requesting information (e.g., asking for a map,

27

Dialogue-Assisted Visual Environment for Geoinformation

or modification of a currently displayed map, with certain features) is the most frequently used interaction. A reply action might be observed when a user responds to clarification questions from the DAVE_G. Informing actions involve direct communication with DAVE_G to provide facts about the current task or to indicate intentions. Although supported by the speech recognition module, and captured by the grammar, informing actions are only partially supported thus far (complete integration with the dialog manager is planned for the next version). The accurate recognition of utterances is crucial to the overall performance of DAVE_G2 and poses a great challenge to available speech recognition systems. The key to reliable performance is a finely tuned context-free grammar that defines syntactically and semantically correct phrases and sentences, constraining the interpretation of continuous speech signals into words and groups of words that constitute meaningful phrases. The

most

common



applied to an



user (see

request

is

modeled

as:

Figure 6). The requested action is, thus,

that can be any feature on the map or an attribute of an entity.

Each entity can be further expressed with qualifiers, as in: “all cities” or “this road”. The most complex and challenging part of the grammar definition is prepositional phrases (e.g. “areas that will flood” or “cities which lay within this area”).

{figure 6 about here} Annotations that are associated with each word are used to group words into meaningful phrases. For example, the grammar shown in Figure 6 will recognize a

Dialogue-Assisted Visual Environment for Geoinformation

28

spoken input “Highlight all assisted living facilities that are likely to flood within this area” and group it into the five phrases {“Highlight”, “all assisted living facilities”, “that are likely to flood,” “within”, “this area”}. In this example, the user indicates the area of interest (“this area”) by performing a contour or circular gesture. To account for multiple recognized gestures during a single utterance, a temporal alignment is performed on all occurring multimodal input events to maximize their co-occurrence probability. Past empirical studies have found that this alignment differs significantly among users and therefore is hard to model (Kettebekov et al., 2000). To account for this high variance we define the probability for co-occurrence on a small fixed time window around the recognized utterance in which such a multimodal interaction input is most likely to occur for all users. Gestures that are recognized within this time frame for a given user request are then encoded with the corresponding utterance and further processed once the semantic meaning of the user request has been analyzed. 6.

Semantic Analysis of Spoken Utterance

The recognized words and phrases from speech input are further analyzed to derive their meaning representations. Using a simple “concept spotting” technique, a phrase or gesture in an utterance may be mapped into either an action corresponding to one of the GIS commands (see Table 4) or into entities or spatial features that correspond to parameters needed by the GIS command under discussion. The mapping from phrases to meanings is not a one-to-one correspondence, making this a challenging task. For example, depending on the context of the utterance, a linguistic reference to “Philadelphia” can be mapped to either a point feature representing the abstract location

29

Dialogue-Assisted Visual Environment for Geoinformation

of the city, or to an areal feature representing the spatial extent as well as location of the city. When multiple mappings are possible, we use contextual information (such as the user’s task, application domain, and map scale) to narrow down to the most appropriate choice. We use a phrase-to-meaning relational schema to capture the semantic knowledge about phrases. See Table 4 as an example. Table 4 Semantic meaning of phrases Phrase

Category

Meaning Representation

Show

Action

“ShowMap” frame (see Fig. 7)

Assisted living facility

Layer Object

“Assisted living facility” layer

this area

Circle gesture

Polygon shape

7.

Command Interpretation

The meanings of multimodal primitives (phrases and gestures) are combined into a coherent meaning of the whole utterance using a semantic frame approach. As mentioned before, the system maintains a set of semantic frame templates, each of which corresponds to one of the GIS commands to be supported. When a phrase corresponding to a GIS command is mentioned, a semantic frame is created based on the corresponding frame’s template. For example, the general action of showing a map with a number of specified layers within a certain extent can be represented as a template shown on the right side of Figure 7. This frame has two parameter slots: “layers” and “extent.” The “layers” slot can be filled with one or more map layers, and the “extent” slot is typed as a rectangular area but can take any input that refers to a shape (including spatial features and gestures). For any recognized multimodal input with a set of phrases and gestures, the command interpretation applies the following algorithm:

Dialogue-Assisted Visual Environment for Geoinformation

30

Step1: scan through all the meaning chunks (coherent units derived from integrating phrases and gestures) and look for an action α. If found, create a semantic frame representation of α based on the template for the action Step2: for all phrases other than the action phrase, define each phrase as providing values to one of the parameter slots in the frame. Following this two-step procedure, an input, “Show the assisted living facilities in this area (circular Gesture)”, will be parsed into a semantic frame representing a “ShowMap” action with its two parameter slots (“layers” and “extent”) filled by the “assisted living facilities” layer and the spatial extent of the region corresponding to the gesture shape, respectively (see Figure 7). {figure 7 about here} The frame-based command interpretation also plays the role of fusing gesture and speech inputs into a coherent meaning representation of utterances and dialogues. As depicted in Figure 7, the system is able to detect the linkage between the verbal reference, “this area,” and the circular gesture and then make proper assignment to the parameter slots in the action frame. Since each parameter slot is typed, the system applies type constraints when filling the frame’s slots with values derived from speech or gesture phrases. When ambiguities are present, discourse contexts, spatial contexts, and task contexts are used to ensure proper semantic assignment. 8.

Dialogue Planning and Generation

Dialogue planning is the process of forming proper strategies and action plans for responding to the last input from the user. Dialogue functions in DAVE_G2 rely on the

Dialogue-Assisted Visual Environment for Geoinformation

31

status of the current semantic frame as a representation of the dialogue state and as the basis for planning responses. After the user’s input is properly interpreted, the system will examine the action frame to see whether enough information has been collected from the user. If yes, the information captured in the action frame will be passed to the GAC and GQI for the execution of that action by the GIS. If the query is successfully executed, a map image will be returned from the GIS server. In many cases, a user’s multimodal commands may be underspecified or incomplete in the sense that the user did not provide adequate input for the system to fill all the parameters required by the current semantic frame of action. This is where the power of a system that supports dialogue management, rather than simply query response, becomes apparent. In Dave_G2, we have implemented two approaches (thus far) to deal with missing parameters: (1) The system generates a question to ask the user for further information, and delays execution of GIS queries until all the information about the current action is collected, or (2) The system fills in default values for those missing parameters, and then immediately executes the GIS query to generate a map M0. This map may not be what the user expected, and hence the user is given an opportunity to verify and correct any deviation. The verification is done by prompting the user’s attention to the map M0 produced and asking for further input. It is expected that the user will either confirm the displayed map as correct, or provide the system with instructions about how to adjust parameter values.

Dialogue-Assisted Visual Environment for Geoinformation

32

The first approach does not provide any visual feedback until all the required parameters are filled. It will result in a quick correction, if the user knows the value of the missing parameter(s) exactly. In contrast, the second approach uses more visual feedback as a way of prompting a user’s specification of parameters, and is more appropriate in situations where the user has difficulties in providing parameter values exactly. We have discussed a full cycle of execution (as depicted in Figure 5) that processes one round of user input and generates a proper response. Each of these execution cycles will cause the dialogue states (as represented by a semantic frame) to be updated, and will be used to plan for response and anticipate the next input. DAVE_G2 is event-driven, and has the capability to engage in simple mixed-initiative dialogues with the user. 4

Discussion and Continuing Research Developing the natural interface envisioned in this paper requires careful attention

to the human user at all stages of design, development, and deployment, thus a humancentered systems approach. The human-centered, multimodal, collaborative interface to GIS we are developing is based on the natural input modalities of speech and free hand gestures, and is complemented by a shared semantic framework through which dialogue can take place. As Potter et al. (Potter et al., 2000, 323) argue, the process of designing such intelligent systems is ‘fundamentally an opportunistic bootstrap process,’ meaning that the process builds on itself from step to step, and thus, is full of rich feedbacks within and among the stages of design and prototype versioning. Our experience is showing that setbacks in one stage of development can often be turned into the seeds for new, innovative and fundamentally different design solutions in the next version.

Dialogue-Assisted Visual Environment for Geoinformation

33

Our first phase of DAVE_G system implementation focused on the use of computer-vision for processing free-hand gestures as well as fusion of speech and gesture input in planning and performing intended GIS actions. We successfully created the architecture of DAVE_G1 that supports two person interactions with a map display by translating user gesture-speech input into GIS commands. Our second phase prototype, DAVE_G2 has demonstrated the feasibility of integrating gesture/speech interface technologies and dialog management with GIS to design a more natural environment for interacting with geospatial information. It is a first step towards supporting task-oriented collaborative dialogues. Our implementation of a multimodal, dialogue-oriented interface to GIS and the approach we took are fundamentally different from other similar efforts (as reviewed in section 2.1) in a number of aspects. First, the core goal for DAVE_G is to support human-human collaborative spatial decision-making mediated by interactive map displays. This is reflected in our decision to employ large-screen display and multiple gesture-speech reception modules that allow multiple people to interact with maps and coordinate through the same dialogue manager.

Second, almost all other experimental

systems (CUBRICON, GeoSpace, MIT Voyager) took a technology-centric view that focused attention on the interface technology required to support a multimodal interface to a map. In contrast, DAVE_G development has followed a human-centered (or usercentered) approach that, from the beginning, directed attention to identifying (through systematic knowledge elicitation with domain experts and iterative system refinement) the critical features and functionalities that a multimodal GIS interface needs to possess in order to support real work. DAVE_G development has also focused on ways to take

Dialogue-Assisted Visual Environment for Geoinformation

34

full advantage of the rich information about geographic context inherent in the GIS database. We are focusing on developing richer dialogue capabilities that enhance the competence of the system in communicating about geographical space (such as handling vague spatial concepts and semantic interoperability issues). Third, DAVE_G is the first prototype system that achieved full integration of natural spoken language, free-hand gestures (which has been shown to be fundamentally different from sketch-based gesture in terms of speech-gesture integration), knowledge-based human-system dialogue management, and commercial GIS (supporting complex GIS operations rather than just simple information queries). We have proposed solutions to many special challenges (architectural, semantic, and management) encountered in such integration. In particular, the totally deviceless interface that DAVE_G has achieved through natural speech and vision-based gesture processing has many advantages over other systems that rely on pens, gloves, or mouse to supply gestures.

Free-hand gestures are most naturally

synchronized with speech and are least demanding of human attentional resources (Kettebekov & Sharma, 2000).

More importantly, users will be able to use the same

modalities to communicate with both the information system and human collaborators in front of the same map display. This advantage will be more apparent as we proceed into the next stage of DAVE_G development where task-oriented dialogues involving multiple uses and the system are to be supported. Extensions to DAVE_G2 have been focused on using dialogue to understand and execute commonly used GIS commands, even when those commands are incompletely specified. Planned future extensions to DAVE_G will address four challenges, each of which is outlined below: handling domain-oriented dialogue, speech understanding

Dialogue-Assisted Visual Environment for Geoinformation

35

through concept spotting, more effective integration of speech and gesture, and supporting multiuser collaborations. Domain-oriented dialogue. The difficulty of processing domain-oriented dialogues about geographic (and other) problems is that the focus of communication will be on the goals, intentions and beliefs that are associated with advancing domain tasks. Thus, GIS commands will be implicit rather than explicit in these dialogues. The need for and form of relevant GIS actions must, therefore, be inferred by anticipating the need for specific kinds of geographical information as part of the user’s problem-solving processes. Speech Understanding for Spontaneous Dialogues. The interpretation of spoken language for the extraction of GIS commands and parameters is highly dependent on the reliability of the speech recognition software. At present, it is necessary to provide the system with a specific grammar ahead of time and for users to limit their commands to terms included in that grammar. This approach is not likely to work well with spontaneous dialogues that may include either complex geospatial concepts or ungrammatical insertions and tongue slips. Designing a complete grammar to capture all the potential language phenomena in spontaneous dialogues is extremely difficult, if not impossible. A solution currently being explored is to use bottom-up parsing techniques that extract concept-bearing phrases (that carry useful information that allows the system to act further) rather than attempting to achieve full parsing of the whole utterance. Fusion of gesture and speech inputs. The process of capturing multi-modal input and, for it, generating meaningful queries to GIS can be very different from using keyboard and mouse in similar tasks. Currently, fusion of speech and gesture is

Dialogue-Assisted Visual Environment for Geoinformation

36

implemented naively and is based on keyword spotting and simple temporal analysis of the signals and lacks effective error resolution for ambiguous and noisy user inputs. To overcome weaknesses in gesture recognition, we will perform co-analysis of speech and gesture on a low-level feature basis, where speech signals help to separate meaningful gestures from ordinary hand movements. To resolve conflicting user actions and disambiguate the user’s intention, we will explore approaches that extract key concepts from both input channels and fuse the extracted concepts and features into the current dialogue context. The co-analysis of these low- and high-level features, supported by user studies on the current DAVE_G system will provide insight into multi-modal interaction generally and while informing development of the next generation DAVE_G. Integration of inputs from multiple simultaneous users. The current version of DAVE_G works with individual requests only, and treats successive requests in a taskoriented dialogue as if they are independent from others. This will be expanded so that recognition and interpretation of user’s gesture/speech input will be processed at the discourse level where multiple utterances in a dialogue are related by an understanding of their linguistic, intentional and attentional structures (Grosz & Sidner, 1986). This will allow the system to deal with a richer set of dialogues that support user’s continuous interaction with the system in advancing their joint task. The overall goal of the DAVE_G project, for which one component is presented here, is to advance our understanding of human interaction with geospatial information, thus to go beyond a focus on human interaction with computers. While not our focus here, what is learned about enabling dialogue with a GIS should also be applicable, in part, for other environments that display spatially referenced information (e.g., computer-

Dialogue-Assisted Visual Environment for Geoinformation

37

aided design, medical imaging) as well as for contexts in which map metaphors are used in design of interfaces to non-spatial information (such as those detailed by (Chang, 2000; Fabrikant & Buttenfield, 2001; Goodchild, 1999; Wise et al., 1995)). We have developed an initial conceptual approach to human-computer dialogue that supports a natural, multimodal style of interaction. We plan, through an iterative series of user studies and system redesign to evolve the system toward one that is robust enough to function effectively in limited, real-world problem situations. This process will, in turn, be used to refine our conceptual approach to human-computer dialogue, an approach that we expect to be extensible beyond geospatial information domains. Acknowledgements This work is based upon work supported by the National Science Foundation under Grants No. BCS-0113030, IIS-97-33644, IIS-0081935, EIA-0306845.

References Al-Kodmany, K. 2000, Extending Geographic Information Systems to Meet Neighborhood Planning Needs: The Case of Three Chicago Communities. URISA Journal, 12(3), 19-37. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., & Stent, A. 2001, Towards conversational human-computer interaction. AI Magazine, 22(4), 27-37. Armstrong, M. P. & Densham, P. J. 1995, A conceptual framework for improving human-computer interaction in locational decison-making. In M. J. Egenhofer (Ed.), Cognitive Aspects of human-Computer Interaction for Geographic Information Systems (pp. 343-354): Kluwer Academic Publishers. BALKANSKI, C. & HURAULT-PLANTET, M. 2000, Cooperative requests and replies in a collaborative dialogue model. International Journal of Human-Computer Studies, 53, 915-968. Blaser, A. & Egenhofer, M. 2000, A Visual Tool for Querying Geographic Databases. AVI2000--Advanced Visual Databases, Salerno, Salerno, Italy, May 2000, pp. 211-216. Burrough, P. A. & Frank, A. U. (Eds.). 1996, Geographic Objects With Indeterminate Boundaries. London: Taylor & Francis.

Dialogue-Assisted Visual Environment for Geoinformation

38

Cai, G., Wang, H., & MacEachren, A. 2003, Communicating Vague Spatial Concepts in Human-GIS Interactions: A Collaborative Dialogue Approach. COSIT 2003: Conference on Spatial Information Theory, Lecture Notes in Computer Science (LNCS) 2825, Ittingen, Switzerland, 24-28 September, pp. 287-300. Chang, S.-K. 2000, The Sentient Map. Journal of Visual Languages and Computing, 11, 455-474. Codella, C. & al, e. 1992, Interactive simulation in a multi-person virtual world. ACM Conference on Human Factors in Computing Systems - CHI'92, pp. 329-334. Corradini, A. C., P.R. 2002, Multimodal Speech-Gesture Interface for Handfree Painting on a Virtual Paper using Partial Recurrent Neural Networks as Gesture Recognizer. Proceedings of the International Joint Conference on Artificial Neural Networks (IJCNN'02), Honolulu (HI, USA), May 12-17, 2002, pp. 22932298. Densham, P. J., Armstrong, M. P., & Kemp, K. K. 1995, NCGIA Initiative 17 on Collaborative Spatial Decision-Making. Santa Barbara, NCGIA. Egenhofer, M. J. 1996, Multi-modal spatial querying. International Symposium on Spatial Data Handling, Delft, The Netherlanda, pp. 785-799. Egenhofer, M. J. 1997, Query Processing in Spatial-Query-by-Sketch. Journal of Visual Languages and Computing, 8(2), 403-424. Egenhofer, M. J. & Franzosa, R. D. 1995, On the Equivalence of Topological Relations. International Journal of Geographical Information Systems, 9(2), 133-152. Egenhofer, M. J. & Golledge, R. G. 1998, Spatial and temporal reasoning in geographic information systems. New York: Oxford University Press. Fabrikant, S. I. & Buttenfield, B. P. 2001, Formalizing Semantic Spaces for Information Access. Annals of the Association of American Geographers, 91(2), 263-280. Fisher, P. 2000, Sorites paradox and vague geographies. Fuzzy Sets and Systems, 113, 7– 18. Florence, J., Hornsby, K., & Egenhofer, M. J. 1996, The GIS Wallboad: interactions with spatial information on large-scale display. International Conference on Spatial Data Handling, pp. 8A.1 - 8A.15. Frank, A. U. & Mark, D. M. 1991, Language issues for GIS. In D. W. Rhind (Ed.), Geographical Information Systems: Principles and Applications (Vol. 1, pp. 147163). London: Longmans Publishers. Fukumoto, M., Suenaga, Y., & Mase, K. 1994, Finger-pointer: Pointing interface by image processing. Computers and Graphics, 18, 633-642. Glass, J., Flammia, G., Goodine, D., Phillips, M., Polifroni, J., Sakai, S., Seneff, S., & Zue, V. 1995, Multilingual spoken-language understanding in the MIT Voyager system. Speech Communications '95. Goodchild, M. F. 1999, Future directions in geographic information science. Geographic Information Science, 5(1), 1-8. Grosz, B. J., Kraus, S. 1996, Collaborative plans for complex group action. Artificial Intelligence, 86, 269-357. Grosz, B. J. & Sidner, C. L. 1986, Attention, intentions, and the structure of discourse. Computational Linguistics, 12, 175-204. Grosz, B. J. & Sidner, C. L. 1990, Plans for discourse. In M. E. Pollack (Ed.), Intentions in Communication (pp. 417-444). Cambridge, MA: MIT Press.

Dialogue-Assisted Visual Environment for Geoinformation

39

Hollnagel, E. & Woods, D. D. 1983, Cognitive systems engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18, 583-600. Hopkins, L. D., Ramanathan, R., & George, R. V. 2001, Interface for a Planning Workbench. Department of Urban and Regional Planning, University of Illinois at Urbana-Champaign. Available: http://www.rehearsal.uiuc.edu/DesignWorkSpace/ [2001, Nov. 4, 2001]. Jankowski, P. & Nyerges, T. 2001, Geographic Information Systems for Group Decision Making: Towards a participatory, geographic information science. New York: Taylor & Francis. Jankowski, P., Nyerges, T. L., Smith, A., Moore, T. J., & Horvath, E. 1997, Spatial group choice: a SDSS tool for collaborative spatial decision-making. International Journal of Geographical Information Science, 11(6), 577-602. Jokinen, K., Sadek, D., & Traum, D. 2000, Introduction to Special Issue on Collaboration, Cooperation and Conflict in Dialogue Systems. International Journal of Human Computer Studies, 53(6), 867-870. Kettebekov, S., Krahnstöver, N., Leas, M., Polat, E., Raju, H., Schapira, E., & Sharma, R. 2000, i2Map: Crisis Management using a Multimodal Interface. ARL Federate Laboratory 4th Annual Symposium, College Park, MD, March, 2000. Kettebekov, S. & Sharma, R. 2000, Understanding gestures in a multimodal human computer interaction. International Journal of Artificial Intelligence Tools, 9(2), 205-224. Koons, D. B. & Sparell, C. J. 1994, Iconic: speech and depictive gestures at the humanmachine interface. Proceedings of the CHI '94 conference, pp. 453-454. Krahnstoever, N., Kettebekov, S., Yeasin, M., & Sharma, R. 2002, A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays. Fourth International Conference on Multimodal Interfaces (ICMI), Pittsburgh, PA USA, October 14-16, pp. 349-354. Larson, R. R. 1996, Geographic Information Retrieval and Spatial Browsing. In M. Gluck (Ed.), GIS and Libraries: Patrons, Maps and Spatial Information (pp. 81-124). Urbana-Champaign: University of Illinois. Lochbaum, K. E. 1998, A collaborative planning model of intentional structure. Computational Linguistics, 24(4), 525-572. Lokuge, I. & Ishizaki, S. 1995, Geospace: An interactive visualization system for exploring complex information spaces. CHI '89 Conference Proceedings, New York, 1995, pp. 409-414. MacEachren, A. M. 2000, Cartography and GIS: facilitating collaboration. Progress in Human Geography, 24(3), 445-456. MacEachren, A. M. 2001, Cartography and GIS: extending collaborative tools to support virtual teams. Progress in Human Geography, 25(3), 431-444. Mark, D. M., Freksa, C., Hirtle, S. C., Lloyd, R., & Tversky, B. 1999, Cognitive models of geographical space. International Journal of Geographical Information Science, 13(8), 747-774. Mark, D. M. & Gould, M. D. 1991, Interacting with geographic information: A commentary. Photogrammetric Engineering & Remote Sensing, 57(11), 14271430.

Dialogue-Assisted Visual Environment for Geoinformation

40

Mark, D. M., Svorou, S., & Zubin, D. 1987, Spatial terms and spatial concepts: Geographic, cognitive, and linguistic perspectives. International Geographic Information Systems (IGIS) Symposium: The Research Adgenda, Proceedings, Vol. I: Overview of Research Needs and the Research Adgenda, Nov. 15-18, 1989, Arlington, VA, pp. II-101-112. McNeese, M. D., Zaff, B. S., Citera, M., Brown, C. E., & Whitaker, R. 1995, AKADAM: Eliciting user knowledge to support participatory ergonomics. The International Journal of Industrial Ergonomics, 15(5), 345-363. Muntz, R. R., Barclay, T., Dozier, J., Faloutsos, C., MacEachren, A. M., Martin, J. L., Pancake, C. M., & Satyanarayanan, M. 2003, IT Roadmap to a Geospatial Future, report of the Committee on Intersections Between Geospatial Information and Information Technology. Washington, DC: National Academies Press. Neal, J. G. & Shapiro, S. C. 1991, Intelligent multi-media interface technology. In A. W. Tyler (Ed.), Architectures for Intelligent Interfaces: Elements and Prototypes (pp. 11-44). Reading, MA: Addison-Wesley. Neal, J. G., Thielman, C. Y., Dobes, Z., Haller, S. M., Glanowski, S., & Shapiro, S. C. 1989, CUBRICON: a multimodal user interface. GIS/LIS '89, Orlando, FL, pp.??? Nobre, E. & Câmara, A. 1995, Spatial simulation by sketch. Proceedings, 1st Conference on Spatial Multimedia and Virtual Reality, Lisbon, Portugal, Oct. 18-20. Nyerges, T. L. & Jankowski, P. 1997, Enhanced adaptive structuration theory: A theory of GIS-supported collaborative decision making. Geographical Systems, 4(3), 225-259. Nyerges, T. L., Mark, D. M., Laurini, R., & Egenhofer, M. (Eds.). 1995, Cognitive Aspects of human-computer interaction for geographic information systems. Dordrecht: Kluwer. Oviatt, S. 1996, Multimodal interfaces for dynamic interactive maps. Proceedings of the Conference on Human Factors in Computing Systems (CHI'96), pp. 95-102. Oviatt, S. L. 1997, Multimodal Interactive Maps: Designing for Human Performance. Human-Computer Interaction, 12(1-2), 93-129. Pavlovic, V. I., Sharma, R., & Huang, T. S. 1997, Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 677-695. Potter, S. S., Roth, E. M., Woods, D. D., & Elm, W. C. 2000, Bootstrapping multiple converging cognitive task analysis techniques for system design. In S. F. C. In J.M. Schraagen, and Shalin, V.L. (Ed.), Cognitive Task Analysis (pp. 317-340). Mahwah, NJ: Lawrence Earlbaum Associates. Rasmussen, J., Pejtersen, A. M., & Goodstein, L. P. 1994, Cognitive engineering: Concepts and applications. New York: J. Wiley & Sons. Rauschert, I., Agrawal, P., Fuhrmann, S., Brewer, I., Sharma, R., Cai, G., & MacEachren, A. 2002, Designing a User-Centered, Multimodal GIS Interface to Support Emergency Management. ACM International Symposium on Advances in Geographical Information Systems, McLean, VA, November 8-9, 2002, pp. (to appear). Rime, B. & Schiaratura, L. 1991, Gesture and speech. In R. Feldman & B. Rime (Eds.), Fundamentals of Nonverbal Behavior (pp. 239-281). New York: Press Syndicate of the University of Cambridge.

Dialogue-Assisted Visual Environment for Geoinformation

41

Ruthkowski, C. 1982, An introduction to the Human Applications Standard Computer Interface, Part 1: Theory and principles. Byte, 7(11), 291-310. Shackel, B. 1997, Human-computer interaction -- Whence and Whither? Jounal of American Society of Information Science, 48(11), 970-986. Shackel, B. 2000, People and computers - some recent highlights. Applied Ergonomics 31 (2000), 31, 595-608. Shapiro, S. C., Chalupski, C. H., & Chou, H. C. 1991, Linking Arc/INFO with SNACTor. Santa Barbara, CA: Technical paper 91-11, Natioal Center for Geographic Information and Analysis. Sharma, R., Cai, J., Poddar, I., & Chakravarthy, S. 2000, Exlpoiting speech/gesture cooccurrence for improving continuous gesture recognition in weather narration. Proc. IEEE conf. on Face and Gesture Recognition, Grenoble, France, March 2000., pp. 422-427. Sharma, R., Pavlovic, V. I., & Huang, T. S. 1998, Toward multimodal human-computer interface. Proceedings of the IEEE, 86(5), 853-869. Sharma, R., Poddar, I., Ozyildiz, E., Kettebekov, S., Kim, H., & Huang, T. S. 1999, Toward Interpretation of Natural Speech/Gesture: Spatial Planning on a Virtual Map. Proceedings of ARL Advanced Displays Annual Symposium, Adelphi, MD, pp. 35-39. Shiffer, M. J. 1998, The evolution of public participation GIS. Cartography and Geographic Information Systems, 25(2), 89-94. Shneiderman, B. 1998, Designing the User Interface - Strategies for Effective HumanComputer Interaction (3rd ed.). Reading: Addison-Wesley Longman, Inc. Slocum, T. A., Blok, C., Jiang, B., Koussoulakou, A., Montello, D. R., Fuhrmann, S., & Hedley, N. R. 2001, Cognitive and Usability Issues in Geovisualization. Cartography and Geographic Information Science, 28(1), 61-75. Smith, T. R., Ramakrishnan, R., & Voisard, A. 1992, Object-based data model and deductive language for spatio-temporal database applications. In H.-W. Six (Ed.), Geographic Database Management Systems (pp. 79-102). Berlin: SpringerVerlag. Starner, T., Weaver, J., & Pentland, A. 1998, Real-time American sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371 -1375. Stieiti, N. A., Hdopf, V., Ishii, H., Kaplan, S., & Moran, T. P. 1998, Cooperative Buildings- Integrating Information, Organization and Architecture. CSCW '98, Seattle, WA, 1998, pp. 411-413. Terveen, L. G. 1995, Overview of human-computer collaboration. Knowledge-Based Systems, 8(2-3), 67-81. Vicente, K. J. 1999, Cognitive Work Analysis: Toward Safe, Productive, and Healthy Computer-Based Work. Mahwah, NJ: Lawrence Erlbaum Associates. Vo, M. T. & Waibel, A. 1994, A multimodal human computer interface: Combination of gesture and speech recognition. Proc. of the CHI'94 summary conference on Human Factors in Computing Systems, pp. 69-70. Wahlster, W. 2002, SmartKom: Fusion and Fission of Speech, Gestures, and Facial Expressions. Proc. of the 1st International Workshop on Man-Machine Symbiotic Systems, Kyoto, Japan, pp. 213-225.

Dialogue-Assisted Visual Environment for Geoinformation

42

Wang, F. 1994, Towards a natural language user interface: An approach of fuzzy query. International Journal of Geographical Information Systems, 8(2), 143-162. Wise, J., Thomas, J., Pennock, K., Lantrip, D., Pottier, M., Schur, A., & Crow, V. 1995, Visualizing the non-visual: spatial analysis and interaction with information from text documents. Proceedings, IEEE 1995 Symposium on Information Visualization, Atlanta,, pp. 51-58. Woodruff, A. G. & Plaunt, C. 1994, GIPSY: Geo-referenced Information Processing System. Journal of the American Society for Information Science, 45, 645-655. Zue, V., Glass, J., Goddeau, D., Goodine, D., Leung, H., McCandless, M., Phillips, M., Polfroni, J., Seneff, S., & Whitney, D. 1990, Recent progress on the MIT voyager spoken language system. Proceedings of the ICSLP, pp. 1317-1320. Zue, V. W. & Glass, J. R. 2000, Conversational interfaces: Advances and challenges. Proceedings of the IEEE, 88(8), 1166-1180.

Dialogue-Assisted Visual Environment for Geoinformation

Figure Captions

Figure 1. A scenario of multimodal collaborative dialogues in an emergency response center

43

Dialogue-Assisted Visual Environment for Geoinformation

Figure 2. Two users interacting with DAVE_G1

Figure 3. User interacting with DAVE_G2 (left) performing a circular gesture to select features on the map (right)

44

Dialogue-Assisted Visual Environment for Geoinformation

Figure 4. An overview of the current multiuser HCI-interface for GIS applications.

Figure 5. Flow of execution in response to a multimodal GIS Command

45

Dialogue-Assisted Visual Environment for Geoinformation

Figure 6. A simplified sample of a grammar definition for recognizing meaningful utterances (word annotations are not shown)

Figure 7. Semantic-frame approach to command interpretation

46

Suggest Documents