dialogue strategies for interruption, resumption and domain-switching ... Researchers within the fields of vehicle safety and ergonomics have also pro-.
Interruption, resumption and domain switching in in-vehicle dialogue Jessica Villing?1 , Cecilia Holtelius2 , Staffan Larsson1 , Anders Lindstr¨om3 , Alexander Seward4 , and Nina ˚ Aberg5 1
Department of Linguistics, University of Gothenburg, Sweden, 2 Volvo Car Corporation, Sweden, 3 Mobility Services R&D, TeliaSonera, Sweden, 4 Veridict AB, Sweden, 5 Volvo Technology AB, Sweden http://www.dicoproject.org
Abstract. The use of dialogue systems in vehicles raises the problem of making sure that the dialogue does not distract the driver from the primary task of driving. Earlier studies have indicated that humans are very apt at adapting the dialogue to the traffic situation and the cognitive load of the driver. The goal of this paper is to investigate strategies for interrupting and resuming in, as well as changing topic domain of, spoken human-human in-vehicle dialogue. The results show a large variety of strategies being used, and indicate that the choice of resumption and domain-switching strategy depends partly on the topic domain being resumed, and partly on the role of the speaker (driver or passenger). These results will be used as a basis for the development of dialogue strategies for interruption, resumption and domain-switching in the DICO in-vehicle dialogue system.
1
Introduction
The study reported on in this paper is part of the DICO project, the overall purpose of which is to demonstrate how state-of-the-art spoken language technology can enable access to communication, entertainment and information services as well as to environment control in vehicles1 . The project group intends to demonstrate this primarily by means of working prototypes which promote safety in driving while at the same time delivering ease-of-use in access to commercially viable sets of on-line as well as in-vehicle services. To this end, the project has developed a working prototype of a speech-based and multimodal dialogue system, which has previously been tested on real users both in simulator tests and while driving in real traffic. One specific question, which has arisen during these trials with the system prototype, concerns how to deal with and even generate interruptions and topic shifts in the spoken dialogue between man and machine, e.g. in order to adapt to the current traffic situation in a timely fashion. ? 1
The authors wish to thank Johan Jarlengrip, Volvo Technology AB. DICO is funded by Vinnova, project 2006-00844
2
Interruptions, resumptions and domain switching in in-vehicle dialogue
As can be expected, in human-human communication, this type of regulation is common and constitutes an integrated part of spoken communication. There are studies indicating that vehicle drivers are in fact very good at adapting their interaction to accommodate the cognitive demands of the combined tasks of driving and interacting through spoken language [1]. Researchers within the fields of vehicle safety and ergonomics have also proposed that in-vehicle spoken dialogue systems should adapt to the workload of the driver and suspend and resume dialogue accordingly [2] or even that the dialogue behaviour should be designed in such a way that a ”neutral, small talk-like interaction results” [3]. The CHAT project [4] focused on robust, wide-coverage, and cognitive loadsensitive spoken dialogue interface, addressing issues related to dynamic and attention-demanding environments such as driving. Even if several of the dialogue and presentation strategies of CHAT are based on corpus data, it would not seem as if topic shifting, and strategies for suspending and resuming topic threads, has been studied in any detail. A limited set of implicit strategies for topic switching were investigated but not included in the final system. The CHAT system was not designed to monitor the driver’s cognitive load; rather, general methods such as robust interpretation were designed to decrease cognitive load more generally. It is the goal of this paper to investigate the strategies employed in humanhuman in-vehicle interaction for interrupting and resuming spoken dialogue, as well as strategies for changing the topic domain of the conversation. For this purpose, dialogues between driver and passenger in real traffic were recorded and videotaped under controlled conditions, where the driver’s cognitive load was simultaneously measured by use of an indirect method. We will first briefly describe the dialogue system which is begin further developed in the project, and describe some shortcomings which motivate the research presented here. We will then describe the experimental setup, as well as the transcription and annotation methods used. Finally, we will point to some future research directions motivated by our results.
2
The DICO dialogue system
The dialogue manager in the DICO system is based on [5]. It enables flexible spoken human-machine dialogue by providing general solutions to several general dialogue management problems: – Grounding: making sure that the system and the user are able to hear and understand each other – Accommodation, enabling the user to • give information in any order • provide information without explicitly stating the task • clarify by responding to system clarification questions if there is some problem
Interruption, resumption and domain switching
3
– Mixed initiative: user can take initiative at any time – Multitasking: switching between multiple simultaneous tasks – Multimodality: Use speech and/or GUI or to interact In addition, since DICO uses a domain independent dialogue manager, knowledge of dialogue is kept separate from domain-specific knowledge, which enables rapid prototyping of new applications. The current version of the dialogue manager will prompt the user for answers to system questions until the user answers. This is clearly not a good strategy in the in-vehicle environment, since it risks increasing the cognitive load on the user by endlessly repeating e.g. a question when the driver is devoting her attention to the traffic: USR> Call Lisa please SYS> OK, Lisa. Do you want to use the home number of the mobile phone number? User enters roundabout and focuses all attention on the traffic SYS> Do you want to use the home number or the mobile phone number? SYS> Do you want to use the home number or the mobile phone number? SYS> Do you want to use the home number or the mobile phone number? ... One common way of dealing with this problem in in-vehicle speech systems is to repeat a message once, then wait for a fixed amount of time, and then give up. That this is not ideal either can be seen from the following (made-up) example: USR> Call Lisa please SYS> OK, Lisa. Do you want to use the home number or the mobile phone number? User enters roundabout and focuses all attention on the traffic SYS> Do you want to use the home number of the mobile phone number? Driver exits roundabout, and after a while the driver is ready to talk again USR> Um, the mobile number please. SYS> Sorry, I don’t understand. What do you want to do? In addition to lacking strategies for dealing with interruptions and resumptions, the dialogue manager offers rather restricted methods for switching between different topic domains. For example, to switch from the “telephone” application to the “audio system” application, the user has to provide explicit requests such as “go to the audio system”. In cases where the system initiates a topic domain switch , this is also done in a rather stereotypical way (“returning to the telephone.”).
4
Interruptions, resumptions and domain switching in in-vehicle dialogue
It would clearly be useful to (1) add strategies for dialogue interruption and resumption, and (2) provide more convenient and natural means for switching between domains. In the context of the DICO project, the main point of smoothly managing dialogue interruptions, resumptions and domain switchings is to minimize the cognitive load of the driver.
3
Method
The goal of the test setup was to elicit driver–passenger dialogue which would feature a substantial and measurable number of instances of the different types of human speech-communicative strategies and linguistic devices known to be employed under cognitive load and other forms of driving-induced stress. One specific challenge was therefore how to make driver and passenger engage in natural dialogue and conversation of sufficient intensity that any additional distractions or increase in the cognitive load, due to driving or the surrounding traffic situation, would immediately compel the subjects to adapt their spoken language in ways which would be detectable from subsequent transcription of the conversation. 3.1
Subjects and tasks
Eight subjects (two female and six male) between the ages of 25 and 36 were recruited internally with one of the partners, and were divided into driverpassenger pairs. The subjects had no previous experience from using speech technology or dialogue systems. To meet the requirements mentioned above, the subjects were given two separate tasks, one navigation task and one memory task. In the navigation task the passenger simply had to instruct the driver on where to drive. The memory task was constructed so that the driver and passenger were to interview each other regarding personal background and interests during the drive, after which their individual ability to recall this information was scored using a fill-out form. Subjects were informed that their joint score would be the basis for a competition, to further encourage interaction, collaboration and thereby conversation. All tests were performed under real and challenging conditions, in relatively dense city traffic in central Gothenburg. A previously unknown driving route was given to the passenger at the start, together with the interview sheet. The passenger was told only to give verbal driving instructions, spanning no more than one intersection ahead. Should the team lose track while navigating, they were instructed to find their way back to the pre-determined route and continue. The driver was told to focus on the main tasks and on driving for safety reasons, but was told also to perform the best he or she could in a so-called Tactile Detection Task (TDT), requiring the driver to press a button at irregular intervals. Each team was free to manage
Interruption, resumption and domain switching
5
and solve the interview task in any way they saw fit. However, they were not allowed to take notes or use any other memory aids. Within the teams, each subject acted both driver and passenger, since the subjects were instructed to switch roles halfway into the test, which lasted for 60 minutes in total. 3.2
Test environment and data
The test car, a Volvo XC 90 (model year 2004), was equipped with a dual headset microphone setup, enabling recording of driver and passenger on separate channels. Two digital video cameras were mounted inside the vehicle, one capturing a close-up of the driver’s face, and the other capturing a wide-screen view of the road ahead. To measure driver workload, a system for performing a Tactile Detection Task was utilized in the test. The system consists of a buzzer attached to the driver’s forearm and a response button attached to the index finger. At random intervals, the TDT issues a tactile stimulus to the driver and the driver is supposed to react as quickly as possible on each stimulus by pressing the response button. Driver distraction can then be measured dynamically in terms of user hit-rate and reaction latency, according to the method developed by e.g [6]. TDT furthermore enables capturing of driving-unrelated cognitive load, caused by other cognitive processes generated by the dialogue itself or by memory processing, even when car was not moving, e.g. at stoplights etc. 3.3
Transcription and coding
For the transcriptions, the transcription tool ELAN2 was used. ELAN is able to handle both audio- and video resources, and it allows annotation along multiple tiers (i.e. an utterance can be annotated with several independent annotation schema), both important features for this study. The annotation schema was designed to enable analysis of utterances related to interruption and resumption. The schema uses some notions from the MUMIN schema [7]. The notion of “utterance” we are using here is approximately “maximal syntactic phrase not interrupted by a long silence”; what counts as a “long silence” varies with context and has not been further operationalized. The domain-switch tier is used for annotating utterances where the domain of the conversation changes. We distinguish three main domains of conversation in this task: navigation, traffic and interview. The following labels are used in the domain-switch tier: – navi: A phrase which introduces or resumes talk about the navigation domain – traffic: traffic (other than navigation) – interview interview – other 2
http://www.lat-mpi.eu/tools/elan/
6
Interruptions, resumptions and domain switching in in-vehicle dialogue
Also, rather than marking whole segments with the domain tier, we only mark the first phrase in each domain segment. The sequencing schema marks formal aspects of domain-switching utterances. The term “sequencing” refers to the mechanisms whereby a dialog is structured into sequences corresponding to different domains of conversation, and topics within these domains [7]. (Note that we do not currently annotate for topic switches within domains, as these are less well-defined than the domains.) – std-phrase (sequencing function, standard phrase): A standardized, domain independent domain-switching phrase, e.g. “Let’s see”, “Where were we” – dom-spec (sequencing function, domain-specific phrase): A domain-specific domain-switching utterance, e.g. “Turn right”, “Wolfmother”, “How was I supposed to drive, again?” – unsure (sequencing function, phrase type not clear): A domain-switching phrase where it is unclear whether the phrase is a standard, domain independent phrase or a domain-specific phrase. In addition to domain-switching and sequencing, utterances with feedback function were annotated with respect to form. Feedback utterances provide information regarding the perception, understanding and acceptance of an utterance. Three labels were used to distinguish forms of feedback utterances: – std-phrase (feedback function, standard phrase): A standardized, domain independent phrase with feedback function, e.g. “Let’s see”, “mhm”, “Okay”, “Huh?”, “What do you mean?”, “Got it” – dom-spec (feedback function, domain-specific phrase): A domain-specific utterance with feedback function, e.g. “To the left” (in response to “Turn to the left”). Typically contains a repetition or reformulation of the latest preceding utterance. – unsure (feedback function, , phrase type not clear): A phrase with feedback function where it is unclear whether the phrase is a standard, domain independent phrase or a domain-specific phrase. Note that “sequencing” and “feedback” are independent tiers; an utterance can thus be coded for both functions. For example, “Okay” can have both a feedback and a sequencing function. The annotation schema has not been tested for inter-coder reliability, due to limited resources. Instead, annotators have discussed problematic examples and agreed on consensus decisions, sometimes altering the definitions in the annotation schema and altering previous annotations correspondingly. While full reliability testing would have further strengthened the results presented here, we believe that our results are still useful as a basis for future implementation and experimental work.
Interruption, resumption and domain switching
4
7
Results
As far as the authors are aware, this is the first investigation into the form of sequencing moves in in-vehicle dialogue. Although this was a fairly small-scale experiment, we believe that some tentative conclusions may be drawn from the transcribed data. All in all 3590 driver utterances and 4382 passenger utterances were transcribed and coded. The drivers made 171 sequencing utterances, the passengers made 246.
InterviewNavigation TrafficOther oops (oj ) 0 alright3 (jaha) 6 let’s see (ska vi se) 2 Table 1. Standard phrases for
1 8 4 0 5 0 driver sequencing
3 0 0 utterances
InterviewNavigation TrafficOther let’s see (ska vi se) 7 9 0 0 alright (jaha) 6 1 0 2 okay 4 1 0 0 Table 2. Standard phrases for passenger sequencing utterances
Table 1 and 2 show the most common standard (i.e. domain-independent) phrases which were used utterance-initially when switching to a new domain. The data has been normalized for variations in pronunciation and in some cases for variations in exact wording (the phrase “let’s see”(Sw. “d˚ a ska vi se”) has a number of variants, roughly paraphraseable as “now let’s see”, “let’s see now” etc.). Table 1 shows that “oops”(Sw. “oj”) is the most common sequencing phrase for the driver, and it is used as a single utterance to comment something in the traffic domain. It is however never used for switching to interview or navigation issues. ”Let’s see” is the most common phrase used by passengers. It is used for switching to the interview and navi domains (e.g. “Now let’s see, sailing...”(Sw. “Nu ska vi se, segling...”) or “Let’s see here, keep right at the bridge”(Sw. “Ska vi se h¨ ar, h˚ all till h¨oger vid bron”)), but never for traffic or other domains.
8
Interruptions, resumptions and domain switching in in-vehicle dialogue
Sequencing phrases that are domain specific, i.e. that can only be understood within a certain domain, are classified based on grammatical category according to the following schema4 : – – – – – – –
DEC: declarative sentence INT: interrogative sentence IMP: imperative sentence ANS: “yes” or “no” answer NP: bare noun phrase ADVP: bare adverbial phrase INC: inomplete phrase
Fig. 1. Domain specific phrases for domain Interview.
Figure 1 shows the frequencies of different kinds of domain-specific sequencing moves within the interview domain. Most common for both driver and passenger are declarative utterances, e.g. ”Enemy of the enemy was the last I read” (re-raising earlier discussion about books). Second most common for drivers are incomplete phrases, e.g. ”That was also favorite”. For passengers noun phrases are second most common. For example, one passenger re-raises an earlier discussion about favorite music by simply saying ”Wolfmother”, which is the name of a previously discussed favorite band of the drivers’. Figure 2 shows the kinds of domain specific phrases that are used within the navigation domain. Interrogative phrases are most common for drivers, e.g. 4
This schema was put together ad-hoc based on corpus observations and standard taxonomies of sentence types and grammatical categories.
Interruption, resumption and domain switching
9
Fig. 2. Domain specific phrases for domain Navi.
”Should I go straight ahead here”, while declarative phrases are most common for passengers, e.g. ”Now you should turn left in the next crossing5 ”.
Fig. 3. Domain specific phrases for domain Traffic.
Figure 3 shows categories for domain specific phrases in the traffic domain. As can be seen the distribution is the same for both drivers and passengers. Declarative phrases are by far the most common, e.g. “And there you come and I don’t know who is driving”(Sw. “Och d¨ar kommer du och jag vet inte vem som kor”) (driver talking to a fellow driver). Figure 4 shows categories for all other domains. The distribution is similar to the traffic domain, and is also the same for both driver and passenger. Declarative 5
Note that this sentence has declarative form even though it is pragmatically a request.
10
Interruptions, resumptions and domain switching in in-vehicle dialogue
Fig. 4. Domain specific phrases for domain Other.
phrases are most common here too, e.g. ”It feels like I’m forgetting to press the button” (driver commenting the TDT button). In addition to the phrases and words explicitly tagged as having a sequencing function, as shown above, it was also noted that in many cases, topic and domain shifts were also audibly distinguishable by virtue of prosodic cues and/or extra-linguistic sounds, such as lip smacks, inhalation noise etc. Two authentic examples from the corpus are shown below. [inhales] s˚ a nu ¨ ar vi som tillbaks h¨ ar igen so now we are sort of back here again [lipsmack] jaa det var fyra stycken d¨ ar yes you had four there A search across all driver and passenger transcriptions for the extra-linguistic sound categories lip smack, breathing, sighing, coughing and throat clearing immediately preceded a domain shift was performed. Matches were found in 9% (10 instances out of a total of 166 domain shifts) of the driver transcriptions and 16% (18 instances out of a total of 244 domain shifts) of the passenger transcriptions. (It should perhaps be noted that sub-domain or topic shifts are not as yet explicitly coded, and consequently could not be included as contexts of the search.) Regardless of whether these sounds are produced at will or subconsciously, a system which were able to detect them could use them as potential cues of an upcoming topic or domain shift.
5
Discussion
There some differences between the tables for driver and passenger standard phrases; for example, “oops”(Sw. “oj”) is the most common standard phrase used for domain switching and dialogue resumption by the driver. It seems clear,
Interruption, resumption and domain switching
11
since the phrase is mostly used when switching to the traffic domain, that this signal is motivated by real-time events in the environment, rather than planned ahead. We can perhaps make a conceptual distinction between “improvised” and “planned” sequencing moves. In addition, we can see that the improvised sequencing moves are motivated by the navigation task (since this is the domain that the dialogue switches to). ”Let’s see”, on the other hand, is a good example of a ”planned” sequencing move. It is frequently used by the passenger in both the interview- and the navi-domain, as well as the navi domain for the driver. This phrase seems to be used when the speaker a) believes that it is necessary to change domain (the driver do not know where to go or the passenger realizes that the driver has not got enough instructions) or b) believes that it is suitable to change domain (the driver knows where to go and the traffic situation is not too heavy, or the passenger believes that the driver should be capable of concentrating on something else but the driving task). “Alright”(Sw. “jaha”) seems to have more of an eliciting function, declaring that the speaker is ready to change domain and encourages the hearer to make the first move. As noted, passengers frequently used bare noun phrases when resuming a previous domain topic. Our hypothesis is that these NP re-raisings allude to a previously discussed topic, e.g. a question from the interviewers questionnaire which was interrupted by navigation- or traffic-domain dialogue. This is similar to the account of reduced ”second-mention” forms for re-raising questions in dialogue put forward in [8]. Passengers usually use declarative phrases in all domains, which can be explained by the fact that it is the passenger who has access to information. In the interview and the navi domains the passenger have all the information about what questions to ask and which way to go. The driver also usually uses declarative phrases, in all domains but the navi domain where interrogative phrases are more common, since the driver frequently has to ask for information about where to go.
6
Future work
We plan to add dialogue management strategies to the DICO dialogue manager to enable it to deal with phenomena like the ones described in this paper, and to evaluate the effect of these strategies on driver cognitive load in in-vehicle dialogue. The frequency lists are expected to be useful when deciding what to listen for from the user, how to react to sequencing signals from the user, and for generating natural-sounding sequencing moves from the system6 . To fully adapt the dialogue to the driver’s cognitive load, it would be very useful to get an estimate of this based on available information sources in the invehicle environment. We are working on using existing technologies for this, with the aim of connecting these technologies to the dialogue system and using it for optimizing system, behavior. A very interesting future research topic would be 6
We are not claiming that system utterances should mimic human speakers in every way, only that knowledge of how humans express sequencing moves will be useful when designing the system output.
12
Interruptions, resumptions and domain switching in in-vehicle dialogue
the detection of cognitive load from the speech signal, and for weighing together evidence from multiple sources. We envision the following kind of behavior: USR> Call Lisa please SYS> OK, Lisa. Do you want to use the home number of the mobile phone number? User enters roundabout and focuses all attention on the traffic USR> um... uh... Driver exists roundabout, and after a while the cognitive load is sufficiently low to allow resuming the dialogue SYS> Let’s see. Lisa. Do you want to use the home number of the mobile phone number? A relevant question in this context is whether user initiative should always override the system’s estimation of the user’s cognitive load. That is, if the speaker resumes the dialogue, should the system respond regardless of cognitive load? If so, how should it respond? Should it also take own initiatives or only do what’s needed to complete the user’s requests? These are questions which we hope to answer in future experiments.
References 1. Esbj¨ ornsson, M., Juhlin, O., Weilenmann, A.: Drivers Using Mobile Phones in Traffic: An Ethnographic Study of Interactional Adaptation. International Journal of Human Computer Interaction, Special issue on: In-Use, In-Situ: Extending Field Research Methods. (2007) 2. Nishimoto, T., Shioya, M., Takahashi, J., Daigo, H.: A study of dialogue management principles corresponding to the driver’s workload. Biennial on Digital Signal Processing for In-Vehicle and Mobile Systems (2005) 3. Vollrath, M.: Speech and driving-solution or problem? Intelligent Transport Systems, IET 1 (2007) 89–94 4. Weng, F., Varges, S., Raghunathan, B., Ratiu, F., Pon-Barry, H., Lathrop, B., Zhang, Q., Bratt, H., Scheideck, T., Xu, K., et al.: CHAT: A Conversational Helper for Automotive Tasks. Ninth International Conference on Spoken Language Processing (2006) 5. Larsson, S.: Issue-based Dialogue Management. PhD thesis, G¨ oteborg University (2002) 6. van Winsum, W., Martens, M., Herland, L.: The effect of speech versus tactile driver support messages on workload, driver behaviour and user acceptance. tnoreport tm-99-c043. Technical report, Soesterberg, Netherlands (1999) 7. Allwood, J., Cerrato, L., Dybkjaer, L., Jokinen, K., Navarretta, C., Paggio, P.: The mumin multimodal coding scheme. Technical report, Center for Sprogteknologi, Copenhagen University (2004) 8. Cooper, R., Larsson, S.: Accommodation and reaccommodation in dialogue. In B¨ auerle, R., Reyle, U., Zimmermann, T.E., eds.: Presuppositions and Discourse. Current Research in the Semantics/Pragmatics Interface. Amsterdam (Elsevier) (2002)