CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
Mixed-Initiative Dialog Management for Speech-based Interaction with Graphical User Interfaces Andreas Löhr Inmedius Europa GmbH Lindwurmstr. 5a, 81667 München, Germany
[email protected]
Bernd Brügge Institut für Informatik, Technische Universität München Boltzmannstr. 4, 85748 Garching b. München, Germany
[email protected]
ABSTRACT
carpal tunnel syndrome cannot use their hands without feeling discomfort or even pain. These persons require tools for controlling the particular widgets of a GUI hands-free. The research area of speech-controlled GUIs aims at providing such tools by using speech recognition technology to control a GUI by spoken commands. Early studies in this area revealed a 400% performance decrease compared to mouse control (Leeming et al. [13]). Technological advances of today's speech recognition systems led to a reduction of the performance decrease to about 50%, at least for specific application domains like web browsing (Van Buskirk and LaLomia [4], Christian et al. [6]). The performance decrease can be explained as follows: first, existing speech-controlled GUI approaches require a high number of commands per task in average (we will give examples later on). Second, recognition errors lead to rejections or misunderstandings of spoken commands, which force users to repeat or correct – which in turn requires more spoken commands.
Controlling graphical user interfaces (GUI) by speech is slow, but proves useful for disabled persons with limitations in operating mouse and keyboard. We present conversation-and-control, a new approach for using speech as input modality for GUIs, which facilitates direct manipulation of widget functions by spoken commands. Our approach is based on a command language, which provides a unique command for each specific widget function. For managing the interaction we propose a mixedinitiative dialog model, which can be generated from widget properties. Using heuristics for inferring the meaning of a recognition result and having the ability to ask clarification questions, our approach avoids the rejection of recognition errors. We hypothesized that conversation-andcontrol allows for shorter task completion times than conventional command-and-control approaches, due to a reduction of the average number of required commands. The results of a user experiment, which we present and discuss, indicate a 16.8% reduction of task completion time achieved by our approach.
In this paper we present conversation-and-control, a new approach for speech-based interaction with GUIs. We model the interaction with the GUI as a mixed-initiative dialog by abstracting the functions of widgets as forms. This modeling technique facilitates direct manipulation of widgets by the possibility to employ a command language, which only requires a single command per function. This reduces the average number of required commands per task. Furthermore, our dialog model provides two methods for dealing with recognition errors: first, a set of heuristic procedures for analyzing a recognition error in order to avoid its rejection, i.e., to avoid repetitions of commands. Second, we incorporated the ability for the system to ask for clarification questions in ambiguous situations for inquiring more information from the user.
Author Keywords
Command-and-control, speech-controlled GUI ACM Classification Keywords
H.5.2 User Interfaces: Graphical user interfaces (GUI), Voice I/O. K.4.2 Social Issues: Assistive technologies for persons with disabilities. INTRODUCTION
Mouse and keyboard are standard input devices for graphical user interfaces (GUI) today. These devices require manual operation; thus, they cannot be used by persons with physical disabilities, which limit the motor functions of the hands. For example, persons suffering from
We begin with providing a brief overview of speechcontrolled GUI technology (a more detailed overview can be found in [14]) and continue with explaining the motivation for our work. At the example of a sample GUI we present the characteristics of conversation-and-control. We conducted a user experiment in which we compared conversation-and-control against conventional commandand-control approaches. We present and discuss our experimental results and finish the paper with drawing our conclusions.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2008, April 5–10, 2008, Florence, Italy. Copyright 2008 ACM 978-1-60558-011-1/08/04…$5.00.
979
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
SPEECH-CONTROLLED GUI TECHNOLOGY
names which will result in layout and usability issues. Dai et al. [5] try to solve this problem by restricting the number of targets. They divide the screen into a course grained grid of named cells (e.g., 3x3) with one virtual target in the center of each cell. By speaking the name of a cell, a finer grained grid is recursively drawn within that cell. Recursion continues, until a virtual target overlaps with the desired target on the screen. Their approach can reach every pixel on the screen by a finite number of recursions, but requires a high average number of commands per task.
We partition speech-controlled GUI technology into the divisions speech-based mouse/keyboard emulation and command-and-control. Speech-based Mouse/Keyboard Emulation
Speech-based mouse/keyboard emulation (SMKE) manipulates text cursor (keyboard emulation) or mouse cursor (mouse emulation) by spoken commands. Keyboard emulation is straight forward technology: commands like "shift", "space" or "press a" are employed to emulate physical keystrokes. An example is SUITEKeys [15] which provides speech control over a virtual 84-key keyboard. For mouse emulation a variety of different approaches exist. While these approaches commonly employ commands like "click" or "press right button" to emulate button clicks, they differ in the way the mouse cursor can be moved.
In summary, direction-based approaches require one command for emulating the mouse button, but significantly more than one command for cursor movement. Targetbased approaches trade a decrease of usability by an increase of required commands and vice versa. Command-and-Control
Command-and-control approaches do not emulate physical devices. Instead, commands address the specific widget which is currently focused, and perform either navigation or activation. A navigation command will focus a different widget than the one currently focused. Activation commands manipulate the currently focused widget.
Direction-based Mouse Emulation
With direction-based mouse emulation the cursor moves into the direction uttered by the user. Karimullah and Sears [11] start a continuous movement of the mouse cursor if one of the commands "Move left", "- right", "- up" or "- down" gets recognized. The movement stops upon "stop"; however, users experienced precision problems: after saying "stop" the cursor kept moving until the speech recognizer finished processing speech. Igarashi and Hughes [9] achieve continuous movement by uttering the direction followed by a non-verbal vocalization. The cursor moves as long as the vocalization lasts, e.g., "move up aaaa". Harada et al. [8] use a similar technique, but assign each direction a specific sound. Using non-verbal vocalizations is more precise, but cumbersome to user for long distances. Brøndsted and Aaskoven [3] achieve discrete cursor movement by displaying eight coordinate axes on the screen (north, north-east, etc.) requiring commands like "go north-east two hundred and ten". Their approach facilitates quick movement over long distances but lacks precision: users tend to wrongly estimate the target coordinates. Mihara et al. [16] combine discrete movements for long distances with non-verbal continuous movement for short distances. They could improve preciseness, but increased the average duration of the interaction process.
Navigation
Restricted navigation assigns each widget a predecessor and a successor. Only predecessor or successor of the currently focused widget can be focused. An example is Windows XP TabletPC Edition: it allows navigating objects by speaking "next" and "previous", just like pressing the TAB key. Restricted navigation is slow since every widget on the way to the desired widget must be focused temporarily. Type-based navigation uses the widget's type as filter for determining predecessor or successor. An example is a speech-controlled GUI for a calendar application described in Olsen et al. [17] providing widgets for months, weeks and days. Users can navigate month-wise ("next month"), week-wise ("next week") or day-wise ("next day"). Arnold et al. [2] present a speech-controlled graphical programming environment which represents syntactic elements of a program by widgets, allowing navigation by syntactic type. For instance "move down if" navigates to the first successor that represents an if-clause. For heterogeneous GUIs, type-based navigation is faster than restricted navigation, but it requires domain knowledge to determine the type of a widget. Furthermore, the user must be aware of the widget order. Again, widgets of the same type between the focused widget and the desired widget have to be focused temporarily.
Target-based Mouse Emulation
Target-based mouse emulation defines specific targets on the screen and assigns them speakable identifiers, displayed closely to the target. Speaking an identifier places the cursor within the corresponding target. An example is QPointer VoiceMouse [1] which uses widgets as targets, e.g., uttering the name of a button will cause the mouse cursor to be placed over the button. However, specific modern widgets behave differently depending on where they are clicked on (e.g., a drop down box), thus, a widget potentially exposes multiple targets. With increasing number of widgets the screen becomes crowed with target
Random navigation solves the previous issues by assigning speakable identifiers to widgets. Speaking an identifier will focus the corresponding object; note that this is different from placing the mouse cursor somewhere within the
980
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
widget. An example is Conversa (see [6]), a web browser, which facilitates speech-controlled web browsing. It generates speakable labels for interactive elements of the currently viewed web page. James et al. [10] have created a speech-controlled GUI for the SAP Workplace product, where users navigate tree menus by speaking labels of menu items.
for conveniently accessing speech services (speech recognition and speech synthesis). The burden of dealing with recognition grammars, semantic analysis and dialog management, however, is on the developer. We believe that this burden can be taken away from the developer, just like with mouse and keyboard events. It is furthermore a generally accepted hypothesis that speech, by the virtue of being natural, is a natural input modality for computing systems4. In this regard we argue that SMKE adds an unnecessary indirection to the interaction: the user controls a mouse cursor, which abstracts the physical position of a mouse device, which is not available. As such we believe that command-and-control is the more natural approach, as it has a flavor of performing a conversation with the GUI.
Activation
Activation commands manipulate the currently focused widget. Common to the approaches direct activation and indirect activation is the definition of speakable names by which widget functions can be invoked. For instance, the pushing of a button could be mapped to "push". Direct activation only invokes functions of the focused widget: the user first utters navigation commands followed by the desired activating command. With indirect activation one specific widget function gets invoked at the time the widget is focused – however, all other functions of the widget need to be invoked by direct activation. Let’s consider a button identified by "Cancel". With direct activation the command sequence "Cancel", "push" is required to focus and push the button, whereas with indirect activation the command "Cancel" would indirectly push the button.
MOTIVATION: QUALIFIED ACTIVATION PRINCIPLE
Indirect activation is fast because it couples the focusing of a widget with the invocation of a specific widget function. As such, speaking the command for explicitly invoking the widget function is no longer required. One could also argue that focusing the widget is no longer required – the widget’s identifier is the very command for indirectly activating a selected widget function. Thus, should it not be possible to invoke all widget functions without having to focus? The benefit becomes clear if we consider that after having stopped speaking, a speech recognition system takes time for processing the recognition result, the so called recognition delay. Thus, merging n commands for focusing and activating saves n-1 times the recognition delay. An even more significant potential is the exploitation of users being familiar with a GUI. For example, independently of speech or mouse control, a novice user first opens a menu by focusing it and then selects a menu item. An expert user having a keyboard is provided hot keys for quickly selecting the menu item directly, without having to focus. A similar concept does not exist for command-and-control: even an expert user first has to focus the menu before selecting the item. This problem exists for other widgets as well, e.g., drop down boxes or trees.
Obviously, indirect activation is faster than direct activation (no explicit activation command) but cannot be applied to all functions of a widget. An example is Conversay Voice Surfer1, which opens the menu of a drop down box if the user speaks the corresponding label (indirect activation). A specific option of the drop down box gets selected by speaking its label (direct activation). Summary and Research Focus
SMKE utilizes spoken commands to emulate physical input devices, whereas command-and-control utilizes spoken commands for focusing widgets and activating their functions. SMKE is applicable to GUIs which accept mouse and keyboard input, because the GUI will treat the emulated mouse and keyboard events as having originated from the physical devices. In contrast, command-and-control must be explicitly supported by the respective GUI, because focusing and activation is specific to the widgets and their state. However, we believe that command-and-control has more potential for improvements. Indeed, command-andcontrol is not generally applicable like SMKE, but this is ultimately a consequence of GUI technology: today, every GUI widget kit (e.g., Java Swing, MFC, etc.) provides built-in support for mouse and keyboard devices. The situation is different for speech as input modality. Today's available technology for incorporating speech input into GUIs, such as SALT2 or VoiceXML3, provides an interface
We extended the idea of indirect activation and defined the principle of qualified activation. It denotes that commands for controlling a GUI by speech must contain all those bits of information which are required for inferring the widget, the function and possibly any function parameters (e.g., the selection value of a drop down box). Thus, qualified activation does not require focusing. Instead, widget function can be invoked with a single command. Our theoretical model for comparing speech-controlled GUI approaches [14] predicts 11% reduction of task completion time for qualified activation. 3
1
http://www.conversay.com
2
http://www.saltforum.org
http://www.voicexml.org Empirical evidence, however, contradicts this hypothesis for specific tasks, such as non-trivial situations in text editing (Karl et al.[12]) 4
981
CHI 2008 Proceedings · Mixed-Initiative Interaction
Name
Functions
Parameter
Button
Push
-
Checkbox Radio Button List List
Drop Down Box
Menu
Tree
Input Field
Spinner
Slider
April 5-10, 2008 · Florence, Italy
determine a reaction upon an utterance of the user. We conclude the section with pointing out limitations.
Example
Check
-
A Model for Widget Functions
Uncheck
-
Select
button name
Select
option name
Deselect
option name
Open
-
Functionality of speech-controlled systems can be modeled by mapping functions to forms, as proposed by Papineni et al. [18]. The slots of each form model those bits of information, which are required to identify and invoke the underlying function. Slots accept specific representations of information they stand for. If the user utters a command, an analysis process will populate specific slots with respectively accepted information extracted from the corresponding recognition result. Using specific heuristics upon the distribution of empty and populated slots among the forms, the system determines an appropriate reaction, e.g., the invocation of a function or the rejection of the recognition result. Let us examine the information required to identify and invoke a widget function. First, there is the widget name. It identifies the widget of which a function should be invoked. Some widgets directly display a suitable widget name, e.g., the name of a button. Others require an additional label, e.g., left to an input field. Second, there is the function name which identifies the particular function, e.g., the function names listed in Table 1. Third, there is an optional parameter. It represents a parameter for the function invocation, e.g., a selection value. Thus, we abstract widget functions by a form consisting of the following slots: the widget name slot which accepts the widget name of the widget that exposes the underlying function; the function name slot, which accepts the function name of the underlying function; the optional parameter slot, which accepts representations of the parameter. In the following we refer to this special form as a topic to reflect the idea, that the system offers several conversation topics to the user (i.e., the functions which can be invoked).
Close
-
Select
option name
Open
-
Close
-
Select
item name
Collapse
node name
Expand
node name
Select
node name
Deselect
node name
Append
character
Delete
-
Append
digit
Delete
-
Increase
-
Decrease
-
Slide
tick name
Table 1. Catalog of Basic Widgets and Functions.
We used the platform independent GUI toolkit Java Swing to derive a catalog of basic widgets and their functions for investigating about qualified activation, as in Table 1. The first column names the widgets and the second column lists their logical functions. The third column lists parameters of widget functions and the last column gives an exemplary representation of the widget. The functions Append and Delete (input field, spinner) perform text input (refer to SUITEKeys [15]): Append appends the given character to the right of the current character string. Delete deletes the righter most character. For speaking characters we assumed the NATO phonetic alphabet (e.g., "alpha" for character 'a'). Some widgets involve scrollbars (e.g., Drop Down Box) which can be modeled analogously to Sliders.
Based on the number of populated slots we assign a state to topics to simplify heuristic evaluation. The state EMPTY denotes that all slots are empty, the state COMPL denotes that all slots are populated (two or three slots). Otherwise, the topic is partially complete (PARTCOMPL). Initially, a topic is in the EMPTY state. With the slots becoming populated the state transitions to the PARTCOMPL or COMPL state. A similar notion of states for forms has already been proposed by Denecke [7] who introduces the states NEUTRAL (=EMPTY), SELECTED (=PARTCOMPL) and FINALIZED (=COMPL). We further enhance this state concept as show in Figure 1 on the following page (the reason will become clear later on). We decompose PARTCOMPL into the sub states PARAMBAD and PARAMOK to encode further information about the parameter slot: PARAMBAD indicates an empty parameter slot and PARAMOK indicates a populated parameter slot. The idea behind this is that a parameterized function can be invoked even if it is only identified by the function name, but never without a parameter value. For simplifying further
CONVERSATION-AND-CONTROL
We used qualified activation as basis for conversation-andcontrol. We will first describe an abstract model for widget functions. It facilitates a simple command language for the user and a generation schema for the recognition grammar. Then, at the example of a sample GUI, we will go into the details of the dialog model according to which we
982
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
Figure 1. Topic States. (a) Sample GUI
discussions we visualize topics and their respective states as depicted in Figure 2.
Figure 3. Sample GUI for Illustrating Semantic Analysis.
simple nature of the command language and the simple values which the slots accept, allowed us to use word spotting: our algorithm extracts each distinct word from the recognition result and then matches each extracted word against all available slots. If a slot accepts a word, the slot gets populated with the word. We refer to this algorithm as semantic update in the following. Note that words do not get consumed once a slot accepts it, i.e., one specific word might populate multiple slots. The reason is that we cannot assume a specific priority among the topics. Different topics might expose slots which accept the same values, e.g., all topics which originate from the same widget accept the same widget name. Thus, a semantic update leads to a specific topic state distribution. We will now introduce heuristics for inferring the intended function from topic state distributions at the example of the sample GUI, depicted in Figure 3(a). It consists of a button "Apply", two spinners "Voltage" and "Amps", and a list "Config" with two options "Day" and "Night". Without restricting the generality, the labels are the widget identifiers. We name the topics after the functions that they represent. Thus, each spinner provides an INCREASE-topic, a DECREASE-topic and an APPEND-topic. We omit the DELETE-topic as it will not provide further insights into our explanations. The button defines a PUSH-topic and the list provides a SELECT-topic and a DESELECT-topic. The topics of the sample GUI are graphically depicted in Figure 3(b).
Command Language and Recognition Grammar
We derived a command language straight forwardly from the structure of a topic. A qualified activation command must contain the widget name WID, the function name FID and optionally the parameter PID. Thus, if we parameterize commands by the distinct corresponding values for PID, there is one command for each function. We define the following recognition grammar for qualified activation commands QAC:
:= := := :=
(b) Topics of Sample GUI
( | | )+ all occurring widget names all occurring function names all occurring parameter ids.
Note that we do not constrain the order or number of constituents on the speech recognizer level – for instance, referring to the checkbox widget from the catalog, it would be valid to say "Check Running" or "Running check". Note that "Check" would also be syntactically correct. We leave it up to the semantic analysis capabilities (see later on) to determine the meaning. Indeed, constraining the command language typically leads to a better speech recognition performance, but on the cost of less freedom for the user. Our experiment, which we discuss later on, shows that the above recognition grammar is feasible – at least for the experiment. Allowing users to construct valid sentences requires them to be aware of the language structure, e.g., by training. Widget names and parameters could then be displayed within or next to the widgets (e.g., button names or menu options). Problematic parameter names, e.g., including special characters, could be substituted by a label that is easy to speak (e.g., an index for options of a drop down box). Function names would have to be learned by hard.
Isolated-Topic-Distribution Heuristic
We call a topic state distribution where a single topic is in state COMPL and all others are in state EMPTY the isolatedtopic distribution. For this distribution we define the topic in COMPL state to be the intended topic. Although this definition is straight-forward, it is still a heuristic, because upon a misunderstanding the topic in COMPL state is in fact not the topic intended by the user. As an example, assume the recognition result "push apply" becoming available. Its constituents transition the PUSH-topic into the COMPL state. All other topics remain EMPTY, as shown below. Thus, the PUSH-topic becomes the intended topic.
Inferring the Meaning of a Recognition Result
The intended topic is the specific topic which represents the function which the user intends to invoke. The core challenge is to infer the intended function from the information contained in the recognition result. Implementations of conversation-and-control are not bound to a specific approach for semantic analysis. However, the
Figure 2. Graphical Notation of Conversation Topic States.
983
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
Peering-Topic-Distribution Heuristic
We call a topic state distribution with a single topic in state COMPL and all others either EMPTY or PARTCOMPL the peering-topic distribution, as virtually the topic in state COMPL peers out. For this distribution, again, we define the topic in COMPL state to be the intended topic. This is a heuristic as well, since the information was (at least partially) "understood" by other topics. As an example, assume the recognition result "increase voltage" becoming available, leading to the topic state distribution depicted below. Only the INCREASE-topic of the voltage spinner is in COMPL state, causing it to become the intended topic.
The relative-maximum-identification heuristic is derived from a "human" inferring strategy, as the example shows: first, there is no other thing than the "Config" list that can be selected ("select"), and second, there is no other thing that has an option labeled "night". If the relative-maximumidentification heuristic fails, we attempt the following procedure. Resolution-by-Historical-Topic
The historical topic is the specific intended topic that has previously been determined. Key idea of resolution-byhistorical-topic is to populate selected ambiguous topics with slot values from the historical topic, as such, to consult the interaction history. At first, determine all ambiguous topics T which have nmax identification slots populated. From all topics in T determine the commonly empty slots and collect the corresponding values V from the historical topic. For example, if all topics in T are missing the widget name slot, then V contains the value of the widget name slot of the historical topic. Then perform a temporal semantic update of all topics in T using the elements of V as "pseudo" recognition result. Re-evaluate using the peering-topic- and the relative-maximum-identification heuristic. If this fails, resolution-by-historical-topic has failed. This heuristic is derived from the idea to shorten command sequences on the same widget, e.g., text/number entry into input fields/spinners. As an example assume that the user has increased the value of the amps spinner by saying "increase amps". The peering-topic-distribution heuristic leads to the invocation of the respective topic, after which all topics are reset to their empty state. Then the user wants to append the digit 5 to the increased value by saying "append five", leading to the topic state distribution below.
Selecting the single topic in COMPL state is a simple strategy for inferring the intended topic from a recognition result. But what about topic state distributions where no or more than one topic is in COMPL state? We call any such distribution an information error and all topics in COMPL state (PARTCOMPL state, respectively) are called the ambiguous topics. Speech recognition errors or incorrectly formulated commands are the reasons for information errors and in the following we discuss methods to deal with them. Automatic Information Error Recovery
We introduce heuristics which assume that all ambiguous topics are in PARTCOMPL state, i.e., no topic is in COMPL state. We handle the case of more than one topic being in COMPL state later on when we introduce clarification questions. We refer to the widget name slot and the function name slot as the identification slots, as they identify a specific function. We augment our graphical notation by a small circle on the left/right for the widget name slot/function name slot. If the circle is filled, the respective slot is populated, otherwise empty. Relative-Maximum-Identification Heuristic
The relative-maximum-identification heuristic determines the maximum number of populated identification slots per ambiguous topic, called nmax (0, 1, or 2). If there is exactly one topic with nmax identification slots populated, which is additionally in state PARAMOK, then it is the intended topic (the state PARAMBAD requires a clarification question for the missing parameter, see later on). Otherwise, the relative-maximum-identification heuristic has failed. As an example, assume the recognition result "select night" becoming available. It contains the function name of the SELECT-topic ("Config" list) and a parameter name of the SELECT- and DESELECT-topic (also "Config" list). This leads to the topic state distribution below according to which SELECT-topic becomes the intended topic.
The relative-maximum-identification heuristic obviously fails and the widget name slot is commonly missing. Using "amps" as pseudo recognition result from the historical topic leads to the peering-topic-distribution below, according to which the APPEND-topic becomes the intended topic.
. Qualified Feedback
We have introduced heuristics for automatically dealing with information errors. There are, however, information
984
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
errors where these heuristics fail and we now show how the system can acquire additional information from the user.
lead to the topic states being reset. As an example, let us assume the situation from above where the recognition result "amps" led to the clarification question "Do you mean increase, decrease or append?". Let us assume that the initial "amps" was a recognition error. The user actually uttered "increase voltage" which the user now repeats as response and which now gets correctly recognized. In the figure below we omitted the "Config"-list and the "Apply"button as they are not affected. The initial situation is marked by (0). The situation marked (1) shows the topic state distribution after processing of "amps", which led to the clarification question. The response update with the recognition result "increase voltage" is marked with (2); the resulting topic state distribution contains two topics in state COMPL, thus, all heuristics fail. The topic states are being reset and the response update is performed again, leading to the distribution marked (1)*. According to the peeringtopic-distribution heuristic the INCREASE-topic of the voltage spinner is now selected as the intended topic.
Generating Clarification Questions
The failing of resolution-by-historical-topic is the trigger for the need to generate a clarification question q according to the following procedure. At first, determine all ambiguous topics T which have nmax identification slots populated. Since the relative-maximum-identification heuristic failed, the case |T| = 1 means, that the parameter slot of a single topic is not populated. In this case q is just a list of available parameter options. Otherwise (|T| > 1) determine the number dw of distinct widget names and the number df of distinct function names represented by topics in T. Please note that T might contain no or more than one topic in COMPL state. There is obviously a limit of |T| above which clarification does not make sense, but this is up for future usability research. Construct q according to the following pseudo algorithm: q := "Do you mean "; For every topic t in T do • If df > 1 then append to q the function name of t • If dw > 1 then append to q the widget name of t • If there are more topics in T append " or " to q
As an example consider the recognition result "amps" having led to the topic state distribution below. Legacy Mode
As we have pointed out at the beginning, qualified activation is an extension of indirect activation. It is therefore possible for conversation-and-control to operate according to the rules of conventional command-andcontrol, that is, random navigation and indirect activation. We achieve this by substituting the widget name slot by a separate FOCUS-topic, which has the substituted widget name slot as its only slot. Initially, every widget just exposes a FOCUS-topic. The invocation of a specific FOCUS-topic leads (1) to the remaining topics of the focused widget becoming visible for the system, (2) the indirectly activated function to be invoked. We call this the legacy mode of conversation-and-control and give a more detailed explanation in [14]. We used legacy mode in order to compare (conventional) command-and-control against our new approach.
According to the algorithm above the clarification question is "Do you mean increase, decrease or append?". Processing Responses
We now assume that the system has presented q to the user, e.g., via a status window or even via speech synthesis. The system must detect if the recognition result following the presentation of q is a response to q (the user might decide to do something else) or if additional information was in fact provided. We take this recognition result for a semantic update on all available topics without resetting their states. We call this semantic update the response update and determine a reaction by comparing the topic state distributions before and after the response update. If the topic state distribution did change, try to infer the intended topic by applying all heuristics discussed above, with the exception of clarification questions – this might lead to an infinite loop. If this fails, or if the topic state distribution did not change in the first place, reset the topic state distribution, perform the response update and try applying the heuristics again – this time including clarification questions. Note that if the initial topic state distribution contained two or more topics in COMPL state, the response update will either not change the distribution or will result in failure of all known heuristics so far – and consequently
Limitations
Identification slots are higher prioritized than parameter slots. This leads to unintuitive clarification questions. As an example, assume the recognition result "increase day", which leads to the topic state distribution below.
The clarification question "Do you mean amps or voltage?"
985
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
would be generated, although the "Config"-list is clearly also a candidate for disambiguation. The dialog model does not verify with the user what it has inferred, e.g., by repeating the recognized command. However, verification can be added to the model if usability studies suggest this. In this context, another reasonable addition might be the incorporation of a global undo function for quickly recovering from misunderstandings; however, this depends on the transaction rollback abilities of the application backend.
We used the Java Speech API (JSAPI) to access Microsoft Speech SDK 5.1 as speech recognizer (we used the "Microsoft English Recognizer v5.1, SAPI5, Microsoft" engine). Our framework implements the dialog model described in this writing and extends all Java Swing widgets from our function catalog with conversation-andcontrol. The framework configures the speech recognizer and the dialog model automatically by inspecting the properties of the widgets that the respective GUI consists of. The framework also supports the legacy mode of conversation-and-control.
Summary
Using the principle of qualified activation we have developed an abstract model for representing widget functions as forms. From the structure of the forms we derived a command language and a mixed-initiative dialog model for managing the interaction between the user and the system. The dialog model is based on a cascade of heuristics and procedures for dealing with recognition errors, including the generation and processing of clarification questions and their corresponding responses. By the dialog model's ability to take over the initiative in the dialog, the interaction with the system has a flavor of a real conversation, which is why we call the entire approach conversation-and-control.
Test Subjects
We used 16 test subjects (5 female, 11 male) between the age of 20 and 30 with different professions: physician (6), physics student (2), insurance agent (1), software engineer (1), computer scientist (2), architect (1), industrial engineer (1), electrical engineer (1) and business economist (1). All of them had significant experience with controlling GUIs by mouse and keyboard, but none had experience with speech-controlled GUIs. All test subjects were German native speakers but had significant experience in speaking English. Experiment System
Using the conversation-and-control framework we implemented an experiment system with the GUI depicted in Figure 4. We included two instances of each widget from the widget function catalog and defined a sequence of 32 simple tasks (in avg. three tasks per widget type). The system displayed instructions regarding the goal of the current task, e.g., a specific option that should be selected. The system detected task completion by observing the widgets' states in respect to the task’s goal. If a task was completed, it switched to the next task, which facilitated an
EXPERIMENT
We describe an experiment by which we compared conversation-and-control against the conventional command-and-control approach. We hypothesized that conversation-and-control would result in a significant reduction of task completion time. Conversation-and-Control Framework
In [14] we present an implementation of a conversationand-control framework in the Java programming language.
Figure 4. Experiment GUI.
986
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
Legacy
Conversation
Reduction
A
257 (41.0)
203 (45.0)
21.0%
table 2 (standard deviation in brackets). Our results are significant at α = 0.05, p=6.5 · 10-7 < α using a paired tTest5. The overall average recognition error rate was 12%.
B
255 (48.9)
223 (52.3)
13.0%
Discussion
The experiment proves the feasibility of conversation-andcontrol: users can complete GUI tasks by directly manipulating widgets by speech based on a mixed-initiative dialog model. Furthermore, the results support our claim that significantly shorter task completion times can be achieved. In our specific experiment we measured a reduction of 16.8% (general prediction: 11%). The individual group results suggest an order effect which we interpret as follows. During the first two test runs, the test subjects became familiar with the GUI. Apparently, group A (first legacy mode then conversation-and-control) could better exploit GUI familiarity in the third and forth test run than group B. This validates our initial motivation: qualified activation commands are similar to "hot keys" on the keyboard and allow for direct manipulation of widgets without having to focus. Consider as example the selection from a drop down box. Group A learned about the respective options while interacting in legacy mode. When interacting in conversation-and-control mode they recalled the options and conversation-and-control allowed them to select them with one command – even when the drop down was closed. If the option name was unique within the GUI, users could even select it by simply uttering it (note that if the option name were not unique, the system would have asked for clarification). Group B learned about the options as well, however, legacy mode in the second half forced them to focus the drop down box before they could select an option – even if they already knew which option to select. The order effect could also imply that conversationand-control initially imposes more cognitive load for constructing the commands, whereas legacy mode, leaving less flexibility, imposes less cognitive load. As our results show, the increase in cognitive load, if any, is marginal. We believe that the familiarity effect prevails. Nevertheless, the order effect suggests research towards a hybrid dialog model supporting novice and expert users. Novice users interact in legacy mode, in order to spend less effort on learning the characteristics of the GUI than on controlling it. Once familiar with the GUI, users can switch to expert mode and control the GUI using conversation-and-control.
Total 256 (44.0) 213 (48.2) 16.8% Table 2. Experimental task completion times in seconds, standard deviation in parenthesis.
automated measurement of the task completion time. We deliberately designed the qualified activation commands of some widgets to "overlap" in order to provoke misunderstandings (e.g., the options of the radio button lists). The widgets did not trigger any domain specific functions, which does not imply that misunderstandings had no negative effect on the task completion time: a misunderstanding might lead to a GUI state incompatible with the task goal, e.g., in incorrect selection value. The experiment system supported both conversation-and-control mode as well as the legacy mode. We provided a Logitech Internet Chat Headset for the experiment (microphone frequency range: 100 – 10,000Hz; sensitivity: -59dBV/uber, -39dBV/Pa +/-4dB). We used a Dell Dimension 8400 with 1GB of RAM and a Pentium IV processor at 3.19 GHz, running Windows XP Professional, Service Pack 2. Procedures
A test run included the completion of all 32 tasks in either mode in an order which was constant throughout the experiment and for all test subjects. We divided the test subjects randomly into 2 equally sized groups, A and B. Group A first performed test runs in legacy mode, then in conversation-and-control mode; Group B performed vice versa. With this setup we did not favor a specific mode by putting it second and it allowed us to detect order effects, if any. The actual experiment included three parts. First, the test subject was given an introduction into speechcontrolled GUIs. Second, the test subject trained an individual speech profile of the speech recognizer using the training wizard provided by Microsoft's Speech SDK (basic training only). Third, the test subject performed two test runs in each mode, i.e., a total of four test runs. The first run was considered training of the respective first mode determined by the group membership. The second test run was performed in the same mode as the first test run, however, test subjects were told that now the system would record a protocol. The third and forth test run was conducted in the same fashion using the opposite mode.
Several test subjects complained about the necessity to focus in legacy mode. They found it more natural to interact with conversation-and-control, especially, with growing experience – this supports our quantitative results and the initial motivation for our work. Two test subjects requested a command that clears out the content of an input field or
Results
We measured an average total task completion time for all 32 task of 256s for legacy mode and 213s for conversationand-control (16.8% reduction). Within group A the average task completion time was 257s for legacy mode and 203s for conversation-and-control (21.0% reduction). Within group B we measured 255s for legacy mode and 223s for conversation-and-control (13.0% reduction); also refer to
5
Kolmogorov-Smirnov-Test resulted in 0.2, which suggests that our data was sufficiently normally distributed.
987
CHI 2008 Proceedings · Mixed-Initiative Interaction
April 5-10, 2008 · Florence, Italy
3. Brøndsted, T., Aaskoven, E. Voice-controlled internet browsing for motor-handicapped users. design and implementation issues. In Proc. INTERSPEECH 2005 (2005)
spinner to more quickly deal with misunderstandings. The speech recognizer performed well even with non-native English speakers (~10% word error rate). However, while training the speech recognizer, the number of recognition errors was felt to be significantly higher than during the experiment (most certainly because Microsoft’s training wizard involved a much larger vocabulary than our experiment). We observed that clarification questions were helpful if they provided only two alternatives to choose from. Two or more alternatives were rather confusing than provided guidance. There are (at least) two explanations. First, we observed that clarification questions frequently had nothing to do with what was uttered – most certainly because our system processed recognition results regardless of speech recognizer confidence. Second, we believe that clarification questions were difficult to understand because there was no visual connection between the alternatives and the graphical objects which were ambiguous. A visualization of the question’s meaning promises to be helpful (e.g., highlighting ambiguous widgets).
4. Van Buskirk, R., LaLomia, M. A Comparison of speech and mouse/keyboard GUI navigation. In Proc. CHI'95, ACM Press (1995), 96 5. Dai, L., Goldmann, R., Sears, A., Lozier, J. Speechbased cursor control: a study of grid-based solutions. In SIGACCESS Access. Comput. 77-8 (2004), 94-101 6. Christian, K., Kules, B., Shneiderman, B., Youssel, A. A Comparison of voice controlled and mouse controlled web browsing. In Proc. ASSETS'00 (2000), 13-15 7. Denecke, M. Informational Characterization of Dialogue States. In Proc. ISLP 2000 (2000) 8. Harada, S., Landay, J.A., Malkin, J., Li, X., Bilmes. J.A. The vocal joystick: evaluation of voice-based cursor control techniques. In Proc. ASSETS'06 (2006), 197-204 9. Igarashi, T., Hughes, J.F. Voice as sound: using nonverbal voice input for interactive control. In Proc. UIST (2001), 155-156
CONCLUSION
We have presented a new approach for controlling a GUI by speech, called conversation-and-control. It is based on heuristic analysis of speech recognition results for avoiding the rejection of misrecognized commands, which is facilitated by representing widget functions as forms. Additionally, our approach incorporates the generation of clarification questions which gives the interaction a conversational character. Using Java Swing and the Java Speech API we have implemented a generic conversationand-control framework which allows for creating conversation-and-control enabled GUIs without having to deal with speech processing details, such as recognition grammar generation or dialog management. With the means of the framework we have conducted an experiment to compare conversation-and-control against conventional command-and-control approaches. Our results indicated a significant decrease (16.8%) in task completion time achieved by conversation-and-control. Our results further suggest research towards a hybrid approach using command-and-control for novice, and conversation-andcontrol for expert users.
10. James, F., Roelands, J. Voice over workplace (VoWP): voice navigation in a complex business UI. In Proc. ASSETS'02 (2002), 9-15 11. Karimullah, A.S., Sears, A. Speech-based cursor control. In Proc. ASSETS'02 (2002) 12. Karl, L., Pettey, M., Shneiderman, B. Speech-activated versus mouse activated commands for word processing applications. In International Journal of Man-Machine Studies 39, 4 (1993), 667-687 13. Leeming, B.W., Porter, D., Jackson, J.D., Bleich, H.L. Computerized radiologic reporting with voice dataentry. In Radiology 138, 3 (1981), 585-588. 14. Löhr, A. Conversation-and-control: extending speechcontrolled graphical user interfaces. VDM Verlag Dr. Müller e. K., Saarbrücken, Germany, 2007 15. Manaris, B., McCauley, R., MacGyvers, V. An intelligent interface for keyboard and mouse control – providing full access to PC functionality via speech. In Proc. Florida AI Research Symposium (2001), 182-188
ACKNOWLEDGMENTS
We wish to thank all participants who volunteered to participate in our user experiment. Thanks are also due to all persons, who contributed valuable insights into the details of form-based mixed-initiative dialog management.
16. Mihara, Y., Shibayama, E., Takahashi, S. The migatory cursor: accurate speech-based cursor movement by moving multiple ghost cursors using non-verbal vocalizations. In Proc. ASSETS'05 (2005), 76-83
REFERENCES
17. Olsen jr., D.R., Tam, C.M., Conaty, G., Phelps, M., Heiner, J.M. Speech interaction with graphical user interfaces. In Proc. Interact 2001 (2001)
1. Hands-free Mouse Listens for Instruction. In ABILITY Magazine (2002). http://abilitymagazine.com/news_voicemouse.html.
18. Papineni, K.A., Roukos, S., Ward, R.T. Free-flow dialog management using forms. In Proc. Eurospeech 99 (1999)
2. Arnold, S.C., Mark, L., Goldthwaite, J. Programming by voice. In Proc. ASSETS'00 (2000), 149-155
988