systems allowing multimodal input [3], [5]. Well-designed .... non-visual information browser for reading online newspapers. .... sufficient practice of the input modality just learned. Half of the ... After the final session, the subjects were interviewed about their ... titles of articles on one information level or going one level.
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Patterns of Multimodal Input Usage in Non-Visual Information Navigation Xiaoyu Chen, Marilyn Tremaine Department of Information Systems, New Jersey Institute of Technology {xiaoyu.chen, marilyn.tremaine} @ njit.edu
Abstract Multimodal input is known to be advantageous for graphical user interfaces, but its benefits for non-visual interaction are unknown. To explore this issue, an exploratory study was conducted with fourteen sighted subjects on a system that allows speech input and hand input on a touchpad. Findings include: (1) Users chose between these two input modalities based on the types of operations undertaken. Navigation operations were done primarily with touchpad input, while non-navigation instructions were carried out primarily using speech input. (2) Multimodal error correction was not prevalent. Repeating a failed operation until it succeeded and trying other methods in the same input modality were dominant error-correction strategies. (3) The modality learned first was not necessarily the primary modality used later, but a training order effect existed. These empirical results provide guidelines for designing non-visual multimodal input and create a comparison baseline for a subsequent study with blind users.
1. Introduction Many information access systems are being developed for blind users. A subset of these systems supports mixed output modalities, such as combined haptic and speech output [7], and combined haptic, audio, and visual output [19]. In addition, two research groups have developed systems allowing multimodal input [3], [5]. Well-designed multimodal input has a great potential to improve information systems’ accessibility, because complementary input modalities can yield a “highly synergistic blend in which the strengths of each mode are capitalized upon and used to overcome weaknesses in the other” [9], pp. 576]. Moreover, Multimodal interfaces are expected to support the natural coordination of speech and hand actions because, as linguists have uncovered, there is a close synchrony between speech and hand gestures in human communication. In fact, speech and hand gestures have been found to be inseparable units expressing different aspects of the same conceptual content in the communication [10]. However, little is known about how to design a multimodal speech and hand input mechanism so that multimodal advantages can be expanded to non-visual
information navigation for visually impaired users. Much prior work investigated speech and hand input when hand input was performed via hand-eye coordination, a skill not readily accessible to blind users. In addition, prior work on dialogues of integrated speech and hand input mainly focused on two application domains: interactive map-based tasks, e.g. [13], [14], and cross-modal error correction in speech recognition systems, e.g. [16], [18]. Empirical research is needed to investigate how to promote multimodal advantages in the non-visual informationbrowsing context. The study in this paper provides parallel speech and hand input modalities for non-visual information browsing and observes users’ input usage patterns. It reports users’ choices between the two input modalities based on (1) types of operations undertaken, (2) types of input errors, and (3) the order in which input modalities were taught. The results shed light on how to integrate the two input modalities and on what should be included in the training materials to promote multimodal advantages.
2. Related Work . Existing research reflects findings in two related domains: multimodal input usage by sighted users on graphical user interfaces (GUIs), and current designs of information browsing systems for blind users.
2.1. GUI Multimodal Input Research Research on GUI multimodal input indicates that combined hand and speech input provides the following user advantages. Expressiveness: Combined hand and speech inputs allow more powerful expressions. Hand actions and speech are both semantically rich input modes and have different expressive powers for describing information in different domains. For example, hand input is a powerful mechanism for describing geometrical attributes through movement trajectories, hand distances, palm orientations, hand shapes and index finger directions [17]. Speech provides a complicated vocabulary to address nominal information or instructional commands [6] and is not tied to spatial constraints. To obtain increased expressiveness, users were found to prefer combined speech and hand input to either modality alone when performing interactive map-related tasks [13].
0-7695-2507-5/06/$20.00 (C) 2006 IEEE
1
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Robustness: Users are more likely to select an input modality that is less error prone for specific tasks [12]. For example, when inputting a foreign surname, users are more likely to write the name than speak it [15]. Moreover, from a usability standpoint, multiple input modalities create a fail-safe system so that when one input mode fails, users can switch to the other mode. It was observed that input recognition errors were more likely to be corrected in one attempt if a switch is made to the non-error causing input modality [9], [11], [16], [18]. Performance Efficiency: It is known from psychology that well-learned motor skills such as speaking and hand and foot motions can be performed simultaneously [2]. With multimodal interfaces users can perform visuo/spatial tasks simultaneously with verbal commands with little cognitive interference [2] and, hence, accelerate task completion. Oviatt and her colleagues found that integrated speech and pen input resulted in 23% fewer spoken words and 10% faster completion times compared to speech input only [13]. Suhm and his co-workers confirmed that multimodal error correction for speech user interfaces was faster than unimodal correction involving respeaking [18]. To take advantage of multimodal input, users learned to use input modalities in a complementary fashion. The complementary usage employed specific tactics. For example, when performing tasks using interactive maps, users used multimodal speech and pen input for location specification, speech input for general commands, and pen input for drawing and pointing [13]. It was also found in speech recognition systems that speech was more effective for target-oriented commands (e.g. “select ”) than in navigation commands (e.g. “move up two lines”) [16]. These results indicate that, in the task domains studied, one input modality may be superior to other input modalities for different types of input operations. It is unknown whether these multimodal advantages can be transferred to non-visual interfaces, because a user study on non-visual multimodal input is not available yet. But arguments can be made that there should also be advantages in multimodal non-visual input.
2.2. Current Input Designs for Visually Impaired Information browsing systems for visually impaired that are currently available on the market fall into three categories: screen readers (e.g. JAWS [7]), non-visual Internet browsers (e.g. Home Page Reader [1]), and nonvisual Personal Data Assistances (e.g. PAC Mate [7]). In these designs, multimodal input is not implemented, and keystroke input is dominant. Although these systems have significantly increased information accessibility, keystroke input without visual feedback has distinct disadvantages. Users of these systems face one or more of the following challenges: (1) A considerable amount of learning and memorization is required to use system functions. The functions are accessed through dozens of single or
combined keystrokes. An important advantage of GUIs, is that the screen serves as a reminder of the available system functions which can be selected by pointing to and clicking on a menu item, is not available on non-visual systems. (2) Information is typically accessed sequentially via arrow keys. This requires the user to skip through a large amount of unwanted information. (3) Often the system is very expensive because of the use of specialized hardware. Carefully designed multimodal speech and hand input has a great potential to solve these issues, because: (1) Hand input via a tangible medium, e.g. a touchpad, allows effective menu selection on non-visual interfaces [4], [5], [8], [20]. Memory requirements are low for users. An instantiation of such a touchpad menu selection is the BullsEye menu [8]. It consists of a set of concentric circles that are divided into quadrants. Each quadrant locates a menu item. Speech output reads the menu item being touched. Non-speech audio indicates the direction of the finger movement and boundaries between quadrants. Users were found to use this mechanism efficiently to select a large number of menu items. (2) Speech input provides fast and direct access to system functions. The sequentiality requirement of information access can be overcome with speech. (3) Touch media for hand input can be very affordable since touchpads are uqbiquitous on computers. Speech recognition software is available as open source software, and thus, free. In short, the issues in the current input designs for non-visual information browsing can be solved using integrated speech and hand input. However, the results of previous multimodal studies cannot be readily used to guide the creation of a non-visual multimodal input mechanism. A key reason is the presence of a visual interface in previous studies that guided the hand input. Hence, the primary goal of the present research is: to bridge the gap between current multimodal input research and multimodal input designs for blind users.
3. Description of AudioBrowser The present study was conducted using AudioBrowser, a non-visual information browser for reading online newspapers. AudioBrowser organizes downloaded newspaper articles into hierarchies using the RSS specifications on the newspaper web site. (RSS is an acronym for Rich Site Summary, an XML format for distributing news headlines on the Web.) The system allows users to access news articles via hand actions on a touchpad and speech input through a microphone. The touchpad input consists of location-based touches and button presses. The sensing area of the touchpad is divided into three horizontal tracks, each of which is dedicated to the display of information or system commands. Each track is divided into small virtual segments that map on to individual information items or commands. Users search within an information level by gliding their fingers on a track. When a track segment is
2
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
touched, the information stored in that segment is read by the system. A click sound is triggered if the boundary between two segments is reached. The functions of the two touchpad-buttons are dynamically assigned. If the information-browsing track is touched, the two buttons zoom in or out of the touched information item. Following the zooming action, a swoosh sound indicates the zoom, and items on the information-browsing track are re-written dynamically to display items at the selected information level. If a command track is touched, the buttons execute touched command in predefined ways. The system’s speech is synthesized using the Microsoft Text-to-Speech Engine. Non-speech audio outputs are prerecorded sound clips. Speech input was added to AudioBrowser to resolve issues such as long search paths to complete tasks and userconfusing switches from one portion of the touchpad to another. A vocabulary of fifty-four speech commands was developed to perform all functions that the touchpad performs. The commands range from one to five words. A “push-to-talk” button is held when a speech input is given to eliminate undesired noise from users’ speech input. Navigation is performed using “next / previous ” commands. (The items in brackets are information or command objects.) Non-navigation instructions are performed through direct commands given in a “verb + target” form, e.g., “increase volume.” Speech input is processed using the Microsoft Speech Recognition Engine. The next paragraph gives an example use of the system. To access a newspaper article on sports, the user moves a finger along the top track of the touchpad until the category “sports” is spoken by the system. Then to hear the news stories on sports, the user presses the zoom in button. The items on the top track will then be the sports stories rather than the newspaper categories. Instead of using the touchpad input, the user could also say, “next category” until the “sports” category is heard. Then, the command “zoom in” is said to display the stories about sports. Touchpad and speech inputs each have advantages and disadvantages. Using the touchpad, users can search for any forgotten command but browsing menu options consumes additional time. Speech commands provide direct access to information and functions but have to be memorized. Each input operator, i.e., the smallest input operation unit, can be performed using either touch input or speech input. A user task usually consists of a series of input operators. To accomplish a user task, the user can mix the two input modalities in ways he / she needs.
4. Experiment Design 4.1. Research Questions and Predictions The main research question was how users would mix touch and speech to operate AudioBrowser and perform non-visual information browsing tasks. We were
particularly interested in whether users would choose different input modalities for different tasks, and whether training would affect users’ modality choice. We focused on these questions because answers to them will tell us how to design dialogues that better match how people will naturally interact with the computer when doing information browsing. We were also interested in whether more than one modality would provide advantages for error recovery via cross-modal error correction, as it does in GUIs. The main research question was broken down into three measurable research questions. These questions are listed below with our predictions of the answers and our reasons for posing these questions. RQ1: Will users choose to use an input modality based on the type of input operation they are performing? Our prediction is that users’ choice of input modality will be dependent on the type of operations undertaken because they will choose the modality that makes the operation easier. If we are wrong, the implication is that the addition of an additional modality is not beneficial for task performance. RQ2: Will users switch input modality when the modality used is failing? Our prediction is that when one input modality fails the user switches to the other to overcome from the error because failure is usually a result of the computer having trouble recognizing the input from the modality in error. Recognizing this, the user would switch to the modality with the higher chance of being recognized. If this does not occur, then the advantage of having the backup modality for error correction is mitigated and possibly needs to be part of user training. RQ3: Will users use the input modality they were trained on first as their primary input method? Our prediction is that the training order is the primary input modality used later. If we find such a primacy effect, then the training materials need to be carefully designed so that modality choice is based on performance efficacy and not training.
4.2. Subjects, Procedure & Tasks Fourteen people participated in the exploratory study. Their ages ranged from early-20s through mid-30s. All spoke fluent English without strong accents. Their backgrounds represented a wide range of computer skills from proficient programmers to individuals with limited technical backgrounds. Subjects were undergraduate and graduate students who volunteered to participate in the study for no remuneration. They did so because they were interested in learning how user evaluation studies are conducted and because of their interest in the multimodal system being evaluated. The subjects were sighted but were not provided any visual displays. All task performance was through comprehending the auditory output provided in
3
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
AudioBrowser. Using sighted subjects establishes a baseline to compare with blind users that will participate in a planned subsequent experiment. The complete study took place over three consecutive days in a lab environment. Each subject participated individually. During the first two days, the subjects were given tutorials of speech input and touchpad input respectively. Each subject trained the speech recognition engine for about a half hour before beginning the speech tutorial. Tutorials were written documents, which guided a subject through interactions with the AudioBrowser system. A practice session was held after each tutorial to ensure sufficient practice of the input modality just learned. Half of the subjects received the speech tutorial on the first day and the touchpad tutorial on the second day. The other half of the subjects received the tutorials in reverse order. Subjects were assigned to the two training orders randomly. On the second day, after the subjects had finished both input trainings, they were asked to finish another set of tasks in which they freely mixed the two input methods. The goal of this practice was to allow users to form their individual multimodal patterns. On the third day, the subjects participated in the experiment session. They were given a set of tasks that covered all system functions and represented a real information-browsing scenario using the system. The tasks included browsing the news, comprehending the hierarchical structure of the newspaper, searching for a specific news story, searching for a unique piece of news data inside a story, comprehending a news paragraph, spelling a name, pausing and resuming reading, and changing the audio output settings. The subjects were not restricted to any input modality. A video camera recorded the experiment session with each subject. After the final session, the subjects were interviewed about their experience with each input modality. They were then asked to fill out a post-experiment questionnaire.
4.3. Video Analysis The user inputs and system responses were transcribed from the experiment videos. The transcribed data was coded for the following variables: 4.3.1. Input Modality Use: For each input operator, i.e., the smallest input operation unit that constitutes an operation sequence needed to finish a user task, the input modality used was either speech input or touchpad input. 4.3.2. Type of Input Operation: The input operation types for text browsing using AudioBrowser were classified into two general categories: Navigation Operations and Nonnavigation Instructions. Navigation operations relate to searching and locating an item in the information space or the command space. Examples are going to the next article
on the current information level, going one level up in the information hierarchy, or reading the next sentence of the current article. Non-navigation instructions are not tied to locations. They include information-browsing related controls such as pausing, spelling and repeating, and auditory setting related commands such as increasing reading volume and changing voice. Information-browsing related operations can be used between nodes on the information hierarchy (e.g., browsing titles of articles on one information level or going one level up) or within nodes (e.g., reading a paragraph in an article or spelling a word). Detailed operation types are illustrated in Table 1. Table 1. Operation Types General Categories
Operation Types Browse titles on one information hierarchy (bn)
Navigation operations
level
in
the
Go to a different information level (bn) Proceed or recede within text (wn) Search for an audio setting Non-proceeding or -receding operations within the text, including pause, resume, spell a word, read the current article from the beginning, and repeat (wn)
Non-navigation instructions
Set reading unit, e.g. word, sentence, paragraph, or complete article (the system reads one unit at a time) Change the value of an audio setting, including reading speed, volume, voice, pitch, and non-speech audio volume
bn: Between-node commands
wn: within-node commands
4.4.3. Repetitive Operations: Occurrences in which users repeated commands as a normal part of their interaction. 4.4.4. Error Correction Actions: When an input error occurred, the user took actions to recover from the error. Users’ error correction operations were coded at two levels, the operator level and the error correction episode level. The operator level is the “micro level,” in which each failed input operator and the following remedial operator is recorded. The error correction episode level is the “macro level.” Each error correction episode involves a series of error correction attempts to correct one error. At the operator level, error correction actions were classified into four categories: (1) an unsuccessful speech operator followed by a speech operator for error correction, (2) an unsuccessful speech operator followed by a touchpad operator, (3) an unsuccessful touchpad operator followed by a touchpad operator; and (4) an unsuccessful touchpad operator followed by a speech operator. At the error correction episode level, each series of error correction attempts was coded until the error was corrected or the user gave up.
4
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
The transcripts and coding were finished by one experimenter. Coding was done twice at an interval of three months. Ninety percent of the items coded were identical.
5. Results A total number of 1642 input operators were performed by the 14 subjects in the experiment sessions. Six-hundred thirty-five (635) or 38.67% of the input operators were speech, and 1007 or 61.33% were hand actions on the touchpad. The average number of operators performed by each subject was 117.3 (SD = 29.43), ranging from 73 to 200. The average number of speech operators for each subject was 45.4 (SD = 24.9), ranging from 5 to 81. The average number of touch inputs for each subject was 71.9 (SD = 30.8), ranging from 27 to 129. Individual subjects’ behaviors were diverse, but common patterns were discovered.
5.1. Dependence of Input Modality Choices on Operation Types Figure 1 (on the next page) illustrates that users’ choices of input modalities were influenced by the types of input operations undertaken. Navigation operations accounted for 60% and nonnavigation instructions accounted for 40% of users’ total input operators. Seventy-nine percent (79%) of the navigation operations were performed using touchpad input. A paired t-test comparing the modality usage for navigation operations by each subject produced t (13) = 4.352, and p (one-tailed) = 0.0004. This result indicates that navigation operations were performed using significantly more touchpad input than speech input. In contrast, nonnavigation instructions were performed using significantly more speech input than touchpad input – 61% nonnavigation commands were given using speech. The paired t-test produced t (13) = 1.955, p (one-tailed) = 0.036. Besides their actual use of speech and touch, users’ subjective ratings on the two input modes were also dependent on the types of operations (see Figure 2 and Table 2). For navigation operations, users’ Ease of Use ratings on touchpad input were significantly better than their ratings on speech input (paired t (13) = 3.853, p (onetailed) = 0.001). For non-navigation instructions, the result was reversed – the Ease of Use ratings on speech input were significantly better than those on touchpad input (paired t (13) = 2.014, p (one-tailed) = 0.033). Users were also asked to provide their opinions on how much they liked to use speech and touch input to perform each type of operation respectively using similar semantic differential scales. The results were similar to their Ease of Use ratings – users preferred touchpad for navigation operations, and speech input for non-navigation instructions.
Table 2. Users’ Choice of Input Modes and Ratings on Ease of Use for General Operation Types Navigation Non-navigation Operations Instructions Touchpad Input Used
Counts: 779
Counts: 254
Avg. EU rating: 5.2
Avg. EU rating: 4.5
Speech Input Used
Counts: 205
Counts: 404
Avg. EU rating: 3.8
Avg. EU rating: 5.1
EU (Ease of Use) rating was based on a 7-point semantic differential scale, with 7 = very easy and 1 = very difficult We further investigated users’ operations on a more detailed level, finding that the navigation operations included browsing titles on one level in the information hierarchy, traversing the information hierarchy, proceeding or receding within a text unit (e.g., going to the next or previous paragraph), and searching for an audio setting. Paired t-tests on users’ input modality choices for each of these operation types all produced significant results, that is, users employed the touchpad significantly more than speech to perform each of these types of operations. The detailed types of non-navigation instructions included within-node non-proceeding operations (e.g., pause, resume, spell, read, repeat), setting the reading unit (e.g., read a word, sentence or paragraph at a time), and changing audio settings. A paired t-test on within-node nonproceeding operations produced significant results, that is, users employed significantly more speech than touchpad input to perform these types of operations. 5.1.1. Input Mode Choice for Repetition: From a different perspective, user input can be categorized as repetitive operations and non-repetitive operations. For certain tasks, users needed to repeat the same command, especially when incrementing or decrementing a value or browsing sequentially through information. An example is the “next sentence” command, which is often given repetitively until a paragraph is read. Another example is the “increase speed” command, which is issued continuously until the desired reading speed is reached. For repetitive operations, users tended to use more touchpad input than speech input. Each user, on average, used 5.4 speech operations and 31 touchpad operations to perform repetitions. A paired t-test (t (13) = 4.6127, p (onetailed) = 0.0002) found these differences significant.
5.2. User Error Correction Strategies Errors took place in both speech input and touchpad input. The success rate of speech input during the experiment was 73.7%, and that of touchpad input was 95.6%. It was believed that when one input failed the user would switch to the other input mode to recover from the errors, as
5
Percentage of Operations in Speech/Touch
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Navigation Operations
Non-navigation Instructions
100% 80% 60% 40% 20% 0% browse traverse proceed browse titles in info levels within text audio one level settings
pause
resume
percentage of speech input used
spell
read
repeat
set reading unit
change audio settings
percentage of touch input used
Avg. Ratings on Ease of Use (Transformed)
Figure 1. Average Percentages of Speech Input and Touch Input Used Based on Operation Types Navigation Operations
Non-navigation Instructions
7 6 5 4 3 2 1 0 browse traverse titles in one info levels level
proceed within text
pause
resume
avg. "Ease of Use" ratings on speech input (transformed)
spell
read
repeat
set reading unit
change audio settings
avg. "Ease of Use" ratings on touch input (transformed)
Figure 2. Average Ratings on Ease of Use of Speech and Touch Inputs Based on Operation Types Ratings were based on 7-point semantic differential scales, with 1 = very difficult and 7 = very easy
5.2.1. Analysis at the Operator Level: In speech input, a total number of 167 operators failed. One hundred-thirtynine (139) or 83.2% of them were followed by another speech operator intending to fix the error, while only 28 or 16.7% were followed by a remedial touchpad operator. A paired t-test among the subjects produced t (13) = 4.279, p (one-tailed) = 0.0004, indicating that remedial speech operators were used significantly more frequently by the subjects than remedial touchpad operators, following a failed speech input. In touchpad input, a total number of 44 operators failed. Thirty-seven (37) or 84.1% of them were followed by a touchpad operator and 7 or 15.9% were followed by a speech operator to recover from the error. The paired t-test among the subjects produced t (13) = 2.760, p (one-tailed) = 0.008. These results indicate that the users used
significantly more touchpad input than speech input to fix failed touchpad operations. Figures 3 and 4 illustrate input modality switches for error corrections by each subject. The figures show clearly that, in most cases, non-switch is the error correction strategy used. In Figure 3, an exception occurred with subject No. 10, who only encountered one error in speech input and turned to the touchpad for error correction. In Figure 4, subjects No. 3, 7, and 11 did not have any touchpad input errors. Subject No. 9 encountered one touchpad input error and switched to speech to correct it. Perc. of Operators to Remedy Failed Speech Operators
has been observed with multimodal GUIs. However, the data analysis disclosed different results. Errors and correction actions were analyzed at two levels, the operator level and the error correction episode level. At the operator level, the researchers observed each input operator immediately following an input failure. At the error correction episode level, the researchers observed each sequence of error correction operators until the error was corrected.
100% 80% 60% 40% 20% 0% s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14
Subjects Failed speech operators followed by remedial touchpad operators Failed speech operators followed by remedial speech operators
Figure 3. Remedial Operators Following Failed Speech Operators
6
100% 80% 60% 40% 20% 0% s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14
Subjects Failed touchpad operators followed by remedial speech operators Failed touchpad operators followed by remedial touchpad operators
Figure 4. Remedial Operators Following Failed Touchpad Operators
Counts of Cases
5.2.2. Analysis at the Error Correction Episode Level: Overall, 158 errors occurred during the experiment sessions. The subjects attempted to correct 150 of them and did not try to correct the other 8 errors. Error correction operations did not always succeed on the first attempt. Some remedial operators failed, too. It took one to five attempts to correct one error. The distribution of the error correction attempts is shown in Figure 5.
140 120 100 80 60 40 20 0
In the second situation, to correct the error, the user had more alternative methods. The user could repeat the failed operator, use another command in the failed modality, or switch to the other input modality. For example, if the speech command “zoom in” fails, the user can repeat the command “zoom in,” use an alternative speech command such as “select,” or switch to the touchpad to zoom in. 88 out of 150 error correction episodes belonged to the second situation. In the first situation, in 19 or 30.65% cases the user switched input mode, while in 43 or 69.35% cases the user did not switch input mode but corrected the error by repeating. This situation is illustrated in Figure 6. A paired t-test produced t (13) = -2.604 and p (one-tailed) = 0.01 indicating that subjects chose to stay in the current input modality and repeat the input in error significantly more times than switching to a new modality. Input Modality Switched Counts of Cases
Perc. of Operators to Remedy Failed Touchpad Operators
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Input Modality Not Switched 43 (69.35%)
50 40 30 20 10 0
19 (30.65%)
The user switched input mode to correct the error
118 (74.88%)
The user did not switch input mode but repeatedly use the only available method in the initial input mode to correct the error
20 (12.66%) 8 (5.08%)
9 (5.70%) 2 (1.27%) 1 (0.63%)
Figure 6. Users’ Error Correction Strategies When Only One Error Correction Method was Available in the Failed Input Modality
Errors Errors Errors Errors Errors No attempt to corrected corrected corrected corrected corrected in five in four in three in two in one correct the error attempt attempts attempts attempts attempts
Input Modality Switched
Figure 5. Counts of Cases that an Error was Corrected in One, Two, Three, Four or Five Attempts
37 (42.05%)
40 33 (37.50%) Count of Cases
One error correction episode includes one input error and the following attempt(s) for error correction until the error is corrected or given up. Results showed that there were significantly more error correction episodes without input modality switches than error correction episodes with input modality switches. A paired t (13) = 4.519 was obtained with a one-tailed p = 0.0003 indicating that users used the same input modality to correct an error significantly more times than switching modalities. At a more detailed level, error correction episodes occurred in two situations. In the first situation, to correct the error, the user had to either repeat the failed operator or switch to the other input modality. For example, when the speech command “pause” fails to be recognized by the system, the user can either repeat the command “pause” or switch to the touchpad input to pause. 62 out of 150 error correction episodes belonged to the first situation.
Input Modality Not Switched
30 20 13 (14.77%) 10 2 (2.27%)
3 (3.41%)
0 The user switched input mode before even trying the second method in the initial input mode to correct the error
The user switched input mode after tried at least one different method in the initial input mode to correct the error
The user The user The user stayed with stayed with stayed with the initial the intial the initial input mode input mode input mode and tried and initial and used method(s) only the initial method to correct the different from method the initial repeatedly to error in one method to attempt correct the correct the error error
Figure 7. Users’ Error Correction Strategies When More than One Error Correction Method was Available in the Failed Input Modality
7
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
In the second situation, when more than one error correction method is available in the initially failed modality, in 15 or 17.05% cases, users switched input modality, while in 73 or 82.95% cases, users did not switch input modality for error correction. A paired t-test confirms that input modality switching occurred significantly less than error correction without input modality switching (t (13) = -4.833, p (one-tailed) = 0.0002). Several types of users’ error correction actions were identified from the video analysis. They are illustrated in Figure 7.
not rate speech significantly better than touch. Users who received touchpad training first did not rate touch significantly better than speech as well. To conclude, a small training order effect existed in users’ actual use of the input modalities. But the effect was not significantly reflected in users’ subjective ratings toward the input modalities. This result suggests that although there is an initial primacy effect, this will disappear with use.
6. Discussion 5.3. Primacy Effect in Input Modality Usage Half of the subjects were trained on speech input first, while the other half were trained on touchpad input first. We predicted that the input modality learned first would be the primary modality used later (the primacy effect). To investigate this potential training order effect on choice of input method, several statistical tests were conducted. The measures used in the tests were the amount of speech input used, the amount of touchpad input used, the Ease of Use ratings on the speech and the touchpad input respectively, and the ratings on how much the user liked to use the speech input and the touchpad input (Likeability), respectively. First, a two-sample t-test was conducted to compare the amount of speech input used by people who received speech input training first and the amount of speech input used by people who received touchpad input training first. The amount of speech input used by each subject was standardized using the formula: (No. of speech inputs) / (total No. of inputs). The comparison of means for these values showed that people who received speech training first tended to use more speech and less touchpad operations than people who received touchpad training first. This tendency was nearly significant (two sample t (12) = 1.568, one-tailed p = 0.07). Second, within each group, a paired comparison was conducted to investigate users’ use of each input modality. Results showed that people who received touchpad training first used significantly more touchpad input than speech input during the experiment sessions (paired t (13) = 2.88, one-tailed p = 0.01). People who received speech training first still used more touchpad input than speech input, but the excess was not significant (paired t (13) = 0.48, onetailed p = 0.33). Third, the average ratings on the Ease of Use and Likability were compared between the two groups. Results did not show significantly different ratings on touchpad input between the group that received touchpad training first and the group that received speech training first. Ratings on speech input by the two groups were not different either. Fourth, within each group a paired comparison was conducted to investigate users’ ratings on each input mode. Results showed users who received speech training first did
Through the above data analysis we have developed a definitive picture of users’ patterns of multimodal input usage with AudioBrowser. We apply these results to our initial posed research questions and for each question present a discussion of possible reasons for the results and the design implications the results suggest.
6.1. Dependence of Input Modality Choices on Operation Types The results suggest a positive answer to the first research question, “Will users choose to use an input modality based on the type of input operation they are performing?” The experiment results indicate that users selected an input modality based on the type of operation being performed, that is, (a) information navigation was performed primarily by touchpad input, (b) non-navigation operations were performed primarily by speech input, and (c) Repetitive operations were performed primarily with touchpad input. There are some possible explanations to these results: (1) Navigating information requires listening to the information and information structure. Using speech input might interfere with this attention-demanding task. Thus, speech was not preferred for navigation operations. (2) Command-based systems (speech input) are much faster than menu-based systems (touch input) because they require no searching operations. Their chief problem is their memorization requirement. Thus, simple speech commands to execute a one time only task, were preferred, as long as the user remembered what they were. (3) The information deployment on the touchpad and the process of navigating the information hierarchy through touch provided a tangible external organization of the information that could aid the navigation. The touch, drag and listen navigation is somewhat akin to using visual cues for navigation in a GUI environment. Thus, touch was preferred for navigation because it gave more location cues. (4) Repetitive operations were cumbersome to perform with speech. The touchpad has a motor performance advantage in this case, that is, it is faster to touch or press a button than to talk, especially when the task is touching or pressing a key continuously versus speaking a command continuously.
8
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
(5) When one task was interrupted by another task, users could be using separate input modalities to keep track of different tasks. For example, during the experiment: when a navigation action needed to be interrupted by a nonnavigation command execution, e.g., when users needed to decrease the reading speed while reading an article, users switched from touchpad input (for reading an article) to speech input (for decreasing speed), and then switched back to touchpad input (to continue reading the article). Users kept their finger on the touchpad while giving the speech utterance. In this case, users used the touchpad input to keep their position in the information space, so that after the intervention of the speech utterance, they could efficiently continue from where they had left off. Design implications derived from the results are as follows: (a) Speech commands should be developed primarily for non-navigation operations. (b) Touchpad input should be developed primarily to facilitate navigation operations. (c) A parallelism in touch and speech commands should be maintained to allow input modality switching for speech repairs. (d) A repeat key or command that can be accessed quickly could be developed to accelerate repetitive speech or touchpad operations.
6.2. User Error Correction Strategies The experiment results indicate a negative answer to the second research question, “Will users switch input modality when the modality used is failing?” Users tended to stay with an input modality when an error correction action was needed. Users were observed not to switch modalities even when their current choice of modality continued to fail. When making choices between repeating the failed operation and switching to the other input modality for error correction, users often chose to repeat the problematic input. When there were methods other than repeating the failed operation that could correct the error in the current modality, the tenacity for staying in this modality became stronger. Below we list some possible explanations for these error correction results: (1) Switching modalities incurs a cognitive load. In the case of moving from speech to touch (since most of the errors occurred in speech), the user had to find the touch command on the touchpad by navigating a command hierarchy. This was cognitively more difficult than simply repeating the command a number of times. (2) Switching modalities may lose a user’s mental place in the information hierarchy being navigated. Although the place is maintained by the system, users would need to return to the information track and then start to mentally recall where they were. (3) Switching modalities is also more cognitively difficult for users because they have already selected one routine problem solving method, which they are executing to perform a task and now have to bring up a new method.
Obviously, moving to a different modality is cognitively harder than bringing up a new method within the same modality, or, at least, this is the choice our users continued to make. These results indicate the following design suggestions: (a) Users do not appear to use cross-modal error correction strategies on the non-visual interface. Thus, to take advantage of multimodal error recovery, users need to be taught to do so. (b) Different methods within the same modality should be implemented to facilitate users’ preferred error correction strategies.
6.3. Primacy Effect in Input Modality Usage The results lead to a weak positive answer to the third research question, “Will users use the input modality they were trained on first as their primary input method?” There was a training order effect in users’ choice of input modality, but not in users’ subjective ratings of the modalities. Users chose to execute more operations in the modality they were trained in first, but users’ ratings on ease of use and their liking for a specific modality for a given task were not affected by the training order. Since the study measured the training order effect not long after the training, the small effect detected might be a spurious result that will disappear with long-term usage. This result indicates that user-training materials, specifically the training order of each input modality to perform an operation type, should be designed carefully, so that the advantages of choosing an input modality that matches an operation type can be promoted.
7. Conclusion and Future Work As the first step of a series of experiments, this study investigated sighted users’ multimodal input patterns in a non-visual context. The results lead to design implications that will be verified through studies with blind users. The results also establish a baseline for comparison with the multimodal input patterns of blind users. The subsequent research will be a repetition of the study described in this paper, with the procedures customized for blind participants. This customization will mostly be a transfer of the written materials to oral presentation. We have added speech input to our tactile system to provide the key advantages that speech can provide to blind users. However, it was not known how to integrate speech and touch into an effective non-visual dialogue. This research is a first step in supporting this integration.
8. References [1] Asakawa, C., Itoh, T., User interface of a Home Page Reader, Proceedings of the third international ACM conference on Assistive technologies (ASSETS’98), April 15-17, 1998, Marina del Rey, California, United States, pp.149-156.
9
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
[2] Billinghurst, M. Put That Where? Voice and Gesture at the Graphics Interface, Computer Graphics, November 1998, 60-63. [3] Brewster, S., Lumsden, J., Bell, M., Hall, M., Tasker, S., Multimodal ‘Eyes-Free’ Interaction Techniques for Wearable Devices. Proceedings of the ACM conference on Human Factors in Computing systems (CHI’03), April 5-10, 2003, Ft. Lauderdale, Florida, United States, pp. 473-480. [4] Chen, X., Chung, J., Lacsina, P., Tremaine, M., Mobile Browsable Information Access for the Visually Impaired, Proceedings of the Tenth American Conference on Information Systems (AMCIS’04), New York, New York, United States. [5] Chen, X., Tremaine, M. User Error Handling Strategies on a Non-Visual Multimodal Interface, Proceedings of the Eleventh American Conference on Information Systems (AMCIS’05), Omaha, Nebraska, USA. [6] Cohen, P., The Role of Natural Language in a Multimodal Interface, Proceedings of the Fifth ACM Symposium on User Interface Software and Technology (UIST’92), November 15-18, 1992, Monteray, California, United States, pp. 143-149. [7] Freedom Scientific, PAC Mate http://www.freedomscientific.com/index.html.
and
JAWS,
[8] Friedlander, N., Schlueter, K., Mantei, M., Bullseye! When Fitt’s Law Doesn’t Fit. Proceedings of the ACM conference on Human Factors in Computing systems (CHI’98), April 18-23, 1998, Los Angeles, California, United States, pp. 257-264. [9] Karat, C., Halverson, C., Horn, D., Karat, J., Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recognition systems, Proceedings of the ACM conference on Human Factors in Computing systems (CHI'99), May 15-20, 1999, Pittsburgh, Pennsylvania, United States, pp. 568-575. [10] McNeill, D. (Ed.) Language and Gesture. Cambridge: Cambridge University Press, MA, 2000. [11] Oviatt, S.L., Mutual Disambiguation of Recognition Errors in a Multimodal Architecture. Proceedings of the Conference on Human Factors in Computing Systems (CHI’99), May 18-20, 1999, Pittsburgh, Pennsylvania, United States, pp. 576-583.
[12] Oviatt, S.L., Cohen, P., Multimodal Interfaces that Process What Comes Naturally. Communications of the ACM, March 2000, Vol. 43, No. 3, pp. 45-53. [13] Oviatt, S.L., Coulston, R., Tomko, S., Xiao, B., Lunsford, R., Wesson, M., Carmichael, L., Toward a Theory of Organized Multimodal Integration Patterns during Human-Computer Interaction, Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI’03), November 5-7, 2003, Vancouver, British Columbia, Canada, pp. 44-51. [14] Oviatt, S.L., DeAngeli, A., Kuhn, K. Integration and Synchronization of Input Modes during Multimodal HumanComputer Interaction. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI’97), March 22-27, 1997, Atlanta, Georgia, United States, pp. 415-422. [15] Oviatt, S.L., Olsen, E., Integration Themes in Multimodal Human-Computer Interaction, Proceedings of the International Conference on Spoken Language Processing, September 18-22, 1994, Yokohama, Japan, pp. 551-554. [16] Sears, A., Feng, J., and Oseitutu, K., Hands-Free, SpeechBased Navigation during Dictation: Difficulties, Consequences, and Solutions, Human-Computer Interaction, 2003, Vol. 18, pp. 229-257. [17] Sowa, T., Wachsmuth, I., Coverbal Iconic Gestures for Object Descriptions in Virtual Environments: An Empirical Study. Post-Proceedings of the Conference of Gestures: Meaning and Use, 2000, Porto, Portugal, pp. 365-376. [18] Suhm, B., Myers, B, Waibel, A. Multimodal error correction for speech user interfaces, TOCHI, v.8 n.1, p.60-98, March 2001. [19] Vitense, H.S., Jacko, J.A., and Emery, V.K. Multimodal Feedback: Establishing a Performance Baseline for Improved Access by Individuals with Visual Impairments, Proceedings of the Fifth International ACM Conference on Assistive Technologies (ASSETS’02), July 8-10, 2002, Edinburgh, Scotland, 49-56.
[20] Williams, C., Tremaine, M., SoundNews: An Audio Browsing Tool for the Blind. Proceedings of the International Conference on Universal Access in Human-Computer Interaction (UAHCI’01), August 5-10, 2001, New Orleans, Louisiana, United States, pp. 1029-1033.
10