Multimodal Interaction under Exerted Conditions in a Natural Field ...

11 downloads 72 Views 239KB Size Report
Oct 15, 2004 - baby Jessie. The police psychological profilers believe that a ... PDAs, one with a 206 MHz Intel StrongArm processor running. Pocket PC 2000 ...
Multimodal Interaction under Exerted Conditions in a Natural Field Setting Sanjeev Kumar

Philip R. Cohen

Rachel Coulston

Oregon Health & Science University 20000 NW Walker Road Beaverton, OR 97006, USA +1-503-748-7803

{skumar, pcohen, rachel}@cse.ogi.edu ABSTRACT This paper evaluates the performance of a multimodal interface under exerted conditions in a natural field setting. The subjects in the present study engaged in a strenuous activity while multimodally performing map-based tasks using handheld computing devices. This activity made the users breathe heavily and become fatigued during the course of the study. We found that the performance of both speech and gesture recognizers degraded as a function of exertion, while the overall multimodal success rate was stable. This stabilization is accounted for by the mutual disambiguation of modalities, which increases significantly with exertion. The system performed better for subjects with a greater level of physical fitness, as measured by their running speed, with more stable multimodal performance and a later degradation of speech and gesture recognition as compared with subjects who were less fit. The findings presented in this paper have a significant impact on design decisions for multimodal interfaces targeted towards highly mobile and exerted users in field environments.

Categories and Subject Descriptors H.5.1 (User Interfaces): Evaluation/methodology, Interaction styles, Natural language, Voice I/O

General Terms Measurement, Human Factors

Keywords Multimodal Interaction, Mobile, Exertion, Field, Evaluation

1. INTRODUCTION Multimodal interaction allows people to engage computer systems with the best combination of modalities for the situation and task. Research has documented that because multimodal interaction merges information from multiple sources, it can overcome errors in the individual input modalities, thereby leading to more robust performance [9]. This process, known as mutual disambiguation (MD), has been shown to lead to relative error rate reductions in speech recognition of 19-41% in challenging environments, such as with accented speakers, moderate noise, and moderate motion Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’04, October 13–15, 2004, State College, Pennsylvania, USA. Copyright 2004 ACM 1-58113-890-3/04/0010...$5.00.

[9, 10]. That is, MD can help stabilize performance for environments in which one or more of the base modalities are individually weak. It has been reported that automatic speech recognition (ASR) performance during walking degrades relative to stationary use [10], and that ASR performance with read speech degrades after exertion [4]. However, that research has not reported the effects of motion or exertion on pen-based gesture recognition, or on system recognition of unplanned interactive speech. The present study is designed to provide initial information about the relationship of speech, pen-based gesture, and multimodal recognition, as a function of the user’s state of exertion. In so doing, one hypothesis of the present research is that multimodal interaction would lead to improved performance over the individual modalities in a challenging environment in which subjects are highly exerted, breathing heavily, and physically tired. As a separate motivation for the present work, there is substantial interest in both the research and application communities in mobile and wearable systems. Indeed, there has also been significant interest in finding a mobile computing platform that can support a user on foot who is in a rapidly changing and potentially hostile environment. We believe that a multimodal interface is likely to be a part of any such solution. By examining the use of a PDA in an exerted situation, we hope to provide initial data and guidance about the suitability of this platform for field multimodal usage. The remainder of this paper is organized as follows. Section 2 discusses the details of the study including the story line, the setup, the multimodal task, and the multimodal system. Section 3 presents the design of the experiment, the measurements, and the procedure. The research findings of the study are presented in Section 4, and the paper concludes in Section 5 with a discussion on related and future work.

2. The Study We conducted a study in which users engaged in strenuous activity while performing map-based computer tasks using handheld computing equipment. Specifically, while engaged in a competitive “anti-kidnapping” scenario, users ran across an uneven field, carrying a baby carrier. At each “station,” the user performed a series of multimodal tasks using a handheld PDA, and then ran to another location. This activity made the users breathe heavily, thus potentially interfering with the speech recognition. Moreover, because subjects were carrying a heavy object, their arms became fatigued, thus potentially interfering with gesture recognition. In order to stimulate performance, subjects were told that the fastest performer would win $100.

The study involved collecting data on speech, gesture, and multimodal recognition rates under such exerted conditions that were also correlated with measures of physical exertion. The remainder of this section discusses the study and the equipment.

2.1 Storyline The subjects were told that their task was to rescue kidnapped baby Jessie. The police psychological profilers believe that a psychopath kidnapped her because various objects arranged in strange shapes were left at the scene. Jessie’s older sister Mary saw the kidnapping happen at a distance and decided to follow the kidnapper. At each object left by the kidnapper, she left an object that indicates the next place she saw the kidnapper stop, and hence the subject could follow the “arrows” to where the baby might be found. The first priority of the subjects was to retrieve the kidnapped baby. Another high priority was to find and arrest the kidnapper, and for this reason, the subjects needed to send information about clues from the field to police headquarters for further analysis. The subjects would need to follow the clues (shapes and arrows) in order, entering the information into the PDA. This data was sent back to “headquarters” so that the detectives and criminal psychologists could possibly infer his motive and location. At the end of the trail of clues, the subject would find baby Jessie abandoned by the psychopath. The subjects need to pick up Jessie and follow the clues back to the starting point, at each point again entering the information in the PDA so as to double check them at the headquarters.

2.2 Equipment and Infrastructure The equipment used in the study consisted of two Compaq iPaq PDAs, one with a 206 MHz Intel StrongArm processor running Pocket PC 2000, and the other with a 400 MHz Intel XScale processor running PocketPC 2002. Each PDA communicated with the servers over a wireless 802.11b network provided by installing a mobile antenna at one end of the course and a roof-top wireless antenna at the other end. One iPaq had a built-in wireless card while the other iPaq used an Orinoco PCMCIA wireless card connected to a 2db external antenna. This 2db antenna was mounted in a backpack that was carried by a research assistant. For interacting with the machines, we employed two close-talking noise-canceling microphones (Plantronics M110 and Andrea ANC 550), and a digital voice recorder (Olympus DS-330). One of the iPaqs had a built-in external microphone jack, and the other iPaq had to be modified to use its headphone jack as an external microphone jack. The Plantronics microphone was attached to the PDA and was used for speech recognition by the multimodal system. The Andrea microphone was attached to the voice recorder to provide data for scoring the speech recognizer, for calculating the respiration rate as a measure of exertion, and for using that data offline to evaluate other speech recognizers for use in field and mobile settings. The subjects would run to a station, take the PDA from the research assistant, plug in one of the microphones, enter data, unplug the microphone and return the PDA to the research assistant before running to the next station. Section 5.1 discusses the network connectivity issues that precluded the subjects from carrying the PDA as they ran. Subjects’ heart rate was continuously monitored both in order to gauge their level of exertion, and to prevent over-exertion. A

Polar heart rate monitor (S610) was used to gather heart rate data. It was set to record heart rates in beats per minute every 5 seconds, and it was synchronized with the server before each subject began. The heart rate data was then uploaded to a database using an infrared connection to the analysis machine. Each PDA was running a multimodal map-based user interface integrated with ScanSoft’s ASR3200TM speech recognition and TTS3000 TM text-to-speech engines. The speech recognition was done on the PDA while the digital ink was transmitted over the wireless network to servers for gesture recognition and for multimodal integration with the output of the speech recognizer. Speech was recognized with the hypothesis list (N-best list) set to 3, meaning the recognizer would return the top three hypotheses, along with their recognition scores. The resulting multimodal command was sent to the PDA over the same wireless network and was displayed on a map of the campus. Gesture recognition was done by the NISSketchTM recognizer from Natural Interaction Systems, LLC. The doll carried by the subjects during the second running phase weighed roughly 7 lbs. The picture in Figure 1 shows a test subject geared up during pilot testing of the experimental setup. One can see the two microphones, the Figure 1: A pilot test subject PDA, and a wireless antenna in the backpack.

2.3 Subjects & Multimodal Task A total of 14 paid volunteers participated in the study. All subjects were male native speakers of American English, between 20 and 49 years of age, with varying levels of cardiovascular fitness. The subjects did not have any prior experience with speech and pen based multimodal systems for PDAs. Previous studies have shown that the performance of speech recognizers is degraded for non-native speakers of English resulting in higher mutual disambiguation for such speakers [9]. Consequently, only native speakers of American English were used in the present study so as to minimize sources of errors in speech recognition other than those caused by exertion and environmental conditions. The reason for using only male subjects was our observation during pilot testing that this particular speech recognizer running on the PDA performed poorly for female speakers even under normal conditions. Finally, the gestures of one of the 14 subjects were so poorly recognized (40%) that he was deemed an outlier and removed from the subsequent analyses.

2.4 The Multimodal System The multimodal system used in this study accepts both continuous, speaker-independent speech as well as continuous pen-based sketch. Both streams are recognized and understood independently, after which their meanings are fused, resulting in multimodal interpretations [6, 11]. In the general case, there are multiple speech and pen-based gesture interpretations, as well as multiple possible fused interpretations. The system’s goal is to

determine the best joint interpretation based on the semantics of the domain, as well as the confidence and/or ranking scores from the recognizers. The system recognized speech on the PDA, with utterance hypotheses being sent over the wireless network to a multi-agent architecture running on a server. The PDA also sent the digital ink to the server for interpretation. Because the experiment involved using a speech recognizer running on a PDA, it was anticipated that the recognition problem should be kept manageable in size. Consequently, a small speech recognition grammar was designed that allowed 85 different utterances, while the gesture recognizer allowed 5 different gestures (dot, cross, arrow, line, and area) giving a total of 425 possible multimodal combinations. The multimodal grammar then allowed 87 “legal” multimodal commands out of the 425 possible commands (11 with dot gesture, 4 with cross gesture, 45 with arrow gesture, 10 with line gesture, and 17 with area gesture). These legal commands induced semantic constraints, as found in other systems (e.g. [9]). Of course, subjects did not know what was acceptable, but only entered the commands corresponding to the found objects. The multimodal integrator used here employed unification as the fusion operation [6]. The multimodal integration algorithm ruled out interpretations that were not licensed by the multimodal grammar, and then ranked the surviving interpretations by the product of the recognizer scores, which ranged between 0 and 1. Since all commands were multimodal, a product combination was satisfactory for these initial explorations. Finally, the statistics of the domain (e.g., what objects were at each station, or which ones had been entered before) was deliberately ignored.

3. Design of the experiment A within-subject design was used for the study. The three conditions compared were the Stationary (or Control) condition, the Running condition, and the running with a load condition (Running*), in that order. These three conditions provided increasing levels of exertion: the Control condition did not exert the subjects, the Running condition provided some exertion, and the Running* condition offered the most exertion. On average, the subjects spent approximately 45–60 minutes in the study (not counting the time for instructions and practice) with about 15–20 minutes per condition. We did not counterbalance the order of trials because of the unpredictability in the time that would have been needed for subjects’ heart rate, respiration, and limb fatigue to return to the control level after exertion. Also, given the outdoor setting, it was infeasible to require subjects to return on subsequent days.

3.1 Procedure The subjects were first given instructions inside the lab and allowed to become familiar with the system through practice until they felt comfortable using it. The subjects’ heart rate was recorded during this practice session in order to obtain a resting level. Thereafter, the subjects were equipped with the rest of the experimental gear and taken outdoors for the actual study. The study was carried out over a period of more than one month under weather conditions that varied from rainy and windy to bright and sunny. It was assumed that the effect of unpredictable random outdoor noise would average out over the period of the study both within and among subjects. Therefore, exertion would be the dominant factor influencing all types of recognition. For safety

reasons, the heart rate monitor was set up to beep at a maximum heart-level calculated per subject depending on his age, and the experiment would have been discontinued if a subject became over-exerted (as indicated by beeping of the heart rate monitor). The subjects first completed ten pairs of multimodal commands while standing in the field where the study was set up. The subjects were shown a set of ten photographs of hypothetical pairings of “shapes” and “arrows” that they had to enter into the PDA. Most of these combinations of shapes and arrows were not used in the running conditions. This part of the experiment served as the control phase since the subjects were not yet exerted but had had enough practice using the multimodal PDA. We can assume that the external ambient noise conditions (other than the occasional, unpredictable sounds) remained the same during this control phase and the subsequent running phases. The task involved entering two kinds of entities into the PDA using speech and gesture. One of the entities was a “shape” formed by arranging smaller objects such as marbles or pencils. The other entity was an object with an inherent pointing aspect that we refer Figure 2: Heart of marbles and knife to as the “arrow”. For example, Figure 2 shows a heart shape made out of marbles, and a knife pointing towards another location. Similarly, Figure 3 shows another entity pair (a triangle of cigarettes, and a screwdriver pointing in a certain direction). For each “shape” the subjects were asked to draw the shape, such as a heart or a triangle, on the PDA using the stylus and say what object the shape was made of (such as “marbles” or “cigarettes”). For the “arrow” object, they were asked to draw an Figure 3: Cigarettes & screwdirver arrow on the PDA while saying “[name of object] this/that way”. For example, for Figure 2, the subject might draw an arrow and speak “knife pointing this way”, and for Figure 3, the subjects might draw an arrow and say “screwdriver that way”. Subjects were told to attempt no more than three times to enter the data at each station before moving on to the next. There were ten stations, whose numbers were visible on flags raised up from the ground. The “arrow” at each location, except for location ten, pointed to the next location. The arrow at location ten pointed to the location of the baby. The stations were arranged diagonally along the arcs of a circle. This allowed the

subjects to run directly towards the next station without wasting time or having too much opportunity to lower their heart rate and respiration rate. The total distance run by the participants was 1.6 kilometers, with an average of 90 meters between each station. After the control phase, the subjects proceeded to the first station to begin the first running condition. At each station, the subjects would enter into the PDA the “shape” and the “arrow” that they found at that station, and then run in the direction of the arrow to reach the next station. This was the Running phase of the experiment. After completing ten pairs of commands in this condition (one pair at each station), the subjects would retrieve the baby in the baby carrier and returned to location ten. Subjects were offered cold water or sports drink at this point to prevent them from getting dehydrated. Thereafter, the subjects began running their way back through all the ten stations in reverse order from station ten to station one, and re-entering information at each station. However, this time they were required to carry the baby in a baby carrier in their writing hand. This phase (called Running*) was designed to test the effect of arm fatigue on gesture recognition when interacting with the PDA using a stylus. After finishing the course, participants returned to the lab for a short interview.

3.2 Measures and Scoring Three kinds of data were captured per subject: speech was captured by a digital voice recorder, heart rate data was gathered by a heart rate recorder, and a log file was created that recorded the outputs of speech, gesture, and multimodal recognitions along with their timestamps. The voice recording was converted into a “wave” file, transcribed for spoken utterances as well as for exhalations (to calculate respiration rates) using a signal analysis tool called PRAAT. The transcribed speech and the heart rate data along with their timestamps were uploaded into a database. The log files were parsed and analyzed and the relevant information for scoring the performance of speech, gesture, and multimodal recognition was uploaded into the same database along with their timestamps and linked to the transcribed speech and heart rate data. A set of GUI tools were developed using Python and Java to help a researcher score the database. The primary measure of interest for the current study is lexical correctness, i.e., whether the speech or gesture recognizer recovered precisely what the user said or drew and whether the system’s multimodal interpretation was correct. Accordingly, gestures were simply counted as correct or incorrect. For lexical correctness of speech recognition, a standard word error rate was calculated in addition to the utterance (or phrase) recognition rate. Mutual disambiguation (or MD) occurs when the top-ranked multimodal command includes an interpretation from speech and/or from gesture that is not itself top-ranked for that modality. The MD rate is thus calculated as the percentage of correct multimodal commands that had mutual disambiguation between the input modalities. This definition of MD rate is different from that in the literature [9] because that prior definition counted even those pull-ups that did not rise to the top of the n-best list.

3.3 Offline Re-processing of the Corpus The corpus recorded from the field study was also processed offline to determine whether the results would have been different had the speech been shipped to a backend machine for recognition instead of doing the recognition on the PDA, and to test whether increasing the length of the speech recognizer’s N-best list made any difference in the results. The speech that was recorded with the digital voice recorder was manually sliced from the continuous speech stream with about 100 milliseconds of silence on either side of the utterance. The utterances were then processed by the same speech recognizer, versions of which were run on both the PC and on the PDA. The post-processed speech was then fed into the multimodal integrator along with the original gesture to produce multimodal results and all the parameters of interest were analyzed for each case. As such, we have the following cases: field, PC with N=3, PC with N=5, re-run on PDA with N=3, and re-run on PDA with N=5. Recall that for the field case, speech was recognized locally on the PDA with size of N-best list set to 3.

3.4 Exertion Level The primary goal of the current study was to find how speech recognition, gesture recognition, and multimodal system performances behave under exerted conditions. We attempted to use multiple indicators of exertion level in the present study including heart rate in beats per minute (both absolute and relative to a resting rate), heart rate recovery (in beats per minute), total running time, running speed, and respiration rate. The speed between each station was computed from the running time between each station and the distance between those stations and it was found to vary from 1.8 mph (mile per hour) to 6.5 mph. Studies have found that the normal human walking speed is between 2.7 and 3 mph [1, 2, 7]. To be conservative, we classified subjects as “fit” if their average speed in the most strenuous condition was >= 4 mph. Most of the subjects who were not classified as fit were in fact walking at the end.

4. Results One subject was removed as an outlier because of his gestures. The results that follow are from the analyses of 13 subjects’ first attempts to interact with the system at each station. Thus, every subject has the same number of first attempts (60). We report results only for lexically correct recognitions. It was found that the utterance recognition rate and the word error rate conveyed equivalent statistical information in the present study. Also, the multimodal performance depends on utterance recognition rather than on word recognition. Therefore, we chose to report only the utterance recognition rate under speech recognition results.

4.1 Recognition Results in the field, during exertion Figures 4 and 5 show the variation of multimodal success rate (MMS), speech recognition rate (SR), gesture recognition rate (GR), and mutual disambiguation rate (MD) across the three exertion conditions - Control, Running, and running with baby (Running*).

4.2 Post-processing of Collected Corpus

Recognition Performance (Field)

95 90

Control

85

Running

80

Running*

75

4.2.1 Field vs. Offline comparisons

70 MMS

SR

GR

Figure 4: Recognition performance in field

First, the gesture recognizer performance decreases significantly from the control to the first running condition (paired t=2.536, one-tailed, p=0.013, df=12), and from the Control to the Running* condition (paired t=3.448, one-tailed, p=0.002, df=12). No significant difference was found for gesture recognition between Running and Running* conditions. Furthermore, no significant difference in speech recognition performance or multimodal success was found as a function of exertion.

MD R ate (% )

MD Rate (Lexically Correct, 1st Attempts) 20 15 10

MD

5 0 Control

Running

Running*

Exertion Condition

Figure 5: MD Rate in field

Figure 5 illustrates that the MD rate in the two running conditions is nearly twice that of the MD rate in the stationary (Control) condition, with the final difference between Control and Running* condition a significant one (one-tailed paired t = –2.53, p=0.013, df=12). Thus, we find that in the control condition, 8% of the multimodal commands succeed because a recognition result in one the modalities is pulled up from a lower position in that recognizer’s n-best list in virtue of its being paired with a highly scoring recognition result in the other modality. In contrast, mutual disambiguation takes place in 14% (17%) of the Running (Running*) condition inputs, and the MD in the most exerted case is significantly greater as compared to the stationary case. Thus, for the field results, mutual disambiguation appears to be increasing with exertion, compensating for the degradation in gesture recognition results. However, contrary to expectations, the performance of the speech recognizer was not found to be affected by running. Likewise, the multimodal success rate was not found to differ among the conditions tested in the field. These surprising results for speech recognition and multimodal success rate led us to further analyze the corpus collected during the study.

However, comparing the field performance of the speech recognizer to the off-line performance, significant differences in speech recognition performance, and also multimodal success rates were found. A pair-wise t-test (one-tailed) for the speech recognition per subject between the field data and the postprocessed data confirms that the offline performance was significantly better during post-processing (Control (t = –3.094, p= 0.005, df=12), Running (t = –4.172, p=0.0007, df=12), and Running* (t = –2.02, p=0.033, df=12). Likewise, paired comparison t-tests on the multimodal success rate for each of the three conditions Control (t = –3.448, p=0.002, df=12), Running (t = –3.573, p=0.002, df=12), Running* (t = –2.588, p=0.012, df=12) indicate that the multimodal success rate due to the postprocessed speech was significantly better than that observed by subjects in the field. Thus, we can conclude that the speech recognizer, and the overall system performance, would have been significantly better if the data had been processed in a way similar to the offline method. It is therefore instructive to examine how exertion affected these results.

4.2.2 Exerted performance using offline-processed speech Figure 6 presents the results of multimodal system performance using the offline-processed speech running on the PC (or PDA, since there was no difference) with size of N-best list set to 3. It demonstrates that the performance of all recognizers decreases as the exertion level increases from the control to running to running with the load. Specifically, speech recognition performance trends downward with exertion, decreasing significantly from Control to Running* (one-tailed paired t=2.622, p=0.011, df=12). The multimodal success rate also shows a downward trend, and finally a significant decrease as a function of exertion from the Control to the highly exerted condition (one-tailed paired t=2.494, p=0.014, df=12). Recognition Performance

Success Rate (%)

Success Rate (%)

100

We compared speech recognition performance during the postprocessing in four conditions – with n=3 as well as with n=5 on both the PC and PDA. No significant differences in recognizer performance were found using paired t-tests. Since gesture recognition was kept constant in these off-line runs, the gesture success rate and thus the multimodal success rates in all four of the post-processed cases were exactly the same.

100 95 90 85 80 75 70

Control

Running

Running*

MMS

SR

GR

Figure 6: Recognition performance for post-processing Figure 7 shows a corresponding increase in MD rate from 8% in the Control condition to 11% in Running condition, and finally to 17% in the Running* condition. The difference in MD rate

between control and running* is significant (one-tailed paired t = –2.541, p=0.013, df=12). Success Rate (%)

Recognition Performance (Not-So-Fit, PC, N=3)

MD Rate (%)

MD Rate 20 15 10

MD

100 95 90 85 80 75 70

5

Control Running Running*

MMS

SR

GR

0 Control

Running

Running*

Recognition Performance (Fit, PC, N=3) Success Rate (%)

Exertion Condition

Figure 7: MD rate for post-processing

4.2.3 Classification of subjects based on fitness-level We were not able to find reliable clusters of subjects based on heart rate or heart rate recovery measures (except gesture). However, running speed seems to cleanly divide the subjects into exertion clusters. Subjects were classified as fit (FIT) if their speed averaged greater than 4 mph for the Running* condition (running with baby). This classification resulted in 6 subjects in the FIT category and 7 subjects in the NSF (Not-So-Fit) category. Overall, the average speed in the highly exerted phase (Running*) was found to be significantly slower than the average speed in the first running phase (one-tailed paired t=2.71, p=0.0095, df=12). FIT subjects showed no significant difference in speed between the Running than Running* conditions (using paired t-test). On the other hand, NSF subjects were significantly slower in the Running* than Running condition (one-tailed paired t=3.23, p=0.009, df=6). Not surprisingly, they were also significantly slower overall, and in each condition (for Running, assuming unequal variances: one-tailed unpaired t=4.98, p=0.0003, df=10; for Running*, assuming unequal variances: one-tailed unpaired t=6.026, p=0.00004, df=11) as compared to the FIT subjects.

4.2.4 Multimodal performance using offline speech as a function of fitness Figure 8 shows the performance of various recognizers for NSF and FIT subjects. Using the offline-processed speech, the multimodal success rate and the speech recognition rate were each found to decrease significantly with increase in exertion level for NSF subjects (multimodal success rate for Control vs. Running*: one-tailed paired t=2.424, p=0.026, df=6; speech recognition for Control vs. Running*: one-tailed paired t=11.214, p=0.00002, df=6) but not for FIT subjects. The gesture recognition decreased significantly for both categories (for FIT: one-tailed paired t=2.719, p=0.021, df=5; for NSF: one-tailed paired t=2.077, p=0.042, df=6). However, when compared across the two groups (FIT and NSF), the only significant difference between conditions (FIT stationary vs. NSF stationary; FIT running vs NSF running, etc.) was found to be for speech recognition in the first running condition (onetailed unpaired t = –2.06, p=0.033, df=10 assuming unequal variance), where the FIT subjects outperformed the NSF subjects.

100 95 90 85 80 75 70

Control Running Running*

MMS

SR

GR

Figure 8: Recognition performance for NSF vs. F subjects

4.2.5 Correlation of heart rate measures with multimodal performance Measures of heart rate, both absolute as well as relative to resting heart rate, and heart rate recovery per minute were correlated with the various speech, gesture, and multimodal success measures. No significant correlations were found, except in one case in which gesture recognition was found to be significantly negatively correlated with relative heart rate of the subjects (calculated for each subject as a percentage over his resting heart rate) for both FIT (ρ = –14%, critical ρ = 11.6%) and NSF categories (ρ = –21.1 %, critical ρ = 13.33%). The critical correlation coefficient calculated using 2/sqrt(N) gives the value above which a correlation can be considered to be statistically significant. The negative correlation indicates the inverse relationship between gesture recognition performance and the relative heart rate.

5. DISCUSSION In the field, the overall multimodal success rate was 81.5%, the speech recognition rate was 77%, and the gesture recognition rate was 92.9%. The mutual disambiguation (MD) rate was 12.5%, meaning in 12.5% of the successful commands, either speech or gesture’s top candidate hypothesis (or both) was incorrect. The MD rate increased with exertion, and the MD rate in the most exerted condition was twice that in the control condition (8.3% in the control, and 13.6% and 16.5% respectively in the two running phases), thus providing evidence of robustness of multimodal interaction in exerted field and mobile environments. These results were confirmed with the offline processed speech. Overall, our hypothesis that mutual disambiguation would stabilize system performance in the face of exertion was supported because performance of the system stayed relatively constant, despite the increased exertion1. Furthermore, based on offline-processed 1

Although this study design did not counterbalance the order of conditions, for the logistical reasons mentioned earlier, any order-

speech, it was found that multimodal success was stabilized for FIT subjects (i.e. there was no significant increase or decrease from the stationary to either of the running conditions). However, the multimodal success rate was still significantly worse for NSF subjects during the most exerted running condition, as compared to the stationary condition. As expected, the performance of the gesture recognizer degraded significantly from control to the first running condition as well as from control to the most exerted condition (from 97.6% to 92.1% and 88.9%). However, subjects’ fitness level differentially affected gesture recognition performance ― the exertion from running affected the NSF subjects’ gesture recognition performance, but it took carrying a load in one arm to significantly affect the FIT subjects' gesture recognition. A surprising result was that the performance of the speech recognizer in the field was not significantly different across the three conditions (75.3% in the control phase, 77.8% in the running phase, and 77.5% in running with the baby phase). One possible explanation was that the speech recognizer running on the PDA did not fail gracefully. This is supported by the observation that in 62% of the errors, the correct recognition did not even appear in the list of hypotheses, and hence mutual disambiguation can be of no help. In order to investigate this finding further, we used the speech corpus collected via the voice recorder. The recorded speech was sliced into individual utterances and processed offline on a PC as well as on a PDA used for the study using the same speech recognition software as was used in the field. It was found that the performance of the speech recognizer with the offlineprocessed speech increased dramatically over the field results, but also in fact decreased over all subjects as a function of exertion, from 92.3% in the control condition, to 91.2% in the running condition, and (significantly) to 87.7% in the highly exerted condition. Fit subjects maintained their speech recognition performance during exertion, but eventually were significantly affected by running with a load. This study did not find that re-run speech recognition on the handheld PDA was significantly different than recognition performed on a backend server machine, provided that the entire speech signal containing the subjects’ utterance was used. We therefore attribute the increase in performance of the speech recognizer during post-processing to the fact that each utterance had been manually sliced from the continuous speech stream with about 100 milliseconds of silence on either side of the utterance. This hypothesis was confirmed by our observing poor performance of the speech recognizer for randomly selected utterances from the corpus that had been sliced with little or no silence at the beginning of the utterance. Thus, differences in the microphone were not likely the cause. Rather, the most probable cause of poor performance of the speech recognizer in the field was truncation of the beginning of the speech samples. It is well-known that once an HMM-based recognizer gets off track in the early part of an utterance, it has a difficult time recovering. We believe this occurred because the PDA started listening too late once the user touched the screen. Thus, even though many of the processes needed for multimodal interpretation were being run on a dependent “learning” that may have taken place would have weakened the results that we found from their actual magnitudes. Thus, the effects may well be stronger than found here.

separate server, the PDA still could not switch processes fast enough to keep up with users’ finely coordinated speech and manual gesture. In order to overcome the difficulties of the PDA’s truncating speech, two potential solutions are apparent: 1) slow the user down, for example by requiring that a button be hit to start speech before drawing [8], or 2) have the system listen continuously, then process speech that begins within some time window of the pen’s touching the screen. We have in fact implemented and used the latter method in tablet PC implementations of the multimodal system, but were hesitant to implement it for the PDA, since having the system listen continually could itself use too many CPU cycles. We made a number of correlations of system performance with heart rate, but only found one that was significant. We hypothesize that one reason that these measures were not often correlated with heart rate is that subjects were self-managing their physiological responses to exertion. Subjects who were fit may well have pushed themselves harder and ran faster, attempting to win the prize, while subjects who were not as fit adapted by slowing down, sometimes walking, with the goal of finishing the course. In general, there may be too many factors affecting heart rate to be a predictor of users’ speech and gesture recognition performance. Finally, respiration rates were simply too labor intensive to calculate for each subject.

5.1 Equipment observations The study was originally designed and pilot tested using a single PDA that was wirelessly connected to the campus computer network using 802.11b. The subject ran while carrying the PDA to each station. However, we found that various 802.11b receivers in the PDAs could not properly maintain network connections while the subject was running. Moreover, switching of network base stations as the subject ran would often cause a substantial time delay (15-20 seconds) in reacquiring the network, often leading to poor multimodal performance. This problem led us to abandon the original method of having subjects carry the PDA from station to station, and instead run from one station to another where they picked up a PDA that was already on the network. Our multimodal map-based system was enhanced with an ability to communicate with an Emtac Bluetooth GPS receiver, thus enabling the system to display the position, orientation and speed of a subject on the map in real time. The objective was to provide the subjects with a sense of their location and orientation with respect to the study area. However, it was found during the pilot studies that the Bluetooth GPS unit and the transmission of GPS data from the PDA were slowing down the system’s response. Because that response time needed to be kept within 3-4 seconds in order not to allow enough time for the heart rate of the subjects to come down between commands, it was decided not to use the GPS for the study. Finally, during pilot testing, the equipment setup employed a Plantronics’ Bluetooth wireless microphone to communicate with the PDA. However, the design of the microphone (which wrapped around the ear) was too unstable to keep positioned correctly when worn by a running subject. In addition, the battery life was too short to last for an entire session. It was therefore decided to go with wired microphones instead for the actual study.

5.2 Related Work There is a large literature on spoken interaction in stressful environmental conditions [3], particularly for speech in combat aircraft with substantial noise and high G-forces. However, very little research into exerted human-computer interaction using speech or pen-based gesture has been reported. Quite recently, Entwistle [4] reports that when subjects have become exerted (as measured by self-reports of perceived exertion levels) on a stationary bicycle, and then dictate passages to a speech recognition system, the overall speech dictation recognition rate degrades from 77.8% (rested) to 66.5% (light exertion) to 60% (exerted). No correlation of recognizer accuracy with heart rate or heart rate recovery was found. This lack of a correlation was confirmed in the present work. Recognition accuracy also was not found to correlate with self-reports of fitness. Using a speed-based classification of fitness, we found fit subjects’ recognition accuracy did not degrade under exertion, but those classified as “not so fit” did suffer significantly degraded speech recognition. Given that Entwistle’s subjects were required to have been enrolled in physical activities courses or intercollegiate athletics, most of those subjects would likely have been classified in our FIT category. Finally, as Entwistle points out, her results are for read speech, not interactive speech with a computer. The present study is most similar to that of Oviatt [10], who reports on the effects of moderate movement (walking) on multimodal interaction as mediated by a mutually disambiguating architecture. Our study confirms Oviatt’s findings for the moderate exertion case but it also found substantial new results for high exertion levels. In Oviatt’s study, MD was found to occur in 11% of the interactions using a noise-canceling microphone during mobile situations, and in 7.5% of the stationary interactions. Likewise, in the present study, MD occurred in 8% and 11% of the stationary and exerted interactions, but jumped to 17% in the highly exerted case.2 Regarding gesture, in Oviatt’s study, a tablet computer was suspended in a harness that was designed to prevent arm fatigue. Not surprisingly from the perspective of the present study, no degradation of gesture recognition was found as a result of interacting with the system while walking. We found the recognition of gestures to degrade under exertion. Other researchers have developed mobile multimodal systems (e.g, Jameson [5], Kruger [8]). Jameson’s system is intended to track eye-movement while users are engaged in mobile tasks, but did not attempt to analyze speech and gesture. Kruger’s multimodal map system endeavors to perform all multimodal interaction onboard the PDA, but it has yet to be evaluated in mobile contexts.

5.3 Concluding Remarks This initial study has demonstrated that exertion will affect both speech and gesture recognition, performing significantly worse for subjects who are not among the most fit. However, overall and especially for fit subjects, mutual disambiguation was found to lead to a stabilization of system performance. To our knowledge, these are the first reports of gesture recognition and multimodal system performance under exertion, and only the second 2

Our definition of MD may underestimate the occurrence as compared to Oviatt’s, since we only considered cases in which the multimodal result was in fact correct.

report on speech recognition in exerted conditions. Future research should investigate this relationship longitudinally, with more subjects, and with more complex recognition problems, particularly since others have found much higher levels of absolute degradation in speech recognizer performance. It will be interesting to discover the extent to which MD with much larger recognition vocabularies can compensate for degradations under exertion. Lastly, our observations of the PDA platform for multimodal interaction apply to the current generation of processors. The next generation is expected to offer much greater performance, and manufacturers are working to improve system robustness and operating system efficiency. This generation could significantly improve system speech and gesture recognition performance, enabling the entire multimodal interface to function standalone.

6. ACKNOWLEDGMENTS This material is based upon work supported by the Defense Advanced Research Project Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services, under Contract No. NBCHD030010. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Project Agency (DARPA), or the Department of Interior. We would also like to thank Helen Ross for helping with the study and to Sharon Oviatt and David McGee for their useful feedback.

7. REFERENCES [1] Canadian Manual of Uniform Traffic Control Devices for Canada (UTCD). [2] U.S. Manual on Uniform Traffic Control Devices (MUTCD). [3] Baber, C. and Noyes, J. Automatic speech recognition in adverse environments. Human Factors, 38: 142-156, 1996. [4] Entwistle, M. S. The Performance of Automated Speech Recognition Systems Under Adverse Conditions of Human Exertion. International Journal of HCI, 16: 127-140, 2003. [5] Jameson, A. Usability Issues and Methods for Mobile Multimodal Systems. In Proceedings of ISCA Tutorial and Research Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, Germany, 2002. [6] Johnston, M., Cohen, P. R., McGee, D. R., Oviatt, S. L., Pittman, J. A., and Smith, I. A. Unification-based Multimodal Integration. In Proceedings of 35th Annual Meeting of the Assoc. for Computational Linguistics, 1997. [7] Knoblauch, R., Pietrucha, M., and Nitzburg, M. Field Studies of Pedestrian Walking Speed and Start-up Time. Transportation Research Record, (1538), 1996. [8] Krüger, A., Butz, A., Müller, C., Stahl, C., Wasinger, R., Steinberg, K.-E., and Dirschl, A. The Connected User Interface: Realizing a Personal Situated Navigation Service. In Proceedings of IUI, 2004. [9] Oviatt, S. L. Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of CHI '99. [10] Oviatt, S. L. Multimodal System Processing in Mobile Environments. In Proceedings of UIST'2000. [11] Vo, M. T. and Wood, C. Building an application framework for speech and pen input integration in multimodal learning interfaces. In Proc. of IEEE-ICASSP, 1996.

Suggest Documents