Evaluation of a Speech Recognition Prototype for ... - Semantic Scholar

4 downloads 281 Views 339KB Size Report
Nov 23, 2010 - interface access strategy for computer text gen- eration. ..... A Toshiba Satellite R20 4 tablet computer ... The laptop touch screen was used to.
Augmentative and Alternative Communication, December 2010 VOL. 26 (4), pp. 267–277

Evaluation of a Speech Recognition Prototype for Speakers with Moderate and Severe Dysarthria: A Preliminary Report SUSAN K. FAGERa*, DAVID R. BEUKELMANb, TOM JAKOBSc and JOHN-PAUL HOSOMd

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

a Institute for Rehabilitation Science and Engineering, Madonna Rehabilitation Hospital, Lincoln, Nebraska, bUniversity of Nebraska, Lincoln, cInvoTek Inc., Alma, AR and dOregon Health and Science University, Portland, OR, USA

This study described preliminary work with the Supplemented Speech Recognition (SSR) system for speakers with dysarthria. SSR incorporated automatic speech recognition optimized for dysarthric speech, alphabet supplementation, and word prediction. Participants included seven individuals with a range of dysarthria severity. Keystroke savings using SSR averaged 68.2% for typical sentences and 67.5% for atypical phrases. This was significantly different to using word prediction alone. The SSR correctly identified an average of 80.7% of target stimulus words for typical sentences and 82.8% for atypical phrases. Statistical significance could not be claimed for the relations between sentence intelligibility and keystroke savings or sentence intelligibility and system performance. The results suggest that individuals with dysarthria using SSR could achieve comparable keystroke savings regardless of speech severity. Keywords: Speech recognition; Dysarthria; Supplemented speech; Voice recognition

INTRODUCTION A range of computer access options for people with severe physical conditions and limited movement capabilities has been developed over the years, including head tracking, eye tracking, ergonomically-designed keyboards and mouse options, one-handed keyboards, onscreen keyboards, scanning software applications, and automatic speech recognition (ASR) technology (Beukelman, Garrett, & Yorkston, 2007). While progress in developing new access technologies is encouraging, there is an ongoing need to design technologies that are efficient, minimally fatiguing, and preferred by individuals with severe physical conditions. Many people with moderate and severe dysarthria retain some natural speaking abilities, which they use to communicate with familiar people about predictable messages. While ASR technology has been useful for those with spinal cord injuries (SCI) and typical speech who experience reduced keyboard efficiency due to limited hand movement, the usefulness of this

technology for individuals with severe physical impairments and co-occurring dysarthria has been limited. If ASR technology could recognize the speech of people with dysarthria, it might be an effective interface access strategy for computer text generation. However, for those with severe physical impairments and co-occurring dysarthria, attempts at functional ASR use with standard, commercially available technology (e.g., Dragon NaturallySpeaking1) have been either largely unsuccessful or successful for only a limited dictionary of words (Coleman & Meyers, 1991; Ferrier, Jarrell, Carpenter, & Shane, 1992; Ferrier, Shane, Ballard, Carpenter, & Benoit, 1995; FriedOken, 1985; Kotler & Thomas-Stonell, 1997; Raghavendra, Rosengren, & Hunnicutt, 2001). Standard, Commercially Available ASR Technology and Dysarthria In order to understand ASR technology, one must understand the basic terminology that

*Corresponding author. Institute for Rehabilitation Science and Engineering, Madonna Rehabilitation Hospital, 5401 South Street, Lincoln, NE 68506, USA. Tel: þ1 402 483 9459. E-mail: [email protected] ISSN 0743-4618 print/ISSN 1477-3848 online Ó 2010 International Society for Augmentative and Alternative Communication DOI: 10.3109/07434618.2010.532508

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

268

S. K. FAGER et al.

describes different variations of these systems. ASR can be described as either discrete utterance or continuous speech based; as well as speaker dependent, speaker independent, or speaker adaptive (Rosen & Yampolsky, 2000; Venkatagiri, 2002). Discrete-utterance-based ASR requires the speaker to insert pauses between words when dictating. Continuous speech-based systems allow the user to dictate a continuous stream of speech without slowing speech rate or inserting unnatural pauses. Speaker-dependent ASR requires that the technology be trained to recognize each individual’s voice. With these systems, the need to train every word to be recognized often results in significantly reduced vocabulary sets. Speaker-independent systems have been developed based on a standard speech template that does not require individual training. These systems are commonly used in automated telephone menus. Speaker-adaptive systems develop and refine the recognition templates while in use. These systems do not require extensive training before they can be used; however, extremely low recognition rates are common during the first few uses. If the speaker diligently corrects all misrecognitions, the system eventually adapts to the speaker’s unique voice and recognition rates begin to increase. A number of studies have documented trials of standard, commercially available ASR technology by speakers with dysarthria. Many of the earliest reports describe accuracies of 40–96% using discrete-utterance, speaker-dependent systems on limited vocabulary sets (alpha-numeric codes and small lists of words from standard reading passages), with individuals with mild dysarthria achieving the highest accuracies (Coleman & Meyers, 1991; Ferrier et al., 1992; Ferrier et al., 1995; Fried-Oken, 1985; Kotler & Thomas-Stonell, 1997). Some have compared the performance of speaker-adaptive and speaker-dependent systems using small vocabulary sets and have found that for those with mild dysarthria, recognition rates were high on both technologies; however, for those with severe dysarthria, the results varied considerably between participants (Raghavendra et al., 2001). The researchers noted the differences in the speech characteristics demonstrated by the participants with severe dysarthria as an explanation for the differences in accuracies. One of the few studies that have examined the use of a discrete utterance versus a continuous system included an individual with mild dysarthria due to traumatic brain injury (TBI) and a typical speaker (Hux, Rankin-Erickson, Manasse, & Lauritzen, 2000). Both the control speaker and the speaker with mild dysarthria were able to achieve higher recognition rates with the continuous speech

system. Other studies with individuals who have mild dysarthria using continuous speech ASR technology have produced similar results (Manasse, Hux, & Rankin-Erickson, 2000). What is clear from these reports is the difficulty that standard, commercially available ASR technology has in recognizing the speech of individuals with moderate to severe dysarthria. Rosen and Yampolsky (2000) provided an extensive review of ASR technology and described the affect that acoustic characteristics of dysarthria have on recognition accuracy. Blaney and Wilson (2000) identified intra-speaker variability across several acoustic parameters as a reason for low recognition rates for individuals with moderate dysarthria. While some levels of ASR accuracy with standard, commercially available technology have been reported, it should be noted that this usually involved speakers who had mild dysarthria and tasks that included very limited vocabulary sets. No reports have documented the functional, daily use of standard, commercially available ASR by individuals with severe dysarthria to support written communication needs. The authors of this study informally surveyed 10 clinical specialists who were not involved in the current study, but who served people with dysarthria in regional clinics, to determine if these specialists had served individuals with moderate to severe dysarthria who routinely used ASR technology. Routine use was defined as use of ASR for communication interaction, word processing, email, and/or Internet access. None of the clinical specialists reported having individuals with moderate to severe dysarthria who routinely used ASR in the past 5 years. Any reports of routine use of standard, commercially available ASR technology related to individuals who had very mild or no dysarthria. ASR Adapted for Dysarthria Due to the challenge individuals face using standard, commercially available ASR, extensive efforts have been underway to develop ASR technology based on models of dysarthric speech (Chen & Kostov, 1997; Omar, Morales, & Cox, 2009; Patel & Roy, 1998; Polur & Miller, 2005; Sawhney & Wheeler, 1999; Wan & Carmichael, 2005). The STARDUST (Speech Recognition for People with Severe Dysarthria) and SPECS (Speech-driven Environmental Control Systems) projects developed ASR based on models of dysarthric speech (Green et al., 2003; Hatzis et al., 2003; Hawley et al., 2003). These projects targeted the development of a speech-controlled environmental control system (ECS) for severe dysarthria (Hawley, 2002; Hawley et al., 2007;

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

SUPPLEMENTED SPEECH RECOGNITION

Judge, Robertson, Hawley, & Enderby, 2009). Individuals with even the most severe dysarthria were able to achieve recognition rates to support routine use of this ECS in the home using a small vocabulary set (mean word recognition rate of 86.9%). Additionally, these projects have resulted in the development of a training system to improve accuracy of the speech of individuals with dysarthria for ASR use (Parker, Cunningham, Enderby, Hawley, & Green, 2006), and a voice-input voice-output communication aid (VIVOCA) for individuals with moderate to severe dysarthria (Hawley, Enderby, Green, Cunningham, & Palmer, 2006). The ENABL (Enabler for Access to ComputerBased Vocational Tasks with Language and Speech) project adapted an ASR to the speech of individuals with dysarthria to control an engineering design system (Magnuson & Blomberg, 2000; Rosengren, 2000; Talbot, 2000). This system utilized sentence-level commands to control the engineering design program. For a male participant with dysarthria, words correctly identified by the ASR increased from 22.7 to 53%. Researchers associated with the Rehabilitation Engineering Research Centre for Communication Enhancement (AAC-RERC) in the United States have also developed an acoustic model based on the speech of individuals with a range of dysarthria severity due to amyotrophic lateral sclerosis (ALS). Data collection for the development of the model included recordings of participants reading the numbers 0 through 9. Initial tests with the prototype model resulted in 86.24% word recognition for a female with mild dysarthria, 70.27% for a female with moderate dysarthria, 75.28% for a male with moderate dysarthria, and 51.25% for a female with severe dysarthria (Caves, Boemler, & Cope, 2007). Supplemented Speech Recognition Prototype Development Preliminary work has been completed by some of the authors on a prototype application based on models of dysarthric speech that will be described in this report. This prototype application, called supplemented speech recognition (SSR), is a ‘‘hybrid’’ approach to ASR that incorporates first-letter identification as well as word prediction, which make it potentially more efficient for speakers with moderate to severe dysarthria versus technology that relies on the speech signal alone. While some commercially available ASR technologies incorporate word prediction (e.g., SpeakQ2) the addition of the first-letter cue makes this system unique. Additionally, using first-letter cues (i.e., alphabet supplementation) is a strategy

269

that many individuals with dysarthria are familiar with, as they use it to compensate for decreased intelligibility (Hanson, Yorkston, & Beukelman, 2004; Hanson, Beukelman, Heidemann, & ShuttsJohnson, 2009; Hustad, Auker, Natale, & Carlson, 2003; Hustad & Beukelman, 2001; Hustad, Jones, & Dailey, 2003). The following includes a description of (a) the development of the ASR technology incorporated by the SSR system, (b) the word prediction algorithm that was employed and how it interacts with the ASR, (c) the user interface, and (d) how the entire prototype system functions for an individual using the system. ASR technology incorporated by SSR The speaker-dependent, discrete-utterance SSR system developed for this work was trained on each individual’s voice using a single recording of each of 211 words. The researchers were cognizant of the time and effort individuals with dysarthria require to train speaker-dependent ASR technology; therefore, an effort was made to reduce the number of words required to train the system to recognize the individual’s speech. The list of 211 words was chosen to be the smallest subset possible of the initial target list of 500 words, while covering a maximum number of phonemes in different phonetic contexts. A phoneme-level Hidden Markov Model (HMM) was employed with words represented as a sequence of context-independent phonemes. The pronunciation of each word was made as general as possible, to allow for phonetic variation among the participants with dysarthria. Training of the entire system on 211 words required about 10 min on a 2 GHz computer. Recognition was performed using a total vocabulary of 500 words, restricted to those words beginning with the letter indicated by the user. The list of 500 words included the 211 words used in training. More detail about the HMM/ ANN framework and feature set can be found in the description of the baseline system described by Hosom for the task of forced alignment (Hosom, 2009). This general approach has yielded state-of-the-art results on tasks such as connected digit recognition (Hosom, Cole, & Cosi, 1998). While the system described in this prior publication is for speaker-independent processing, the system described here is speaker-dependent because training data comes from a single speaker. Word prediction in SSR The word prediction component to the SSR was developed custom to the project by Invotek, Inc.,

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

270

S. K. FAGER et al.

using a bigram language model. The following is a description of how the word prediction algorithm worked in conjunction with the SSR system described above: First, the word prediction software generated two lists based on the last word in the text being generated and the first letter of the target word that was entered by the user. This first list was based on a 5,000-word database that included the probability of every word in the database, given the previous word that was generated in the text. A second list, based on the first letter of the target word that was typed, was then generated. This second list was a subset of the first list and only contained the words known to the SSR system. The SSR system then attempted to match the acoustic signal of the spoken word produced by the individual with dysarthria with the list produced by the SSR system. The best match was entered automatically into the text in the message window and the next five most probable alternative words were displayed on the word-prediction buttons. The SSR interface During this preliminary investigation, the participants with dysarthria did not interact with the user interface (other than providing the spoken target word). The following is a description of the interface, to allow the reader to understand how the prototype system functioned: The user interface of the prototype system consisted of an onscreen keyboard in alphabetized layout, a message window and five word-prediction buttons that were positioned along the right-hand side of the message window (Figure 1). The interface was limited to five word-prediction buttons, that were vertically aligned based on previous reports of the cognitive demands related to scanning

word-prediction buttons, in an attempt to reduce these demands for individuals using the system (Koester& Levine, 1998; Lesher, Moulton, & Higginbotham, 1999). The interface was similar in design to many interface display options in commercially available speech generating devices (e.g., Vmax by Dynavox MayorJohnson3). Functioning of SSR The SSR software prototype system functioned in the following manner: The first letter of the target word of each sentence was typed using the onscreen keyboard on a touch-screen tablet computer. This letter was displayed in the message window. The participant would then hear a beep, which was the auditory prompt for him or her to speak the target word. The word that the system interpreted to be the closest match produced by the participant would then appear in the message window, and the next five probable choices would be displayed alphabetically on the word-prediction buttons to the right of the message window. If the word was correctly displayed in the message window, this procedure would be repeated for the next word in the sentence. However, if the word in the message window was not correct, but the correct word was available on one of the word-prediction buttons, that word would be selected and it would replace the incorrect word in the message window. If the correct word was not displayed in the message window or on any of the word-prediction buttons, subsequent letters of the target word would be entered until the correct word appeared on one of the word-prediction buttons or until the participant had typed it in its entirety. Once the correct word was in the message window this process would be repeated for subsequent words within the target sentence. Purpose

Figure 1. The interface of the SSR software program including onscreen keyboard, message window and word-prediction buttons.

The purpose of this project was to investigate the performance of the SSR prototype system for speakers with a range of dysarthria severity. The specific research questions were: (a) How does keystroke savings derived from using SSR compare to conventional text generation using word prediction only? (b) How does the combination of ASR and word prediction (provided through the SSR system) impact word recognition accuracy compared to ASR alone, and (c) What are the relationships between keystroke savings and speech intelligibility and SSR performance and speech intelligibility?

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

SUPPLEMENTED SPEECH RECOGNITION

271

METHOD

Materials

Participants

Software

Seven speakers with dysarthria participated in this project. One participant had cerebral palsy and the remaining six had a primary diagnosis of traumatic brain injury (TBI). The participants included six males and one female with an age range of 24–51 years. The type of dysarthria each participant exhibited was determined by the lead authors using the Darley, Aronson, and Brown (1969) classification of dysarthria. One participant demonstrated spastic dysarthria and the remaining six participants demonstrated mixed dysarthria. Intelligibility was measured using the Speech Intelligibility Test (SIT, Yorkston, Beukelman, Hakel, & Dorsey, 2007). Intelligibility scores ranged from 16.4–89.1%. Two of the seven participants routinely used augmentative and alternative communication (AAC) devices and strategies to support communication. Table 1 provides the demographic information of all of the participants.

The software designed for the project was the SSR prototype system described previously. The software interface included an onscreen keyboard, message window, and word-prediction buttons (Figure 1). Hardware A Toshiba Satellite R20 4 tablet computer with an Intel Pentium 2 GHz 5 processor hosted the software. The laptop touch screen was used to identify first letters of target words. An Andrea NC-7100 USB head mounted microphone was used to capture the audio signals from the participants with dysarthria. Target stimuli The target stimuli consisted of 17, nonstandardized sentences and phrases developed from the 500 words in the SSR master dictionary

TABLE 1 Participant Diagnosis, Dysarthria type, and Sentence Intelligibility Results. Participant P1 P2 P3 P4 P5 P6 P7

Diagnosis

Age

Dysarthria type

Sentence intelligibility

AAC to support communication?

CP TBI TBI TBI TBI TBI TBI

35 42 48 38 51 24 45

Spastic Mixed-spastic Mixed Mixed-flaccid Mixed Mixed Mixed

16.4% 45.0% 50.0% 74.6% 77.3% 82.7% 89.1%

Yes No Yes No No No No

Note. CP ¼ cerebral palsy; TBI ¼ traumatic brain injury.

Figure 2. Average keystroke savings for each speaker participant (ordered by sentence intelligibility) with researcher-controlled touchscreen computer using full SSR on typical and atypical sentences compared the max keystroke savings possible using only word prediction on typical sentences and atypical phrases.

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

272

S. K. FAGER et al.

Figure 3. Percentage of target stimulus words the SSR placed in the message window compared to the percentage of target stimulus words that were placed in the message window or on a word-prediction button for each participant (ordered by sentence intelligibility) for typical sentences and atypical phrases.

(Appendix A); 10 of these sentences were typical in that they were grammatically correct as judged by the researchers. The typical sentence set had a predictability index of 50% using the assessment reported by Cannito, Suiter, Chorna, Beverly, Wolf, Watkin, and Pfeiffer (2008). One of the side effects of a small dictionary of common words for the ASR was that, in typical sentences, the words were often easily predicted by the word prediction language model alone. To get a better sense of the SSR performance, an atypical stimulus set of seven phrases and incomplete sentences was developed. The researchers judged this set to be phrases and incomplete sentences that were consistent with spoken conversation. Many individuals with moderate to severe dysarthria rely on AAC strategies to support their communication. These conversational phrases were chosen, in addition to the typical sentences, because in the future some individuals may use the SSR to support spoken communication. Procedures The researcher controlled the touch screen, in order to control for participants’ varying levels of manual and visual capability and reduce their learning time. Individuals with dysarthria often present with a myriad of visual and motor deficits. For example, in the current study, three of the participants demonstrated visual field cuts, one required switch-scanning to use an AAC device, two required head-tracking technology to control a computer cursor, and one required changes in select time on a touch-screen device due to ataxic limb movements. At the time of the experiment,

the prototype system did not accommodate all of the potential access needs that individuals with severe motor and visual impairments may require (alternative access methods such as head/eye control, switch-scanning capabilities, modifications to touch-screen sensitivity, etc.). Additionally, substantial learning and practice may be required in order for an individual to become competent using a system that requires accurate first-letter identification as well as the ability to use word prediction. While these are important considerations to be addressed in future studies, the cognitive, visual, and motor requirements of using such a system pose potentially confounding variables. The goal of this preliminary work was to examine how the system performed using realtime dysarthric speech. Seven individuals with dysarthria participated in the first part of the project. The researcher presented each word of the target sentence orally. The researcher typed the first letter of the target word. Next, the participant heard a beep as a prompt to say the target word. The participant spoke the word, and the SSR placed the most probable word into the message window and provided the next five probable word options on the word-prediction buttons. The researcher then did one of the following: (a) proceeded to the next target word in the sentence if the word was accurately displayed in the message window, (b) selected the target word from the word prediction list, or (c) spelled the word using the onscreen keyboard portion of the interface. A measure of keystroke savings was calculated during this task. The percentage of time during which the target word appeared in the message window and on the

SUPPLEMENTED SPEECH RECOGNITION

word-prediction buttons was also calculated and used to represent SSR system performance. For the second portion of the project, the researcher entered each target sentence/phrase with the ASR turned off (no speech recognition). The word prediction was active and when target words appeared on the word-prediction buttons, they were selected. A measure of keystroke savings was calculated during this task. Measurements

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

Keystroke savings Keystroke savings of SSR used the following formula: (X7Y)/X, whereas X represented the total number of keystrokes required to enter a word plus a word-terminating character (e.g., space); and Y represented the total number of keystrokes required to enter the word in the SSR (Higginbotham, 1992; Koester & Levine, 1998; Magnuson & Hunnicutt, 2002; Trnka, Yarrington, McCoy, & Pennington, 2005; Venkatagiri, 1994). In the SSR prototype, the following scenarios could occur: Scenario 1: The SSR placed the correct target word in the message window. In this instance, only one keystroke would be entered (the first letter of the word that was to be spoken) and the SSR identified an accurate match. It then placed that word in the message window, and the individual using the system would then proceed to the next word. For the target word ball, the keystroke savings formula would yield the following result: B-A-L-L ¼ 5 keystrokes (X) SSR ¼ required 1 keystroke (Y) (X-Y)/X (5-1)/5 4/5 ¼ 80% Scenario 2: The SSR did not place the correct target word into the message window but the correct word was available on one of the wordprediction buttons. The individual using the system would have to enter a keystroke (the first letter of the word that was to be spoken) as well as perform the equivalent of another keystroke (selection of the word from the word prediction list). If the target word again was ball, the keystroke savings formula would yield the following result: B-A-L-L ¼ 5 keystrokes (X) SSR ¼ required 2 keystrokes (Y) (X-Y)/X (5-2)/5 3/5 ¼ 60%

273

Scenario 3: The SSR did not place the correct target word into the message window or the word-prediction buttons, so that the word had to be spelled either in its entirety or until it appeared on a word-prediction button. If the target word appeared on a word- prediction button after the ‘‘a’’ was typed in the example ball, the individual using the system would have entered one keystroke (the first letter of the word to be spoken), plus one keystroke for the ‘‘a’’, and one more keystroke to select the target word from the word prediction list. The keystroke savings formula would then yield the following result: B-A-L-L ¼ 5 keystrokes (X) SSR ¼ 3 keystrokes (Y) (5-3)/5 2/5 ¼ 40% SSR system performance To provide a better understanding of the potential benefits of the individual components of the SSR, the percentage of target stimulus words that were displayed in the message window was calculated and compared to the percentage of target stimulus words that were displayed in the message window or on the word-prediction buttons. Statistical analysis Data were described using descriptive statistics (means). Significance was tested using the Wilcoxon Test. Relationships were examined using Spearman’s Rank Order Correlation. RESULTS The results include an investigation of the keystroke savings using word prediction only compared to using the full SSR; the accuracy by which the SSR identified the target word (SSR system performance); and the relationship between keystroke savings, SSR performance, and the sentence intelligibility of participants. Data are reported for the typical sentences and atypical phrases. Keystroke Savings with Word Prediction Only For typical sentences with the researcher optimally controlling the keyboard, the keystroke savings were 59.6% with the word prediction only (no ASR was employed). For atypical phrases, the keystroke savings were 44% with the word prediction only (no ASR was employed). These percentages represent the maximum keystroke savings possible using the SSR prototype.

274

S. K. FAGER et al.

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

Keystroke Savings with full SSR (ASR þ Word Prediction) Keystroke savings using the SSR for typical sentences averaged 68.2% (range ¼ 64.8–70.1%) for all speaker participants when the researcher controlled the touch-screen computer optimally. Keystroke savings for atypical phrases averaged 67.5% (range ¼ 60.4–73.9%) for all speaker participants with the touch-screen computer optimally controlled by the researcher. Figure 2 displays the individual keystroke savings results on each stimulus set for each speaker participant with the touch-screen computer controlled by the researcher. There was a significant difference using the full SSR (Mdn ¼ 67.8) compared to word prediction only (Mdn ¼ 59.6) for typical sentences (Z ¼ 2.366, p ¼ 0.022). There was also a significant difference using the full SSR (Mdn ¼ 64.32) compared to word prediction only (Mdn ¼ 44) for atypical phrases (Z ¼ 2.197, p ¼ 0.035). SSR System Performance The word inserted into the message window of the SSR system reflected the system’s most probable choice. For typical sentences, an average of 50.7% of the target stimulus words were displayed in the message window (range ¼ 31.9– 75.2%). For atypical phrases, an average of 51.2% of the target stimulus words were displayed in the message window for atypical phrases (range ¼ 27.4–73.4%). As reported above, the most probable word was inserted into the message window on the SSR interface. Words that appeared on the wordprediction buttons represented the next five most probable word choices ordered alphabetically on the SSR interface. To examine the performance of the full SSR, the percentage of target stimulus words that appeared either in the message window or on the word-prediction buttons was examined. For typical sentences, an average of 80.7% of the target stimulus words were displayed either in the message window or on one of the word-prediction buttons (range ¼ 67.4– 97.5%). For atypical phrases, an average of 82.8% of the target stimulus words appeared either in the message window or on one of the word-prediction buttons (range of 69.4–94.4%). Relationships between Keystroke Savings and Intelligibility, SR System Performance, and Intelligibility The relationships between keystroke savings and speech intelligibility did not reach statistical

significance for typical sentences (r(5) ¼ .750, p ¼ .052), or atypical phrases (r(5) ¼ .607, p ¼ .148). There was not a statistically significant relationship between the percentage of target stimulus words that were displayed in the message window and intelligibility for typical sentences (r(5) ¼ .750, p ¼ .052), but there was a significant relationship between intelligibility and atypical phrases (r(5) ¼ .757, p ¼ .049). There were no statistically significant relationships for the percentage of target stimulus words displayed either in the message window or on word predictionbuttons, or for intelligibility on typical sentences (r(5) ¼ .714, p ¼ .071) or atypical phrases (r(5) ¼ .500, p ¼ .253). A review of these results suggests that, for speakers with dysarthria with a range of speech severity, the keystroke savings and system performance of the SSR prototype was similar across the different stimulus sets. DISCUSSION This preliminary investigation examined the impact of an assistive writing prototype designed for individuals with a range of dysarthria severity. This unique system incorporated a variety of components with encouraging results. The specific components in this prototype system included ASR for dysarthric speech, first-letter identification (alphabet supplementation), and word prediction algorithms. To date, most assistive technology writing options have focused primarily on these components separately. Therefore, it is not possible to directly compare the current results to other ASR technologies based on dysarthric speech, because they do not include first-letter identification in addition to word prediction. The combining of these strategies into a hybrid approach may be a more efficient strategy compared to ASR or word prediction alone for speakers with moderate to severe dysarthria. This preliminary study demonstrated keystroke savings of 59.9% for typical sentences and 44% for atypical phrases using word prediction only (no ASR). While these results were similar to keystroke savings using word prediction in other reports (Lesher, Moulton, & Higginbotham, 1999), they cannot be compared directly to other keystroke savings research due to the differing stimuli and difference in dictionary size. However, keystroke savings were significantly higher when using the full system than for word predication alone. Using the full SSR, the optimal keystroke savings averaged 68.2% and 67.5% for typical and atypical sentences, respectively. This hybrid approach promoted a unique, dynamic adjustment in the core dictionary size

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

SUPPLEMENTED SPEECH RECOGNITION

that led to the encouraging results. Coupling ASR with first-letter identification (alphabet supplementation) dynamically decreased the dictionary set from which an ASR system had to match the acoustic signal. This set was further reduced by using word prediction algorithms. Word prediction was included so that the system presented multiple options to maximize the chances of the target word being easily accessible to the speaker. This dynamic reduction utilized several features that are common in assistive technology and strategies with which speakers with dysarthria (i.e., alphabet supplementation) are familiar. The dynamic reduction in the core dictionary resulted in significant keystroke savings for all participants, regardless of severity of dysarthria. Similar keystroke savings results occurred regardless of the dysarthria severity, as measured by speech intelligibility, for these speakers. A review of these results revealed that for the speakers whose dysarthria was so severe that they could not use their natural speech to meet their daily communication needs, keystroke savings were within a similar range to that of speakers with more functional natural speech abilities. The SSR prototype appeared to provide a significant impact on keystroke savings regardless of speaker intelligibility, compared to ASR alone. While the results were encouraging for this small sample, use of this technology with larger numbers of individuals with dysarthria is required in order to generalize the potential benefit this technology may have to a larger population. In addition, future work with more participants should focus on the impact of dysarthria type on speech recognition performance. A greater range of text-generating tasks is needed to more fully understand how SSR may support functional written expression and face-toface communication needs. Although this study used tasks that were much larger and less predictable than the tasks used in previous reports (e.g., environmental control commands used by Hawley et al., 2007, 2009; engineering software commands used by Talbot, 2000; and numbers used by Caves et al., 2007), future research needs to investigate the use of the SSR technology with messages that would be characteristic of routine use to support writing and communication interaction. Finally, these results, while encouraging, reflect optimal performance in that the researcher controlled the touch-screen computer. The first letter was always identified accurately, and the word prediction-buttons were selected immediately, when the target word was displayed. Due to the cognitive demands associated with using word prediction, individuals using the system may not

275

make the same ‘‘optimal’’ selections during functional use of this technology. In future research, it will be important to require participants to physically control the interface through either touch or head- or eye-tracking as they use the SSR technology. Such investigations would determine if the physical demands of interface control would change speech characteristics in such a way that would interfere with speech recognition. In addition, it would allow an assessment of the impact of the cognitive load required to control the interface on overall system performance. Acknowledgements This project was supported by the Eunice Kennedy Schriver National Institute of Child Health & Human Development under grant number 1R43HD047044-01. The authors thank the participants for their involvement in this project. Declaration of interest: This work was completed on the first prototype of the Supplemented Speech Recognition System funded under the NIH SBIR grant 1R43HD047044-01. Tom Jakobs of Invotek, Inc. continues to develop this technology. The authors alone are responsible for the content and writing of this paper. Notes 1 Dragon NaturallySpeaking by Nuance, http://www.nuance. com/naturallyspeaking/ 2 SpeakQ by Quillsoft Ltd., 2416 Queen Street East, Toronto, Ontario M1N 1A2, Canada, http://www.wordq.com 3 Vmax by Dynavox Mayer-Johnson, 2100 Wharton Street, Suite 400, Pittsburgh, PA 15203, (866) 396-2869, http:// www.dynavoxtech.com/products/v/ 4 Toshiba Satellite R20, http://us.toshiba.com/computers/ laptops/satellite 5 Andrea NC-7100 USB head mounted microphone by Andrea Electronics, http://www.andreaelectronics.com 5http://www.andreaelectronics.com/4

References Beukelman, D. R., Garrett, K. L., & Yorkston, K. M. (2007). Augmentative communication strategies for adults with acute or chronic medical conditions. Baltimore: Brookes Publishing Co. Blaney, B., & Wilson, J. (2000). Acoustic variability in dysarthria and computer speech recognition. Clinical Linguistics & Phonetics, 14(4), 307–327. Cannito, M., Suiter, D., Chorna, L., Beverly, D., Wolf, T., & Pfeiffer, R. (2008). Speech intelligibility in idiopathic Parkinson’s disease before and after amplitude therapy. American Speech-Language Hearing Association, 13, 105. Caves, K., Boemler, S., & Cope, B. (2007). Development of an automatic recognizer for dysarthric speech. Proceedings of the RESNA Annual Conference, Phoenix, AZ.

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

276

S. K. FAGER et al.

Chen, F., & Kostov, A. (1997). Optimization of dysarthric speech recognition. Proceedings of the IEEE Engineering in Medicine and Biology Society Conference, Chicago, 1436–1439. Coleman, C. L., & Meyers, L. S. (1991). Computer recognition of the speech of adults with cerebral palsy and dysarthria. Augmentative and Alternative Communication, 7, 34–42. Darley, F. L., Aronson, A. E., & Brown, J. R. (1969). Differential diagnostic patterns of dysarthria. Journal of Speech and Hearing Research, 12, 246–269. Ferrier, L. J., Jarrell, N., Carpenter, T., & Shane, H. C. (1992). A case study of a dysarthric speaker using the DragonDictate voice recognition system. Journal for Computer Users in Speech and Hearing, 8, 33–53. Ferrier, L. J., Shane, H. C., Ballard, H. F., Carpenter, T., & Benoit, A. (1995). Dysarthric speakers’ intelligibility and speech characteristics in relation to computer speech recognition. Augmentative and Alternative Communication, 11, 165–174. Fried-Oken, M. (1985). Voice recognition device as a computer interface for motor and speech impaired people. Archives of Physical Medicine and Rehabilitation, 66, 678– 681. Green, P., Carmichael, J., Hatzis, A., Enderby, P., Hawley, M., & Parker, M. (2003). Automatic speech recognition with sparse training data for dysarthric speakers. Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech’03), Geneva, Switzerland, 1189–1192. Hanson, E. K., Beukelman, D. R., Heidemann, J. K., & Shutts-Johnson, E. (2009). The impact of alphabet supplementation and word prediction on sentence intelligibility of electronically distorted speech. Speech Communication, 52, 99–105. Hanson, E., Yorkston, K., & Beukelman, D. (2004). Speech supplementation techniques for dysarthria: A systematic review. Journal of Medical Speech-Language Pathology, 12, 9–29. Hatzis, A., Green, P., Carmichael, J., Cunningham, S., Palmer, R., Parker, M., & O’Neill, P. (2003). An integrated toolkit deploying speech technology for computer based speech training with application to dysarthric speakers. Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech’03), Geneva, Switzerland, 2213–2216. Hawley, M.S. (2002). Speech recognition as an input to electronic assistive technology. British Journal of Occupational Therapy, 65, 15–20. Hawley, M.S., Enderby, P., Green, P., Brownsell, S., Hatzis, A., Parker, M., . . . Palmer, R. (2003). STARDUST: Speech training and recognition for dysarthric users of assistive technology. In G.M. Craddock et al. (Eds.), Assistive technology – Shaping the future (pp. 959–964). Amsterdam: IOS Press. Hawley, M., Enderby, P., Green, P., Cunningham, S., Brownsell, S., Carmichael, J., . . . Palmer, R. (2007). A speech-controlled environmental control system for people with severe dysarthria. Medical Engineering & Physics, 29 (5), 586–593. Hawley, M. S., Enderby, P., Green, P., Cunningham, S., & Palmer, R. (2006). Development of a voice-input voiceoutput communication aid (VIVOCA) for people with severe dysarthria. Lecture Notes in Computer Science, 4061, 882–885. Higginbotham, D. J. (1992). Evaluation of keystroke savings across five assistive communication technologies. Augmentative and Alternative Communication, 8, 258–272.

Hosom, J.P. (2009). Speaker-independent phoneme alignment using transition-dependent states. Speech Communication, 51(4), 352–368. Hosom, J.P., Cole, R.A., & Cosi, P. (1998). Improvements in neural-network training and search techniques for continuous digit recognition. Australian Journal of Intelligent Information Processing Systems, 5(4), 277– 284. Hustad, K. C., Auker, J., Natale, N., & Carlson, R. (2003). Improving intelligibility of speakers with profound dysarthria and cerebral palsy. Augmentative and Alternative Communication, 19, 187–198. Hustad, K. C., & Beukelman, D. R. (2001). Effects of linguistic cues and stimulus cohesion on intelligibility of severely dysarthric speech. Journal of Speech, Language, and Hearing Research, 44, 497–510. Hustad, K. C., Jones, T., & Dailey, S. (2003). Implementing speech supplementation strategies. Journal of Speech, Language, and Hearing Research, 46, 462–474. Hux, K., Rankin-Erickson, J. L., Manasse, N. J., & Lauritzen, E. (2000). Accuracy of three speech recognition systems: Case study of dysarthric speech. Augmentative and Alternative Communication, 16, 186–196. Judge, S., Robertson, Z., Hawley, M., & Enderby, P. (2009). Speech-driven environmental control systems – a qualitative analysis of users’ perceptions. Disability & Rehabilitation: Assistive Technology, 4, 151–157. Koester, H., & Levine, S. (1998). Model simulation of user performance with word prediction. Augmentative and Alternative Communication, 14, 25–35. Kotler, A., & Thomas-Stonell, N. (1997). Effects of speech training on the accuracy of speech recognition for an individual with a speech impairment. Augmentative and Alternative Communication, 13, 71–80. Lesher, G. W., Moulton, B. J., & Higginbotham, J. (1999). Effect of ngram order and training text size on word prediction. Proceedings of the RESNA Annual Conference, Arlington, VA. Magnuson, T., & Blomberg, M. (2000). Acoustic analysis of dysarthric speech and some implications for automatic speech recognition. TMH-QPSR, 41, 19– 30. Magnuson, T., & Hunnicutt, S. (2002). Measuring the effectiveness of word prediction: The advantage of longterm use. TMH-QPSR, 43, 57–67. Manasse, N., Hux, K., & Rankin-Erickson, J. (2000). Speech recognition training for enhancing written language generation by a traumatic brain injury survivor. Brain Injury, 14 (11), 1015–1034. Omar, S., Morales, C., & Cox, S.J. (2009). Modeling errors in automatic speech recognition for dysarthric speakers. EURASIP Journal on Advances in Signal Processing. Retrieved August 31, 2009, from http://www.hindawi. com/journals/asp/2009/308340html Parker, M., Cunningham, S., Enderby, P., Hawley, M., & Green, P. (2006). Automatic speech recognition and training for severely dysarthric users of assistive technology: The STARDUST project. Clinical Linguistics and Phonetics, 20, 149–156. Polur, P. D., & Miller, G. E. (2005). Effect of high-frequency spectral components in computer recognition of dysarthric speech based on a Mel-cepstral stochastic model. Journal of Rehabilitation Research and Development, 42, 363–371. Patel, R., & Roy, D. (1998). Teachable interfaces for individuals with dysarthric speech and severe physical disabilities. In Proceedings of the AAAI Workshop on Integrating Artificial Intelligence and Assistive Technology (pp. 40–47). Madison, WI: AAAI Press.

Augment Altern Commun Downloaded from informahealthcare.com by 74.51.221.1 on 11/23/10 For personal use only.

SUPPLEMENTED SPEECH RECOGNITION Raghavendra, P., Rosengren, E., & Hunnicutt, S. (2001). An investigation of different degrees of dysarthric speech as input to speaker-adaptive and speaker-dependent recognition systems. Augmentative and Alternative Communication, 17, 265–275. Rosen, K., & Yampolsky, S. (2000). Automatic speech recognition and a review of its functioning with dysarthric speech. Augmentative and Alternative Communication, 16, 48–60. Rosengren, E. (2000). Perceptual analysis of dysarthric speech in the ENABL project. TMH-QPSR, 41, 13–18. Sawhney, N., & Wheeler, S. (1999). Using phonological context for improved recognition of dysarthric speech. Retrieved October 13, 2009, from http://74.125.155.132/ scholar?q¼cache:fSFdtom_TMJ:scholar.google.com/þ SawhneyþandþWheelerþ1999&hl¼en Talbot, N. (2000). Improving the speech recognition in the ENABL project. TMH-QPSR, 41, 31–38. Trnka, K., Yarrington, D., McCoy, K., & Pennington, C. (2005). The keystroke savings limit in word prediction for AAC. Retrieved March 11, 2009, from http://citeseerx. ist.psu.edu/viewdoc/download?doi¼10.1.1.72.4944&rep¼ rep1&type¼pdf Venkatagiri, H. (1994). Effect of window size on rate of communication in a lexical prediction AAC system. Augmentative and Alternative Communication, 10, 105– 112. Venkatagiri, H. S. (2002). Speech recognition technology applications in communication disorders. American Journal of Speech-Language Pathology, 11, 323–332. Wan, V., & Carmichael, J. (2005). Polynomial dynamic time warping kernel support vector machines for dysarthric speech recognition with sparse training data. Proceedings INTERSPEECH, 3321–3324.

277

Yorkston, K., Beukelman, D., Hakel, M., & Dorsey, M. (2007). Sentence Intelligibility Test [Computer software]. Lincoln, NE: Madonna Rehabilitation Hospital.

APPENDIX A Typical Sentences 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

My mom and dad leave on Monday. She has been sick since Thursday. They like being married. I’ll be back this Friday. I’ve been on the go. I’m looking for a big watch. You can take a picture with the phone. He is crazy about her. Her kids play well together. I do not have paper with me.

Atypical Phrases 1. 2. 3. 4. 5. 6. 7.

not to look half bad makes him sure money a make believe movie about Kyle their big idea talk at this meeting anything at that time boy who wants this

Suggest Documents