Discourse Comprehension of Synthetic Speech

0 downloads 0 Views 102KB Size Report
Echo listeners performed more poorly at normal versus slow SPRs regardless of text complexity level. However, DECtalk listener performance declined only for ...
0743-4618/94/1003-0191 $3.00/0; Volume 10, September 1994 AAC Augmentative and Alternative Communication Copyright © 1994 by ISAAC

Discourse Comprehension of Synthetic Speech Delivered at Normal and Slow Presentation Rates D. Jeffery Higginbotham, Anne L. Drazek, Kim Kowarsky, Chris Scally, and Erwin Segal Department of Communicative Disorders and Sciences, Communication and Assistive Device Laboratory, State University of New York at Buffalo (D.J.H., A.L.D., K.K. and C.S.) and Department of Psychology, State University of New York at Buffalo, Buffalo, New York, USA (E.S.)

The purpose of this investigation was to determine the effects of the quality and speech presentation rate (SPR) of synthetic speech and textual characteristics (length, complexity, genre) on a listener’s ability to summarize paragraph-length texts. Forty able-bodied students and staff members were individually tested over a 3-day period, listening to eight texts produced by one of two synthesizers (DECtalk, Echo+) at a normal SPR or with 10-second intervals of silence interspersed between individual words. Using a discourse summarization taxonomy developed for this study, subjects listening to DECtalk speech produced more accurate summaries than did ECHO speech listeners, and synthetic speech presented at a slow rate was summarized more accurately than synthetic speech presented at a normal SPR. Additionally, a significant three-way interaction effect was noted for voice × SPR × text complexity. Echo listeners performed more poorly at normal versus slow SPRs regardless of text complexity level. However, DECtalk listener performance declined only for complex texts presented at normal SPRs. Discussion focuses on the role of the above variables on the comprehension of voice output communication aids. KEY WORD: adults, augmentative alternative communication (AAC), communication partners, discourse comprehension, speech synthesis, voice output communication aid (VOCA)

of silence between selections, make it difficult to generalize from current research in synthetic speech comprehension to the augmentative communication context (Greene & Pisoni, 1988; Higginbotham, MathyLaikko, & Yoder, 1988; Light & Lindsay, 1990; Pisoni, Ralston, & Lively, 1990).

Recent advances in speech synthesis technology have resulted in the proliferation of voice output communication aids (VOCAs) to facilitate interpersonal communication. Greater availability and significantly lower costs of VOCA technologies now allow consumers to make choices about what voice output characteristics they need or desire (e.g., age and gender appropriate, natural, durable, cosmetically appropriate, low cost) (Vanderheiden & Lloyd, 1986; Vitale, 1991). Among the major criteria for selecting a VOCA is its ease of understanding by target listeners. Although substantial data on the intelligibility of speech synthesis systems is now available, comparatively little is known about how the use of VOCAs to convey messages via synthesized speech affects a listener’s discourse comprehension (Logan, Greene, & Pisoni, 1989; Mirenda & Beukelman, 1987, 1990). 1 Further, output characteristics specific to VOCA technologies, such as the slow overall communication rate, frequent use of spelled-out words, and periods

Sentence and Discourse Comprehension of Synthetic Speech Comprehension of sentences and narratives has been found to be slower and less accurate when materials are presented with synthetic versus natural speech (Pisoni, Manous, & Dedina, 1987; Ralston, Pisoni, Lively, Greene, & Mullenix, 1991; Talbot, 1987). Using a sentence verification task paradigm, Pisoni et al. (1987) collected data on response latency, sentence verification accuracy, and transcription accuracy for sentences presented via natural speech, DECtalk,2 Prose 3.0,3 Infovox,4 and Votrax5

1

2

DECtalkTM is a trademark of the Digital Equipment Corporation. Prose synthesizers are a product of Telesensory Systems, Inc. 4 Infovox was developed by the Swedish Institute of Technology. 5 Votrax is a product of Votrax, Inc.

For the purpose of this paper, the term discourse comprehension will refer to a listener’s ability to construct a representation of discourse upon which various mental operations can be performed, such as recall and inference making (based on Kintsch, 1988).

3

191

192

Higginbotham et al.

speech synthesizers. For all three measures, natural speech was found to be equal to or better than highquality speech synthesizers (DECtalk and Prose), which were generally better than moderate- to lowquality speech synthesizers (Infovox and Votrax). For the speech synthesis conditions, response times for the false sentences were slower than for true sentences. In addition, the synthetic speech sentences were more likely to be incorrectly verified and transcribed than sentences presented using natural speech. From these results, Pisoni et al. (1987) hypothesize that decrements in the segmental intelligibility of synthetic speech require additional processing resources to be allocated for accurate perception of the speech signal and result in decreased comprehension speed. Increases in linguistic complexity (e.g., sentence length, true/false judgments) also compete for the limited processing resources. If such resources are not available due to perceptual processing demands, comprehension errors will result. Ralston et al. (1991) found that subjects listening to extended texts produced by a Votrax speech synthesizer demonstrated significantly longer listening times and more word and proposition recognition memory errors than subjects who heard natural speech discourse. An interaction between word and proposition recognition by voice type led Ralston et al. to suggest that because of the degraded quality speech signal relative to natural speech, listeners expend more of their processing resources on speech perception than on comprehension processing, resulting in better recognition of words than propositions. Conversely, subjects listening to natural speech are able to devote proportionately more resources to comprehension processing needs. Discourse Processing of VOCAs The discourse processing research presented above has substantial implications for the comprehension of VOCA-produced messages. Listening to lowquality synthetic speech not only increases the amount of perceptual level processing required to parse the acoustic-phonetic characteristics of the synthesized speech signal, but this additional processing effort may restrict syntactic parsing, inferencing, and other higher level comprehension processes (Pisoni et al., 1987; Talbot, 1987). One could also speculate that if at certain times, the listener focuses on higher level needs, then the processing of incoming phonetic information may be compromised. In either case, comprehension is compromised, leading to misunderstanding. Additional questions arise in the case of augmentative communication with respect to overall communication rate and the organization of the communication output. Current estimations of the overall rate at which

spontaneous communications are produced by the augmented communicator range from approximately 1 to 10 words per minute, depending on user and device characteristics and communication context (Foulds, 1980). This 10- to 200-fold decrease in communication rate compared to that of natural speakers (150–200 wpm) may substantially alter the listener’s ability to comprehend the text. Increased silent intervals could facilitate comprehension by allowing more processing time for both perceptual and higher level cognitive processes (Venkatagiri, 1991). However, if the overall presentation rate of the text is too low, attention to the speech may wane, jeopardizing successful comprehension (Light & Lindsay, 1990). Another VOCA output factor that may affect comprehension is the output method used to produce the spoken message (Mathy-Laikko, 1992). As a result of the particular technical characteristics of a given VOCA technology, output method may take one or more of the following prototypic message forms: 1. 2. 3. 4.

5.

Word method: sequence of spoken words, with each word separated by periods of silence. Sentence method: sentences separated by periods of silence. Letter method: spoken spellings of words (e.g., “J-E-L-L-O”). Each letter, word, etc. is separated by a period of silence.6 Combined method: spoken spellings of words intermixed with spoken words (“Is–G-O-O-D”) and/or word level repetitions of spelled words, phrases, etc. Phoneme method: words spoken as sequences of phonemes (e.g., /d/–/ε/–/| /–/o/) separated by periods of silence (represented here by hyphens). Phonemes may be intermixed with other spoken words, phrases, etc.

Mathy-Laikko (1992) examined the auditory comprehension and processing times of subjects taking a modified version of the Revised Token Test in which the instructions were delivered by either the letter, word, or sentence methods using a DECtalk speech synthesizer. The speech presentation rate (SPR) was set to emulate actual AAC output rates (7.5 wpm). Mathy-Laikko found higher comprehension scores for the word and letter methods, compared to the sentence method, and task completion times were related to both output method and message complexity.

6 Silent periods typically reflect the amount of time taken by the communicator to formulate a message and to physically access the message components.

Discourse Comprehension of Synthetic Speech

Thus, when combined with the excessively slow overall communication rates typical of augmentative communicators, the particular output method may substantially affect a listener’s ease and ability to process the VOCA’s output and comprehend the speaker’s intended communication. Because of these unique output characteristics associated with VOCA use, more research is needed to investigate how differences in VOCA quality, rate, and output structure interact with discourse characteristics (e.g., genre, topic, length, complexity) to affect a listener’s discourse comprehension performance. The practical effect of synthetic speech on discourse comprehension is still not well understood, as little work has been done to determine what is actually comprehended by the listener subsequent to discourse presentation. Previous attempts to determine the quality of a listener’s discourse comprehension via written recall of the synthesized passages have produced mixed results and have been criticized for the use of inadequate analytic techniques (Jenkins & Franklin, 1982; Lively, Ralston, Pisoni, & Rivera, 1990; Lucce, Festuel, & Pisoni, 1983; Raghavendra & Allen, 1993). However, sensitive and reliable techniques for the holistic evaluation of writing proficiency have been developed (Educational Testing Service, 1992). Such procedures could be adapted to provide a functional rating of a listener’s recall of a synthesized discourse (i.e., level of summarization accuracy) and to determine the practical consequences of various VOCA output characteristics on discourse comprehension. The results from research on the discourse comprehension of VOCA-produced speech should further our understanding of augmentative communication processes, help pinpoint the types of modifications required to improve present-day VOCA technologies, and direct the kinds of clinical interventions needed to overcome these technological constraints to permit appropriate and effective communication. It is success at the discourse level of communication (e.g., stories, explanation, conversation) that underlies the social accomplishments of the augmented communicator. The purpose of this study is to adapt current research methodologies in comprehension processing to determine how various output characteristics of VOCAs affect the comprehension of discourse. Specifically, this study will examine the effects of two aspects of VOCA output (voice quality, presentation rate) and three aspects of discourse (text length, text complexity, familiarity of content). To accomplish this goal, a set of categories and procedures for evaluating quality of listeners’ written summaries were developed to determine the practical significance of VOCA- and text-related factors on a discourse comprehension.

193

METHODS Subjects Forty university students and staff from the State University of New York at Buffalo served as subjects for this study. Subjects ranged in age from 18 to 38 years of age (M = 21; SD = 4.13) and were evenly represented across genders. The Peabody Picture Vocabulary Test-Revised (Form M) (Dunn & Dunn, 1981) scaled scores ranging from 89 to 141 (M = 107.4; SD = 12.3). All subjects passed a pure-tone screening (25 dB SPL at 25, 0.5, 1, 2, and 4 kHz) (ANSI, 1969) and spoke English as their first language. Materials and Instrumentation Text Materials. The eight paragraphs used in this study were originally developed by Kintsch, Kozminsky, Streby, McKoon, and Keenan (1975) for experimental studies in the comprehension of spoken and written discourse. As shown in Table 1, the stimuli texts were controlled across several linguistic parameters, which have been shown by Kintsch and his colleagues to affect text comprehension and recall. Texts were classified as being short or long (text length) according to the total number of individual propositions (i.e., word concepts, referring expressions) appearing in each paragraph. Text complexity refers to the number of different arguments or unique referents appearing in each paragraph. Texts were further divided according to topical familiarity (text familiarity). Half of the paragraphs were relatively familiar narrative passages taken from a children’s history book, whereas the other half of the paragraphs were less familiar expository paragraphs obtained from Scientific American. DECtalk IITM7 and the Echo8 Plus speech synthesizer were used to generate the speech stimuli used in this investigation (voice condition). The DECtalk was chosen because of its high degree of segmental intelligibility and its wide use for VOCAs. The Echo Plus synthesizer was chosen due to its relatively low level of segmental intelligibility and its widespread use in education and disability-related fields. The Perfect Paul default voice setting was used for all DECtalk output, and the default settings were used for all Echo speech output. Individual words (slow SPR) and sentences (normal SPR) were first typed into text files then input through a speech synthesizer. The synthesized speech output 7 The portable DECtalk II synthesizer was developed by the Institute on Applied Technology, Children’s Hospital, 300 Longwood Avenue, Fegan Plaza, Boston, MA 02115 and is an adaptation of the commercial DECtalk III speech synthesis technology. 8 EchoR is a registered trademark of Street Electronics Corporation.

194

TABLE 1:

Higginbotham et al.

Classification Factors and Quantitative Descriptors for the Eight Experimental Texts

Classification Factors

Quantitative Descriptors

Text Length

Text Complexity

Text Version

Number of Words

Number of Propositions

Number of Arguments

Greek

Short

Simple

History

21

8

3

Babylon

Short

Complex

History

23

8

8

Tyros

Short

Simple

Science

20

8

3

Turbulence

Short

Complex

Science

22

8

7

Joseph

Long

Simple

History

66

23

7

Assyria

Long

Complex

History

67

24

17

Astroids

Long

Simple

Science

70

25

6

Comets

Long

Complex

Science

71

25

15

Text Name

These data were obtained from Kintsch et al., (1975).

During testing, the experimenter sat in the control area while the subject was seated on the other side of an opaque partition. The speech stimuli were presented via an amplifier/speaker system (i.e., Realistic

SA-150 amplifier, Realistic Minimus-7 speakers 11), which was positioned at head level on either side of and approximately 2.5 feet away from the seated subject. All speech stimuli were delivered at approximately 30 dB SPL above the ambient noise level of the room, which was assessed using a Brüel and Kjær sound level meter (model 1613). Subjects were allowed to modify loudness level upward to suit their listening needs. Listeners were randomly assigned to one of the four experimental conditions (i.e., DEC-normal SPR, DEC-slow SPR, ECHO-normal SPR, ECHO-slow SPR). They were tested one at a time for three 1-hour sessions. Each session took place on different days over a 5-day period. On the first day, subjects were screened for hearing and language comprehension ability and took two speech intelligibility tests and a short-term memory test. Subjects then practiced listening and summarizing three additional practice paragraphs. The first practice paragraph was presented to all subjects at normal SPR. A copy of the text was provided to facilitate word recognition. Subjects were then presented with two additional practice paragraphs (no accompanying texts provided) at an SPR appropriate for the subject’s group. During sessions 2 and 3, eight texts were randomly presented for summarization, four per session. In all sessions, the subjects received the following instructions:

9 SoundRecorder is a trademark of Farallon Computing, Inc., 2201 Dwight Way, Berkeley, CA 94704. 10 The digitized synthetic speech samples and related software developed for this study are available by writing the first author. 11 The Realistic amplifier and speakers are produced by Radio Shack Corp.

You will listen to a paragraph composed of synthetic speech. It will be spoken at a normal/slow speaking rate. After the paragraph has been presented, please write down what you remember about the paragraph. Be as complete as possible; however, exact recall is not necessary. Prior to the start of the paragraph, you will be prompted to begin listening. You will also be informed when the paragraph

was passed through a 10-kHz low-pass filter then digitized using an 8-bit Farrallon SoundRecorderTM9 A–D converter at a 22-kHz sampling rate. All digitized stimuli were stored as resource files on a peripheral land infinity (PLI) 650-megabyte magneto-optical hard drive for later presentation. Delivery of the digitized speech stimuli was controlled by a Macintosh II microcomputer, using software designed to control the presentation rate of single words and sentences stored on the PLI drive 10 (Higginbotham, 1991). Paragraph materials were delivered to subjects at two SPRs. For the normal SPR condition, speech stimuli were delivered at a withinsentence SPR of 149 wpm for DECtalk and 128 wpm for Echo. The slow SPR condition was designed to simulate the slow output rate of many VOCA communicators. To accomplish this, a 10-second silent interval followed each spoken word, resulting in a withinsentence SPR of 5.6 wpm for DECtalk and 5.75 wpm for Echo. Verbal prompts were placed at the beginning and end of each paragraph (“Start listening now,” “The paragraph is now over”) to facilitate the task. Procedure

Discourse Comprehension of Synthetic Speech

has ended. Please start writing as soon as the paragraph has finished.

No time limit was imposed on the recall task, and subjects were provided with short breaks between each task. The experimenters attempted to answer all questions posed by subjects during the course of the experiment. When questions pertained to their written summaries, the experimenters responded by rephrasing the task directions. Each written sample was later transcribed by a research assistant into a database, preserving the spelling as well as the word and letter spacing and organization of the original text. Transcription accuracy was verified by two additional research assistants. Qualitative Rating of Discourse Summaries (QRDS). A secondary goal of this investigation was to develop a rating method for analyzing a listener’s (or reader’s) summarization of paragraph level discourse. The QRDS was designed to provide a simple and reliable means for assessing the overall consistency of the listener’s written summaries of previously heard or read texts. The general structure of this taxonomy relies on previous holistic scoring methods used to evaluate writing quality (Educational Testing Service, 1992; Oller, 1979), discourse analysis (Harris, 1987; Keenan & Schiefflen, 1976; Riley, 1980), consensusmaking procedures (Educational Testing Service, 1992; Shriberg, Kwiatkowski, & Hoffman, 1984), and behavioral taxonomy development (Herbert & Attridge, 1975). The QRDS was developed to resemble the overall judgments that the average speaker may make about their listener’s comprehension of a spoken or written text, based on the listener’s reiteration of the text. Four ordinally related levels of summarization quality were developed to provide an exhaustive and mutually exclusive set of categories to evaluate summarization quality. These categories included full summary (i.e., “I was completely understood”), partial summary (i.e., “My listener understood most of my message”), changed summary (i.e., “My listener misinterpreted my message”), and fragmented summary (i.e., “My listener was unable to make sense of my communication”). The full definitions and defining features of each summarization category are provided in the Appendix.12 To verify the ordinal relationship of the QRDS, five master and doctoral level speech-language pathologists were asked to rank order the individual categories (Appendix, titles removed) in terms of their respective summarization quality. All five individuals ranked the categories in the intended order (i.e., full, partial, changed, fragmented). Rater Training and Text Classification Procedure. Two graduate students in communicative disorders and 12 Complete definitions and rating protocol for the QRDS are available by contacting the first author.

195

the project director (first author) served as raters for this study. All raters had classified transcripts for previous investigations and had collaborated on the design of the QRDS protocol (Higginbotham, Drazek, & Sussman, in review; Higginbotham, Lundy, & Scally, 1993, in review). Text classification was accomplished in the following manner. First, raters independently classified each transcribed text according to the established definitions and rating protocol. Then, using a structured consensus procedure, the raters met as a group and resolved discrepant ratings by reviewing individual ratings against published definitions, central examples, and guidelines, discussing each judge’s rationale for providing a particular rating. All raters were purposely kept uninformed as to each text’s associated experimental condition in order to control for experimenter bias during classification. Inter-rater agreement was determined by calculating the ratio of the number of disagreements between each pair of judges before structured consensus over the total number of texts. Inter-rater agreement between independent classifications averaged 85% and an average Kappa coefficient of .84 (Suen & Ary, 1989). Using the structured consensus process, agreement was achieved for all but two texts. Rating was achieved for these last texts via majority vote. Design and Analysis The two between-subjects variables of Voice (DEC/Echo) and SPR (normal/slow) combined with the three within-subjects variables, text length (short/ long), text complexity (simple/complex), and text version (history/science) to form a 2 × 2 × 2 × 2 × 2 mixedmodel design. All statistical analyses were performed using JMP statistical software (SAS Institute, 1989). Between-subjects comparisons were analyzed using the within-subjects error term, whereas within-subjects comparisons utilized residual error. Four- and five-way interactions were not analyzed in this study. The probability of committing a Type I error was set at p < .05. RESULTS Table 2 presents the weighted means and standard deviations for summarization scores across experimental conditions. Table 3 presents the results of the mixed-model ANOVA. The summarization scores were first analyzed across experimental sessions to determine whether discourse comprehension was affected by short-term experience. Using a within-subjects ANOVA, no significant difference was noted for the sessions variable, nor did it interact with Voice or SPR (F[1, 275] < 1 for all comparisons), indicating that short-term experience did not appreciably affect subjects’ ability to recall the texts. Because of the lack of these significant effects, the sessions variable was not considered in the overall analysis of variance.

196

TABLE 2:

Higginbotham et al.

Weighted Means and Standard Deviations for Voice, SPR, and Text Length, Complexity, and Version

Length Voice

Rate

DECtalk

Complexity

Version

Long

Short

Simple

Complex

History

Science

Summary (Rows)

Slow Normal

3.35 (0.74) 3.12 (0.72)

3.56 (0.72) 3.03 (0.77)

3.46 (0.72) 3.33 (0.73)

3.45 (0.75) 2.83 (0.68)

3.72 (0.56) 3.28 (0.55)

3.20 (0.79) 2.88 (0.85)

3.46 (0.73) 3.08 (0.74)

Echo+

Slow Normal

2.75 (0.87) 2.12 (0.79)

2.67 (0.94) 2.00 (0.82)

2.78 (1.00) 2.00 (0.93)

2.65 (0.80) 2.12 (0.65)

2.25 (0.81) 2.25 (0.81)

2.55 (0.93) 1.88 (0.76)

2.71 (0.90) 2.06 (0.80)

Summary Voice

DECtalk Echo

3.24 (0.73) 2.44 (0.88)

3.29 (0.79) 2.34 (0.94)

3.39 (0.72) 2.39 (1.04)

3.14 (0.78) 2.39 (0.77)

3.49 (0.60) 2.56 (0.88)

3.04 (0.83) 2.21 (0.91)

3.26 (0.76) 2.39 (0.91)

Summary SPR

Slow Normal

3.05 (0.86) 2.62 (0.91)

3.11 (0.95) 2.51 (0.94)

3.11 (0.93) 2.66 (1.07)

3.05 (0.87) 2.47 (0.75)

3.29 (0.83) 2.76 (0.86)

2.88 (0.92) 2.38 (0.95)

3.08 (0.90) 2.57 (0.92)

2.84 (0.90)

2.81 (0.99)

2.89 (1.02)

2.76 (0.86)

3.03 (0.89)

2.62 (0.96)

Summary (Columns)

Standard deviations are shown in parentheses.

Listeners summarized texts produced by DECtalk significantly better than those produced by the Echo synthesizer (F[1, 36] = 28.54, p < .0001), with almost a 1-point difference separating the two voices. This difference is notable since the average DECtalk score of 3.26 lies within the interval bounded by full and partial summaries, compared to the average Echo score of 2.39, which falls into the partial to changed summarization category range. Listeners summarized slowly presented texts significantly better than those presented at a normal SPR (F[1, 251] = 9.79, p < .003), with approximately a 1/2-point difference in effect size between normal (2.57) and slow (3.08) SPR. A main effect was found for text familiarity, with history texts being summarized better than science texts by approximately onehalf point (F[1, 36] = 36.92, p < .001). No significant main effects were found for the text length or text complexity variables. A significant three-way interaction was observed for text complexity × voice × SPR interactions (F[1, 251] = 8.09, p < .005). Utilizing tests of simple interaction effects, a significant interaction was found for the difference between voices across text complexity levels for the normal SPR condition (tnormal[1, 251] = 2.68, p < .008) (Fig. 1). However, in the slow SPR condition, no difference was noted for the same contrast (tslow[1, 251] < 1). Both DECtalk and Echo speech appeared relatively stable across complexity levels at slow SPRs, whereas the average summarization scores declined by over half a point across complexity levels at normal SPR for both the Echo group and the complex DECtalk group. These data suggest that at slow SPRs, text complexity has little effect on a listener’s text summarization abilities. However, at a normal SPR, subjects’

summarization abilities decline as a function of text complexity. For DECtalk speech, summarization scores declined only for those subjects listening to complex texts presented at normal rates. For Echo speech, listeners performed more poorly at normal versus slow SPRs, regardless of the text complexity level. A length × version × complexity interaction was also found (F[1, 241] = 5.172, p < .024). Analysis of the simple means revealed more accurate recall for short, simple history versus short, simple science texts (tShortSmpl[1, 251] = 5.85, p < .001); long, simple history versus long, simple science texts (t LongSmpl[1, 251] = 2.09, p < .038); long, complex history versus long, complex science texts (tLongCplx[1, 251] = 2.47, p < .015), but not long, simple history versus long, simple science texts (tShortCplx[1, 251] = 1.71, p = .0883). Descriptive Analysis of the Voice × SPR × Summarization Categories Although no significant interaction effects were found for the voice by SPR interaction, a visual analysis of the distribution of the judges’ ratings across summarization categories (Fig. 2) displays how listeners’ summarization performance was systematically affected by these variables. Inspection of DECtalk across SPR groups shows a shift away from full summaries, which typified the slow SPR group, to more partial and changed summaries for the normal rate group. The Echo slow SPR group displayed a similar summarization profile to the DEC normals but a somewhat higher percentage of fragmented summaries. For the normal SPR level, however, 70% of the ECHO listeners’ texts were rated as either changed or fragmented, and only 3% rated as full summarizations.

Discourse Comprehension of Synthetic Speech

TABLE 3:

197

Analysis of Variance

Source Voice SPR Voice-SPR Subject (Voice, SPR) Length Version Length-Version Complexity Length-Complexity Version-Complexity Length-Version-Complexity Length-Voice Version-Voice Length-Version-Voice Complexity-Voice Length-Complexity-Voice Version-Complexity-Voice Length-Version-Complexity-Voice Length-SPR Version-SPR Length-Version-SPR Complexity-SPR Length-Complexity-SPR Version-Complexity-SPR Length-Version-Complexity-SPR Length-Voice-SPR Version-Voice-SPR Length-Version-Voice-SPR Complexity-Voice-SPR Length-Complexity-Voice-SPR Version-Complexity-Voice-SPR Length-Version-Complexity-Voice-SPR Residual Error

SS

MS

61.036 20.946 1.502 77.026 0.049 12.762 1.249 1.249 1.241 0.800 1.798 0.451 0.201 0.794 1.249 0.201 1.241 1.788 0.613 0.013 0.012 0.309 0.113 0.111 0.313 0.313 0.113 3.592 2.795 0.309 0.313 2.099 86.774

61.036 20.946 1.502 2.140 0.049 12.762 1.249 1.249 1.241 0.800 1.798 0.451 0.201 0.794 1.249 0.201 1.241 1.788 0.613 0.013 0.012 0.309 0.113 0.111 0.313 0.313 0.113 3.592 2.795 0.309 0.313 2.099 0.346

DISCUSSION Summary of Results In terms of identifying factors affecting comprehension processing, synthesized voice quality, SPR, and text complexity were found to play significant roles. For voice quality, 54% of all Echo summaries were significantly misinterpreted (changed + fragment categories), compared to only 18% for DECtalk. This finding suggests that perceptual level differences in synthesized voice quality (e.g., segmental intelligibility, prosodic characteristics) may substantially affect a listener’s ability to accurately understand synthetically produced texts. Such a finding is supported by Ralston et al. (1991), who found a moderate relationship between intelligibility scores and comprehension processing measures across different speech synthesizers. Summarization performance was improved when spoken words were separated by 10-second periods of silence, suggesting that the silent interval afforded listeners additional time to process the speech signal,

DF 1 1 1 36 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 251

F Ratio

Probability

28.538 9.793 0.702 6.189 0.142 36.915 3.613 3.613 3.590 2.314 5.200 1.303 0.581 2.296 3.613 0.581 3.590 5.172 1.773 0.037 0.035 0.895 0.328 0.321 0.906 0.906 0.328 10.389 8.086 0.895 0.906 6.072

.000 .004 .407 .000 .707 .000 .059 .059 .059 .129 .023 .255 .447 .131 .059 .447 .059 .024 .184 .847 .852 .345 .568 .572 .342 .342 .568 .001 .005 .345 .342 .014

rehearse the text, determine the relationship of the word to its linguistic context, and/or elaborate upon their understanding of the text. With respect to the processing issue, these results were consistent with those of Venkatagiri (1991), who found significant increases in transcription accuracy for Echo II-produced sentences with a 250-msec silent interval between words. Pisoni et al. (1987, 1990) and Lively et al. (1990) also have shown that longer delays in processing time are required for synthetic speech compared to natural speech. The better comprehension of slow SPR (word) versus normal SPR (sentence) was also consistent with Mathy-Laikko’s (1992) results in which subjects received the higher scores on the Token Test when completing instructions via the word method versus the sentence output method. The significant interactions between text complexity with voice and SPR also presented evidence for the effect of high-level processing demands on summarization performance. Finding that the summarization scores for DECtalk were lowered only for complex texts presented at a normal SPR supports the notion that both low- and high-level cognitive processes

198

Figure 1.

Higginbotham et al.

Plots of summarization score means for the interaction between voice, SPR, and text complexity.

compete for a limited pool of resources, and that when processing demands exceed available resources, accurate comprehension becomes problematic. The significant difference between history and science texts also substantiates the influence of high-level

processing demands on comprehension since frameworks for representing history and narrative-based materials are generally more accessible to listeners than is science information expressed through expository discourse genres (Kintsch et al., 1975). Discourse Comprehension and Augmentative Communication

Figure 2. Histogram matrix of listeners’ summarization scores across voice by SPR.

The results of the summarization analysis support the model of discourse comprehension processing described earlier (Kintsch, 1988; Pisoni et al., 1990; Ralston et al., 1991). Listeners engage in a constructive process of decoding and interpreting the speech signal within the context of a rich, representational framework. Higher order processing allows the listener to make use of prior discourse and topical and world knowledge to construct meaningful interpretations. However, difficulty performing perceptual analysis of the speech signal or increases in text-based complexity necessitate reallocation of processing resources, at times jeopardizing comprehension. For augmentative communication, the implications of these results are substantial. First, the relatively poor summarization performance of Echo listeners compared to their DECtalk counterparts across SPR and text complexity conditions reinforces the need for use of high-quality speech synthesis for social communication purposes. If the Echo group performance is indicative of the performance of other low and

Discourse Comprehension of Synthetic Speech

moderately intelligible speech synthesis devices, then users of these technologies are at considerable risk of being misunderstood by their listeners, especially if the listener is relatively unfamiliar with synthesized speech. However, differences in the speech output rate and method (e.g., word by word, spelled words, sentence output) of the augmentative communication device may have a considerable effect on comprehension, particularly for low- to moderate-quality synthetic speech. Because of the demonstrated difficulties in processing synthetic speech approximating natural speaking rates, it may be beneficial for comprehension purposes if devices were designed to slightly delay the output of each word when presented in a phrase or sentence context. The determination of an optimal output delay cannot be made based on this investigation, and further research into this area is required. As demonstrated by this study, significant gains in comprehension can be achieved through the word method, even with significant delays between words. Clinicians may also consider training VOCA users to employ the word output method, particularly if listener miscomprehension is an issue. Again, further work is needed to determine the extent to which these variables improve comprehension of experienced listeners or during social interaction (Drazek, 1992; Higginbotham, Drazek, & Sussman, in review; Scally, 1993). The QRDS summarization procedure and analyses used in this study provide a relatively simple and credible means for assessing the overall representation of just-heard texts, and the procedure appears capable of discriminating between voice types, SPRs, and linguistic complexity. Notably, the results from this study generally correspond to the body of research involving natural and synthetic speech comprehension presented above, supporting the validity of this global rating method. Although other approaches to comprehension assessment such as response time and question/answer measures may provide insights to certain aspects of cognitive processing, the text summarization procedure allows the analyst to gain understanding about the listener’s understanding of the text through his or her written or spoken recall. In this investigation, our goal was to provide a global measure of text comprehension. However, this type of data provided by the listeners’ discourse recall can also inform us as to the perceptual, linguistic, and representational basis for certain types of mishearings and misunderstandings, which are so problematic for augmentative communicators and their partners. This could be accomplished by characterizing the types of discrepancies that occurred between the original and recalled texts, such as the intrusion of novel words and semantic changes, as well as the omission of words and sentences. For example, for those texts that were substantially misinterpreted,

199

listeners usually produced alternate versions based upon misheard words and the intrusion of general knowledge structures. Table 4 illustrates these phenomena. In text 0636, the phonetic substitutions made by the listener for “lark” (“art”), “loved” (“was”), and “Greeks” (“reeds”) resulted in a rather bizarre interpretation of the text. In text 5035, the listener, after mishearing “leaves” for “Greeks,” attempts to incorporate the word “beautiful” with it, resulting in the sentence, “The leaves were beautiful there.” Some listeners appeared to construct their interpretations of the texts based on the sense that they could make of the words recovered. For instance, in text 0939, the listener correctly recognized the word “they” but used it to refer to “larks” instead of “Romans,” which was not recovered. The original word “learned” was interpreted as “turn,” conforming to the behavior patterns of a lark, with “woven pattern” being the result of the lark’s turning movements. Such interpretations reflect the listener’s application of linguistic and world knowledge for producing a meaningful interpretation of a problematic text. In addition to the omissions and substitutions of phonemes, lexical items, phrases, etc., the summarizations for the second text (“Joseph”) illustrate how specific knowledge about a topic can be used to “fill

TABLE 4: Examples of Listeners’ Text Summaries (Echo Groups) Stimulus Paragraph A The Greeks loved beautiful art. When the Romans conquered the Greeks, they copied them, and thus, learned to create beautiful art. Changed Summary 0636 There was a beautiful lark The reeds were asked to create the beautiful lark 0930 the rem of the lark as they . . . turn creates a beautifully woven pattern 5035 The leaves were beautiful there. Somebody asked something. Stimulus Paragraph B Although Joseph was a slave in Egypt and it was difficult to rise from the class of slaves to a higher one, Joseph was so bright that he became a ruler in Egypt. Joseph’s sicked brothers, who had once planned to kill him, came to Egypt in order to beg for bread. There they found that Joseph had become a great ruler. Changed Summaries 0626 Joseph, the son of a Hebrew slave, was found by an Egyptian princess floating in a basket in the Nile River. She raised Joseph as her son and he grew to become a prince in Egypt. 0232 John was a ruler of Egypt and had become a great ruler. I’m assuming he was raised up to this “great” ruler. He wasn’t that good in the beginning.

200

Higginbotham et al.

in” for missed information. In text 0626, the listener recognized the passage as referring to a biblical story but mistakenly integrated the story of Moses into his/her summarization. It is interesting to note how various terms from the original texts, such as “Egypt,” “slave,” and “rise/raised,” are changed in the summarization, reflecting a substantially different interpretation provided by the listener compared to the original text. The examples provided here are indicative of the data collected during this investigation. Listeners, by and large, provided meaningful interpretations of the texts, attempting to integrate mishearings and misinterpretations into their written renditions, and used world and specific knowledge, where needed, to fill in the gaps. Current research efforts in our laboratory are directed towards providing more comprehensive qualitative analyses of these transcripts as well as determining the correspondence between our summarization measure with other types of text analysis procedures (e.g., referent use, propositional structure) (Baird, 1993; Higginbotham, 1993; Higginbotham & Baird, in review). ACKNOWLEDGMENTS Thanks to Jenifer Rauck and Bill DeRoo for their efforts during the preliminary phases of this project. The authors also thank Elaine Stathopolous for her critical reading of the first drafts of this manuscript. Finally, the authors are especially grateful for the critical review and recommendations provided by the reviewers. This research was funded in part by the National Institute on Deafness and Other Communicative Disorders, Grant No. DC00034. Address reprint requests to: D. Jeffery Higginbotham, Department of Communicative Disorders and Sciences, 118 Park Hall, State University of New York at Buffalo, Buffalo, New York 14260, USA.

REFERENCES American National Standards Institute. (1969). Specifications for audiometers (ANSI S3.6-1969). New York: ANSI. Baird, L. (1993). Qualitative and quantitative analysis of listeners’ recall of synthesized speech texts. Unpublished master’s thesis, State University of New York at Buffalo, Buffalo, NY. Drazek, A. L. (1992). Discourse comprehension of synthetic speech by naive versus experienced listeners. Unpublished master’s thesis, State University of New York at Buffalo, Buffalo, NY. Dunn L. M., & Dunn, L. M. (1981). Peabody Picture Vocabulary Test - Revised. Circle Pines, MN: American Guidance Service. Educational Testing Service. (1992). TOEFLR Test of Written English Guide. Princeton, NJ: Author. Foulds, R. (1980). Communication rates of nonspeech expression as a function of manual tasks and linguistic constraints.

In Proceedings of the International Conference on Rehabilitation Engineering (pp. 83–87). Washington, DC: RESNA, Association for the Advancement of Rehabilitation Engineering. Greene, B. G., & Pisoni, D. B. (1988). Perception of synthetic speech by adults and children: Research on processing voice output from text-to-speech systems. In L. E. Bernstein (Ed.), The vocally impaired: Clinical practice and research (pp. 206– 248). Philadelphia: Grune & Stratton. Harris, J. (1987). Speech comprehension and lexical failure. In R. Reilly, (Ed.), Communication failure in dialogue and discourse: Detection and repair processes (pp. 81–98). New York: North-Holland. Herbert, J., & Attridge, C. (1975). A guide of developers and users of observation systems and manuals. American Educational Research Journal, 12 , 1–20. Higginbotham, D. J. (1991). The effect of communication device output mode on conversational performance (Continuation report, Grant No. DC00034-03). Bethesda, MD: National Institute on Deafness and Other Communicative Disorders. Higginbotham, D. J. (1993). Comprehension of VOCA produced texts: Referent recall, intelligibility and short term memory. Manuscript in preparation. Higginbotham, D. J., Baird, L., & Duchan, J. D. (1994). Discourse analysis of listeners’ summaries of synthesized speech passages. Manuscript submitted for publication. Higginbotham, D. J., Drazek, A., & Sussman, J. (in review). The effect of experience on the single-word intelligibility and discourse comprehension of synthetic speech. Manuscript submitted for publication. Higginbotham, D. J., & Scally, C. (1994). The effect of voice output method on VOCA mediated social interaction. Manuscript in preparation. Higginbotham, D. J., Lundy, D. C., & Scally, C. (1993a). The effect of output method on the comprehension of voice output communication aids. Manuscript submitted for publication. Higginbotham, D. J., Lundy, D. C., & Scally, C. (1993, in review). Qualitative ratings of discourse summaries: Definitions and procedure manual. Unpublished manuscript. Higginbotham, D. J., Mathy-Laikko, P., & Yoder, D. E. (1988). Studying conversations of augmentative communication system users. In L. E. Bernstein (Ed.), The vocally impaired: Clinical practice and research (pp. 206–248). Philadelphia: Grune & Stratton. Jenkins, J. J., & Franklin, L. D. (1982). Recall of passages of synthetic speech. Bulletin of the Psychonomic Society, 20, 203–206. Keenan, E., & Schiefflen, B. (1976). Topic as a discourse notion: The study of topic on the conversations of children and adults. In L. Li, (Ed.), Subject and topic (pp. 335–382). New York: Academic Press. Kintsch, W. (1988). The role of knowledge in discourse comprehension: A construction-integration model. Psychological Review, 95, 163–182. Kintsch, W., Kozminsky, E., Streby, W. J., McKoon, G., & Keenan, J. M. (1975). Comprehension and recall of text as a function of content variables. Journal of Verbal Learning and Verbal Behavior, 14, 196–214. Light, J., & Lindsay, P. (1990). Cognitive science and augmentative and alternative communication. Augmentative and Alternative Communication, 7 , 186–203. Lively, S. E., Ralston, J. V., Pisoni, D. B., & Rivera, S. M. (1990). Effects of text structure on the comprehension of natural and synthetic speech (Research on Speech Perception Progress Report No. 16). Bloomington, IN: Speech Research Laboratory, Indiana University.

Discourse Comprehension of Synthetic Speech

Logan, J. S., Greene, B. G., & Pisoni, D. B. (1989). Segmental intelligibility of synthetic speech produced by rule. Journal of the Acoustical Society of America, 86 , 566–581. Luce, P. A., Feustel, T. C., & Pisoni, D. B. (1983). Capacity demands in short-term memory for synthetic and natural speech. Human Factors, 25 , 17–32. Mathy-Laikko, P. A. (1982). Comprehension of augmentative and alternative communication device output methods. Unpublished doctoral dissertation, University of Wisconsin-Madison, Madison, WI. Mirenda, P., & Beukelman, D. R. (1987). A comparison of speech synthesis intelligibility with listeners from three age groups. Augmentative and Alternative Communication, 3, 120–128. Mirenda, P., & Beukelman, D. R. (1990). A comparison of intelligibility among natural speech and seven speech synthesizers with listeners from three age groups. Augmentative and Alternative Communication, 6 , 61–68. Oller, J. W. (1979). Language tests at schools. London: Longman Group. Pisoni, D. B., Manous, L. M., & Dedina, M. J. (1987). Comprehension of natural and synthetic speech: Effects of predictability on the verification of sentences controlled for intelligibility. Computer Speech and Language, 2 , 303–320. Pisoni, D. B., Ralston, J. V., & Lively, S. E. (1990). Some new directions in research on comprehension of synthetic speech (Research on Speech Perception Progress Report No. 16). Bloomington, IN: Speech Research Laboratory, Indiana University. Ralston, J. V., Pisoni, S. E., Lively, S. E., Greene, B. G., & Mullenix, J. W. (1991). Comprehension of synthetic speech produced by rule: Word monitoring and sentence-by-sentence listening times. Human Factors, 33 , 471–491. Raghavendra, P., & Allen, G. D. (1993). Comprehension of synthetic speech with three text-to-speech systems using a sentence verification paradigm. Augmentative and Alternative Communication, 9 , 126–133. Riley, P. (1980). When communication breaks down: Levels of coherence in discourse. Applied Linguistics, 1, 201–216. SAS Institute. (1989). JMPR (version 2.05). Cary, NC: Author. Scally, C. (1993). The effect of voice output method on VOCA mediated social interaction. Unpublished master’s thesis, State University of New York at Buffalo, Buffalo, NY. Shriberg, L., Kwiatkowski, J., & Hoffman, K. (1984). A procedure for phonetic transcription by consensus. Journal of Speech and Hearing Research, 27 , 456–465. Suen, H. K., & Ary, D. (1989). Analyzing quantitative behavioral observation data. Hillsdale, NJ: Lawrence Erlbaum. Talbot, M. (1987). Reaction time as a metric for the intelligibility of synthetic speech. In J. A. Waterworth (Ed.), Speech and language-based interaction with machines: Towards the conversational computer. Chichester: Ellis Horwood. Vanderheiden, G. C., & Lloyd, L. L. (1986). Communication systems and their components. In S. Blackstone, (Ed.), Augmentative communication: An introduction (pp. 49–162). Rockville, MD: American Speech-Language-Hearing Association. Venkatagiri, H. S. (1991). Effects of rate and pitch variations on the intelligibility of synthesized speech. Augmentative and Alternative Communication, 7, 284–289. Vitale, T. (1991). Assistive speech I/O in the 1990s: Current priorities and future trends. In Proceedings of the California State University, Northridge Technology Conference on Voice I/O and Persons with Disabilities (pp. 1–21). Northridge, CA: Office of Disabled Student Services, California State University.

201

APPENDIX Definitions for the Qualitative Rating of Discourse Summaries

Full Summary Gloss

The text was completely understood. A.

Definition The summary repeated the original summary, including all textual elements (actors, relevant actions, events, locations, themes, and causal relations).

B.

Defining Features 1. The summary is an exact repetition of the original text, or it includes all elements of the original text. OR 2. The meaning of the summary is equivalent to that of the original text.

C. Additional Considerations The summary is also considered to be full in the following situations: 1. Change or omission of an element (or phrase) if: • the meaning of the altered element is consistent with that of the original text; • the altered element does not affect the meaning of other sentences or the central meaning of the summary compared to those of the original text. 2. Introduction of new words into the text, if the meaning of the text remains consistent with the meaning of the original paragraph (e.g., use of synonyms). 3. Omission of assumable elements and phrases (ellipsis). 4. Changes in punctuation, use of numbering (e.g., omission of commas, numbering of sentences), unless these changes appear to affect the meaning of the text.

Partial Summary Gloss

My listener understood most of the text. A.

Definition One or more of the substantive elements of the original text were altered or omitted. However, the overall meaning of the summary is similar to the original text.

B.

Defining Features 1. One or more elements of the summary are omitted or changed compared to the original text. AND

202

Higginbotham et al.

2.

The overall meaning of the summary is similar to that of the original text. The altered elements do not substantially depart from the overall gist of the summary.

C. Additional Considerations The summary is also considered to be partial in the following situations: 1. Text elements, phrases, and sentences may be changed or omitted if the overall interpretation of the summary is comparable to that of the original text. 2. New elements may be introduced into the text, if they do not significantly change the overall meaning of the summary compared to the original text.

AND 3. The summary makes sense within itself. C. Additional Considerations The summary is also considered to be changed in the following situations: 1. The altered elements significantly affect the interpretation of the entire text or subsequent sentences in the text. 2. The summary may be reasonable and even elaborate, but its meaning is significantly different from that of the original text.

Fragmented Summary Gloss

Changed Summary

My listener was unable to make sense of the text.

Gloss

A.

Definition The summary is incoherent, and/or only fragments of the original message were provided.

B.

Defining Features 1. The majority of the elements are omitted or changed. OR 2. It is difficult to make sense of the text. OR 3. The subject may report that they are unable to recall the paragraph.

My listener misinterpreted the text. A.

Definition Alterations made to the elements of the summary significantly change the overall meaning or gist of the original text. The summary is understandable unto itself.

B.

Defining Features 1. A number of elements, phrases, and/or sentences of the summary are added, omitted, or changed compared to the original text. AND 2. The altered elements significantly change the overall meaning of the summary compared to the original text.

C. Additional Considerations The summary is also considered to be fragmented in the following situation: 1. The subject may only comment on the difficulties related to comprehending the original paragraph.

DON JOHNSTON INCORPORATED DISCOVER POTENTIAL . . . EXPERIENCE SUCCESS DISTINGUISHED LECTURESHIP ISAAC is pleased to announce the establishment of the Don Johnston Incorporated Discover Potential . . . Experience Success Distinguished Lectureship. This award will be given biennially to an individual who plays an active and leading role in the AAC field and has made substantial contributions to expanding the knowledge and capabilities of AAC professionals and users. The first recipient of this new award will deliver the Distinguished Lecture at the Closing Ceremonies of ISAAC ’94 in The Netherlands.

ISAAC ’94 MAASTRICHT The Netherlands-Flanders Chapter of ISAAC (International Society for Augmentative and Alternative Communication) is organizing the ISAAC 1994 Sixth Biennial Conference in Maastricht, The Netherlands, October 9–13, 1994, in cooperation with the Netherlands Institute for Rehabilitation Research (IRV) in Hoensbroek. The Scientific Program covers a wide range of topics in the area of Augmentative and Alternative Communication. Parallel to the conference, an exhibition will be open. Both are open to researchers, practitioners, publishers, manufacturers, developers, and, last but not least, AAC users. For further information, please contact: Van Namen & Westerlaken Congress Organization Services, Attn.: Ms. Susanne Pauwels, P.O. Box 1558, 6301 BN Nijmegen, The Netherlands. Phone: +31 80 234471; Fax: +31 80 601159.

Suggest Documents