Subjects

Journal of Speech and Hearing Disorders, Volume 54, 209--214, May 1989

COMPARISONS OF PERCEPTUAL AND ACOUSTIC CHARACTERISTICS OF TRACHEOESOPHAGEAL AND EXCELLENT ESOPHAGEAL SPEECH S. E. SEDORY* S.L. HAMLET University of Maryland, College Park N. P. CONNOR Speech Motor Control Laboratories University of Wisconsin, Madison The results of recent studies have established significantacoustic differencesbetween tracheoesophageal (TE) and conventional esophageal speech. Listener preferences and acoustic differences between TE and excellent esophageal speech were examined in the present investigation.Although,as a group, TE speech was characterized by longer extended pholaation, more syllables per breath, and increased intensity, there were no significantdifferences in listener preference between the groups. cative effectiveness. Dudley (1984) found no significant difference in the intelligibility of conventional esophageal speech and TE speech. Several investigators have described TE speech as often perceptually equal to or better than superio~r conventional esophageal speech (Singer, Blom, & Hamaker, 1981; Wood, Tucker, Rusnov, & Levine, 1981). However, to date there have been no published reports that objectively define perceptual differences between TE and excellent esophageal speech. Moreover, no published studies have compared the acoustic and perceptual characteristics of these two speech types. Such objective information would be valuable to speech-language pathologists and otolaryngologists considering TEP surgery for their laryngectomized patients. This study was undertaken to directly assess the relationship between selected acoustic measures and perceptual judgments of acceptability of TE speech and excellent conventional esophageal speech.

With the development of the tracheoesophageal puncture (TEP) technique (Singer & Blom, 1980), tracheoesophageal (TE) speech has become a widely used method of alaryngeal voice rehabilitation. TE speech is achieved when pulmonary air is directed through a prosthesis into the upper esophagus to vibrate the pharyngoesophageal segment and produce voice. This replaces the need to insufflate the esophagus as in conventional esophageal speech and provides the TE speaker with an increased capacity to support his speech with the respiratory system. As the number of TE speakers has increased, investigators have begun to define the acoustic characteristics of TE speech and to compare such properties with those of normal and conventional esophageal speech. TE speakers sustain phonation longer, produce more syllables per breath or insufflation, maintain faster speaking rates with less pause time, and speak with greater intensity than conventional esophageal speakers (Baggs & Pine, 1983; Robbins, Fisher, Blom, & Singer, 1984; Wetmore, Krueger, & Wesson, 1981). In addition, TE speech and conventional esophageal speech have also been differentiated by jitter ratio, mean shimmer, and percentage of periodicity (Robbins, 1984). These studies suggest that the differentiating acoustic variables may affect intelligibility and acceptability of esophageal and TE speech and could be considered when developing therapeutic objectives. Only a small number of investigators have attempted to examine whether there are differences in listener perceptions of randomly selected TE and conventional esophageal speech. Williams and Watson (1987) found the naive listeners rated esophageal and TE speakers differently on quality/extraneous noise, visual presentation, speaking rate, pitch, loudness, and intelligibility/overall communi-

METHOD

Subjects Two groups of speakers and one group of listeners participated in this study. The first group o f speakers consisted of 5 men who had undergone total laryngectomy 3-20 years earlier (M = 9.5 years). They all began using esophageal speech within 1 year post[aryngectomy, received 3-14 months of speech services (M = 10 months), and were judged as superior conventional esophageal speakers by their speech-language pathologists in terms of general effectiveness of speech (Snidecor & Curry, 1959). The second group consisted of 5 male laryngectomees who had a tracheoesophageal puncture 0-9 years postlaryngectomy (M = 2.5 year~0. These represented the first 5 volunteers who were wearing the

*Currently affiliated with the National Institutes of Health, Bethesda, MD. © 1989, American Speech-Language-HearingAssociation

209

0022-4677/89/5402-0209501.00/0

210 Journal of Speech and Hearing Disorders same type of TE prosthesis, the A.V. Mueller Lo-Pressure prosthesis. All had received less than 1 month of speech services and/or instruction in the care and use of a prosthesis and proficiently used TE speech as their primary method of communication. The speakers were recruited from New Voice Clubs in the metropolitan Washington, DC, area and used standard American dialect. None reported a history of communicative disorders other than that associated with their laryngectomy. Each speaker was tape-recorded at the same record level in a quiet room on a Nagra IV-S recorder (7.5 ips). A head-mounted Sony ECM50 microphone was held at a constant distance of 5 cm in front of the speaker's mouth. Speech tasks consisted of producing a maximum extended phonation of/a/on one inhalation or insufflation at a comfortable loudness. Three successive trials were recorded to compute a mean maximum phonation time. Subjects also read the first three sentences of "The Rainbow Passage" (Fairbanks, 1944), once silently and then twice while being tape-recorded at a normal conversational loudness and rate.

Listening Procedure Seven female and three male adults unfamiliar with alaryngeal speech served as listeners. All passed a puretone hearing screening at 25 dB or better in both ears, at octave frequencies from 500 to 4000 Hz. Age of the listeners ranged from 19 to 59 years (M = 29 years). A listening tape was created using a computer to digitize each speaker's utterance of the second sentence of the Rainbow Passage (low-pass filtered at 7 kHz, sampled at 16 kHz with 12-bit accuracy), stored in separate files. The sentences were arranged in a randomized paired-comparison format (Weiss, Yeni-Komshian, & Heinz, 1979) so that each speaker was paired twice with every other speaker except himself, once as the first talker and again as the second talker. There were a total of 90 stimulus pairs with 700-ms interstimulus intervals and 5 s between stimulus pairs. Because each speaker pair appeared twice (although not in the same order), it was possible to compute intrajudge reliability from one listening session. The stimuli were recorded on a Revox tape recorder, and a second version was made using block randomization to control order effect. Listeners participated individually or in small groups. Prior to testing, an instructional tape was played that described alaryngeal speech and provided five samples of esophageal and TE speakers not used in this study. Listeners were instructed to select the sentence in each pair that they "preferred" and were encouraged, exactly, to "use their subjective impression . . . which may be based on different aspects of speech, including intelligibility, voice quality, fluency, rate, naturalness, communicative effectiveness, or just which [they] would most like to listen to." These instructions were provided to alert all 10 listeners to the same aspects of speech. Because the end result of the listening study was a dichotomous judgment, it is unlikely that this general definition biased the listeners in any

54 209-214

May 1989

direction. Listeners were encouraged to attempt to maintain the same set of judgment criteria throughout the listening session. One of two listening tapes was played on a Nagra IV-S portable tape player, free field in a quiet room at a comfortable loudness level. Following the listening test, listeners were asked to write down the basis for their judgments and to describe their impressions of the task. The listeners' scores were tallied as 1 for the sample preferred in each pair and 0 for the unselected sample. The scores were summed for each speaker so that a perceptual rating number could be assigned to each of the 10 speakers (total of 18 points possible for any speaker). Mean intrajudge reliability was 88.5%, ranging from 84% to 94%. This score was considered acceptable for discriminating alaryngeal speakers by naive listeners.

Acoustic Analysis All acoustic measures were obtained using Micro Speech Lab (MSL) (Dickson, 1985) and an IBM-PC microcomputer. This method of analysis and computer hardware/software package were selected because it is commercially available and capable of performing the selected acoustic measures with little training. All speech samples were played from a Revox tape player at a constant playback level. For durational measures, the three sentences of the second reading of the Rainbow Passage were sampled at 8 kHz with 10-bit accuracy. Each sentence was displayed as a time waveform and, using a cursor, the length of the sentence and pause/speech segment durations were measured with 10-ms accuracy. A pause was marked when there was more than 200 ms of continuous silence. The criterion for determining a pause is similar to that used in other studies so as to exclude stop-closure durations from being interpreted as pauses (Lisker, 1957; Robbins et al., 1984). Maximum phonation time was measured by sampling each extended phonation at 3 kHz with 8-bit accuracy to allow a sufficient sampling time. The sample was displayed as a time waveform and the length of phonation was determined using cursors. The duration of the three trials was measured individually and then averaged. An amplitude display of the waveform was used to measure the relative intensity of the continuous speech portions of the second sentence of the reading passage. In MSL, an intensity "energy analysis calculation" sums the amplitude of the sampled data within 20-ms analysis frames. The energy values for each 20-ms frame were averaged to determine the mean relative intensity level of the sentence for each subject. RESULTS

Acoustic Findings Descriptive results of the acoustical analyses are presented in Table 1. Preliminary examination of the varia-

SEDORY ET AL.: Tracheoesophageal and Esophageal Speech

211

TABLE 1. Comparisons of esophageal and tracheoesophageal (TE) speakers on acoustic measures.

Mean extended phonation (in seconds)

Subjects

Mean relative intensity (rel. amplitude)

Mean syllables per second

Mean syllables per breath

% pause time

Total number of pauses

Mean pause length (in milliseconds)

Esophageal

1 2 3 4 5 M SD

0.71 0.52 1.27 1.47 1.59 1.11 0.47

585.8 546.0 645.0 362.6 511.5 530.2 106.0

2.71 2.78 2.45 2.61 3.77 2.86 0.52

3.0 3.7 3.0 4.1 7.0 4.2 1.7

26.8 31.8 37.1 34.6 16.6 29.4 8.1

20 18 20 14 7 15.8 5.5

346 445 529 664 440 484 119.4

TE

1 2 3 4 5 M SD

9.20 7.39 6.77 4.71 14.54 8.52 3.73

1,205.4 1,037.5 554.3 1,027.4 662.7 897.5 295.8

3.32 3.68 3.38 3.63 4.08 3.62 0.30

7.8 8.8 7.8 8.8 11.7 8.9 1.6

13.4 14.6 22.9 24.5 16.3 18.3 5.0

6 5 6 5 3 5 1.2

471 555 792 946 931 739 216.9

bles suggested redundancy among the acoustic measures of number of pauses, syllables per breath, mean percentage of pause time, and syllables per second. Because these relationships were supported by high Pearson correlation coefficients (r > -.85), syllables per breath and mean number of pauses were eliminated from further analyses. Nonparametric statistics were used in subsequent analyses, due to differences in the variances of the speaker groups and the ordinal scale of the perceptual data (Cohen & Cohen, 1975). Mann-Whitney U tests (SPSS x, 1986) were performed on the remaining acoustic variables to examine w h e t h e r differences existed between the TE and excellent esophageal speakers (Table 2). To control for experimentwise error, an alpha level of .03 was required for significance (Cohen & Cohen, 1975). Significant differences were found for mean length of extended phonation, mean relative intensity, and syllables per breath/insuffiation.

Perceptual Findings Results of the Mann-Whitney U test showed no significant differences in perceptual score as a function of the sex of the listener ( p > .14), the order of presentation (Tape A or B) ( p > .09), or the listener's reliability score (p > .11). Additionally, Kendall's coefficient of concordance (SPSS x, 1986) showed a very high degree of agreement among the listeners in their ratings of the speakers TABLE 2. Results of Mann-Whitney U test by speaker groups.

Variable

U test

Significance

Mean ext./a/ Mean relative intensity Syllables/second Syllables/breath or insufflation Mean pause length

-2.61 -2.19 - 1.78 -2.64 - 1.98

.009* .028* .076 .008* .047

*Significant at < .03 level.

(×~ = 72.12, p < .0001). Therefore, the perceptual scores were averaged across all 10 listeners to create a "mean perceptual score" (Table 3). There was not a significant difference ( p = .35) between the perceptual scores of TE and excellent esophageal speech, as assessed using the Mann-Whitney U test.

Perceptual~Acoustic Correlations In order to further assess the relationship between the acoustic and perceptual measures, Spearman rank correlation coefficients (SPSS x, 1986) were computed (Table 4). None of the acoustic variables were significantly correlated with the perceptual rank of the speakers. Subjectively, all of the listeners stated that the stimulus pairs were often too similar to make a preference judgment. They described their selection criteria as including control over the "smoothness," clarity of speech, and a "more normal sounding" voice for the preferred speakers. Unacceptable characteristics identified were nonspeech noises, audible inhalations or injections, slurring of words, and lack of control over loudness and rate.

DISCUSSION There were three major findings of this study. First, the acoustic differences between excellent conventional esophageal and TE speakers are similar to those found in other studies. The mean length of extended phonation is within reported ranges for conventional esophageal speakers (1.3-5.3 s; Berlin, 1965) and for TE speakers (330 s; Wetmore et al., 1981). In the literature, reports of speaking rate approximated 2.25 syllables/second for good esophageal speakers (Filter & Hyman, 1975), with 5.8 syllables/insufflation (Snidecor & Curry, 1959) and 36.1% pause time (Robbins et at., 1984). In TE speech, faster rates of 2.6 syllables/second, 9.2 syllables/breath,

212 Journal of Speech and Hearing Disorders

54 209-214

May 1989

TABLE 3. Comparisons Of tracheoesophageal and esophageal speakers on measures of listener preference. The mean perceptual score represents the number of "preferred" scores for each speaker, of a possible 18, averaged across 10 naive listeners. Speakers Esophageal

Tracheoesophageal

Mean perceptual score

Rank

M (SD)

Mann-Whitney U statistic

2.8 11.9 10.0 3.2 11.2 5.5 16.4 14.1 10.8 4.1

1 8 5 2 7 4 10 9 6 3

7.80 (4.45)

-0.94 p=.35

1 2 3 4 5 1 2 3 4 5

and 24.2% pause time have been found (Robbins et al, 1984). As confirmed in this study, the increased frequency of pauses by esophageal speakers seems to affect the total percentage of pause time and speaking rate, whereas the rate of TE speech is more strongly influenced by longer pauses. Measures of mean relative intensity from the present study reflect a 1.7:1 ratio between TE and excellent esophageal speech, a value consistent with ratios derived from previous reports (Baggs & Pine, 1983; Robbins et al., 1984). However, due to the variety of methods used to investigate relative intensity in these studies, it is difficult to make firm comparisons. The acoustic differences found by statistical analysis clearly demonstrate the differences between speaker groups. Even when compared with excellent esophageal speakers, the TE speakers had significantly longer lengths of extended phonation, spoke more syllables per breath, and produced louder speech during reading. The second major finding was that the two speaker groups could not be judged as perceptually different by these 1O naive listeners. Although there are no published reports that compare excellent esophageal and TE speech, the information to date presents conflicting reports of the perceptual differences or similarities between randomly selected TE and esophageal speakers (Dudley, 1984; Williams & Watson, 1985, 1987). The speakers in the present study, although not statistically different, showed some variation in judged communicative preferTABLE 4. Spearman rank correlation coefficients of perceptual and selected acoustic measures. Measure Mean extended/a/with perceptual score Mean relative intensity with perceptual score Syllables/second with perceptual score Syllables/breath or insufflation with perceptual score Mean pause length with perceptual score

Spearman's rho

p

.1273

.36

.0909 .3697

.40 .15

.2936

.21

.1273

.36

10.18 (5.32)

ence, even at the excellent ability level. Two subjects of each speech type received perceptual scores lower than the mean score, and 3 of each speech type received higher than average scores. This indicates that the listeners were not able to distinguish between the two types of speech but were able to identify perceptually better and worse speakers. Examination of the listeners' subjective opinions of the speakers and of the task supports these findings. All of the listeners found several pairs to be similar. When asked, none of the listeners identified two different speech types. They could only identify that they preferred some speakers more than others. The third finding was that no significant relationships were found between any of the acoustic measures and the perceptual ranks. This lack of correlation may be explained in two ways. First, by looking at the individual acoustic data, several of the speakers' values clearly do not follow the same pattern as the perceptual rank. For example, the seventh ranked speaker (highest ranked esophageal speaker), E-2, maintained the shortest maximum phonation of approximately 500 ms; whereas the third ranked speaker (lowest ranked TE speaker), TE-5, sustained the longest extended phonation of 14.5 s. This suggests that simply the ability to sustain a longer phonation does not make that speaker more preferred in connected speech. A second explanation is that the listeners based their perceptual judgments on properties other than those acoustic variables examined in this study. Although studies of conventional esophageal speech have shown correlations between perceptual acceptability judgment and acoustic measures of rate and intensity (Filter & Hyman, 1975; Shipp, 1967), these studies did not force their listeners to make choices between two types of esophageal speech. The listeners in this study described the preferred speakers as having control over the smoothness (rate and pauses) of speech, speaking more clearly, and having a more normal voice. They also described the characteristics that they did not find acceptable, such as extraneous nonspeech noises, audible inhalations or injections, slurring words together, and lack of control over loudness and/or rate. It appears that although listeners subjectively believed they identified differences in rate and phrasing, those differences may not have been exclu-

SEDORY ET AL.: Tracheoesophageal and Esophageal Speech sive to one speech type. Moreover, the other aspects of speech, such as extraneous noises, more normal voice, and articulatory clarity, may have played an equal or more influential role in forming judgments. Several clinically relevant implications emerge from this study. First, the question of whether a TEP procedure is a valid method of alaryngeal speech rehabilitation is answered favorably in this study. All 5 TE speakers, who had b e e n using TE speech for an average of 2 years, were judged to be perceptually similar to excellent conventional esophageal speakers who had been using esophageal speech as their primary method of communication an average of 10 years. There is an advantage in being able to achieve excellent alaryngeal communication in such a short period of time. However, based on these 10 speakers and 10 listeners, one cannot say that either type of speech is significantly better from a perceptual perspective. Moreover, the acoustic advantages given these TE speakers do not correlate highly with more or less preferred speech. As this study is based on a relatively small number of speakers and listeners, it is not appropriate to generalize these findings to all laryngectomized patients. However, our findings lead us to speculate that these esophageal speakers who have achieved excellence would not increase significantly in listener preference if they were using TE speech. Current literature suggests that the choice of which type of speech a laryngectomized patient uses should primarily depend upon such factors as the ability to manipulate and care for the TE prosthesis on a daily basis, dissatisfaction with present method of alaryngeal speech, or certain physical limitations (Andrews, Mickel, Hanson, Monahan, & Ward, 1987; Komisar & Hoch, 1982; Wetmore et ah, 1981; Wood et al., 1981). We agree that speech-language pathologists and otolaryngologists should continue to consider the patient's present method of alaryngeal communication before recommending T E P surgery, particularly in the case of a laryngectomized patient with excellent communication. The second implication concerns therapeutic strategies. With acoustic variables supporting significant differences between speaker groups in favor of TE speech, many speech-language pathologists serving these populations might be inclined to use such variables as criteria in speech rehabilitation. Certainly, the importance of rate, phrasing, and loudness cannot be ignored in the rehabilitation process. However, the speech-language pathologist should be careful not to overlook 0rther perceptual characteristics that may be limiting the acceptability of both the conventional esophageal speaker and the TE speaker. As Weinberg (1985) notes, alaryngeal speech rehabilitation encompasses more than just the restoration of voice. Clinical strategies recommended for excellence in esophageal speech that are intended to reduce extraneous noises, refine connected speech fluency, and improve articulatory and sound source control (see Keith & Darley, 1979; Weinberg, 1980) would be expected to help improve the overall acceptability of TE speech as well. Because the TE speakers in this study all received less than 1 month of speech services, primarily

213

directed at instruction in care and use of the prosthesis, several sessions seem to be warranted to help the TE speaker achieve optimal alaryngeal communication.

ACKNOWLEDGMENTS This research is based upon a thesis by the first authOr, under the guidance of the second author, accepted as partial fulfillment of the requirements for a master of arts degree in speech pathology from the Hearing and Speech Department, University of Maryland, College Park. Portions of this paper were presented at the Annual Convention of the American Speech-LanguageHearing Association in New Orleans, LA, November 1987. The authors wish to gratefully acknowledge Sabyasachi Basu for statistical consulting. REFERENCES

ANDREWS, J.C., MICKEL,R.A., HANSON, D.G., MONAHAN, G. P., & WARD, P. H. (1987). Major complications following tracheoesophageal puncture for voice rehabilitation. Laryngoscope, 97, 562-567. BAGGS, T.W., & PINE, S.J. (1983). Acoustic characteristics: Tracheoesophageal speech. Journal of Communication Disorders, 16, 299--307. BERLIN, C.I. (1965). Clinical measurement of esophageal speech. III. Performance of non-biased groups. Journal of Speech and Hearing Disorders, 30, 174-183. COHEN, J., ~ COHEN, P. (1975). Applied multiple regression/ correlation analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. DICKSON, C. (1985). User's manual for Micro Speech Lab. Victoria, British Columbia: Software Research Corporation. DUDLEY,B. (1984). The comparative intelligibility of tracheoesophageal, esophageal and laryngeal speech. Unpublished doctoral dissertation, Northwestern University, Evanston, IL. FAIRBANKS,G. (1944). Voice and articulation driUbook. New York: Harper and Row. FILTER, M. D., & HYMAN, M. (1975). Relationship of acoustic parameters and perceptual ratings of esophageal speech. Perceptual and Motor Skills, 40, 63-68. KEITH, R. L., & DARLEY,F. L. (Eds.). (1979). Laryngectomee rehabilitation. Houston: College-Hill Press. KOMISAR, A., & HOCH, L. (1982). Rehabilitation of speech following tracheoesophageal puncture. Laryngoscope, 92, 1322-1323. LISKER, L. (1957). Closure duration and the intervocalic voicedvoiceless distinction in English. Language, 33, 42-49. ROBBINS,J. (1984). Acoustic differentiation of laryngeal, esophageal, and tracheoesophageal speech. Journal of Speech and Hearing Research, 27, 577-585. ROBB1NS,J., FISHER,H. B., BLOM,E. C., & SINGER,M. I. (1984). A comparative acoustic study of normal, esophageal, and tracheoesophageal speech production. Journal of Speech and Hearing Disorders, 49, 202-210. SHIPP, T. (1967). Frequency, duration, and perceptual measures in relation to judgments of alaryngeal speech acceptability. Journal of Speech and Hearing Research, 10, 417--427. SINGER, M. I., & BLOM, E. C. (1980). An endoscopic technique for restoration of voice after laryngectomy. Annals of Otology, Rhinology, and Laryngology, 89, 529-533. SINGER, M. I., BLOM, E. C., & HAMAKER, R. C. (1981). Further

experience with voice restoration after total laryngectomy. Annals of Otology, Rhinology, and Laryngology, 90, 498-502. SNIDECOR, J. C., & CURRY, E.T. (1959). Temporal and pitch aspects of superior esophageal speech. Annals of Otology, Rhinology, and Laryngology, 68, 623-636. SPSS~user's guide, 2nd ed. (1986). Chicago: SPSS. WEINBERG, B. (Ed.). (1980). Readings in speech following total

214 Journal of Speech and Hearing Disorders

54 209-214

laryngectomy. Baltimore, MD: University Park Press. WEINBERG, B. (1985). Speech rehabilitation following total laryngectomy: Some perspectives. Head and Neck Cancer, 1,

WILLIAMS, S. E., & WATSON, J. B. (1987). Speaking proficiency variations according to method of alaryngeal voicing. Laryngoscope, 97, 737-739. WOOD, B. G., TUCKER, H. M., RUSNOV, M. G., & LEVINE, H. L.

516-521. WEISS, M. S., YENI-KOMSHIAN,G. H., & HEINZ, J. M. (1979). Acoustical and perceptual characteristics of speech produced with an electronic artificial larynx. Journal of the Acoustical Society of America, 65, 1298-1308. WETMOtEE,S. J., KRUEGER,K., & WESSON,K. (1981). The BlomSinger speech rehabilitation procedure. Laryngoscope, 91, 1109-1116. WILLIAMS,S. E., & WATSON,J. B. (1985). Differences in speaking proficiencies in three laryngectomee groups. Archives of

(1981). Tracheoesophageal puncture for alaryngeal voice restoration. Annals of Otology, Rhinology, and Largngology, 90, 492-494. Received July 2, 1987 Accepted May 2, 1988 Requests for reprints should be sent to Susan E. Sedory, M.A., Speech & Voice Unit, IRP, National Institute on Deafness and Other Communication Disorders, Building 10, Room 5N226, National Institutes of Health, Bethesda, MD 20892.

Otolaryngology, 111,216-219.

© 1989, American Speech-Language-Hearing Association

May 1989

214

0022-4677/89/5402-0214501.00/0