In order to limit the scope of our work, we targeted the least intelligible phonemes ... 150 words per minute is favored under these conditions. We therefore used ...
6th International Conference on Spoken Language Processing (ICSLP 2000) Beijing, China October 16Ć20, 2000
ISCA Archive
http://www.iscaĆspeech.org/archive
Improving Speech Synthesis for High Intelligibility under Adverse Conditions Davis Pan, Brian Heng, Shiufun Cheung, and Ed Chang Cambridge Research Laboratory Compaq Computer Corporation, One Cambridge Center, Cambridge, MA 02142-1612
ABSTRACT We investigate methods of improving the intelligibility of synthetic speech under noisy or low-fidelity acoustic conditions. Techniques explored improve speech in a natural manner, such that training won’t be required for the user to understand the enhanced speech. While the improvements are natural in this respect, the changes aren’t limited to creating only speech that is achievable by a human vocal tract. Modifications fall into three broad classes: increasing phoneme amplitude, altering spectral shape, and lengthening phoneme duration. Listening tests conducted in noisy and noise-free conditions demonstrate significant improvements to intelligibility for most of the subject phonemes.
1. MOTIVATION We foresee a rapidly increasing need for high-intelligibility speech synthesizers. As computer systems become more miniaturized and mobile, there will be more circumstances where video displays are impractical. Whether because the computing device is too small for an adequate visual display, such as in a cell phone, or because it would be dangerous to divert an operator’s visual attention, as when driving an automobile, speech synthesis is an attractive alternative user interface. Unfortunately, many portable systems are typically used under acoustically challenging conditions, for example within the noisy environment of an automobile, or subject to the limited acoustic output and low fidelity of tiny, low-powered speakers. In these cases, it is much more important that the speech be understood than it is for the speech to sound natural. Our goal is to spur the development of synthesizers that maximize intelligibility, which is analogous to how modern fonts maximize the legibility of text.
2. BACKGROUND Arguably work in the area of speech intelligibility spans nearly two centuries. Early work focused on hearing aids [1]. Subsequent studies focused on the intelligibility of specific phonemes [2,3], while recent work studied intelligibility in a more general sense [4,5]. With the advent of speech synthesizers, researchers have examined the intelligibility of synthesized speech under both clear and noisy conditions [7,8,9]. Time-honored work has been done on methods of improving speech intelligibility. Since World War II, people have known that speech intelligibility could be improved by increasing the amplitude of consonants relative to that of vowels [6]. Humans also tend to increase the consonant-vowel energy ratio when
asked to speak clearly [10]. In addition, studies have shown that humans instinctively change the way they speak to improve intelligibility. When speaking in a noisy environment, humans tend to automatically alter their speech. This phenomenon is known as the Lombard Effect. Lombard speech has been found to increase speech intelligibility in noise in many cases [11]. Information from these studies allows us to improve intelligibility in two ways. We both use traditional signalprocessing approaches as well as mimic what humans do to improve intelligibility.
3. OVERVIEW For this paper we explored methods of improving the intelligibility of synthesized speech in noise. We focused on methods that do not require user training to achieve the improved intelligibility. We based our experiments on a DECtalk formant synthesizer [12], version 4.6. This synthesizer allows us to identify phonemes and extract associated acoustic parameters. This facilitates modification of the synthetic speech output. DECtalk has the additional advantages of a high initial intelligibility [7,8] and a low-complexity, memory-efficient realization. It is thus well suited to portable applications. While our study focused on a specific synthesizer and language, we kept our approach general so that it can be applied to any synthesizer or language. We first determined the phonemes that need the most improvement in intelligibility, a set of consonants. Next we developed tests to measure the intelligibility of these consonants in the presence of noise. We then proceeded in three phases: 1.
Determine whether amplification of the selected consonants is sufficient to improve their intelligibility in noise.
2.
For the consonants that do not respond to amplification, determine if a change in spectral shape improves intelligibility.
3.
For the consonants that do not respond to the above two techniques, determine if time-stretching the consonant improves intelligibility.
4. TARGET CONSONANTS In order to limit the scope of our work, we targeted the least intelligible phonemes in the American English language. It is widely known that consonants are less intelligible than vowels [5]. This is not surprising since consonant sounds are generally both weaker in strength and shorter in duration than vowels. Figure 1 shows a plot of the average RMS value versus the duration in milliseconds for the 55 DECtalk phonemes. These
5. INTELLIGIBILITY TEST
35
w ll
m
30
n
25
df
b
r
20
jh dx
15
t p
lx
ax
aa
er
We formed 80 different syllables by pairing each of the 16 consonants with each of the following 5 vowels:
ar
ix
z
zh
dg
ow ao rr aw uh ey ur uw or ay ah ir iy rx yu ae oy el
ih eh y
Average RMS (dB)
5.1. Test Description
Phoneme Power Versus Duration
40
dh nx s ch v sh dz
k hx
10
•
Back vowels:
aa, uw
•
Front vowel:
iy
•
Diphthong:
ay
•
Retroflex:
er
en
Included Phonemes Excluded Phonemes f
5
0
th
tx 0
50
100
150
200
250
300
350
400
450
Duration (ms)
Figure 1: Average phoneme RMS versus duration
phoneme values were extracted from representative word utterances containing each phoneme. Since phoneme power and duration vary from word to word, this figure serves only as an example. We focused on 16 of the most common, least intelligible, consonants (using the DECtalk phonetic alphabet): •
Voiced plosives:
b, d, g,
•
Unvoiced plosives:
p, t, k,
•
Voiced fricatives:
v, dh, z,
•
Unvoiced fricatives: f, th, s, sh, hx,
•
Nasals:
m, and n.
In order to minimize the impact of adjacent vowels, we concentrated on word-initial consonants. Consequently we omitted zh, df, dx, dz, tx, nx, en, el, and rx because these rarely, if ever, occur in the word-initial position. Although word-final consonants are generally less intelligible than those in the wordinitial or word-medial position [5], we chose to focus on wordinitial consonants because there was a richer combination of phonetically natural sounding consonant-vowel syllables. We omitted semivowels r, w, y, ll, and lx because they generally are higher in amplitude and longer in duration and thus more intelligible than the consonants [5]. We left out affricates jh and ch to reduce the length of the intelligibility tests. Besides, our analysis of the TIMIT speech database indicated that affricates have a relatively low frequency of occurrence in speech. Results of previous studies of synthetic speech intelligibility under noisy and stressful conditions [9] indicate that a synthetic male voice is more intelligible and a speaking rate of roughly 150 words per minute is favored under these conditions. We therefore used the default male DECtalk voice (Paul) speaking at a rate of 150 words per minute.
For each test we compared the intelligibility of two sets of 80 syllables. The first set was the original data as synthesized by DECtalk. The second set had all the consonants enhanced in some manner to improve the intelligibility. Each syllable from both sets of data was presented in a random order and repeated twice during the test, for a total of 320 observations per test. We used an initial training sequence to acclimate test subjects to the test. The training sequence consisted of a sequence of 16 noisefree syllables followed by 10 syllables with noise. The training sequence was not scored. We wrote a program in Visual C++ for Windows to administer this test. Fifteen subjects, eleven males and four females, participated in the testing. Not all subjects participated in all the tests. Six subjects took the phase 1 tests, nine subjects took the phase 2 tests (including the same six subjects from the phase 1 tests), five subjects took the phase 3 tests, and six subjects took the noise-free test. All were native English speakers with normal hearing ability. Each listened with a pair of Stax Lambda Nova headphones with an SRM-T1S driver unit. We adjusted the volume of the headphone output so that the noise-only component of the test measured 80 dB on a Realistic sound level meter, using the A weighting, and slow response settings. After hearing each syllable, subjects were asked to identify which one of 16 consonants they heard by using a mouse to click on the appropriate radio button. The program would constantly update the radio-button labels with plausible consonant-vowel spellings that match the vowel part of the syllable just presented. For example after the syllable sound, “sigh”, the th consonant button would be labeled “TH (thigh).” A “don’t know” choice was also available in case subjects were unable to determine what was said. The pace of the test was under the subjects’ control in that they could take as long as they wanted to enter their consonant identification but the program would play the next syllable soon after the identification. Usually, subjects required about 20 minutes to complete the test. No subjects took more than one test per day to avoid fatigue effects. For tests of intelligibility in noise, each syllable was mixed with white Gaussian noise adjusted to achieve a 5dB signal-to-noise ratio. Early trials indicated that this amount of noise was sufficient to result in a consonant identification error rate of over 50%. The noise would start half a second before the start of the syllable and finish a half-second after the end of the syllable.
5.2.
Results
The first phase of our study was an extension of previous work that improved intelligibility by indirectly increasing the amplitude of the consonants relative to the vowels, either by clipping or by fast limiting the speech signal [6]. We amplified each consonant by 600% or until the signal level was just below clipping, whichever was less. Half of our set of consonants improved with just amplification: s, sh, m, n, t, k, b, and d. For the remaining eight consonants this method provided little improvement and in some cases actually introduced more errors. For instance, after amplification, hx’s were more easily mistaken for p’s, and v’s were more easily mistaken for m’s. The second phase of our study was designed to test whether the lack of intelligibility improvement for the remaining consonants was due to poor phonetic reproduction by the DECtalk synthesizer. We replaced each of these consonants with an amplified recording of that consonant produced by a male speaker trying to speak the associated syllable clearly. The amount of amplification was comparable to that used in the first phase. Five consonants, p, g, hx, z, and v, had a significant improvement in intelligibility using this approach. That there was little improvement to the f, th, and dh phonemes is not too surprising. For natural speech, these fricatives are weaker than the other fricatives [13]. The output of the DECtalk synthesizer is consistent with this observation. Figure 1 shows that f and th are the two weakest consonants we tested. Unamplified, these consonants are inaudible for the noise level we used. Unfortunately with amplification these consonants no longer sound as expected. In particular, the amplified dh sounded more like a v or a z. Our difficulties in improving the intelligibility of the th and dh consonants may also be partially due to the limited bandwidth of the DECtalk synthesizer. According to notes on spectrogram reading, the major energy onset in the frequency domain for these consonants starts at 6 kHz for an adult male [13]. Perhaps increasing the sampling rate of the synthesizer from 11.025 kHz to 16 kHz would help. In the third phase we attempted to improve the intelligibility of consonants that apparently couldn’t be improved by amplification or the substitution of human clear speech utterances. Studies of human clear speech indicate that people tend to increase the voicing onset time and increase the duration of frication noise [10]. We therefore tried doubling the duration of the consonants, f, th, and dh. We enhanced the other consonants as in phase 2. There was no significant improvement in intelligibility using this technique for dh and th. This was expected because the time-stretching aspect was already captured in phase 2 by using clips of human clear speech. The intelligibility of f unexpectedly improved from an error rate of 62% to 44%. Figure 2 summarizes the overall outcome of our tests in the form of confusion matrices. The actual consonants are labeled on the left while the perceived consonants are labeled across the top. The count in each entry, as a percentage of the total, is depicted according to the shading scale on the right. Detailed scores in the form of Excel spreadsheets are available at the following web site:
p
t
k
b
d
g
f
th
s sh hx
v dh
z
m
n
p t
100 % 90
k 80
b d
70
g 60
f th
50
s sh
40
hx 30
v dh
20
z 10
m n
0
(a) p
t
k
b
d
g
f
th
s sh hx
v dh
z
m
n
p t
100 % 90
k 80
b d
70
g 60
f th
50
s sh
40
hx 30
v dh
20
z 10
m n
0
(b) p
t
k
b
d
g
f
th
s sh hx
p t
v dh
z
m
n
100 % 90
k 80
b d
70
g 60
f th
50
s sh
40
hx 30
v dh
20
z 10
m n
0
(c) Figure 2: Confusion matrices for (a) original data set, (b) enhanced data set using only amplification, and (c) enhanced data set using both amplification and human clear speech replacement for certain consonants
http://crl.research.compaq.com/projects/speechsynth. Figure 2(a) shows the intelligibility results of the unmodified data set. The unmodified set scored an overall error rate of roughly 71%. It is satisfying to note that, of the consonants we tested, those with the highest intelligibility, m and s, are located highest and widest on the phoneme-power-versus-duration curve in Figure 1, while those with the lowest intelligibility, p and k, were located closest to the origin. Figure 2(b) shows the results for the amplified DECtalk consonants. Amplification alone improved the overall error rate to 49%. Figure 2(c) shows the results for the enhanced data set. This set combines amplification for most of the consonants with human consonant replacement and amplification for the phonemes: p, g, hx, z, v, f, and th. The enhanced consonants scored an overall error rate of less than 37%. It is clear that the enhancements have improved the intelligibility of the synthesized speech.
the first and second formant ranges to make the vowels aa, iy, ow more distinct (enlarge the aa, iy, ow vowel triangle).
Finally, to check if the enhancements to improve speech intelligibility in noise affected intelligibility without noise we conducted a test of the phase 2 speech enhancements without the interfering noise. Test results as shown in Figure 3 indicate that, in the noise-free case, except for g and hx, intelligibility for the enhanced consonants was higher than or roughly equal to the intelligibility of the original DECtalk consonants. Both these enhanced consonants were more often mistaken for a k in the noise-free case.
4. Bradlow, A.R., Torretta, G.M., & Pisoni, D.B., (1996), "Intelligibility of Normal Speech I: Global and Finegrained acoustic-phonetic talker characteristics," Speech Communication, vol. 20, no. 3-4, 255-272
Number of Errors Out of 60 Samples
1. Berger, K.W., (1974), “The hearing aid: its operation and development,” National Hearing Aid Society, Livonia, MI. 2. Miller, G.A. & Nicely, P.E, (1955), “An Analysis of Perceptual Confusions Among Some English Consonants,” J. Acoust. Soc. Am., vol. 27, no. 2, 338352. 3. Fairbanks, G., (1958), “The Test of Phonemic Differentiation: The Rhyme Test,” J. Acoust. Soc. Am., vol. 30, no. 7, 596-600.
5. Neel, A.T., Bradlow, A.R., & Pisoni, D.B., (1996), "Intelligibility of Normal Speech II: Analysis of Transcription Errors," Research on Spoken Language Processing, Progress Report No. 21. 6. Kretsinger, E.A. & Young, N.B., (1960), “The Use of Fast Limiting to Improve the Intelligibility of Speech in Noise,” Speech Monogr. No. 27, 63-69.
Number of Errors Under Noise Free Conditions 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
7. REFERENCES
Original Enhanced
7. Wright, J.T., B.J. Malsheen & M. Peet, (1986), "Comparison of segmental intelligibility and pronunciation accuracy for two commercial text-tospeech systems," Proc. AVIOS, 235-261. 8. Logan J., Greene B., & Pisoni D. (1989). “Segmental Intelligibility of Synthetic Speech Produced by Rule”, J. Acoust. Soc. Am., vol. 86, no. 2, 566-581. p
t
k
b
d
g
f
s
sh
h
v
z
m
n
Phoneme
Figure 3: Noise-Free Error Statistics
6. FUTURE WORK Our preliminary work in this area shows much promise. There is still much work to be done. We should isolate and identify the clear speech features that contributed to the increase in intelligibility of p, g, hx, z, and v. We should also conduct intelligibility tests with noise at other levels and other spectral distributions. This study only examined word-initial consonants; intelligibility enhancements to word-medial, word-final consonants should also be studied. More traits of human clear speech [4,10] should be incorporated in the synthesizer to see if they improve intelligibility. These include: extending formant transitions, including sound insertions (a schwa vowel) after voiced consonants, slowing down the speech rate, expanding
9. Simpson, C.A. & Navarro, T. (1984), “Intelligibility of Computer Generated Speech as a Function of Multiple Factors,” Proc. IEEE 1984 Nat. Aerospace and Elec. Conf., vol. 2, 932-940. 10. Pichney, M.A., Durlach, N.I., & Braida, L.D., (1986), "Speaking Clearly for the Hard of Hearing II: Acoustic Characteristics of Clear and Conversational Speech," J. Speech and Hearing Research, 29, 434-446. 11. Junqua, J-C, (1993), "The Lombard Reflex and its Role on Human Listeners and Automatic Speech Recognizers,” J. Acoust. Soc. Am., vol. 93, no. 1, 510524. 12. Klatt, D.H. & Klatt, L.C., (1990), “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am., vol. 87, no. 2, 820-857. 13. Zue,V. (1985), “Notes on Spectrogram Reading”, Mass. Inst. Tech. Course 6.67s lecture notes, Cambridge, MA.