19th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007
POLISH SENTENCE TEST FOR SPEECH INTELLIGIBILITY MEASUREMENTS IN MASKING CONDITIONS PACS: 43.71 Gv Ozimek Edward; Kutzner, Dariusz; Sek, Aleksander; Wicher, Andrzej Institute of Acoustics, Department of Room Acoustic and Psychoacoustics, 85 Umultowska Street, 61-614 Poznan, Poland;
[email protected]
ABSTRACT The article reports on development of a Polish Sentence Test for accurate and reliable speech intelligibility measurements against noise. The Polish material is composed of 481 different sentences divided into 37 lists of 13 sentences each. All the lists are phonemically and statistically balanced, i.e. the respective lists reveal comparable phonemic content and similar psychometric functions. Moreover, the phoneme distribution for each list reflects mean phoneme content for Polish language. The test is optimised for speech reception threshold (SRT), measurements in speech-like maskers, i.e. babble noise or speech-shaped noise. For otologically-normal subjects SRT= - 6.1 dB. The mean steepness of the psychometric functions (S50test) at the SRT point, is equal to 25.6%/dB, i.e. is slightly higher than that of comparable sentence tests developed previously for other languages.
INTRODUCTION Different methods have been proposed for measuring speech intelligibility in quiet and in noise [2,3,5,7,9,16,18,19,23,24]. They differ in the following aspects: the structure of the speech material, details of the test procedure, presentation level, range of the signal-to-noise ratio (SNR), type of interfering noise and a presentation mode. As far as the structure of the speech material is concerned, one can distinguish three basic types of tests: word intelligibility tests [1,8,12-14], digit intelligibility tests [12,13,16] and sentence intelligibility tests. The sentence tests can be divided into those using meaningful, everyday utterances [7,9,11,17,18] and those using semantically unpredictable (nonsense) sentences [3,19-22]. The main disadvantage of word intelligibility tests, as used in a traditional speech audiometry, is the difficulty to determine a precise SRT value (defined as the signal-to-noise ratio, that yields 50% probability of correct response), in a time-efficient manner. Sentence intelligibility tests, on the other hand, have been shown to be much more accurate [3,7,9,10,18,24]. They have some advantages over the word tests in testing the quality of hearing aids with different signal processing algorithms since they provide a reliable distinction between unaided and aided patient performance. Furthermore, unlike words and digits, sentence materials reflect natural communication process. Many sentence tests have been developed for accurate measurement of the speech reception threshold (SRT) in the noise [5,7,9,10,18]. Since there had not been Polish sentence materials for speech intelligibility assessments in noise, the present study deals with the preparation and evaluation of the Polish Sentence Test to be used for accurate and reliable measurement of the SRT in noise. The test is structurally similar to the previously developed meaningful sentence tests in: Dutch [10], American [9] and German [7]. The speech test developed is designed for evaluation of speech intelligibility in clinical conditions as well as for laboratory measurements.
DEVELOPING THE SENTENCE MATERIALS The process of the developing the sentence test was as follows: 1) initial selection of the sentences (in the written form); 2) recordings; 3) measurements of the sentence intelligibility in the noise and derivations of psychometric functions; 4) selection of the optimal sentences; 5) composition of statistically and phonemically balanced sentence lists; 6) verification measurements.
Initial preparation of the sentences and recordings In the first stage, about 3500 sentences were selected automatically from a large digitised database containing about 16 million Polish utterances taken from everyday speech, literature, TV and theatre. All of them fulfilled the fundamental definition of a sentence, i.e. they included a subject, an object and a verb. They referred to normal everyday contexts. The following criteria were used in the automatic selection of the sentences (similar to Versfeld et al. [18]): the total number of syllables in a sentence should be equal to eight or nine (number of words fell into a range from 3 to 7; mean 4.6 words/per sentence); the words in the sentences should not contain more than three syllables each; the sentences should not contain punctuation characters and capitals (excluding the initial capital). No duplicate sentences were selected. The second stage of sentence selection was realized manually on the basis of the following criteria: the sentences should be grammatically and syntactically correct and semantically neutral, which excluded political, war or gender-related topics. Questions, proverbs, proper names and exclamations were eliminated. This process reduced the number of sentences to 1200. The selected sentences were read out in a natural intonation by a professional male speaker. During the recording a special care was taken to keep approximately the same loudness level over time. Recording was performed in a high quality radio studio using a Neumann U87 capacitor microphone. The microphone output fed one of the input channels of a Yamaha 02R mixer. In the mixer, the microphone signal was pre-amplified and converted into the digital domain at a sampling rate of 44.1 kHz, with a resolution of 24 bits. It was also digitally high-pass filtered at a cut-off frequency of 80 Hz. The signals were then sent via an optical connection (ADAT-type) to a PC and stored on a computer hard disc using Samplitude Pro v.8.2 software. Sentence intelligibility measurements A computer controlled Tucker-Davis Technologies (TDT) System III with a 24-bit digital real-time signal processor RP2, and a headphone amplifier HB7 was used to play back the sentence material with the interfering noise at fixed SNRs. The level (in dB SPL) of the output signal was calibrated with B&K instruments (an artificial ear type 4153 connected with a microphone type 4134, preamplifier type 2669 and amplifier type 2610). Speech signals were presented monaurally via Sennheiser HD 580 headphones. The test sentences were mixed digitally with a speech babble noise (masker) at a constant level of 70 dB SPL. The babble noise was generated by summing up all of the sentences and normalizing the rms value of a resultant wave. Waveforms representing single sentences were randomly shifted in the time domain with respect to each other and, additionally, half of them were reversed. In this way, the decrease in masker power resulting from a decline in the speaker’s vocal effort during an utterance was reduced. As a result, a 15-s realisation of the speech babble noise was obtained. For this type of masker, average SNRs in the respective frequency bands (auditory filters) were kept constant for a given speaker. In the experiments the sentence presentation level was changed to get different SNRs. All rms of the sentences were equalized [18]. SNR was defined as the ratio of the sentence rms to the masking noise rms. The masking noise was a gated signal (20-ms ramps) and started 300 ms before the onset of the sentence and ended 300 ms after the end of the sentence. All sentences were presented at 5 SNRs: three SNR values were constant and were equal to –9, –6 and –1 dB and two other values were adjusted (based on a results of a pilot measurements) in such a way to optimally encompass the SRT point, but their values were between –9 and –6 dB and –6 and –1 dB, respectively. 35 normal-hearing subjects took part in the measurements. Each sentence was presented to a given listener only once and was presented 7 times on 5 different SNRs. The subject’s task was to type on a keyboard what they had just heard. Apart from the typing response, oral answers were recorded via a Yamaha MG10/2 microphone and saved on a hard disc as a wav file. A comparison of the both typed and oral responses turned out to be very useful in the case when listeners made typing errors. During the measurements the subjects were seated in an acoustically-insulated booth. The listeners were paid for their participation in the experiments.
2 th
19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID
Power Spectrum Density [dB re:1]
10 0 -10 -20 -30 -40 -50 0,05
0,1
0,5
1
5
10
Frequency [kHz]
Figure 1. Power spectrum density of masking noise used in the measurements (solid line). For comparison power spectrum of the masker used in study of Versfeld et al. [18] is presented (dashed line).
Determination of the psychometric functions For each sentence and each SNR the intelligibility data were determined, i.e. the probability of a correct response was computed. A subject response scored 1 (100%) if and only if an entire sentence was understood correctly; otherwise it scored 0 (0%). The psychometric function, i.e. function that links the probability of a correct response to SNR the signal is presented at, can be modelled by an arbitrary S-shaped function [15]. In this case, the normalized Gaussian cumulative functions were fitted to the intelligibility data by the least mean square procedure. The formula describing the psychometric function was as follows:
100 ϕ ( SNR ) = 2π
SNR − SRT σ
∫
−t 2 2
e dt
(Eq.1)
−∞
where: σ and S50 denote a spread of data and steepness of the psychometric function at the SRT point. Both parameters are easily convertible:
σ = ( 2π S50 ) −1 .
Consequently, 1200 psychometric functions and 1200 corresponding SRT and S50 parameters were determined. Selection of the optimal sentences It is widely known that homogenized, i.e. statistically balanced, test materials required stimuli of very similar and steep psychometric functions [7,10,18]. Accordingly, 500 sentences of SRT values falling into the range of ±1.5 dB with respect to the mean SRT of all sentences and S50 not less than 15 %/dB were selected. Thus, 700 sentences did not fulfil these conditions and were rejected. Composition of statistically and phonemically balanced lists The final stage of development of a the sentence test for speech intelligibility measurements was a composition of a set of phonemically and statistically balanced lists. In other words, the respective lists are required to reveal very similar phonemic content, reflecting additionally a reference phoneme distribution for Polish language. Furthermore, due to learning effects, the lists should be composed of different sentences, but are required to ‘produce’ the same results in a given measurement conditions, i.e. SRT and S50 ought not to depend on the list index. It was decided that optimally composed lists should meet the following criteria: the lists specific (mean) SRT should fall into the range ±0.5 dB with respect to mean SRT for the 500 selected sentences and mean steepness S50 of the respective list should fall into the range ±1 %/dB with respect to mean S50 of the chosen sentences (homogenisation of the psychometric functions). Moreover mean phonemic distributions for each list were determined and compared to the reference distribution for the Polish language [4]. The lists were regarded as phonemically balanced if for each phoneme and each list, the frequency of occurrence of any phoneme did not exceed the range ±2.5 percentage points with respect to the frequency of occurrence of that phoneme, i.e. a quality of phonemic balance that was comparable to that obtained by Kollmeier and Wesselkamp [7].
3 th
19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID
In order to compose automatically such sentence lists a special algorithm was implemented in Matlab 7.0 (MathWorks) which performed Monte Carlo simulations. The steps in the compilation were as follows: 1) generation of a random permutation of N=500 selected sentences; 2) random selection of n=13 sentences; 3) analysis of statistical properties and phonemic content of the selected group of n=13 sentences; 4) if the mean SRT, S50 and phoneme distributions for each of 39 phoneme met the above mentioned conditions, the selected group of n=13 sentences was stored on a hard disc as a list and the algorithm went to 1) and the processes were repeated for the remaining N-n sentences; 5) If the above mentioned criteria were not meet, the algorithm went to 1) and processes were repeated for N sentences. The algorithm terminated if 37 different (but statistically and phonemically homogenized) lists containing 13 sentences were generated. In total, 37*13=481 sentences out of 500 ‘optimal’ sentences were used, thus 19 sentences were rejected. Properties of the material developed Fig. 2 depicts the psychometric functions for an exemplary sentence list (solid lines) and the socalled list-specific psychometric function (dashed line). Sentence intelligibility score [%]
100 90
List 21
80 70 60 50 40 30 20
SRTlist=-6.3 [dB] S50list=26.8 [%/dB]
10 0
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0
1
2
SNR [dB]
Figure 2. Sentence specific psychometric functions (solid lines) and the list-specific psychometric function (dashed line) for an exemplary sentence list The list-specific SRTlist was computed as a mean of SRT values for individual sentences constituting the list (sentence specific SRTs). The list-specific steepness S50list was determined according to the probabilistic model proposed by Kollmeier [6]:
S 50list ≈
S 50 mean 16 S 50 mean σ SRT 1+ (ln(2e1 / 2 − 1 + 2e1 / 4 )) 2 2
2
(Eq.2)
where: S50mean is the mean of S50 values for sentences in a list, σSRT is a standard deviation of SRT across the sentences. The formula 2 suggests that a large list-specific steepness S50list, i.e. high accuracy and reliability of the test, requires large sentence-specific steepness parameters and small spread of SRT values for the respective sentences. Fig.3 presents a juxtaposition of the list-specific psychometric functions obtained for the 37 lists developed. Fig.4. depicts a comparison of the mean phoneme distribution of the lists and a reference phoneme distribution for Polish language determined for a large database of utterances. Retest measurements To examine reliability of the designed speech materials, the retest experiments were carried out. The measurements were conducted using both constant stimuli paradigm as well as an 4 th
19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID
adaptive method (with 1-up/1-down decision rule converging to 50%-probability point on a psychometric functions, i.e. SRT). Sentence intelligibility score [%]
100 90 80 70 60 50 40 30 20
SRT=-6.1 [dB] S50test=25.6 [%/dB]
10 0
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
0
1
2
SNR [dB]
Figure 3. Comparison of the list-specific psychometric functions for 37 sentence lists
Phoneme distribution [%]
12
reference distribution mean disribution
10
8
6
4
2
0
e a o I j t i n v r p u m s d n k l w z b S g f s Z ts x tSts w~dz c dz N z J j~ dZ
Phonemes
Figure. 4. A comparison between the reference phoneme distribution for Polish language [4] (open circles) and the mean phoneme distribution of the 37-sentence list (filled circles). The vertical bars show the range of the maximum and minimum percent of phoneme distribution across the lists. Both for the constant stimuli and the adaptive measurements, the standard deviations between list-specific SRTs obtained in the previous part of the study and in list-specific SRT values determined in the retest measurements, i.e. the so-called test-retest standard deviations [10], were less than 1dB.
DISCUSSION AND CONCLUSIONS The mean SRT for the test is –6.1 dB and it is comparable to the corresponding SRT values characterizing the previously developed sentence tests for speech measurement: German Göttingen Test [7] –6.2 dB; German OLSA Test [20-22]: -7.1 dB or Danish DANTALE II Test [24]: -8.4 dB. Since the SRT=–6.1 dB was obtained for otologically-normal subjects, it might be regarded as a reference value for clinical measurements. The mean steepness for the Polish Sentence Test S50test =25.6 %/dB, that is slightly larger that those of other sentence materials for which this parameter varies between 16 %/dB [3] and 21.2 %/dB, depending on masker type, linguistic structure and scoring method. Nevertheless, if the developed Polish Sentence Test was used in a similar condition like, for example, the Göttingen Test (babble noise masker, semantically predictable sentences, word-scoring) the obtained SRT and S50Test turned out to be very close: SRTPolish=-7.5 dB, SRTGerman=-6.2 dB and S50Polish=22.4%/dB and S50German=19.2%/dB. What is more, if the intelligibility of the Polish sentences were measured for speech-shaped noise and for sentence scoring, like in the recently designed French Sentence Test, the corresponding parameters turned to be quite comparable: SRTPolish=-6.7 dB, SRTFrench=-7.8 dB and S50Polish=22.5%/dB and S50French=21.2%. 5 th
19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID
Summarizing, the main purpose of this study, i.e. the preparation of Polish Sentence materials for accurate intelligibility measurements under noisy conditions, has been met.
ACKNOWLEDGMENTS This work was financially supported by the European Union FP6, Project 004171 HEARCOM and the State Ministry of Education and Science. REFERENCES [1]
[2]
[3] [4] [5]
[6] [7]
[8] [9]
[10] [11] [12] [13] [14] [15] [16] [17]
[18]
[19] [20] [21] [22] [23] [24]
Bosman, A.J. and G.F. Smoorenburg, Intelligibility of Dutch CVC syllables and sentences for listeners with normal hearing and with three types of hearing impairment. Audiology, 1995. 34: p. 260-284. Brachmanski, S. and P. Staroniewicz, Fonetyczna struktura materialu testowego stosowanego w subiektywnych pomiarach jakosci mowy, in Speech and Language Technology. 1999: Poznan. p. 71-80. Hagerman, B., Sentences for testing speech intelligibility in noise. Scandinavian Audiology, 1982. 11: p. 79-87. Jassem, W., Podstawy fonetyki akustycznej (Bases of acousticphonetics). 1973, Warszawa: PWN. Kalikow, D.N., K.N. Stevens, and L.L. Elliot, Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability. J. Acoust. Soc. Am., 1977. 61(1337): p. 1337-1351. Kollmeier, B., Messmetodik, Modellierung und Verbeserung der Verstandlichkeir von Sprache. 1990, Georg-August-Universtat: Gottingen. Kollmeier, B. and M. Wesselkamp, Development and evaluation of a sentence test for objective and subjective speech intelligibility assessment. Journal of Acoustical Society of America, 1997. 102(4): p. 1085-1099. Martin, M., Speech Audiometry. Wiley Publishers, 1997. (2ed). Nilsson, M., S.D. Soli, and J.A. Sullivan, Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. Journal of the Acoustical Society of America, 1994. 95: p. 1085-1099. Plomp, R. and A.M. Mimpen, Improving the reliability of testing the speech reception threshold for sentences. Audiol., 1979. 18: p. 43-53. Plomp, R. and A.M. Mimpen, Speech-reception threshold for sentences as a function of age and noise level. J. Acoust. Soc. Am., 1979. 66: p. 1333-1342. Pruszewicz, A., G. Demenko, L. Richter, and T. Wika, New articulation lists for speech audiometry. Part I. Otolaryngol. Pol., 1994. 48: p. 50-55. Pruszewicz, A., G. Demenko, L. Richter, and T. Wika, New articulation lists for speech audiometry. Part II. Otolaryngol. Pol., 1994. 48: p. 56-62. Runge, C.A. and H. Hosford-Dunn, Word Recognition Performance with Modified CID W-22 Word Lists. Journal of Speech and Hearing Research, 1985. 28(3): p. 355-362. Smits, C., Hearing screening by telephone. 2005, Amsterdam: VU University Medical Center. Smits, C., T. Kapteyn, and T. Houtgast, Development and validation of an automatic speech-innoise screening test by telephone. International Journal of Audiology, 2004. 43: p. 15-28. Smoorenburg, G.F., Speech reception in quiet and in noisy conditions by individuals with noiseinduced hearing loss in relation to their tone audiogram. Journal of the Acoustical Society of America, 1992. 91(1): p. 421-437. Versfeld, N.J., L. Daalder, J.M. Festen, and T. Houtgast, Method for the selection of sentence material for efficient measurement of the speech reception threshold. Journal of Acoustical Society of America, 2000. 107: p. 1671-1684. Wagener, K., Factors influencing sentence intelligibility in noise, in PhD Disertation, University of Oldenburg. 2003. Wagener, K., T. Brand, and B. Kollmeier, Development and evaluation of a German sentence test II:Optimalization of the Oldenburg sentence tests (in German). Z.Audiol, 1999. 38: p. 44-56. Wagener, K., T. Brandt, and B. Kollmeier, Development and evaluation of a German sentence test I:Design of the Oldenburg sentence test (in German). Z.Audiol, 1999. 38: p. 4-15. Wagener, K., T. Brandt, and B. Kollmeier, Development and evaluation of a German sentence test III: Evaluation of the Oldenburg sentence test (in German). Z.Audiol, 1999. 38: p. 86-95. Wagener, K., F. Eeenboom, T. Brandt, and B. Kollmeier, Ziffern-Tripel-Test: Sprachverstandlichkeitstest uber das Telefon. 8. DGA. Jahrestangung, 2005. Wagener, K., J.L. Josvassen, and R. Ardenkjaer, Design, Optimization, and Evaluation of a Danish Sentence Test in Noise. Journal of International Audiology, 2005. 42(1): p. 10-17.
6 th
19 INTERNATIONAL CONGRESS ON ACOUSTICS – ICA2007MADRID