Automatic Segmentation of Continuous Speech on Word Level based

Automatic Segmentation of Continuous Speech on Word Level based on Supra-segmental features Klára Vicsi and György Szaszák Budapest University for Technology and Economics, Department of Telecommunication and Mediainformatics Laboratory of Speech Acoustics. Budapest, Hungary Email: {szaszak, vicsi}@bme.tmit.hu

Abstract— This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages. In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.

I. I NTRODUCTION Supra-segmental features are an integral part of every spoken language utterance. These can provide cues to the linguistic structure of the speaker’s message, emotional state or communicative intent. Intonation, stress, rhythm, etc. can help to signal the syntactic structure of utterances into larger discourse segments and provide additional information for human speech processing. Using supra-segmental features in automatic speech recognition to increase its robustness is a tendency again. Some trials were conducted in the mid-eighties. But it has not yet been possible to exploit such knowledge in an automatic speech recognition system [1], [7]. This relative failure, according to Philippe Langlais [3] is mainly due to three types of difficulties: • Significant contextual variability of prosodic knowledge (type of speech, speaker, structure and content of sentences, nature of the environment, etc.; • Complexity of relations between prosodic information and various linguistic organization levels of a message; This paper is under publication in the International Journal of Speech Technology.

Problems encountered with accurate measurement of prosodic parameters, and their possible integration on a perceptual level. The solution has been much more successful in the field of speech synthesis. For example, J. Venditti and J. Hirschberg summarised the current state of knowledge in intonation and discourse processing for American English [11]. They described an intonation-discourse interface which can be used in speech technology, mainly for speech synthesis. In a number of recent works certain researchers have focused on temporal information for the detection of speech landmarks, again in American English [9]; A. Salomon, C.Y. Espy-Wilson, and O. Deshmukh for example, used the above method in [8] as the front end of an HMM-based system for automatic noisy speech recognition. Other researchers used multiple cues for detection of phrase boundaries in continuous speech, and integrated these into speech recognition systems [4], [5]. Speech production is a continuous movement of the articulating organs, producing a continuous acoustic signal. In human speech processing, linguistic content and phonological rules help the brain to separate syntactic units, such as sentences, phrases (sections between two intakes of breath), or even words. In our experiments we examined how words can be automatically separated in continuous speech in fixed stress languages such as Hungarian and Finnish (both of these languages belong to the Finno-Ugrian language family). These two languages are highly agglutinative, so they are characterized with longer average word length than English and also with a relatively free word order. Due to this, almost all words have some stress (a stronger or a slighter stress depending on the syntactical structure), normally on the first syllable (fixed stress), except in case of conjunctions or articles. This means that the word-level intonation units which we dealt with during our experiment are composed from a word, stressed on the first syllable, together with unstressed conjunctions or articles, if any exists. These word level intonation units are shorter than prosodic phrases. Hence, forward in this article we will strictly use the expression ”word unit”, and the boundaries of these units will be called ”word boundaries”. In our experiment we measured fundamental frequency, energy and time course. These parameters (all or some of them) are necessary for the realisation and perception of stress in the Finno-Ugrian language family [2]. Fundamental •

B. Boundary detection of the word unit

Fig. 1. Fundamental frequency and energy levels measured at the middle of vowels and duration of the vowels in the syllables in a Hungarian sentence ’Titkárul szerzödtette a fökonzul lányát.’ The syllable sequence is presented on the X axis

frequency and energy level, measured at the middle of vowels of the utterance, and the duration of the vowels in the syllable are presented in a Hungarian sentence in Fig 1 as an example. Based on our detailed investigation on Hungarian and Finnish databases (see Section II) The peaks of energy and fundamental frequency clearly represent the first syllables of the words. Syllable length, it was found, is not greatly influenced by the stress. II. M ETHODOLOGY For stress classification, two methods were used and compared after the acoustical pre-processing of speech: a rulebased method and a HMM-based statistical method. Hungarian BABEL [6] and Finnish Speech Database [12] continuous read speech databases were used for the examination. The databases were segmented on prosodic level by an expert: word, phrase and sentence boundaries were also marked. For Hungarian 1600 sentences from 22 speakers and for Finnish 250 sentences from 4 speakers were used. A. Acoustic pre-processing For stress detection, fundamental frequency (Hz) and energy level (dB) were measured exactly at the middle of the vowel of the syllables. For the determination of fundamental frequency, autocorrelation method was used. In addition, a median filtering of the fundamental frequency sequence was applied, so fundamental frequency Fi at the ith frame was obtained after median filtering: Fi = med{Fi−3 , Fi−2 , Fi−1 , Fi , Fi+1 , Fi+2 , Fi+3 }

(1)

i+ M 2

X

s2n

T =M +c∗σ

(3)

where c is a constant with values usually between 0.5 and 1.5; M is the expected value, σ is the variance. Then, if an xi value in the stream is found to be higher then threshold T , stress is detected on the corresponding syllable. Taking a Hungarian declarative sentence as a reference, we can note that F0 and the energy level fall towards the end of phrases. To compensate for this descending intonation on phrasal level, T threshold is computed with a sliding window covering 7 - 17 syllables. In this way threshold is aligned to the intonation curve of phrases (or sentences). So the threshold value we assigned to the ith syllable is: Ti = M (xi−A , xi−A−1 , ...xi )+c∗σ(xi−A , xi−A−1 , ...xi ) (4) if i > A and Ti = M (x1 , x2 , ...xA ) + c ∗ σ(x1 , x2 , ...xA )

(5)

otherwise, where A is the length of the sliding window measured in the number of syllables, xi represents the indexed elements of a data stream of either F0 , or energy level values. We also computed the threshold values for data streams obtained by computing absolute value of F0 and energy level differences between two neighbouring syllables as follows: Mi =

i 1 X abs(xj − xj−1 ) A

(6)

i 1 X (Mj − xj )2 A

(7)

j=i−A

σi2 =

j=i−A

The Ei energy values were calculated with an integration time of 100 ms using the standard equation: 1 Ei = M +1

For the detection of word boundaries, we have tested rulebased and data-driven algorithms. In this subsection we would like to present and compare these two approaches. 1) Rule based approach: This approach is based on detection of emphasized syllables. Since in Hungarian emphasis is perceived on vowels, we assigned an F0 and an energy value to each syllable that was measured on the stationary part of the vowel. Additionally, we also assigned the first order deltas of these measures to each syllables. To determinate emphasized syllables in speech, we used peak-detection algorithms. These methods are based on the calculation of means and variances of the given data streams, in our case F0 and energy. In this way, for a data stream referred to by variable X = { x1 , x2 , ..., xn } we calculate a threshold T in the following way:

(2)

n=i− M 2

where M is the number of samples pro 100 ms, s is the speech signal. Fundamental frequency and energy level data streams have both a frame rate of 25.6 ms.

Again we used a sliding window (see Equations 4 and 5) to compensate for the decreasing dynamic range of speech towards the end of the phrase. The stressed vowel detected may be situated in the onset or in the nucleus of the syllable. Hence, the detection of the onset of the word unit is carried out by searching for a minimum point preceeding the stress mark (placed by the stress detection by the method discribed above). This minimum point is defined as the absolute minimum point of the F0 and/or Energy

Fig. 3.

Block diagram of the training procedure

discrete curves within an interval of [t − 100ms, t], where t refers to the position of the stress mark. The length of this interval is 100 ms, which was set in empirical way for normal Hungarian speech rate. 2) Data-driven approach: In data-driven approach we used the well-known HMM method [10] for the determination of word units and their boundaries, the HMM method was applied using F0 , energy level parameters and their first and second order deltas. This type of examination needs a speech database segmented on prosodic level to train prosodic pattern HMM models. Databases were segmented by audio-visual segmentation method by an expert relying on fundamental frequency and energy cues. Prosodic phrase segment boundaries were marked so that they overlap with word boundaries. A set of 6 different Hidden Markov Models was constructed. We trained 6 different models that represent 5 types of prosodic (intonation) curves which are descending, falling, rising, floating and rise-fall. In other words, we interpreted intonation on syntagmatical or on word level. The 6th model is a silence model. Training examples for different prosodic curves are presented in Fig. 2 between cursors. The block diagram of the training procedure of HMMs is shown in Fig. 3. Speech is pre-processed acoustically as described above, F0 and energy data are computed, which are then used to train prosodic HMMs. By word level segmentation, these HMMs are used to recognize prosodic patterns on pre-processed input data. The prosodic model itself may be of interest in the future, but for the moment, we use only the boundaries of these ”intonation” units for the evaluation. C. Evaluation To evaluate results, the obtained prosodic segmentation is compared with the original one. Two measures were used to present our results. The first one is accuracy, defined as:

For speech recognition tasks, the accuracy is more critical then effectiveness, since we require that if a word boundary is predicted it should be accurate (at least around 80%). Of course the higher the effectiveness the more robust the system will be, but we cannot allow this at the expense of falling accuracy. In the rule based case we accepted the prediction if stress was predicted to the first syllable of a word. In data driven approach the predicted segment boundary is regarded as correct, if it does not deviate more than 100 ms from real word boundaries. III. R ESULTS In this section, we would like to present briefly how we optimized the algorithms and intonation models to obtain the best segmentation accuracy. Segmentation accuracy means the accuracy of determination of word unit boundaries. Finally, results obtained by rule-based and data-driven methods will be compared. Comparison with other, earlier results in the filed was not possible, because as far as we know, nobody has carried out yet such experiments for finno-ugrian languages. A. Rule-based approach When using the peak detection algorithm for stress localization, we investigated the performance in function of c constant and A sliding window length parameters. The performance of the system is characterized by the accuracy and effectiveness described above in Evaluation section. In Table I we show the results in six columns depending on which combination of prosodic parameters (only F0 , only Energy, F0 + Energy, only ∆F0 , only ∆Energy, ∆F0 + ∆Energy) we detected the stress. Accuracy may increase if the length A of the sliding window goes over 10 syllables, as can be seen in Table I. As expected, rising constant c results in a higher accuracy with a more considerable fall in effectiveness. The more accurate results were obtained by detecting stress on the basis of fundamental frequency and energy level changes (∆F0 + ∆Energy) from syllable to syllable. The overall best results can be achieved by calculating only with fundamental frequency differences (∆F0 ), and in this latter case effectiveness is more acceptable. B. Data-driven method

When training the HMM prosodic recognizer we have tried two acoustic preprocessing alternatives. In the first case, we used only F0 or only Energy data with appended first and which denotes that if a word boundary was predicted by our second order deltas. In the second case we used as input data algorithm it was correct or not. Second one is effectiveness: F0 and Energy and to both of them first and second order deltas were appended. In Table II the results are presented Ccorrectly marked word boundaries ∗100 as a function of the acoustical preprocessing parameters. The Ef f [%] = Call word boundaries in ref erence transcription (9) best results were obtained when fundamental frequency-type which says how many word boundaries were found from all parameters are together with the energy-type parameters. word boundaries in the reference file. This latter measure is We have also tested several training strategies for the expected to be much less than 100%, hence not all words in constructed HMMs. During the examination, a 14 speaker data speech are emphasized. In the above forulae, C refers to the set was used for training and 18 speakers for testing. First the count of word boundaries. size of the training material was changed. The training set Acc[%] =

Ccorrectly marked word boundaries ∗ 100 Call marked word boundaries

(8)

Fig. 2. Examples for the trained prosodic curve types. Time function, spectrogram of the speech signal are shown, below the prosodic labels with F0 contour on Hungarian speech data

TABLE I A CCURACY AND EFFECTIVENESS OF STRESS DETECTION BASED ON PEAK DETECTION ALGORITHMS FOR SLIDING WINDOW WIDTH

H UNGARIAN DATA IN FUNCTION OF

WORD UNIT WITH

HMM FOR DIFFERENT TRAINING SETTINGS

(A) AND OF C CONSTANT PRESENT IN

THRESHOLD CALCULATION FOR SIX PROSODIC DATA PATTERNS

A

c

Accuracy/Effectiveness (%/%) F0 E F0 + E ∆F0

∆E

∆F0 + ∆E

9 9 9 13 13 13 17 17 17

0.5 0.9 1.1 0.5 0.9 1.1 0.5 0.9 1.3

49/41 52/32 52/27 51/39 52/28 54/24 51/38 54/28 56/20

59/21 61/17 62/15 61/19 64/16 65/14 64/19 65/15 65/11

84/11 83/9 85/8 84/9 87/8 88/7 86/9 86/7 90/6

46/29 46/23 45/20 45/27 45/20 46/18 46/26 46/20 46/15

TABLE III A CCURACY AND EFFECTIVENESS OF BOUNDARY DETERMINATION OF

45/18 47/12 47/9 46/16 46/11 49/9 46/16 49/10 52/7

76/24 78/21 79/19 77/22 79/19 79/17 78/21 79/18 81/15

TABLE II A CCURACY AND EFFECTIVENESS OF STRESS DETECTION WITH DIFFERENT ACOUSTICAL PREPROCESSING

Used parameter(s)

Language

Train set

Corr/Eff [%/%]

F0 +dF0 +d2 F0 E+dE+d2 E F0 +dF0 +d2 F0 + E+dE+d2 E

Hungarian Hungarian

14 persons 14 persons

67.4/58.4 67.7/66.6

Hungarian

14 persons

76.5/53.0

was reduced to 4 persons and finally to one person, while the test set consisted of the same 18 speakers in all cases. If the HMMs were trained on few speakers, these speakers were selected carefully in order to ensure a relatively accurate training corpus. Results are shown in Table III. Surprisingly, there is no relevant difference in accuracy if fewer speakers are involved in the training corpus. Effectiveness, however, depends very much on the number of speakers, and to achieve effectiveness over 50% at least four speakers’ data should be used for training. If only F0 or energy patterns are used, effectiveness is excellent, but we

Used parameter(s)

Language

Train set

Corr/Eff [%/%]

F0 +dF0 +d2 F0 + E+dE+d2 E

Hungarian Hungarian Hungarian

1 person 4 persons 14 persons

77.1/49.6 77.4/57.2 76.5/53.0

Fig. 4. Boundary detection accuracy and effectiveness as a function of the number of states of HMMs (4 speakers, 8 Gauss)

have 10% reduction in accuracy. The overall best result was 77.4% accuracy with 57.2% effectiveness, obtained with HMMs trained on 4 speakers’ F0 and energy data. The effect of the number of model states was examining too and it was found that 11 states are now the best, as can be seen in Fig. 4 For the further experiments the optimalized parameters were apply: frequency and enery level values together, the speech examples of 4 persons and the state of the models were 11. The developed system can be convenient for automatic segmentation of word units. An example is presented in Fig. 5 of how the developed segmentation technics work on word level. Time function of the speech signal, F0 and energy contour are visible on the screen, while at the bottom the first row contains an expert-made hand segmentation taken as reference, the second row illustrates the output obtained by automatic prosodic segmentation. Segmentation accuracy means the accuracy of determination of word unit boundaries.

Fig. 5. HMM provided word level segmentation (bottom) versus expert-made hand segmentation (up) on a passage of 3 Hungarian sentences: [s] denotes word boundaries, [sil] silence TABLE IV

TABLE V

A CCURACY AND EFFECTIVENESS OF BOUNDARY DETERMINATION OF WORD UNIT WITH HMM FOR F INNISH LANGUAGE COMPARED TO

R ESULTS OF DIRECT CROSS - LINGUAL WORD BOUNDARY SEGMENTATION FOR F INNISH AND H UNGARIAN

H UNGARIAN Used parameter(s)

Language

Train set

Corr/Eff [%/%]

F0 +dF0 +d2 F0 + E+dE+d2 E

Finnish Hungarian

4 persons 4 persons

69.2/76.8 77.4/57.2.6

Train set

Test set

Corr/Eff [%/%]

Hungarian (4 speakers) Hungarian (4 speakers) Finnish (4 speakers) Finnish (4 speakers) FI+HU (4+4 speakers) FI+HU (4+4 speakers)

Hungarian Finnish Finnish Hungarian Hungarian Finnish

77.4/57.2 67.1/52.1 69.2/76.8 70.7/52.3 75.0/68.2 69.7/83.7

C. Data-driven method for Finnish language Results obtained for Finnish with the same data-driven method as for Hungarian are presented in Table IV. We can note a considerable fall in accuracy, while effectiveness is high. The reason for this may be that Finnish speech is much slower than Hungarian, and Finnish words often contain long plosive sounds where F0 and energy contour show a very similar behaviour to the one they have on real word boundaries. As a result, many more word boundaries can be found than in Hungarian, but we can also detect some in-word secondary emphasis, which explains the lower accuracy. Cross-lingual prosodic word boundary segmentation results for Finnish and Hungarian are presented in Table V. It can be seen that segmentation of Finnish speech with models trained on a Hungarian database gives nearly the same result as the Finnish models. On the other hand, segmentating Hungarian data with Finnish models yields 70.7% accuracy, which is somewhat better than on Finnish data (67.1%). This is probably due to the sparseness of Finnish data. Generally this means that a prosodic word boundary segmenter well trained with a Hungarian database can be applied for automatic segmentation of unknown Finnish speech material and vice versa. Naturally hand-made correction is

necessary. Using multi-lingual training strategy ensures a considerably more effective prosodic segmantation performance for Hungarian, by preserving a good accuracy ratio. IV. C ONCLUSION Our prosodic segmentationer based on measuring fundamental frequency and energy level functions gave promising results. Word boundaries can be marked with acceptable accuracy, even if we are not able to find all of them. Two measurements, accuracy and effectiveness, were used to describe the behavior of this prosodic segmentation system. The method we evaluated is easily adaptable to other fixed-stress languages. Word boundaries are found with acceptable accuracy and effectiveness for fixed-stress languages like Hungarian and Finnish. In case of Finnish, we obtained results comparable with Hungarian, with lower accuracy and higher effectiveness, which may be the result of the difference between the two languages, and also of data sparseness in the Finnish database. Moreover, these results ensure that integration of a prosodic recognizer into a CSR system can help reduce the search-

ing space and thus improve speech recognition performance. The importance of this searching space reduction is great in recognition of agglutinative languages such as Hungarian and Finnish, where the possible number of words may be more than one million. In this domain further investigations are needed. Cross- and multilingual prosodic word boundary segmentation study of Finnish and Hungarian shows similarity between the two languages at word level prosody: i.e. segmentation of Finnish speech with models trained on Hungarian data give nearly the same result as models trained on Finnish data. By segmentation of Hungarian speech with models trained on Finnish accuracy is similar to the one obtained by models trained on Hungarian data, and effectiveness is improved. These similarities are of course well known from the prosodic description of the two languages. Summarizing the results of our experiments, it is clearly worth continuing research in this field. We believe that examination of other fixed-stress languages would be useful. We have presented one means of showing how it is possible to use prosodic information, but there may be several solutions. A practical result emerged from these experiments: this prosodic recognizer can be used as a word-level automatic segmentationer for Hungarian and Finnish languages. ACKNOWLEDGEMENTS We would like to thank Toomas Altosaar (Helsinki University of Technology) for his kind help and his contribution to the use of the Finnish Speech Database. The work has been supported by the Hungarian Research Foundations OTKA T 046487 ELE and IKTA 00056. R EFERENCES [1] Di Cristo: Aspects phonétiques et phonologiques des e´ léments prosodiques. Modèles linguistiques Tome III, 2:24-83. (1981) [2] Kassai, I.: Fonetika. Nemzeti tankönyvkiadó, Budapest, pp. 209-230. (1998) [3] Langlais, P. and Méloni, H.: Integration of a prosodic component in an automatic speech recognition system. 3rd European Conference on Speech Communication and Technology. Berlin, pp. 2007-2010. (1993) [4] Mandal, S., Datta, A. K. and Gupta, B.: Word boundary Detection of Continuous Speech Signal for Standard Colloquial Bengali (SCB) Using Suprasegmental Features. FRSM (2003) [5] Peters, B.: Multiple cues for phonetic phrase boundaries in German spontaneous speech. Proceedings 15th ICPhS. Barcelona CA: ICPhS, pp. 1795-1798. (2003) [6] Roach, P.: BABEL: An Eastern European multi-language database. International Conference on Speech and Language Processing. Philadelphia. (1996) [7] Rossi , M.: A model for predicting the prosody of spontaneous speech (PPSS model). Speech Communication, 13:87-107. (1993) [8] Salomon, A., Espy-Wilson, C.Y. and Deshmukh, O.: Detection of speech landmarks. Use of temporal information. The Journal of the Acoustical Society of America 115:1296-1305. (2004) [9] Yang, L.: Duration and pauses as phrase and boundary marking indicators in speech. Proceedings 15th ICPhS. Barcelona, CA: ICPhS, pp. 1791-1794. (2003) [10] Young, S., Evermann, G., Kershaw, D, Moore, G., Odell, J., Ollason, D. et al.: The HTK Book (for version 3.3). Cambridge: Cambridge University, pp. 22-131. (2005) [11] Venditti, J.and Hirschberg, J.: Intonation and discourse processing. Proceedings 15th ICPhS. Barcelona, CA: ICPhS, pp. 107-114. (2003)

[12] Vainio, M., Altosaar, T., Karjalainen, M., Aulanko, R., Werner, S.: Neural network models for Finnish prosody. Proceedings of ICPhS 1999. San Francisco, CA: ICPhS, pp. 2347-2350. (1999)