continuous-speech recognition sys tems such as HARPY (mwerre & Mdy,. 1980). The ultimate alignment process would be completely automatic phon-.
Dept. for Speech, Music and Hearing
Quarterly Progress and Status Report
Automatic time alignment of speech with a phonetic transcription Blomberg, M. and Elenius, K. O. E.
journal: volume: number: year: pages:
STL-QPSR 26 1 1985 037-045
http://www.speech.kth.se/qpsr
A. A W Q T I C 'TIME ALIGI!Jf.EWT OF SPFKH WITH A PkIONEXIC TRANSCRIPTION * Mats BlcPnberg and Kjell Elenius
Abstract 1%is paper describes a system for time aligning a phonetic transcription to a speech signal. The phonetic segments are described by broad acoustic parameters and a dynarnic programming algorithm is used for optimizing the alignment of segments to the speech signal. In the present study, only two parameters have been used; the intensity of the speech signal below 400 Hz ancl the intensity above 500 Hz. It is shown that this very coarse information is enough to give a correct segmentation in most cases. The signals have been differentiated in the time domain. This will reduce the effects of using different speakers, of varying signal levels, and altering filter characteristics of the speech channel. A text-to-speech system is used to transcribe an orthographic re2resentation of the utterances to phonetic segments. A small experiment consisted of 30 sentences spoken by one male spedker. The average sentence length was 8 words. The rule-based transcription was correct for 97% of the segments. The boundaries were judged to be correct within 10 ms, the sampling interval, for 87% of the segments.
Introduction The problem of automatic time alignment of a speech wave to a known phonetic transcription has attracted a lot of attention during the last years. It would facilitate or replace the tedious manual labeling and would be a way to make it more consistent. The development of speakerindependent large-vocabulary speech recognition systems requires very large amounts of speech data to get quantitative and qualitative measures of the influences of, e.g., coarticulation, reduction, and stress patterns on the acoustic speech signal. Several hours of speech will be necessary to cover a sufficient amount of phonetic variation to get reliable statistics of the speech data. The data may also be used for improving speech synthesis rules. A detailed study of alignment errors will reveal difficult phonetic contexts and where the acoustic-phonetic rules should be improved to better predict the data. A possible application can also be found in foreign language education where pronunciation deviations from a teacher's voice could be automatically interpreted in phonetic terms and fed back to the pupil.
*
also presented at the French-Swedish Sgninar, Grenoble, April 22-24, 1985.
The alignment procedure itself is an essential part of the verification component in several phorletic speech recognition systems (~lomberg & Elenius, 1974, 1981; Lowerre & M d y , 1980). It serves the same function as the nonlinear time warping algorithm in standard pattern matching word recognition systems. The alignment process can be performed in different ways and at varying levels of automation. Some of them will be described below, starting with the least automated and gradually increasing the level of automation. The first level would be the completely manual way of inspecting various representations of the speech wave and entering boundaries by hand. This method requires a skilled phonetician and lots of time. Cxle second of speech may require several minutes of hand labeling (bung & Zue, 1984). A higher level has been used by Bridle & Chamberlain (1984). They start by labeling a recorded utterance by hand. By means of dynamic programming, new repetitions of the same utterance are time aligned to the first recording. This is essentially the same procedure as in conventional word recognition systems, where an unknown word is compared to the reference templates of the vocabulary. A limitation of this method is that all the utterances should be pronounced in the same way. The sensitivity to new speakers axad varying speaking modes is also wellknown from the word recognition systems based on this principle. A possible advantage may be that time alignment is done for each time sample and not only for the phoneme boundaries. This could be of interest for inter-phoneme acoustic analysis. Aligning the utterance to the phonetic symbols directly, without using any reference utterance, is another way of increasing the level of automation. This can be done at a phonetic level, requiring a phoneme recognition step (Wagner, 1981; Leung & Zue, 1984). It can also be performed at a parametric level. In that case the transcription must be transformed to a sequence of acoustic events, which is matched against the utterance. The phoneme-to-parameter part of a speech synthesis system could be used. This method has been used by Lennig (1983), le Saint-Milon & Stella (1983), and by Bridle & Chamberlain (1983) to improve rules for speech synthesis. This method requires a phonetician for transcribing the spoken utterance, still making phonetic labeling of a speech data base a very time consuming process. Still another step to diminish the manual interaction is to replace the transcription process with the text-to-phonemes part of a synthesis program. Like in the previous method, the alignment can be done at a phonetic or parametric level. To fully automate the transcription process, the synthesis system must generate all possible ways of p r e nouncing the given text. The alignment program must be able to handle the different possibilities arad select the correct transcription (&el, Eskenazi, & Mariani, 1983). m i s method is much more complex than the previous one. In fact, the optional pronunciations can be represented in
I I I
I
-
STL-QPSR 1/I985
39
!
a finite state network. This is very similar to the method used in continuous-speech recognition systems such as HARPY (mwerre & M d y , 1980). The ultimate alignment process would be completely automatic phoneme recognition of unknown speech. This needs the realization of a speaker-independent, natural language recognition system. This is a far dream for all researchers in the recognition field and will certainly remain a dream for several years.
Phoneme
Text
-
Text- Phoneme conversion
-
Manual editing
-
Compute Reference Data
-
-
-
r Speech signul
Analysis
-
Fig. 1.
Preliminary segmentation
Dynamlc Progmmming
4
Phoneme boundor~es
D sto once value
Block diagram of the phonetic alignment system.
System description Ablock diagram of the system used in this report is shown in Fig. 1. The main principle is a continuation of earlier work on word recognition systems (Blomberg & Elenius, 1978, 1980). The method can be placed between the third and fourth levels described above. The main features of the alignment process are described as follows. Phonetic transcriptions are generated from text using a module of the text-tospeech synthesis system cleveloped by Carlson & Granstrom ( 1982). Only one transcription is generated for each sentence. It must be modified by hand to fit the actual pronunciation. General acoustic descriptions of the phonemes are retrieved from a lexicon. They are transformed to correspond to the parameters computed from the speech wave by the speech analysis block. The speech is recorded using a sampling frequency of 16 kHz and the frame rate of the parameters will be 100 Hz. At the moment, only two parameters are used: intensities below 400 Hz and above 500 Hz. Although this is a too crude spectral description for recognition pur-
i I f
STGQPSR 1/ I985
-
40
-
poses, it is quit? usable for the alignment task. A more detailed acoustic description will be used in the future, but it was considered of interest to test the alignment accuracy on these simple parameters. Another reason to use broad hand parameters are that they are more robust arrd less speaker-dependent than more specific ones. Tb lower the sensitivity to signal level, different microphones and transmission channels, the parameters are time differentiated. This has previously been used to lower the spedker dependency of the segmentation algorithm (%hne, 1983; Blomberg & Elenius, 1980, 1981). The low pass 400 HZ intensity is chosen because of its discriminability between voiced and unvoiced sounds while the high pass 500 Hz band will help the distinction between vowels and voiced consonants as well as between silence and weak fricatives. Difficulties can be expected for phoneme sequences within any of these categories while transitions between them will be more accurate. Mainly for processing speed reasons, a step of prelimirlary segmentation is performed. Preliminary segments are put where changes in either or both of the intensity bands are detected. The sensitivity of this process must be high enough not to lose any correct boundary, since the phoneme boundaries are chosen from these preliminary bundaries. No classification is done of the preliminary segments, since they only serve for selecting boundary candidates among the time samples. A similar method is used by Neel, Eskenazi, & Mariani (1983), but instead of boundaries, regions of maximum stability are used. The alignment is performed by a dynamic programming algorithm at the parametric level. This means that the acoustic distances may be measured more accurately compared to doing a string comparison at the phonetic level. Time constraints on the time warp are given by durational limitations of each phoneme. A beam search algorithm prunes paths with an accumulated distance sufficiently higher than that of the best path. The processing speed is about equal to real time on a 16 bit minicanputer An example of the Swedish utterance "Ta tjuren vid hornen" with spectrogram, parameters, and alignment results is shown in Fig. 2. In this example one alignment error was made. The boundary between /H:/ and /r/ in "tjuren" was confused with the amplitude drop due to lip rounding inside the vowel. The dashed line shows the boundary prior to correction.
.
Experiment A preliminary test has been performed. 30 sentences read by a male speaker were recorded. The average sentence length was 8 words and the average number of phonetic events was 44 per sentence. Manual correction of the rule generated phonetic transcriptions had to be made for 3% of the phonemes. Almost all of these were due to a more relaxed and reduced way of speaking than what was generated by the text-to-monemes mdule
.
Fig. 2. An example of a test utterance. Spectrogram, parameters, phonetic transcription, and segmentation result are displayed for the sentence "Ta tjuren vid homen" /ta: ctt:ren vI hu:gen/. The dashed line s b w s the /a:r/ boundary before manual correction. Percent 100 90
80
70 60
50 LO
30 20
10
0 0
Fig. 3.
50
100
150
200 Time (ms)
Cumulative distribution of alignment errors.
poses, it is quite usable for the alignment task. A more detailed acoustic description will be used in the future, but it was considered of interest to test the alignment accuracy on these simple pararneters. Another reason to use broad band parameters are that they are more robust and less speaker-dependent than more specific ones. To lower the sensitivity to signal level, different microphones and trarlsmission channels, the pararneters are time differentiated. This has previously been used to lower the speaker dependency of the segmentation algorithm (%he, 1983; Blomberg & Elenius, 1980, 1981). The low pass 400 Hz intensity is chosen because of its discriminability between voiced and unvoiced sounds while the high pass 500 Hz band will help the distinction between vowels and voiced consonants as well as between silence and weak fricatives. Difficulties can be expected for phoneme sequences within any of these categories while transitions between them will be more accurate. Mainly for processing speed reasons, a step of preliminary segmentation is performed. Preliminary segments are put where changes in either or lmth of the intensity bands are detected. The sensitivity of this process must be high enough not to lose any correct boundary, since the phoneme boundaries are chosen from these preliminary boundaries. No classification is done of the preliminary segments, since they only serve for selecting boundary candidates among the time samples. A similar method is used by Neel, Eskenazi, & Mariani (1983), but instead of boundaries, regions of maximum stability are used. The alignment is performed by a dynamic programming algorithm at the parametric level. This means that the acoustic distances may be measured more accurately compared to doing a string comparison at the phonetic level. Time constraints on the time warp are given by durational limitations of each phoneme. A beam search algorithm prunes paths with an accumulated distance sufficiently higher than that of the best path. The processing speed is about equal to real time on a 16 bit miniccmputer An example of the Swedish utterance "Ta tjuren vid hornen" with spectrogram, parameters, and alignment results is shown in Fig. 2. In this example one alignment error was made. The boundary between /H:/ and /r/ in "tjuren" was confused with the amplitude drop due to lip rounding inside the vowel. The dashed line shows the boundary prior to correction.
.
Experiment A preliminary test has been performed. 30 sentences read by a male speaker were recorded. The average sentence length was 8 words and the average number of phonetic events was 44 per sentence. Manual correction of the rule generated phonetic transcriptions had to be made for 3% of the phonemes. Almost all of these were due to a more relaxed and reduced way of speaking than what was generated by the text-to-pbnemes module.
&f erences Andreewsky, A., Desi, M., Fluhr, C., and Poirier, F. (1983): "Une methode de mise en correspondence d'une chaine phonetique et de sa forme acoustique", llth Int.Congr.Acoust., Toulouse Satellite Symposium on "The Processes for Phonetic Coding and Decoding of Speech", Toulouse, 15-16 July, 1983.
Blomberg, M. and Elenius, K. (1974): " T V ~forsok med automatisk taligenkanning", Technical Report, Dept. of Speech Communication, KTH, Stockholm
.
Blomberg, M. and Elenius, K. (1978): "A phonetically based isolated word recognition system," J.Acoust.Soc.Arn. 64, Suppl. No. 1, p. S181. Blomberg, M. and Elenius, K. (1980): "Automatic segmentation of speech controlled by a quasi-phonetic transcription", 10th Int.Congr.Acoust., Sydney, 9-16 July, 1980. Blomberg, M. and Elenius, K. (1981): "Forsijk med ett segmentbaserat taligeMnningssystem", TRITA-TLF-81-4, D e p t of Speech Communication & Music Acoustics, KTH, Stockholm,
.
Bridle, J.S. and Chamberlain, R.M. (1983): "Automatic labeling of speech using synthesis-by-rule and non-linear time-alignment", llth Int.Congr.Acoust., Toulouse Satellite Symposium on "The Processes f ~ r Phonetic Coding and Decoding of Speech", Tbulouse, 15-16 July, 1983, Carlson, R, Granstrijm, Be, and J3unnicutt, S. (1982): "A Multi-langmge text-to-speech module", Conference Record, 1982 IEEE-ICASSP, Paris, France. Chamberlain, R.M. and Bridle, J.S. (1983): "ZIP: A dynamic algorithm for time-aligning two indefinitely long sequences", Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, IEEE Catalog No. 83CH1841-6, 2, 17.11 Boston. Hijhne, H.D.8 Coker, C., Levinson, S.E., and Rabiner, L.R. (1983): "On temporal alignment of sentences of natural and synthetic speech," IEEE Transac. on Acoustics, Speech and Signal Processing, ASSP-31, No. 4,August. LRnnig, M. (1983): "Automatic alignment of natural speech with a corresponding transcription," 11th Int.Congr.Acoust., Toulouse Satellite Symposium on "The Processes for Phonetic Coding and Decoding of Speech", Tbulouse, 15-16 July, 1983. Leung, H.C. and Zue, V.W. (1984): "A procedure for automatic alignment of phonetic transcrilJtions with continuous s e . " Proc. IEEE 1nt.Conf. on &usticst Speech and Signal Processing, I & &log bb. 84~~1945-5, 1, 2 . 7 . 1 San Diego. Lowerre, B. and Reddy, R. (1980): "The HARPY Speech Understanding System", in (ed. W.A. Lee): Trends in Speech Fkcognition, Prentice Hall, Englewood Cliffs, New Jersey, USA. Neel, F., Eskenazi, Mu, and Mariani, J.J. (1983): "Cadrage automatique pour la constitution de dictionnaire d'entites phonetiques", llth Int. Congr.Acoust., Toulouse Satellite Symposium on "The Processes for W netic Coding and Decoding of Speech", Toulouse, 15-16 July, 1983.
STL-QPSR 1/I985
- 45 -
le Saint-Milon, J. and Stella, M. (1983): "Extraction automatique de diphones par programmation dynamique pour des besoins en synthese de la parole," 11th 1nt.Congr.Acoust. Toulouse Satellite Symposium on "The Processes for Phonetic W i n g and Decoding of Speech", Toulouse, 15-16 July, 1983. Wagner, M. (1981): "Automatic labeling of continuous speech with a given phonetic transcription using dynamic programming algorithms", Proc. IEEE 1nt.Conf. on Acoustics, Speech and Signal Processing, IEEE Catalog No. 81CK1610-5, 1, p ~ .1156-1159, Boston, USA.
I
i [
1