Semi-Automatic Speech Segmentation for ...

Semi-Automatic Speech Segmentation for Macedonian Language Ivan Kraljevski1, Slavcho Chungurski1, Dragan Mihajlov2, Sime Arsenovski1 1 Faculty of ICT, FO First Private University, bul. Vojvodina bb, 1000 Skopje, Macedonia 2 Faculty of Electrical Engineering and Information Technologies, Ss. Cyril and Methodius University, 1000 Skopje, Macedonia {ivankraljevski, simearsenovski, chungurski}@fon.edu.mk; [email protected] Automatic segmentation is significant for creating speech corpora for development of TTS systems where large quantity of segmented and labeled speech sentences is required. Currently there are activities for creation and development of speech corpora in Macedonian language suited for TTS, where the recorded speech sentences have to be segmented and labeled. There is no functional ASR system in Macedonian, as well as no existent speech corpus which could be used for training and using such system to segment the speech sequences. This paper deals with this problem, and the applicability of semi-automatic system for segmentation of speech sentences in Macedonian is estimated and justified. The system uses the spectral transition measure function in order to detect the boundaries between phonemes, usually at the positions of function’s local maximums [5]. Array of vectors of Mel-cepstral coefficients (MFCC) were used as a spectral representation of the speech sequence. The MFCC were calculated over frames with Hamming window with 25 ms duration and offset of ¼ of frame. The MFCC consists of 24 coefficients including zero order coefficients and energy log as well as the second order derivatives. For phoneme boundaries detection the method uses the spectral transition measure, given by:

In the creation process of speech corpora intended for development and testing systems for artificial speech recognition or speech synthesis, the segmentation and labeling of recorded speech utterances is very important phase. The segmentation is performed by locating boundaries in speech signal between distinctive spectral and auditory features and dividing it into different segments that corresponds to certain phonetic categories. The labeling may include wider description of acoustic-auditory features of the speech sequence as prosody, accent, pitch or phonetic units as phonemes or diphones. There are several ways to segment and label speech sequences. The most accurate method and at the same time most effort consuming is manual segmentation and labeling by human experts, which is not practical for building large speech corpora. The marked boundaries are result of expert’s subjective decision by examining the spectrogram, the speech waveform as well as the audio reproduction of the sequence. It is very important that the process for manual segmentation and labeling is well defined in order to achieve consistency among all labeled sentences in the developing corpora. After the location of the boundaries, the segmented regions are labeled using textual code, in the case of phonetic units with corresponding IPS symbols. Numerous methods and algorithms for automatic or semi-automatic segmentation and labeling were developed [1]. However the validation and correction of the results is still performed by human experts and the achieved accuracy is lower than the manual segmentation but satisfactory for specific applications and research. Systems for fully automatic segmentation are based on acoustic alignment to a synthetic reference utterance, trained systems for ASR with HMM models or combination of acousticarticulary features in phoneme categories (elitist approach).

ܵܶ‫ܯ‬ሺ݉ሻ ൌ ሺσ஽௜ୀଵ ܽ௜ଶ ሺ݉ሻሻΤ‫ܦ‬

(1)

where D is the vector dimension, and ܽ௜ଶ ሺ݉ሻ is the

spectral changes coefficient defined as:

ܽ௜ ሺ݉ሻ ൌ ሺσூ௡ୀିூ ‫ܥܥܨܯ‬௜ ሺ݊ ൅ ݉ሻ ‫݊ כ‬ሻΤሺσூ௡ୀିூ ݊ଶ ሻ

(2)

where n is the frame index, and I represents the number of frames from left and right of the current frame. Using higher values for I might cause failure in detection some of the spectral transitions, and vice versa smaller values lead to occurrence of additional boundaries positions.

33 Posters Abstracts of the ITI 2008 30th Int. Conf. on Information Technology Interfaces, June 23-26, 2008, Cavtat, Croatia

Fig. 1 shows an a examplee for autom matic phonemee boundariess detection (automatic ( - full line, maanual - dasheed line) in speech s sequuence “slušalkaa” (IPA: [’sluualka]).

Several S tests were made on test corp pora with 100 manually segmented and labeled d speech senttences of isolated w words choseen from reco orded set off Macedoniaan-English dictionary d [2]. The sequennces were cooded in MP P3 format with h 16 KHz annd 16-bit ressolution with h 64 kbps rate that guarannties good pperceptual qu uality for the given task [3]. Manuallly placed bo oundaries werre matchedd with the corresponding auto omatically detected d possitions. Som me of the auto omatically poositions are m missed, and some are addiitionally inserted. The rresults are shown on Tab ble 1 in form m of percenntage of delleted and inseerted boundaaries regardinng the valuee I of the STM M function and the threeshold valuee for the seco ond criterionn.

Figure 1. Boundarries detectio on in [’slualka]

Tab ble 1. Perce entage of in nserted and d deleted pho oneme boun ndaries

It couuld be noticeed that the sppectral transittions with functionn has numerrous local maximums m differentt amplitudess that corressponds withh the change of intensity of the signnal energy. The most inttensive maxximums occuurs at transition between phonemes of o different type, t consonants: fricative, affricativee, plosive and a vovels. For examplee, between “šš” (fricative)) and “a” (voovel) as well as betweeen “l” (friccative) and “k” (plosive)). Increasingg the frame siize of the MF FCC vectors, results in situation s wheere some off the boundariies could be overrsight, wheereas decreasinng the fraame step provides better b accuracyy in locating boundaries positions. p In order o to avvoid false boundaries that correspoonds to local maximumss with relatiively low am mplitudes, additional processing is performeed. First, all the local maaximums thaat are smaller than t particullar threshold are evicted. The thresholdd is estimaated as the fraction off the maximum m value of the t array of local maxim mums after som me of them were w excludeed. From the Fig. 1-b it could c be nooticed that some maxim mum values arre much largger and they are a rejected from f the thresshold estimattion using standard studeent ttest withh confidence interval of 99.99% 9 [4]. Addittionally moore of the false f boundaries positionss could be errased using criteria c whenn the differencce betweenn local maximum and neighborring local minima is under certain thresholdd expressed as a percennt of the maxxima value (t)). These peaaks belong too flat regionns of spectral transition function f withh no signifi ficant spectral changes. Vaarying thesee two parameters gives thee opportunitty to tweak the t sensitivitty of boundariies detectioon method according the quality of o the recordeed speech seequences.

I t=0.05 t 3 2,15 4 1,79 5 2,33

Deleted d (%) t=0.1 t=0..2 t=0.4 2,33 2,68 8 3,34 1,79 2,33 3 3,22 2,68 3,22 2 4,29

t=0.6 5,00 4,65 5,91

t==0.05 772,9 440,6 225,7

Inserted (%) ( t=0.1 t=0.2 2 t=0.4 71,7 69,2 66,0 39,5 37,5 35,2 25,2 23,6 21,8

t=0.6 61,3 33,8 19,8

The T achievedd results jusstify the usee of such metthod for automatic a phoneme boundary b deteection, as a part of sem mi-automatic labeling systtem. The operator has to match the text tran nscription to correspondding segmen nts. Even morre, segments with same neighboring symbols coulld be merged together aand eliminatee most of the false inserteed boundariees. Using this kind of sem mi-automatic system deecrease the required timee compareed with llaborious manually m segm mentation annd labeling. Key ywords. Speeech segmenttation and lab beling, specctral transitioon measure, pphoneme bou undaries [1] M. J. Makashhay C. W. Wigghtman A. K. Syrdal Al. Conkie, “Perceptual evvaluation of au utomatic Speech Syntheesis”, segmentationn in Text-To-S ICSLP 2000, Beijing, Chinna, October 20 000 English Dictioonary, Zona, 2004 2 [2] Macedonian-E [3] P. Sirum Ng, I. Sanches, “T The Influencee of Audio ystems”, Compression on Speech Reecognition Sy 2 XI, SPECOM’2004: St. Petersburg, Russia 20-22. 2004 [4] Risto Malceskki, Statistika zza biznis, Alfaa 94, 2006 [5] Sorin Dusan and a Lawrencee Rabiner, “On n the Relation betw ween Maximum m Spectral Trransition Positions andd Phone Bounddaries”, ICLSP P 2006, 17-21.09. Pitttsburgh, Pennnsylvania

34