are: (a) the number of zero crossings of the waveform in a 10 msec window .... the minimum zero crossing rate 30 msec before the peak was too high, (b) theĀ ...
s10.12 Segmentation and Broad Classification of Continuous Speech1 Ronald A. Cole and Lily Hou Computer Science Department Carnegie-Mellon University Pittsburgh, PA 15213
Spectral Change paramet The three remaining parameters are measures of spectral change derived from a 256 point DFT computed on a 10 msec window every 3 msec. The spectral change parameters are: (a) "fastchange0-8," the Euclidian distance of spectra (0-8000 Hz) 9 msec apart, (b) "slowchange0-8," the Euclidian distance of the average spectra (0-8000 Hz) in adjacent 30 msec intervals of speech, and (c) "slowchange2-3," the Euclidian distance of average spectra in adjacent 30 msec intervals of speech between 2000 and 3000 Hz. The spectral change parameters were motivated by research on segmentation performe
Abstract This report describes an algorithm that performs speakerdependent segmentation and broad classificationof continuous speech. The algorithm is implemented as a set of knowledge that apply rules to speech parameters to locate segment d assign broad category labels to the resulting output of the algorithm is a network of segments egory labels. We describe the structure of the manner in which acoustic phonetic knowledge is e performance of the algorithm on 200 utterances
Segmentation and Classification Introduction
Locating Sonorant In Sonorant intervals inclu st one vowel but may also include weak sonorants ([I], [r], [w], [I]), weak intervocalic voiced fricatives ([dh], [v]), nasals ([m], [n], [ng]) and flaps ([dx], [nx]). A sonorant interval may be as short as a single glottal pulse or as long as the entire utterance.
is a Rule Based Segmentation and Classification . RBSC applies empirically derived rules to speech rs to produce a network of segment paths through an . Each arc in the network is assigned one of six broad labels; closure, stop, obstruent, vowel, weak sonorant Figure 1 shows a network of segment paths (directly for the utterance "What about a tea
peak in the ptp0-2500 if the peak belonged to a
ses a top down parsing strategy; the rules produce a refinement of the preceding segmentation. Speech is first parsed into sonorant verses non-sonorant intervals. Sonorant intervals are then parsed to locate nasals, liquids, glides, flaps and intervocalic voiced stops. Non-sonorant intervals are processed to locate closures, stops and fricatives.
f the peak interval
ro crossing count
Representations context sensitive rules. For example, context sensitive rules were used to detect devoiced vowels and to discriminate plosive peaks from sonorant peaks. For example, a peak was assumed to belong to a plosive if a bigger peak was found within the next 120 msec (suggesting a interval between the two zero crossings.
mentation and classification are based on rules applied to parameters computed every 3 msec. The parameters are in Figure 1, along with the spectrogram and waveform of erance. In addition to the parameters, information about icity was provided by a pitch period estimation Waveform Parameters. The top four parameters in Figure 1 are: (a) the number of zero crossings of the waveform in a 10 msec window between 0 and 8000 Hz, (b) "ptp0-700," the peak to peak amplitude in a 10 msec window between 0 and 700 Hz, (c) "ptp0-2500," the peak to peak amplitude in a 10 msec window between 0 and 2500 Hz, and (d) "ptpl200-8000," the peak to peak amplitude in a 10 msec window between 1200 and 8000
Sonorant Onset. After eak was identified, the onset and offset were found. A preliminary onset was located based on ptp0-2500 amplitude relative to the peak. Using this point as an anchor, context sensitive rules were used to find the sonorant onset. These rules examined the surrounding acoustic context for evidence of a closure, obstruent or weak nasal before the sonorant and assigned the Sonorant onset based on the
'All measurements of peak to peak amplitude were expressed as the ratio of the maximum amplitude minus the minimum amplitude in a window 300 msec before to 250 msec after the peak
Research supportedby DARPA and NSF
453 CH2561-9/88/0000-0453$1.00 0 1988 IEEE
Stops In Sonorants. The voiced stops [b] and [d] and stoplike realizations of [dh] occur in sonorant intervals at nasal-vowel boundaries. Intervocalic [d] and [t] are usually flapped, but may be accompanied by a release burst at the offset of the amplitude dip associated with the flap. We attempted to locate stops at each segment boundary within the sonorant interval.
Identification of closures and obstruents used a " globaVlocal" decision strategy. Global decisions were used to evaluate the entire interval as a sure closure, sure obstruent, possible closure or possible obstruent. Global decisions used information about the maximum, minimum and e values in the waveform parameters.
Stop detection was based on peaks in the spectral fastchange parameter. Peaks in this parameter are usually associated with a stop burst or glottalization (e.g., a glottal stop). Each peak before a segment boundary in the fastchange array was assigned a normalized peak value based on the mean and standard deviation of the parameter values 30 msec before the peak. If the peak value was sufficiently large, a sure stop was inserted in the sonorant interval between the burst and the following segment boundary.
If the global decision rules did not produce a sure closure or a sure obstruent, a different set of rules were used to locate successive closures and obstruents. Closures were located by finding 9 msec of speech (3 successive time frames) that satisfied the conditions for a closure onset (low peak to peak amplitudes and zero crossings) followed by 9 msec of speech that satisfied the conditions for closure offset (sufficient increase in these parameters).
Obstruents were identifie sufficient increase in zero crossings. The obstruent o s based on a sufficient fall from the peak in the zero crossing parameter found after the closure or obstruent was found, the obs were again invoked to evaluate the the or obstruent offset and the following inte sonorant or stop.
Asslgnlng Labels t o Segments. A segment was called a vowel if (a) the segment included the sonorant peak (ptp0-2500), (b) the segment was adjacent to a dip, or (c) the segment had more peak to peak amplitude than its adjacent segments. A segment was called a weak sonorant if (a) the segment was defined by a spectral peak inside a dip, (b) the segment extended from the sonorant onset to a ramp boundary, (c) the segment extended from a ramp boundary to a sonorant offset, (d) the segment had less peak to peak amplitude than its adjacent segments, (e) the segment was at the onset or offset of the sonorant and its average peak to peak amplitude less than 50% of the sonorant peak. A segment was called a dip if it was located by the dip detection subroutines, even if the original dip boundaries were moved by boundary adjustment rules, A segment was Galled a stop if it was located by the stop detection subroutines.
res and obstruents provide The rules used many examples tion of acoustic phonetic knowledge. Global closures took account of the fact that stop closures may be noisy and contain transient bursts; the rules were designed to igno?e these short bursts. Decisions about obstruents versus closures were contingent upon the presence or absence of a prevocalic stop at the end of the interval. For example, if an obstruent offset was observed within 18 msec of a stop, a closure was automatically inserted between the obstruent and the stop. If the obstruent offset was within 18 msec of a Sonorant onset, the obstruent was extended to the sonorant onset. When the stop was present, we assumed that the offset was caused by a closure in an [SI-stop cluster. When the stop was absent, we assumed that the interval sonorant onset was an epinthetic closure.
Generatlng the Sonorant Network. The segments were compiled into a network spanning the sonorant interval. Each segment boundary corresponds to a node in the network, and each segment is an arc. All paths in the network were required to converge on a node with sure boundary status, and all paths ired to span nodes with alternate boundary status.
Stops Before Sonorants [b]. [d], [g], [p], [t] and [k] are usually urst before sonorants. In addition, the oflen exhibit plosive bursts before sonorants were found by detecting the largest peak in the fastchange parameter within 90 msec of a sonorant onset, and evaluating the interval between the peak and the following sonorant for acoustic features characteristic of nsonant.
often realized with a [t] inserted The phrase "fancy between the [n] and [SI in "fancy" and a [k] burst before the final [SI in "kicks." The phrase "fifth stripe" contains four successive obstruents; the fricatives [f], [th] and [SI and the stop [t], which is usually realized with a partial closure in an [str] cluster.
e candidate stop was eliminated from consideration if (a) the minimum zero crossing rate 30 msec before the peak was too high, (b) the average zero crossing rate between the plosive burst and the sonorant was too small, (c) evidence for a closure was obtained between the plosive onset and the sonorant, (d) the interval between the plosive onset and the sonorant contained 3 or more consistent pitch pulses, (e) the average ptp0-2500 amplitude between the plosive onset and the sonorant was greater than 30% of the ptp0-2500 amplitude at the peak in the following sonorant.
decision about a stop was based on the presence of a burst in the spectral fastchange parameter near the obstruent onset. presence or absence of Obstruent-Obstruen greater than 45 msec
egment before the obstruent.
ies.
Obstruent segments ined for obstruent-obstruent
in the slowchange0-8000 parameter had a sufficient normalized peak values (based on the distribution of values within the
Closures and Obstruents
ptpl200-8000 para
In the final stage, closures and obstruents were identified between the offset of one sonorant interval and the onset of the next sonorant interval, which could be preceded by a stop. Decisions about closures and obstruents used the four waveform parameters.
If two spectral peaks were the peaks was evaluated as a
455
average ptp1200-8000 between the peaks was less than the average peak to peak amplitude of the adjacent segments.
insertions. Insertions occurred when two or more segments were located within a single hand labeled segment. Insertions were typically caused by (a) identification of noise bursts as obstruents within noisy closures, (b) identification of epinthetic closures (that were not hand labeled) at obstruent-sonorant boundaries, (c) incorrect assignment of sure boundaries associated with dips or ramps. Examination of these cases reveals that additional knowledge can be added to RBSC to eliminate most of these insertions.
Segmenting Closures. Finally, closures that spanned the entire interval between sonorants were examined for spectral peaks. If the normalized peak value was sufficiently large, the closure was given an alternate segmentation. The segment with the larger average zero crossings was called "obstruent" and the other segment was called "closure."
Classlflcatlon. Table 1 shows the percentage of broad category labels in the best path assigned to the hand labels. Stops, affricates, strong fricatives and vowels were assigned an appropriate broad category label about 90% of the time. Examination of the visual displays for the 200 utterances revealed that weak sonorants and weak fricatives were assigned a contextually appropriate label about 70% of the time.
Evaluation RBSC was evaluated on 200 utterances produced by 30 male and 20 female speakers. The utterances used in the evaluation were not used to train the system. The sentences were drawn from the DARPA TlMlT database designed for acoustic phonetic research. The speech was recorded at Texas Instruments under quiet conditions using a noise canceling microphone, and phonetically labeled by Victor Zue and his colleagues at MIT. RBSC was evaluated by comparing the network output to the phonetic labels.
Table 1. Distribution of broad category labels, insertions and deletions in the best path of the RBSC network for hand labeled segments.
The 200 sentences contained 7,543 hand labeled segments, excluding the labels indicating background noise at the beginning and end of each utterance. The depth of the network was calculated for each hand labeled segment by counting the number of arcs in the network at each 3 msec time frame in the segment and dividing by the number of time frames in the segment. The mean depth per segment was 2.0. The median depth was 1.3. The greater mean depth was caused by a small proportion of sonorant intervals with several alternate boundaries.
STOP OBS
The remaining evaluations compared the best path in the RBSC network to the hand labeled segments. The best path was determined automatically and then verified manually by visual inspection of pictures of each utterance. Of the 7,543 hand labeled segments, 6,578 (87%) were mapped to a single broad category label, 383 (5%) were mapped to a sequence of two or more labels, and 582 (8%) of the hand labels were deleted. Boundary alignment was similar for left and right boundaries; 75% of all segments in the best path were within 9 msec of the hand labeled boundary and 90% of the segments were within 15 msec of the hand labeled boundaries.
CLOS D I P
b d g stop-dh
64
10
p t k
75
20
ch j h
39
61
s sh z
3
90
f v d h t h hh
3
41
3
20
1
84
1
clos
1
VOC
6
2
3
1
15
1
4
6
18
6
3
4
7
4
8
88 9
vowel
l r w y m n ng
1
DEL
7
dx nx
Sonorant Intervals. Sonorant intervals were accurately identified. Over 2300 sonorant intervals were present in the 200 utterances, excluding vowels that were completely devoiced. Only four of these were not identified as a sonorant. These consisted of very weak sonorants at the end of an utterance. There were about 10 occurrences of stops that were identified as sonorants. The overall error rate for identification of sonorant verses non-sonorant regions was less than .1 percent.
WEAK I N S
2
26
12
0
3 38
4 4
3 18
Conclusions RBSC does not yet approach the performance of human spectrogram readers in segmenting or labeling speech. Nevertheless, we are most encouraged by the present results. During the past year, we have seen a steady improvement in performance as knowledge has been added to the algorithm. The current evaluation has revealed many cases in which the current implementation can be improved and additional knowledge can be added.
Deletions. About 45% of all deletions involved the class of "weak sonorants" composed of Ills /r/, /w/, lyl, InV, In/, /nu. Boundaries associated with /r/ accounted for over half of the deletions in this class. Vowels in sonorant intervals accounted for about 15% of all deletions (although only 3% of vowel labels were deleted). Vowel deletions were caused by a missed boundary between adjacent sonorants; about half of these involved the reduced vowels /ix/ and /ax/. Boundaries associated with the /q/ label, indicating glottalization or a glottal stop, accounted for about 10% of the deletions. (RBSC was designed to ignore glottalization.) Closures, stops and weak fricatives accounted for the remaining deletions. The vast majority of these occurred within sonorant intervals. Most stop and closure deletions occurred in nasal-vowelcontexts.
References 1. M.S. Phillips, (1985) "A Feature-BasedTime Domain Pitch Tracker," J. Acoust. Soc. Amer., 77, S9-SlO (A). 2. J. R. Glass and V. W. Zue, (1988) "Multi-level Acoustic Segmentation of Continuous Speech," This Proceedings.
456