Page 1 |||||||||||||||||||||||- Page 2 Audio indexing using speaker identi ...

0 downloads 0 Views 171KB Size Report
iteratively generating a segmentation using the Viterbi algorithm, and retraining the ... ratio. Segmentation accuracy using agglomerative clustering initialization ...
|||||||||||||||||||||||-

Audio indexing using speaker identi cation Lynn Wilcox, Don Kimber, and Francine Chen Xerox Palo Alto Research Center 3333 Coyote Hill Road, Palo Alto, California 94304 ISTL Technical Report No. ISTL-QCA-1994-05-04

ABSTRACT In this paper, a technique for audio indexing based on speaker identi cation is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker training data is used to initialize sub-networks for each speaker. Sub-networks can also be used to model silence, or non-speech sounds such as a musical theme. When no prior knowledge of the speakers is available, unsupervised segmentation is performed using a nonreal time iterative algorithm. The speaker sub-networks are rst initialized, and segmentation is performed by iteratively generating a segmentation using the Viterbi algorithm, and retraining the sub-networks based on the results of the segmentation. Since the accuracy of the speaker segmentation depends on how well the speaker sub-networks are initialized, agglomerative clustering is used to approximately segment the audio according to speaker for initialization of the speaker sub-networks. The distance measure for the agglomerative clustering is a likelihood ratio in which speech segments are characterized by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data. Keywords: speaker identi cation, speaker segmentation, audio indexing, speaker index, speech retrieval, hidden Markov models, agglomerative clustering

1 INTRODUCTION In order to nd speci c information in an audio recording, it is usually necessary to listen to it sequentially. While it is possible to fast forward an arbitrary amount, it is dicult to know exactly where to stop. Techniques exist for time compression of audio without pitch distortion,1 but comprehension is lost after about a factor of two compression. Arons2 has developed a system which allows skimming of speech by playing only sections following long pauses. Wilcox and Bush3 use keyword spotting as means of indexing speech based on its content. Chen and Withgott4 identify emphasized regions in speech. In this paper, we propose the use of speaker identi cation as a means of indexing audio data. Unlike standard speaker identi cation, in which the speech is from a single speaker, here the audio contains speech from multiple speakers. We propose to segment the audio into intervals, with each interval containing speech from a single speaker. An index based on these intervals provides the capability to skip

to the next speaker when reviewing audio data, or to playback only those portions of the audio corresponding to a particular speaker. Pauses, or silence intervals, as well as non-speech sounds such as a musical theme, can also be identi ed for use in indexing. In cases where speakers are known a priori, segmentation can be performed in real time. This is useful, for example, in video annotation systems where annotations are made during the recording process.5 It is also possible to segment the audio when no prior information on the speakers is provided. However, the algorithm for unsupervised segmentation is iterative and cannot be performed in real time. The basic framework for segmentation of the audio is a hidden Markov model (HMM) network consisting of a sub-network for each speaker and interconnections between speaker sub-networks.6 Speech is represented as a sequence of cepstral vectors. Speaker segmentation is performed using the Viterbi algorithm7 to nd the most likely sequence of states, and noting those times when the optimal state sequence changes between speaker sub-networks. A similar technique was used in by Sugiyama et.al.,8 however the sub-networks consisted of a single state. The speaker sub-networks used here are multi-state HMMs with Gaussian output distributions. This form of speaker model was used by Wilcox and Bush9 to model non-keyword speech for speaker dependent word spotting. Similar non-phonetic10 and phonetic11 models have been applied to speaker identi cation, but segmentation was not considered. In addition to modeling speakers, sub-networks are also used to model silence and non-speech sounds such as a musical theme. In cases where the speakers are known a priori, and where it is possible to obtain sample data from their speech, segmentation of the audio into regions corresponding to the known speakers can be performed in real time, as the speech is being recorded. This is done by pre-training the speaker sub-networks using the sample data, and then using the Viterbi algorithm with continuous traceback12 for segmentation. When no prior knowledge of the speakers is available, unsupervised speaker segmentation is possible using a non-real time, iterative algorithm. Speaker sub-networks are rst initialized, and segmentation is achieved by iteratively using the Viterbi algorithm to compute a segmentation, and then retraining the speaker sub-networks based on the computed segmentation. It is necessary for the iterative segmentation algorithm to have good initial estimates for the speaker subnetworks. When speaker sub-networks are initialized randomly, results of the segmentation vary and can be quite poor. Thus here, agglomerative clustering is used to obtain an initial segmentation of the speech. This segmentation is then used in Baum-Welch training7 of the speaker sub-networks. The results for initialization based on agglomerative clustering depend on the distance measure between speech segments. We use the likelihood ratio statistic proposed by Gish et.al.,13 but extend it by replacing the Gaussian distributions with tied Gaussian mixtures. We also recompute distances between merged segments at each level of the hierarchical clustering and augment the distance with a duration model. When hierarchical clustering using the tied mixtures in the likelihood ratios was used to initialize the speaker sub-networks, speaker segmentation performance equaled that obtained with supervised initialization of the speaker sub-networks. The following sections describe in more detail the speaker segmentation network, the real time segmentation algorithm, the unsupervised segmentation algorithm, and the use of agglomerative clustering for initialization of the speaker sub-networks. Results of speaker segmentation on a video-taped panel discussion14 are presented for the various algorithms.

2 SPEAKER SEGMENTATION NETWORK The speaker segmentation network is composed of a sub-network for each speaker. In addition, sub-networks can be supplied for silence, non-speech sounds such as a musical theme, and garbage. Garbage is de ned as speech or sound not modeled by the other speaker sub-networks, for example, unknown speakers or sounds in the audio. Figure 1 shows a speaker segmentation network for M speakers and silence. Transition probabilities from the initial null state to the speaker and silence sub-networks are uniform. The transition probability out of speaker models and silence is set to a constant . In principle, these transition probabilities could depend on the speaker,

Speaker1

ε

S1

SIL SpeakerM

Silence

ε ε

SIL

SIL

S2

SL SIL

Figure 1. Speaker segmentation network.

Figure 2. Speaker sub-network.

Figure 3. Silence sub-network.

and could be learned during training. However, for simplicity, the prior probabilities of speakers are assumed to be uniform, and the speaker exiting probability  is selected empirically to discourage speaker change based on isolated samples. Figure 2 shows a speaker sub-network. The network consists of L speaker states S1 ; :::S , and an optional silence state SIL, connected in parallel. Each state has a Gaussian output distribution, and has a self transition and an exiting transition. Figure 3 shows a silence sub-network. It consists of 3 states, connected serially, to model the longer duration silences. Each state has a common or tied Gaussian output distribution, as indicated by the label SIL. The silence sub-network models long periods of silence, but is not appropriate for pauses, or brief silence intervals in conversational speech. This is due to the fact that as a separate path in the segmentation network, transitions out of a speaker sub-network into the silence network are penalized equally with transitions to another speaker. To avoid this penalty, pauses are modeled by adding a silence state to each speaker sub-network. The output distribution of the silence state within a speaker sub-network is tied to the output distribution of the states in the silence sub-network. This is indicated by the label SIL. Tying the output distributions of the silence states across speakers allows speaker di erences to be determined without in uence from pauses. L

3 REAL TIME SPEAKER SEGMENTATION When the set of speakers is known, and when it is possible to obtain training data for the speakers a priori, a speaker index can be created in real time, as the audio is being recorded. This is the case, for example, for a meeting in which the usual attendees are known. Real time speaker segmentation is performed using the Viterbi algorithm to nd the most likely sequence of states through the speaker segmentation network. Whenever the optimal state sequence changes from one speaker sub-network to another, a speaker index is generated. Real time performance is possible by using the Viterbi algorithm with partial traceback.12 With this method, backtracking is performed periodically for all paths, and results are reported for those times when all paths are identical. While there is some delay inherent in this algorithm, in practice the delays are usually less than half a second, and thus are too short to be noticeable. Before segmentation is possible, each of the speaker sub-networks must be trained. This is done using the standard Baum-Welch training algorithm on approximately a minute of speech from each speaker. The initial parameters for the Gaussian output distributions for the states within a speaker model are derived from the grand mean and covariance of the training data for that speaker. The initial mean for a given state's Gaussian

output distribution is obtained by adding or subtracting one standard deviation to randomly chosen components of the grand mean. The covariance for each Gaussian is diagonal, and is initialized with the grand covariance. All transitions are set uniformly. Using this initialization, Baum Welch training is performed for each speaker sub-network. Sub-networks representing silence and non-speech sounds are trained similarly.

4 UNSUPERVISED SPEAKER SEGMENTATION When no prior knowledge of the speakers is available, speaker segmentation is still possible. However, the algorithm for unsupervised speaker segmentation is iterative, and therefore not real time. First, initial estimates for the parameters of the speaker sub-networks must be obtained. These parameters consist of the transition probabilities between states of the speaker sub-network (see Figure 2), and the mean and covariance matrices of the Gaussian output distributions for each of the states. Then, segmentation of speech according to speaker is performed by repeating the following steps until there is no further change in the segmentation: 1) Use the Viterbi algorithm to nd the most likely state sequence through the speaker segmentation network and mark those times when the optimal state path changes speaker. 2) Use the Baum-Welch algorithm to retrain the parameters of the speaker sub-networks, using the results of the Viterbi segmentation to obtain training data corresponding to each speaker. In the real time case, speaker training data is available to obtain estimates for the parameters of the speaker sub-networks. The iterative algorithm can also be used with this supervised initialization. The resulting segmentation is more accurate, but real time performance is lost. In the unsupervised case, estimates of the speaker sub-network parameters must be obtained by other means. Two methods for initialization are discussed below. In the rst, parameters are initialized randomly. In the second, agglomerative clustering is used to approximately segment the data according to speaker. This approximate segmentation is used as training data for Baum Welch training of the speaker sub-networks.

4.1 Random initialization When no training data is available for the speakers, one technique for obtaining initial estimates for the speaker sub-networks is to randomly initialize the speaker models. For two speakers, the sub-networks are initialized as follows. First the grand mean and standard deviation are computed for the unsegmented speech. A base mean for one speaker is obtained by adding one standard deviation to each component of the mean, and for the other speaker by subtracting one standard deviation. The means for Gaussians in each speaker sub-network are then obtained by adding or subtracting one standard deviation to random components of the base mean as in the supervised case. The covariance matrices are diagonal, initialized with the grand standard deviations. All transition probabilities are initialized uniformly.

4.2 Agglomerative clustering Another technique for obtaining initial estimates for the speaker sub-networks is by using hierarchical agglomerative clustering to segment the data into approximate speaker clusters.15 This data is then used in Baum-Welch training of the speaker sub-networks. The unsegmented data is rst divided into equal length segments consisting of several seconds of speech. These segments are used to initialize a set of clusters, where a cluster X consists of either a single segment, X = x, or a set of segments, X = fx1; x2; : : :g. The distance between clusters X and Y is

denoted by d(X ; Y ). Hierarchical agglomerative clustering proceeds by computing all pairwise distances between clusters, and merging the two clusters with the minimum pairwise distance. This process is repeated until the desired number of clusters is obtained. In Gish et.al.,13 the distance between two segments was derived from a likelihood ratio test for the hypothesis H0 that the segments were generated by the same speaker and the hypothesis H1 that the segments were generated by di erent speakers. Let x = v1 ; :::; v denote the cepstral vectors in one segment, y = v +1 ; :::; v denote the vectors in another segment, and z = v1 ; :::; v denote the combined collection of vectors. The vectors are assumed to be i.i.d, and are not necessarily time adjacent. Let L(x :  ) be the likelihood of the x segment, where  denotes maximum likelihood estimates for the mean and covariance matrix based on samples in the x segment. Let L(y :  ) and L(z :  ) be similarly de ned. The likelihood L1 that the two segments were generated by di erent speakers is L1 = L(x :  )L(y :  ). The likelihood L0 that the segments were generated by the same speaker is L0 = L(z :  ). Let  denote the likelihood ratio,  = L0 =L1. Thus L(z :  ) : (1)  = L(x :  )L(y :  ) The distance measure between segments x and y used in the hierarchical clustering is d (x; y) = ? log( ). r

r

n

n

x

y

x

z

x

z

y

L

L

z

L

x

y

L

L

Rose and Reynolds16 found that a Gaussian mixture model provided a more accurate method for speaker identi cation than a single Gaussian. Thus we extend the likelihood ratio of equation (1) to tied mixtures of Gaussians. Rather than computing the likelihood of a segment of speech assuming a single Gaussian, the likelihood is based on a mixture of K Gaussians. Let N (v) = N (v :  ;  ) be the Gaussian distribution for a vector v associated with the k mixture component for each k = 1; : : : ; K . The means  and covariance matrices  for components of the Gaussian mixture are estimated using the entire set of unsegmented data. These parameters are then xed. Let g (x) be the weight for the k mixture estimated using segment x. The likelihood of x = v1 ; :::; v is YX L(x :  ) = g (x)N (v ): (2) k

k

k

th

k

k

th

k

r

r

K

x

The likelihood L(y :  ) is computed similarly.

k

k

j

j =1 k =1

y

Since the means and covariance matrices of the Gaussian mixture are xed, the only free parameters to be estimated from the x segment are the mixture weights. Thus  = (g1(x); : : : ; g (x)). The weight g (x) is estimated by the proportion of samples v in the x segment for which the probability of the k component, N (v), is maximum. Thus the mixture weights g (z ) can be derived from the weights g (x) and g (y) as x

K

k

th

k

r )g (y): ( ) = ( nr )g (x) + ( n ? n

gk z

k

k

k

k

k

(3)

The distance measure d (x; y) = ? log( ) can then be computed using the mixture model of equation (2) in equation (1). L

L

Typical hierarchical clustering algorithms, for example \hclust" in the S Interactive Environment for Data Analysis and Graphics,17 compute the distance between clusters as the maximum, minimum or average of the

pairwise distances between segments comprising the clusters. Thus the distance between clusters X and Y using the maximum pairwise distance is d(X ; Y ) = max d(x ; y ): (4) i;j

i

j

The likelihood based distance measure d used here relies on estimates of either the Gaussian parameters (for the single Gaussian model) or mixture weights (for the tied mixture model), which are based on cepstral vectors contained in the segments. These estimates are more reliable when computed using more vectors. Thus we used a hierarchical clustering algorithm which recomputes distances for clusters. The recomputed distance between clusters X and Y is obtained using (1) with x, y, and z replaced by X , Y and Z , where Z is the result of merging X and Y . L

It is usually the case that adjacent segments are from the same speaker. In order to take advantage of this information at the level of the hierarchical clustering, the likelihood ratio of equation (1) was biased using a simple duration model based on speaker changes over the original equal length segments. Let S denote the speaker during segment i, and M the number of speakers. Assume that S is a Markov chain with Pr[S +1 = ajS = a] = p for each speaker a, and Pr[S +1 = bjS = a] = (1 ? p)=(M ? 1) for each a and b 6= a. The probability Pr[S + = S ], that the speaker for segment i is also speaking for segment i + n, may be computed by using a two state Markov chain, where state 1 of the chain represents the speaker at time i, and state 2 represents all other speakers. (This reduction of the M state chain to a 2 state chain is only possible because of the complete symmetry.) The transition probability matrix P for this chain is  p 1?p  (5) P = (1? ) 1 ? (1? ) : ?1 ?1 In terms of this matrix, Pr[S + = S ] = (P )11. By diagonalizing P this may be expressed in closed form as 1 + (M ? 1)( ??11 ) : (6) f (n)  Pr[S + = S ] = M i

i

i

i

i

i

i

n

i

i

p

p

M

i

n

M

n

Mp

i

n

n

M

i

Using this equation we can compute the prior probabilities that two given clusters X and Y are produced by either the same speaker (hypothesis H0) or by two di erent speakers (hypothesis H1). Let Z be the cluster formed by merging X and Y . There will be segments z 2 Z such that z 2 X and z +1 2 Y (or vice versa), corresponding to intervals in which the beginning and ending speakers are di erent according to H1. Let n be the di erence between time indices of the rst and last segments of the i such interval, and let C be the number of intervals. A duration bias is then de ned as Q f (n ) Pr[ H 0] (7) =  = Q Pr[H1] (M ? 1) (1 ? f (n ))=(M ? 1) : The duration biased distance between clusters X and Y , d (X ; Y ) is de ned as d (X ; Y ) = ? log( ) ? log( ). j

j

j

i

th

C

D

i

i

C i

i

D

D

L

D

5 EXPERIMENTS One set of test data for speaker segmentation consisted of a video-taped panel discussion from Siggraph.14 There were ve main speakers: a moderator and four panel members. In addition, there were speakers from the audience who asked questions of the panel members. The moderator and panel members each gave a short talk prior to the discussion. One minute from each speaker's short talk was used to train the speaker models for real time segmentation. The test data for the various experiments discussed below consisted of subsets of the rst 18 minutes of the panel discussion. The data was time-aligned and labeled according to speaker. Silences longer than half a second were also labeled. The data was digitized using Sun Sparc-10 audio. The sampling rate was 8 kHz, with mu-law encoding and 8 bits per sample. Twelve cepstral coecients were computed every 20 ms. Real time segmentation was tested using the training data described above to train a speaker sub-network for each of the ve speakers. Since the data contained speech from members of the audience, a garbage model was used to separate these speakers from the ve members of the panel. Training data to initialize the garbage model was obtained by concatenating short portions of data from each of the ve speakers. A silence model was also created. As shown in Table 1, using the real time segmentation algorithm without tied silence states, segmentation of the entire 18 minute test data resulted in 26 percent error, where error was the percent of the total time the wrong speaker was chosen. When tied silence states were added to the speaker models, the error rate was 14 percent. We also tested the performance of the iterative segmentation algorithm on this data. For non-tied silences, the error rate dropped to 6.1 percent, when the non-real time algorithm was used to iteratively retrain the speaker models based on previous segmentations. For tied silences, the error rate using the iterative

Table 1: Speaker Segmentation Error for Supervised Training. no tied silence tied silence real time 26% 14% iterative 6.1% 5.5% Table 2: Unsupervised Speaker Segmentation Error: Single Gaussian. maximum initial converged

dL

recomputed dL

dD

31.4% 10.1% 6.0% 30.3% 3.6% .9%

algorithm dropped to 5.5 percent. Thus the iterative segmentation algorithm provides a substantial improvement in segmentation accuracy, at a loss of real time performance. Next we compared the schemes for initializing the speaker sub-networks for unsupervised speaker segmentation. To test random initialization of the speaker sub-networks, a 3 minute portion of the test data containing only two speakers was selected. No silence or garbage models were used. The error rates varied greatly depending on the randomizer. In ve trials, the error rates were 3.4 percent, 1.5 percent, 48.8 percent, 1.5 percent and 9.0 percent. The error rate using supervised training for the initial estimates was .6 percent. Such sensitivity to initial estimates was also noted by Sugiyama et.al.8 To test unsupervised initialization using agglomerative clustering, a 6 minute segment of the panel discussion containing three speakers was selected. No silence or garbage models were used. Five second intervals of speech were used as the initial uniform segmentation. Table 2 shows the results for the distance d based on a single Gaussian model. When the distance between clusters was computed as the maximum pairwise distance between segments comprising the cluster, the segmentation error was initially 31.4 percent. The error after convergence of the iterative resegmentation algorithm was 30.3 percent. When the distance between clusters was recomputed, the error was initially 10.1 percent and dropped to 3.6 percent after iterative resegmentation. When the recomputed distance was biased by the duration model, the initial error was 6.0 percent, and .9 percent after convergence. For comparison, the segmentation error using supervised data to initialize the iterative algorithm was .5 percent. L

Results with the distance d using the tied mixture model with 32 components are shown in Table 3. The initial error rate using the maximum pairwise distance was 10.7 percent. After iterative resegmentation, the error rate was .8 percent. When the distance between clusters was recomputed, the error was initially 4.2 percent and dropped to .5 percent after iterative resegmentation. Adding the duration bias did not further improve results. L

In order to investigate the ability to segment non-speech sounds, we tested the system on a portion of the MacNeil-Lehrer News Hour. In this program, a musical theme is used to separate segments of the show. The Table 3: Unsupervised Speaker Segmentation Error: Tied Gaussian Mixture. maximum recomputed initial converged

dL

dL

dD

10.7% 4.2% 4.2% .8% .5% .5%

ability to index according to these segments would allow the user to easily skip portions of the news. The musical theme (5 seconds) was used to train one speaker sub-network, and an introduction by the newscasters (20 seconds) was used to train another speaker sub-network. Three minutes of the news was segmented according to these models. The musical theme was correctly classi ed, but a 10 second segment of the regular news was also classi ed as the musical theme. However, this portion was actually a location piece, in which a band was playing.

6 CONCLUSIONS A hidden Markov model network for speaker segmentation consisting of a sub-network for each speaker and interconnections between speakers has been described. Silence, and non speech sounds such as a musical theme, can also be segmented. When speakers are known a priori, speaker segmentation can be performed in real time. However, segmentation error can be decreased by iteratively resegmenting and retraining the speaker subnetworks. When no prior knowledge of the speakers is available, unsupervised segmentation can be performed using the iterative algorithm. However, it was found that a good initialization of the speaker sub-networks was critical. We proposed a hierarchical agglomerative clustering algorithm for initialization of the speaker subnetworks. Distance measures for the hierarchical clustering were based on a likelihood function which assumed segments were described by a single Gaussian, or by a tied mixture of Gaussians. Because of this likelihood based distance, performance was improved when the distance between clusters was recomputed, rather than derived as the maximum of the pairwise distances between segments in the clusters, as is common is hierarchical clustering implementations. The use of tied mixtures rather than single Gaussian models resulted in improved performance, as did the addition of a durational bias for the distance between clusters. Initialization using hierarchical agglomerative clustering with the tied mixtures gave the same segmentation performance as initialization using supervised training for each speaker.

7 REFERENCES [1] Arons, B. \Techniques, Perception, and Application of Time-Compressed Speech". Proc. 1992 Conf. American Voice I/O Society, pp. 169-177, September 1992. [2] Arons, B. \SpeechSkimmer: Interactively Skimming Recorded Speech". Proc. USIT 1993: ACM Symposium on User Interface Software and Technology, November 1993. [3] Wilcox, L.D., I. Smith and M.A. Bush. \Wordspotting for Voice Editing and Audio Indexing". Proc. CHI 1992, ACM SIGCHI, pp. 655-656, May 1992. [4] Chen, F.R. and M.M. Withgott. \The Use of Emphasis to Automatically Summarize a Spoken Discourse". Proc. Int. Conf. Acoustics, Speech and Signal Processing, San Fransisco, CA, vol. 2, pp. 229-232, March 1992. [5] Weber, K. and A. Poon. \Marquee: A Tool for Real-Time Video Logging". Proc. CHI 1994, ACM SIGCHI, pp. 58-64, April 1994. [6] Wilcox, L.D., F.R. Chen, D. Kimber and V. Balasubramanian. \Segmentation of Speech Using Speaker Identi cation". Proc. Int. Conf. Acoustics, Speech and Signal Processing, Adelaide, Australia, April 1994. [7] Rabiner, L.R., \A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proc. IEEE, Vol. 77. No. 2, pp. 257-285, February 1989. [8] Sugiyama, M., J. Murakami, and H. Watanabe. \Speech Segmentation and Clustering Based on Speaker Features". Proc. Int. Conf. Acoustics, Speech and Signal Processing, Minneapolis, MN, vol. 2, pp. 395-398, April 1993. [9] Wilcox, L.D. and M.A. Bush. \Training and Search Algorithms for an Interactive Wordspotting System". Proc. Int. Conf. Acoustics, Speech and Signal Processing, San Fransisco, CA, vol. 2, pp. 97-100, March 1992. [10] Matsui, T., and S. Furui. \Comparison of Text-Independent Speaker Recognition Methods Using VQ-Distortion and Discrete/Continuous HMMs". Proc. Int. Conf. Acoustics, Speech and Signal Processing, San Fransisco, CA, vol. 2, pp. 157-160, March 1992.

[11] Gauvain, J.L. and Lamel, L.F. \Identi cation of Non-Linguistic Speech Features". Proc. ARPA Human Language Technology Workshop, March 1993. [12] Brown,P.F., J.C. Spohrer, P.H. Hochschild, J.K. Baker, \Partial Traceback and Dynamic Programming". Proc. Int. Conf. Acoustics, Speech and Signal Processing, Paris, France, pp. 1629-1632, May 1982. [13] Gish, H., M.-H. Siu, and R. Rohlicek. \Segregation of Speakers for Speech Recognition and Speaker Identi cation". Proc. Int. Conf. Acoustics, Speech and Signal Processing, Toronto, Canada, vol. 2, pp. 873-876, May 1991. [14] \Where Do User Interfaces Come From". Panel Discussion from Siggraph, 1987. [15] Siu, M.-H., G. Yu, and H. Gish. \An Unsupervised, Sequential Learning Algorithm for the Segmentation of Speech Waveforms with Multiple Speakers". Proc. Int. Conf. Acoustics, Speech and Signal Processing, San Fransisco, CA, vol. 2, pp. 189-192, March 1992. [16] Rose, R.C. and D.A. Reynolds. \Text Independent Speaker Identi cation Using Automatic Acoustic Segmentation". Proc. Int. Conf. Acoustics, Speech and Signal Processing, Albaquerque, NM, pp. 293-296, April 1990. [17] Becker, R.A. and J.M. Chambers. S: An Interactive Environment for Data Analysis and Graphics. Wadsworth Advanced Book Program, Belmont, CA. 1984.