Applying Speaker Diarization to Costa Rican Trials

Applying Speaker Diarization to Costa Rican Trials Speech Data Technical Report1 Carlos Loría-Sáenz2 December 2012

Abstract We research about Speaker Diarization (SD) as a way to deal with the processing of speech generated in trials in Costa Rica which presents important recording quality issues. Such data sources deserve to be considered of great social value something that justifies the kind of potential benefits this research pursues. SD deals with the problem of determining “who speaks when” in an audio in a unsupervised way. Our intention is to produce SD than can help to generate indexes facilitating subsequent information retrieval tasks and searching accesses which are quite useful in Web contexts. We review key state-of-the-art SD techniques and test them with a sample of the real data. We use an open-source SD tool in order to provide practical criteria for the potential of its application in our case and derive guides for future research work. The work proposes and tests an approach based on domain-specific knowledge that could be used in this study case to improve the diarization performance given the audio quality issues. The obtained results indicate that this kind of strategy deserves further analysis and test. We compare our results with another SD proprietary system in its basic configuration.

1 Introduction In this work, we research on recent Speaker Diarization (SD) methods to provide technical and practical criteria to deal with the problem of processing speech that is generated in trials in Costa Rica. We have two main goals for such an intention whose exposition should briefly justify the relevance of this research. First, we want to highlight alternatives and practical possibilities in the field of automatic speech recognition (ASR) for dealing with the data of interest under challenging circumstances: it is a fact that the quality of the recordings is unfortunately far from optimal to achieve good levels of performance by means of highly automated recognition methods. The reason is that these recordings are produced for human consume not for computers. In awareness of that and as a second goal, we want to provide experimental results to show that is feasible to gain value from the recordings. This is important because it will perhaps require some time before the underlying quality limitations can be improved or eventually avoided. 1 This work is the result of a research visit to the LSV group (www.lsv.uni-saarland.de) of the University of Saarland (Germany) during the Autumn 2012, supported by DAAD (www.daad.de) and UNA (www.una.ac.cr). Author mail address: [email protected] 2 Escuela de Informática, Universidad Nacional de Costa Rica (UNA)

1

This work is a natural follow up of some experiments developed in our land when we were trying to produce transcriptions of the mentioned recordings ([21]). This initial work showed us that we would need a special treatment and research considerations due to several factors affecting the audios, like noise, multi speakers, gender among others. One solution is obviously to improve the recording conditions and equipment something that requires investment and justification, hence, usually resulting in a long-term path. In the meantime, a lot of data is produced daily and will be generated before that ideal point can be reached. So, it seems to us necessary to process the speech at least in some way that aids at tasks like audio indexing, searching and navigating. In a form that can be coupled with informations systems (already) publicly available through the Web by the justice department, for instance. This in accordance with the growth of multimedia stored in Web which is common nowadays, in general. The social value of the data is enough reason to justify any attempt that promotes a better use. Hence, we consider very much relevant to have some practicable computationally based approach and this work suggests one that deserves further attention, in our opinion; we try to justify our claims through this research. This works also represents an important change and reorientation in our original strategy ([20]). We believe that we have a better option, now. Initially, we wanted to face the problem as a research about ways to improve the underlying models used for speech recognition (essentially Hidden Markov Models (HMM [27])). We considered studying alternative approaches that could improve HMM robustness to deal with our quality factors (we pointed, for instance, to variants like Conditional Random Fields (CRF) as an alternative). However, as a result of this research that way revealed to us as too long and more involved in terms of our constrained scope and temporary conditions surrounding this initiative. After realizing that our initial intention would correspond to a more theoretically based and long-term research, we rather took another route that could be evaluated in the short-term range and more accordingly with potential applications in realistic scenarios. As our research path we selected the area of Speaker Diarization3 (SD) whose basic background we will be reviewing in the following sections. SD deals with the problem of determining “who speaks when” in an audio under no previous knowledge about speakers. In this work, we evaluated the potential effectiveness of SD by using an open-source tool ([22]) and testing it with some real sample data from the kind we could have in our Costa Rican case. This is done so in order to maximize research efforts in a new field in a short time period and preparing future work, as we explained along this document. Our evaluation test is formally speaking limited given the small amount of data available to us (because the data is not publicly available, in general) and, as a consequence, the results have to be handled with the corresponding care. However, we provide evidence that SD could really offer several important advantages that might justify its application in a more involved project and realistic scenario. This reports provides research to support our beliefs and promote the need and value of further work in the field. In general, this research has an exploratory nature starting from preliminaries and seeking for potential applications devising future work and key issues, to point out on potential applications.

3 Both words “diarisation” and “diarization” are used in the SD references.

2

2 Notions of Speaker Diarization (SD) We review the area of speaker diarization (SD) as required for the scope of this work. We do not attempt to produce a complete introduction, for appropriated sources we refer to and extensively use here parts of [3, 24 32, 31]. Our goal here is to introduce a theoretical background and from there to establish a coherent context for the report. As already mentioned, SD deals with the segmentation and labeling of homogeneous parts of audio according to speakers and considering perhaps additional characteristics of them like gender and emotions, among others. In its pure form and in contrast with other ASR tasks like Speaker Recognition (SR), the SD task does not assume as an input any a priori knowledge about the number of speakers or any reference model representing them. In other words, SD has to provide an unsupervised determination of the speakers in the different turns they take in the audio. Thus, SD has to build a model or informational representation of each speaker in order to fulfill its goal of labeling the segments according to “who speak when”. Such a representation, as usual in ASR, is strongly based on statistics. As we will see, speakers are essentially described by probabilistically founded models or characterizations which are learned from the audio using machine learning techniques. We will review some of them in this report. A successful SD can help to gain more semantic structure from the audio and that can eventually be used for improving subsequent ASR tasks, like rich transcription (RT), that is the textual version of the audio. The SD task can vary according to the kind of event/environment that generates the audio, for instance, telephone conversations, broadcast news and more recently conference meetings or lectures, where in addition to audio, multimodal integration with other sources like video and scripts and considerations of gestures, emotions and human interactions could also be demanded [24]. Each one of such scenarios being more involved than the other, certainly. The development of the field has evolved along these stages during the last years according to [3]. In addition, the recording conditions are naturally important elements, where aspects like the type and location of microphones used will have an impact or require special so-called preprocessing phases before submitting audio to SD algorithms, properly. In our case, we have audio that can be probably considered near to meetings but with a more fixed structure; one that results from the formal protocol of a trial which is controlled by the judge and is regular and paused; such is mostly the case in our sample data. Moreover, the number of speakers can be easily known a priori, a domain-specific knowledge that can be exploited to improve the speaker diarization quality, as we could apply in this work, in fact. Many of the trial recordings are produced in small offices using one single distant microphone (SDM) put on a desk. In other cases, however, several (usually three or four) microphones are used in a auditorium. In this work, we conducted our test using data of the first kind, only. As a computational task, SD can be considered a work-flow of processing stages and subtasks starting from the wave and ending with a clustering, where each cluster represents some hypothesized speaker and should contain all (as much as possible) the segments associated with only that speaker. Some of these stages are related with theoretical elements as well as modeling approaches which we want to analyze and synthesize based on the literature and sources researched. The coverage is oriented by our concrete needs, thus, in some passages it contains notices and elaborations related to our particular 3

study case that can be useful to take into consideration.

Speech Data and Cluster Modeling: GMM and HMM SD aims at producing speakers representations in an unsupervised way where such a representation, or cluster, essentially comprises all the speech that could be associated with a speaker. Clusters are built and refined iteratively starting from previous approximations working with audio segments which are sequences of contiguous features (vectors of real numbers of some dimension), as we review below. Standard assumption is that segments and later clusters can be modeled by means of underlying Gaussian distributions. Two important models appear as fundamental under such assumption: Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM). In this last case, the HMM model is in theory independent of GMM, but in practice, HMM-GMM working together turns out the most usual case in SD. We briefly review some main concepts required for the rest of the report which are quite standard; we mainly use ([10, 29, 27]). A GMM is a parametric multi-dimensional probability density f GMM represented by a finite weighted sum of say m,of individual multivariate Gaussian densities, called GMM components (mixtures) . m

f GMM  x∣ =

∑ wi N  x∣ i , i  i=1

1 1 N  x∣ ,   = exp−  x −T −1  x− d 2 2  ∣∣

(1)

Each one of the component, is a d-variate Gaussian, where d is the feature (data) dimension. Such dvariate densities generalize standard the one-dimensional normal curve to d-dimensional vectors of real numbers. Thus, the GMM parameters are the vector d-dimensional mean, the dxd covariance matrix m

 j and the weights w j for j=1,⋯, m. with

∑ wi =1 . As

indicated in [29], one might intuitively

i=1

consider each component as being the model of a hidden variable that reflects some particular set of data characteristics that the GMM adds according to the weight (or relative importance) of each one of such variables. A GMM can be used with a diagonal or with a full covariance matrix. The first case representing a situation where features entries are uncorrelated and offers better computational efficiency. As in other model designing considerations, this results in a trade off that depends on speech task. For instance, [29] mentions that diagonal GMM behaves well for SR and indicates that any correlation can be modeled by increasing the number of components. On the other hand, out tested system ([22]) uses both kinds depending on the SD stage. In general, that kind of model design decision, like the values of d, m and the type of covariance, result normally in system parameters that have to set and tuned. A HMM can be considered a kind of stochastic finite automaton with output when traversing its states transitions. The transition among states and the state output are both probabilistically determined processes. Thus, we have a i j the probability for transitioning from state i to j; and bi j k for emitting output k when the system transitions from i to j. A special case of a i j are the i probabilities that determine the probability that HMM starts in state i for any i. The HMM states represent unknown (hidden) information, because we can only observe sequences of the outputs that the automaton 4

generated as was transversing it following some states path. One has to predict the most likely path (or state sequence) that could have generated the given observations; the standard method for such task is the Viterbi (dynamic-programming based) algorithm, also called Viterbi decoding or forwardbackward algorithm. The path produced by this algorithm is called Viterbi path. In SD, an observation (the value k in bi j k ) can be a speech feature vector (or a segment); the states are the speaker labels we want to assign to (use to classify) the feature (segment). The usual assumptions are that the output probabilities can be described by using a GMM, the combined model is denoted by HMM-GMM. Different HMM architectures are possible, linear forms are frequent for practical reasons. As one can see, a GMM can be treated as 1-state HMM and a HMM can be considered a GMM based “soft” classification system, where the a i j are like a kind of penalties from transitioning from one GMM to another one. For a concrete example, Viterbi realignment is indeed used in SD after a hierarchical clustering in order to re-segment the audio in our studied system. A HMM 1, GMM 8, D is used to model each cluster; where we write HMM  n , M  to denote a HMM with n states each of them modeled by M, this one being of the form GMM  k , D for the diagonal covariance matrix and GMM  k , F  for the full covariance case and k being the number of Gaussian in the mixture. Both models GMM and HMM are usually trained using the well know Expectation Maximization algorithm (EM). Training means to learn the GMM parameters (mean, covariance and weights) and in the case of HMM-GMM the state transition probabilities, as well. EM can work under a Maximum Likelihood Estimation (MLE) approach (the standard way), but also under a Maximum a Posteriori (MAP) criterion. This is appropriate for adaptation purposes which is used in SD for CLR clustering, as we will see. The EM algorithm can be naturally understood as a clustering algorithm with soft assignment: data is probabilistically assigned to every cluster not just to one as in hard clustering. The cluster label can be considered as the missing information part of the data under a common interpretation of EM. EM-MAP of GMM is used in Speaker Recognition tasks that can be seen as on line learning. Given a priori GMM universal model of speakers (called a Universal Background Model, or UBM) the adaptation can be used to recognize the speaker using low amounts of data. As we learn later on and related to a posteriori adaption, SD operates in iterations leading to a final GMM based clustering model. From such a model we can take advantage for producing further SD improvements, as we propose below.

Preprocessing and Speech Features In the preprocessing stage the signal is converted into sequences of features after probably several phases that might depend on the kind of recording conditions and processing tasks. That is usually done using a digital signal processing (DSP) front-end. Preprocessing can account for noise reduction, handling of multi-channel (microphones) and finally the feature extraction. Most of these steps were not considered part the scope of this work, they are yet fundamental themes of DSP. However, we account for very important pointers related that can turn out relevant of being considered in future research with important applications to our intended scenario. In other words, DSP is certainly not a closed theme in our context. 5

The essential concept is the representation of the sound (or signal) as a sequence of features which, mostly, are Fourier-based (discrete cosine or DCT) logarithmic-spectral representations of small wave sections (smoothed/filtered by standard window functions) at some small time rate, for instance, 20ms (milliseconds) progressing at a 10ms rate ([29]). In the end, features become vectors of real numbers whose dimension (which is the range or order of spectrum) and selection are strongly associated with the type of speech processing task. For instance, dimensions like 13 in SD or 40 in ASR are common. To produce speech features different approaches are used being the (static) Mel Frequency Cepstral Coefficients or MFFC (ex. F1 to F12) most commonly employed. In addition, so-called dynamic features are computed from the MFFCs: the deltas (first and second order) and complemented with the coefficient of energy also called C0 and, eventually, its corresponding deltas. In the study case reported in this work, the employed system actually uses different dimensions and selections of features depending on the specific SD subtask, something that is the result of experimentation, clearly. For instance, feature normalization (like warping or cepstral mean substraction, CMS) in only used during the last clustering stage when strong hypothesis are already built. In contrast, in speaker identification (SID) normalized cepstral are mentioned as a better choice from the start ([22, 11, 29]). On other hand, [32] mentions that deltas (or first and second order) might not be as useful in SD as they are in ASR transcription. An extensive comparative list of the variety of features used in SD systems is inventoried in [17]. As indicated, features are usually short-term and usually represent low-level properties of the signal in the form of (logarithmically measured) amounts of energy distributed in some selected frequency bands. They build a spectrum which allows to characterize and recognize patterns of sounds. For capturing higher level characteristics. Thus, (prosodic) long-term features are also used in SD in combination with short-term ones for producing a more efficient SD system ([7]). High level features can also be used for detecting human level of participation or capture emotions, semantics that can be used for a more precise discrimination of speakers and data. In [14] the pitch correlogram (see [17] for how to calculated the pitch) was used to create a precomputation of clustering candidates for the BIC merging, called fast match, which results in great savings of computational time without affecting the error rate. That kind of optimization of the SD time is key when we need to process huge speech corpora or produce SD live (in real-time). The work [4] evaluates several features alternatives and combinations which are appropriated for speaker overlap detection, something considered central in processing meeting recordings with spontaneous speech and short time dialogs. In this work, we ignored the overlapping problem because of the assumed relative fixed structure of most of the audio that we can observe in our samples. However, overlap does occur (for instance, at the beginning and end of a trial) and we should expect the problem becomes more relevant when considering other kinds of samples in future work. As a consequence the role of features selection remains, in general, as another potential area who requires more study from us.

Speech Activity Detection (SAD) In this stage, that can be part of the preprocessing, speech and non-speech regions in audio need to be separated to minimize distortions of clustering. Typical non-speech events are background speech, 6

noise, music, doors being closed. among many others that need to be removed before clustering proceeds. In some cases, non-speech was handled as another cluster, but practice has shown that turns out less effective than eliminating it. For implementing SAD, [22] makes use of a HMM based model of non-speech trained on, silences, bandwidth, music and jingles (as features 12 MFCC (without energy C0) plus deltas are used.). This SAD runs as a stage between clusterings. On the other hand, [29] describes SAD in a (DSP) preprocessing stage where they mention about 25% from the signal is discarded in this phase in case of telephone audio. We do not formally try to estimate the effectiveness of [22] SAD HMM model in our study case; but we got the impression that the speech segments still are noisy enough for disturbing tasks like transcription (which have experienced is very sensitive to noise and signal variability). Thus, we probably will require to train or try to adapt some background models depending on some general environmental classes of rooms, if possible ([29]). However, obtaining robust enough models of environment might be difficult to achieve at least for a sufficiently general situation, due to great variations of the recording places and also the own dynamics of a particular meeting ([3]). Then, SAD represents an issue requiring special consideration in any derivation of this work. We finally remark that SAD is an essential subtask strongly affecting SD quality because it can increase the SD error rate by two ways: deleting actual speech data or accepting non-speech data. According to information from [24], SAD and overlapping together can reach around 45% from the total SD error sources; which highlights its great importance.

Speaker Change Detection (Initial Segmentation) SD is sometimes mentioned as the result of two separable subtasks: segmentation and clustering. It is also known that clustering algorithms (for instance, EM based [10]) are very sensitive to initial configurations. Hence, the first segmentation of the features can affect the quality levels of subsequent diarization stages. The segmentation has to produce a first partition of the signal into segments separated by so-called speaker turns, also called speaker changing points (SCP) . This entails two runs along the whole signal: first, detecting turns, and a second run, merging contiguous segments if they can be proven to belong to the same speaker (called in some cases linear clustering). The technique used for both phases underlies onto statistical model selection techniques under the usual assumption that speech segments follow (multivariate) Gaussian distributions of probabilities (GMM). Given a segment representing a signal window of some small size (for instance, 5 seconds are used in [22]), a change point is a maximum value change according to some criterion calculated on each feature conforming s in the following sense: Any point (feature position) t in s [0 : n] suggests splitting the segment into two new models, namely: s 0=s [0: t] and s1 =s[t : n] if n=∣s∣ denotes the length of s4. The decision is which one represents a better model of data: to keep s (with corresponding hypothesis H 0 under model M 0 ) against split s into s 0 and s1 (with corresponding hypothesis H 1 under model M 1 ). This naturally reduces to hypothesis testing (or model selection) and for that sake different methods are used in SD which essentially are sustained on model comparison metrics or distances. 4 We denote

s [ a :b ] the segment slice from position a to position b−1 taken from s in a Python-like style. 7

Such metrics are based on different forms of (log-)likelihood ratios and associated with hypothesis testing under different assumptions and conditions. We omit the details involved in distance derivations. Commonly accepted criteria that can be used for model/hypothesis comparison are the Bayesian Information Criterion (BIC), the Generalized Likelihood Ratio (GLR), the symmetric KullbackLeibler divergence (KL2 or Gaussian divergence), the Hostelling T2 distance, the Cross Likelihood Ratio (CLR), Cross Entropy (CE) and Normalized Cross Entropy (NCLR), among others (see [5, 30, 18, 28, 12]). CLR, CE and NCLR are more associated with clustering than with SCP ([18]). Covering such a long list in detail is beyond the scope of this work. We review BIC, GLR and NCLR basics below. BIC is broadly mentioned in the SD literature; thus, it emerges as a standard in this field; but it is considered computationally expensive for the initial segmentation stage and not being so precise, especially when turns are short ([31]). Thus, for instance, our tested system uses GLR for change detection instead and leaves BIC for segment merging (linear clustering) and later speaker clustering stages. (We have already mentioned fast match strategy to cope with BIC costs in the clustering stage, approximating it with a KL2.). It terms of comparing BIC and GLR, let us write the BIC definition, just in the case of a Gaussian model M representing some data containing n samples of d-dimensional features and with covariance matrix  and ∣∣ its determinant: 1 BIC  M =− n log∣∣− P 2

(2)

d d 3 log N is the so called model penalty (complexity or perplexity or number of 4 model free parameters) and  the penalty weight that should have value 1, but, in general, is system parameter that needs to be tuned ([17]). Such an expression functioning like a regularization term to penalize big models, as usual. According to [22], using the penalty factor with log  n1n2  instead of log N  leads to a better performance, in practice. In practice even so, each criterion is compared against a threshold system parameter which depends on the specific metric. where P=

On the other hand-side, GLR M  is computed in [22] as BIC  M  just without subtracting the model penalization5. Doing so the efficiency between both would be dominated by the determinant calculation. The difference would be that GLR does not constrain the model size probably allowing bigger segments than BIC does, something useful in the turns detection task. As can be noticed there is a duality for such metrics because they be employed for segmenting, merging in this initial stage and for clustering, as well. Remarking, however, that the initial segmentation only deals with obtaining segments that are contiguous in audio while clustering considers the more general case. In both cases (initial segmentation and clustering), we emphasis the role of GMM assumptions at different processing stages, as we will repeatedly see. We will return to the theme of metrics in some more detail when address clustering, later. As an illustration we will use again BIC based clustering later on for being so common. The principles are generic and metric independent, however. This is useful for modular implementations. 5 Which seems to differ from GLR as in [12].

8

To conclude this part, and returning to efficiency, an alternative method that uses distances that are not based on parametric models distances like BIC but instead on information similarity is presented in [32, 33] (where the word information is meant here in the sense of Information Theory). With the advantage by these methods of offering better computational costs and still keeping comparative results with respect to BIC, according to the references. Because these methods could provide interesting results in out case, they should deserve special attention in our future work; they were not available in our evaluation tool at the time of writing this report. At a first sight, we consider feasible to incorporate them as prototype and generate what can be very useful comparisons elements for our future study case.

Speaker Clustering Clustering in SD takes mostly the form of an agglomerative (bottom-up) algorithm. A clustering stage starts with one previous cluster hypothesis probably containing too many clusters (see Table 1 for an instance). On every new iteration the algorithm selects two clusters and decides if they should be merged into one new cluster or not. The selection of such candidates to merge is based on a distance like BIC or CLR, for instance. As in segmentation, each cluster is represented by a model and distance is meant in terms of such cluster models. For merging, the closets clusters according to the distance are the candidates. If a new merged cluster is produced then its corresponding model has to be trained and the distances from other clusters to it, as well. The symmetry of the distance becomes here important, for correctness and efficiency matters. Finally, a stopping criterion is established that might depend on the kind of distance employed and incorporate additional controlling elements like, thresholds, reaching a minimal number of clusters and not exceeding a maximum number of iterations and case of failing in achieving convergence at the expected level of precision. It results clear that under this description that a clustering algorithm can be generically implemented allowing instantiations of its basic components, as [16] actually does. For our experimentation purposes such a modular architecture turned out very helpful. For illustration purposes, let us review the case of BIC based hierarchical clustering. As described in the key work [5], clustering can be considered a greedy construction of a tree built while optimizing a distance derived local criterion between the clusters candidates. (In other words, a minimal description length (MDL) tree construction constrained by such criterion.). The optimization function is given by the  BIC value. Given a clustering C from a previous clustering iteration, let us say, C={c 0, c 2, ⋯, c k } where n i=∣c i∣ is the number of segments in each cluster c i and N =∑ ni the total number of segments being clustered. Assuming, as usual, that each cluster c i follows a Gaussian with covariance matrix i. all having the same feature dimension d (and so, all clusters model having the same complexity). Using the definition of BIC, we have the following identity (with  and P as before in BIC definition (2)): k

BIC C  =

∑ − 12 ni log∣ i∣− P

(3)

i=0

Supposing now, that clusters c 0 and c 1 are the candidates to merge then a new clustering hypothesis C ' ={c , c 3, ⋯, c n } would be produced with c=c1∪c 2 being the new merged cluster and n=∣c∣ its length. In such a case using (3), we would have: 9

 BIC C ,C '  = BIC C −BIC C '  =

1 1 n log ∣∣−  n0 log ∣ 0∣n1 log∣ 1∣− P 2 2

(4)

Hence, in order to merge c 1 and c 2, its  BIC value should be positive, and consequently in this form, the clustering algorithm will be maximizing this criterion after each merging iteration while building the tree bottom-up. The corresponding BIC stopping criterion occurs when there is no clusters increasing BIC (up to a threshold). This illustrates clustering based on BIC, essentially. But the same principle can be used with other criteria. We notice that the “distance” based on BIC can be simply define it in terms of the criteria: distance BIC  c0 , c1  =  BIC {c 0 , c 1 }, {c0 ∪c 1}

(5)

Such a distance can be computed for all pairs of clusters and stored in a diagonal matrix for efficiency. Noticing that the distance is symmetric. Another relevant criterion (and distance) is CLR, which is used for the corresponding CLR-clustering. CLR derives from CE and is used in normalized form which is called NCLR. Because is very important in the tested system, we reviewed it here. The description corresponds to NCLR, but we refer to it as CLR, simply. Clustering based on CLR uses a UBM to adapt cluster models, as in [18, 22]. Such UBM is a gender-bandwidth model trained on a huge amount of audio, it is a not a speaker model. The CLR cluster models are generated by adapting previous clusters (for instance, BIC-clusters) with this UBM, where GMM-MAP adaption is employed. In order to present the distance, we introduce some notation. Thus, L c∣M  is the likelihood of data (cluster) c under model M . And let M ubm denote the model M adapted from the UBM. The NCLR distance definition becomes:

distance NCLR  c0 , c 1 =

Lc 0 ∣ c 1 Lc 1 ∣ c 0 1 1 log   log   ubm ubm ∣c 0∣ L c 0 ∣ M 1  ∣c 1∣ L c 1 ∣ M 0 

(6)

The likelihood ratios are crossed as in CE and CLR. NCLR normalizes them. Again, the same clustering algorithm pattern described with BIC can be used for NCLR. Table 1 shows a comparison between different clusterings with respect to the number of clusters generated, as an illustration.

Re-Segmentation (Viterbi Decoding, Realignment) The purpose of this stage is to adjust the clustering hypothesis produced by any previous clustering iteration in order to obtain a more precise segmentation ([22]) (it is called re-clustering [31]). In the first work, a HMM 1, GMM 8, D is ML-EM learned from all segments in each cluster and where cluster transition penalties are fixed empirically (controllable by a system parameter). That HMM is used for re-classify all the segments which aims at producing better segments boundaries, but some additional post-processing can also be required, as we explain below.

Segment Labeling by Gender/Bandwidth In this stage segments can be labeled according to some attributes like gender (male or female) and bandwidth: low, narrow/telephone or high/studio). According to [31, 22] the task is performed by using 10

MLE-GMMs previously trained for this kind of detection on specific data. Concretely, the last reference uses four GMM(128, D), one for each combination among male/female and narrow/wide possibilities. As indicated, the generated labeling product of this stage is used as UBM in a subsequent CLR clustering stage that produces the definitive cluster model as a refinement of a previous clustering giving as input. As we explain later, we will take advantage of the gender labeling (bandwidth was not relevant in our sample) of segments as a way to better discriminate our proposal for clustering refinement. Not as boolean discriminator, however, but as weight adding to a score of nearness from segments to clusters.

Post-Processing In this stage the (final or partial) results of SD stages can be modified in some way. We briefly mention some of them which are relevant. As mentioned below, in some cases the segment boundaries can be imprecise resulting in cut words ([22]). This reference cites boundaries adjustments based on Viterbidecoding or heuristically based rules. Another mentioned way is removing of long segments including silences and fillers. Segment compaction falls into this category, too. According to the cite, a 1.03% improvement in diarization error could be obtained by this kind of processing.

Clustering effectiveness As an illustration integrating previous concepts, Table 1 shows the clustering evolution after each SD stage in the case of the studied tool using our sample data (which is described in more detail later on). The last row shows a ratio between the average number of clusters in each column and the number of clusters in the CLR column Thus, for instance, the initial number of clusters produced by the GLRsegmentation is in average 78.31 times the final CLR-clustering. This exemplifies the incremental effectiveness of each stage in detecting the number of speakers relative to CLR. Sample ID

GLRSegment.

BICMerging

BICclustering

Speech Detection

CLRclustering

05

585

330

78

57

4

07

104

57

20

11

3

08

201

110

30

15

3

10

128

56

22

3

3

Ratio= . /CLR

78.31

42.54

11.54

6.62

1

Table 1: Cluster Numbers evolution after SD-Stage (sample data)

11

SD Evaluation Criteria (DER) The National Institute of Standards and Technology (NIST [26]) performs evaluations (so-called rich transcription (RT) campaigns6) for promoting the SD development and maintaining benchmarking data for measuring advances by means of standardized criteria, as explained in [3]. In this reference the figures of the RT'09 evaluation are analyzed something that we want to briefly review in this section for later comparative use. As a part of that, we first remind the NIST metric for the evaluation of SD results which is broadly accepted in the field: the diarization error rate (DER). Given a diarization reference (known as ground-truth reference, (GTR) or simply the reference) and a hypothesized diarization (HD), DER accounts for three sources of error: Missed speech (MISS): percentage of speech in GTR but not in HD. False Alarm speech (FA): percentage of speech in HD but not in GTR. And third, the speaker error (SE): percent of speech in HD assigned to a different speaker with respect to GTR. This item includes overlapping errors, when HD fails in determining the right number of speakers that speak at the same time in some segment in GTR. Formally, the DER of a segmentation S can be computed as follows according to its components: DER S  = E FA  S E MISS  S E SE  S 

∑

s∈S  N hyp  s N ref  s 

E FA S  =

dur  s N hyp  s−N ref  s T score  S 

∑

E MISS  S  =

s ∈S  N ref  s N hyp  s 

dur  s N ref s −N hyp s  T score  S 

∑ dur s  min N hyp  s−N ref  s E SE  S  =

s∈ S

T score S  =

T score S 

∑ dur  s N ref  s s ∈S

(7)

(8)

(9)

(10)

(11)

Where T score S  is total scored time of S and dur  s is the duration of segment s , N ref the number of speakers in the reference, N hyp the number of speakers in the hypothesis and N correct the number of speakers which are correct in the hypothesis7. DER is speech time based, as we want to remark. We remind here that in our evaluation test set speaker overlapping was not considered by us (it does not 6 These are denoted as RT'yy with yy being the year of the evaluation. 7 The tool md-eval.pl (in Perl) was used in this work for calculating the DER. The tool uses the RTTM format from NIST

12

occur in the references). So, for speaker error (SE), we only take wrong single speaker assignments into account. In NIST DER a ±250ms error margin (collar) in GTR segment borders is accepted accounting for potential imprecision during manual labeling yielding the GTR. In our case, we were forced to use 500ms, due to limitations of the employed tool. Because we consider it a bit simpler, we refer to “missed speech” as “deletions”, “false alarms” as “inclusions” and “speaker errors” as “renames” in this report. This rephrasing is being considered from the GTR perspective. Finally, with respect to the NIST DER figures, the RT'09 instance indicates a 17.7% DER average (ranging from 7.4% to 49.65%) for the single distant microphone (SDM) scenario, which in this sense is similar to the scenario of our trial data. The DER for the multiple distant microphone (MDM) case gets an average DER of 10.1% (ranging from 5.3% to 22.2%). NIST evaluations for RT-09 uses data with higher overlapping indicators than the previous contests did, according to the cite. This is reflected in RT'09 sensibly worst results in one particular meeting (DER 49.65%) with respect to RT'07 (22% DER). This last result is relevant, as it suggests that speaker overlapping is an area of improvement.

3 SD Tool Overview Having established the required background, we now review the main elements and strategies that compose the SD LIUM_SPKDIARIZATION tool. We use reference [22], the source-code and our own experiences for the analysis. We will be referring to it simply as the LIUM system. LIUM system was developed by the University of Maine (France) like a tool-kit specifically designed to perform SD work. It provides a variety of algorithms implementations (distances BIC, KL2, GLR, CLR, etc., change detection, GMM/HMM modeling, EM training, modular/generic clustering, etc.). It allows customization for specific SD (sub)tasks by means of parameterization or to access functionality trough the different entry points (mains) which are available. It is presented as an evolution of a previous very successful SD system, LIMSI, (winner of the NIST RT'04) having Avignon's ALIZE recognition library as its starting point. Thus, it incorporates important previous experiences in its implementation. We have already highlighted some key elements about the form LIUM system implements SD tasks during the coverage of SD in the previous section. In this section, we concentrate more on usability, results and evaluation. An overview of LIUM work-flow is shown in Drawing 1. The system is written in Java8 and uses CMU Sphinx4 ([34]) as preprocessing front end engine (for the feature extraction). Being implemented in Java was considered in this work as appropriate because of proper attributes of this platform (especially portability) and because our previous experience with ASR used Sphinx4 which also Java-based. In general, we consider Java easier to deal with for the kind of research and experimentation that we required in this work; in awareness of eventual future performance or scalability issues which were no relevant according to the scope of this work. The tool can be used “out of the box” as a console application; it is configured to process broadcasting 8 JDK 1.6, requires the Sun JRE. The version used during this work is the 4.2 available at http://www-lium.univlemans.fr/fr/content/liumspkdiarization

13

news by default. It provides a wide range of system parameters to take into consideration that can be set, optionally, as command line arguments. As a result, there are also correspondingly several hard coded default parameters values (including the thresholds in clustering stopping criteria) which only can be known by studying the code. As we verified in our case, some of them have important impact on diarization results. Hence, for system tunning and other important evaluation issues, the source code is the only actually detailed documentation available, to our knowledge. Again, being open-sourced Java resulted more comfortable to us, for that reason.

Drawing 1: LIUM Work-flow Although, in any case, we emphasize that without the necessary background in the field of SD it turns out very difficult to work with this kind of application just as a final user especially when the 14

diarization results demand tunning for improvement. The system usability was not meant for that, we are certain. On the other side, it is evidently a huge help to have such a system for free to start; due to the inherent complexity and risks of having to work from a scratch development. In general, our goal was to use the LIUM system as a guide for SD research, by focusing on thematic areas required for our concrete purposes. In that particular sense, this state-of-the-art system and its companion paper were indeed the right door to enter into the SD field. In terms of a DER general evaluation of SD, the reference [22] indicates that LIUM reaches DER average levels ranging from 10.01% (the Ester 2 corpus9) to 6.36% in RT'03 (broadcast news audio) after some tunning of thresholds and some post-processing. We recall the 17.7% SDM-DER and 10.01% in MDM in RT'09 already mentioned above. Hence, LIUM figures can be considered good. These numbers serve as a basis for comparison in our study. Although, any raw comparison, in general, against our data set deserves questioning because the data might come from a probably much different nature.

4 Trials Sample Data Our data is constituted by 4 audios containing about 1.48 hours which were recorded in the same judge office and come from different trials. We believe they are representative of that kind of room: audios are very noisy, especially strong noise coming from street cars occur frequently among other usual sources. A single microphone put on a desk is shared by all participants who sit around it (distance is less than a meter). Speakers are male and female. Each trial follows a protocol where the judge starts and decides the order and turns in which speakers should speak. Such turns are usually regular representing just one speaker but in some cases small dialogs occur, producing some overlap. At the beginning and at the end of session (about 5s) informal speech may occur like greetings and similar interchanges. The number of speakers was determined by hearing the audio; in general, there exists proceedings records on paper that can be consulted for such and other purposes. Table 2 summarizes the key elements of our data. The sample identification corresponds to the real record number, we will refer to them by using just the first two digits (05, 07, 08 and 10), for simplicity. Sample identification

Duration # of Male (sec) Speakers

Fem.

05-000038-077-PE

2850

4

2

2

07-201320-456-PE

620

3

2

1

08-001622-175-PE

1212

3

1

2

10-000011-515-PE

818

3

2

1

Table 2: Sample Data 9 See at ELRA: http://catalog.elra.info/product_info.php?products_id=1167

15

We remark that trials data are not publicly available, this sample was especially provided just for research purposes by the department of justice.

5 Our Experimental Approach Our first task with the sample data was to establish a baseline using LIUM in its default configuration in order to determine how much recognition can be achieved. We prepared a reference diarization for each audio where, in some cases, we removed some small parts where noise or speaker overlap pauses were considered too high or irrelevant (greetings at the end of the proceeding, for instance), according to our perception (just one person generated the references). To produce this manual diarization, we heard the audio, advancing in slices of 10 seconds, approximately detecting turns. We remark that the reference segment boundaries we got are probably not precise enough as should be given limitations of the tool used. Thus, we estimate a ±0.5s precision error in boundaries which is the double of by NIST used criteria. We reflected such a margin error in our DER calculations. In Table 3 we show the results of the SD from data using LIUM in its default configuration. As we can see these results reaches huge DER rates around 80% which are out of proportion with respect to NIST and LIUM reported results. Hence, analysis and tunning were required in order to get better results. Evidently, we attributed the problem to the high noise present in audios. Apparently, the speech detection stage of LIUM was discarding a lot of noisy speech as we can infer from the high deletion amount (78.4%). We also observed that in terms of insertions and renames the results are not bad, so the system seems to form good speaker hypothesis from the available data which unfortunately is few. Sample id

Hyp. num. of Clusters/(Real)

Insert (%) (FA)

Delete Rename (%) (%) (Missed) (Speaker error)

DER (%)

5

4(4)

0.1

58.3

0.9

59.32

7

7(3)

1

75.2

3.4

79.62

8

4(3)

0.1

85.6

1.1

86.72

10

2(3)

0.1

94.6

0

94.61

AVG

NA

0.33

78.4

1.4

80.1

Table 3: LIUM_SPKR SD Baseline from data In our strategy for improvement we discarded to clean the audio from noise in some pre-processing stage because we considered it out of our scope and focused more in posterior SD stages. We rather concentrated our efforts in improving the high deletion rate which was driving the DER. For such a purpose we establish the following goals: •

Determine in LIUM the appropriate system parameters to tune up 16

•

Use domain specific knowledge as an improvement strategy for tunning such parameters. We selected as candidates the number of speakers and the regularity of the audio with respect to speaker turns (due to the underlying trial protocol) as the two criteria; we studied how to map adequately these criteria into the parameters values controlling the diarization.

As already explained, default values of systems parameters are hard coded, so for achieving the first goal it was necessary to study in some deep level the implementation. It was necessary to modify the code. We tried to be the least intrusive possible to avoid potential bugs and regressions. We could determine that the number of speakers is actually a command line parameter (cMinimumOfCluster10); however by simply setting this parameter alone does not necessarily guarantee that this given minimum will be reached by the system. This is due to the way the clustering stopping criteria is implemented11. As we mentioned, the system uses a BIC-based hierarchical clustering and then (optionally controlled by another parameter doCEClustering to optionally select the final CLR-clustering. This yields the definitive diarization result. In both clustering cases, a corresponding threshold parameter is used, whose defaults values are 3.0 and 1.7, for the hierarchical and the CLR methods, respectively. Because other clusterings are even so available in LIUM correspondingly default values also are given to them, but we did not alter them. Likewise, an additional parameter controls the maximum numbers of merges (iterations) which we have not used in this work either. Having determined that system ability about the diarization control parameters, we modified the stopping method in a way that considers the number of speakers as a part of the threshold dynamically; that is, as a percent of clusters unreached so far with respect to the expected; this is done so instead of setting the parameter with a static value. We also added new command line parameters (cNNOp) for turning on/off this dynamic threshold strategy and (cNNThreshold) for setting the %alpha parameter. (the origin of the name NN becomes clearer in the following.). However, as we may expect, running the tests samples only with this strategy did not improve significantly the values of the DER baseline because the deletions still remained too high, as in baseline. So, we put our attention on the LIUM speech filtering policy, which again required a deeper code review. In this case, we determined that LIUM applies a filtering step that splits segments generated by the BIC-clustering as a part of its speech detection stage. Such filtering divides the segments according to another parameter (fltSegPadding) in a way that if a segment s [a :an] can get split into two segments s [a : a pad ] and s [ an− pad :a n] where pad default value in code is 2512. This filter discards the features in the middle (which in average reaches about 73% of features); we decided to study the discarded segments and discovered that actually many of them can be considered still intelligible for human hearing; so discarding all of them was an important lost13. In some cases however, the filtering seemed to work fine in detecting very noisy segments or long silences. 10 See fr.lium.spkDiarization.parameter.ParameterBNDiarization 11 See fr.lium.spkDiarization.program.MClust.stopCriterion 12 See fr.lium.spkDiarization.parameter.ParameterFilter and fr.lium.spkDiarization.tools.Sfilter.filter

13 We believe, because rich transcription might be a LIUM posterior phase this stage is so selective. We mailed a question to the LIUM group but at the time of writing this report, we had no answer (apparently the group might be not so active as two years ago).

17

Because of this relatively important reduction on the amount of speech, the system can perform its task more quickly and the obtained clusters are relatively good for the diarization of the surviving data, as insertions and renames are low. A fact that we used in our approach, as we explain in the next section. In general, we concluded about the need of an strategy for reincorporating as much as possible the filtered features but without increasing DER significantly. We notice that just eliminating the padding strategy did not work well, as we tested during the work. And by doing so it significantly increases the clustering running time.

6 Strategy and Results Our approach to recover filtered out data let LIUM perform the filtering work but we collect the discarded data as a work-flow by product. Once the system has obtained the final (CE/CLR) clustering, we apply a k-nearest-neighbor (NN)-like assignment of the discarded segments to the formed clusters (those from the surviving segments). However, we did not adapt the clusters as assignments occur on them (that is no on-line learning is performed). We select the best cluster for the assignment in terms of a metric based on the sum of the following criteria and controlled decision: •

Likelihood criteria: If s is a segment to be assigned, we use the GMM based cluster model that produced the final clustering to calculate a model score for s. This score is just the likelihood of s yielded by each GMM.

•

Gender criteria: We use the gender of s and the gender of the cluster and added a reward to the score only if they match: the score is doubled.

•

Distance criteria: We add a value to the score based on the distance between the s borders and the cluster borders (in terms of start and end times). Where the start of a cluster was calculated using the start time of its first segment and the end time of its last segment. The score is scaled by a constant of 100014.

•

Assignment decision: Based on a parameter  , we finally decide to assign s to the best cluster c only if their score-gender-distance metric was at least  times the corresponding metric of the second best cluster ( s1  s 2 ). We reject the assignment otherwise. We tested different values for this parameter including =1 , which means unrestricted assignment for strictly higher scores.

We use the same scoring parameters for each sample audio tough perhaps, in some cases, using sample specific based parameters might yield better results in each particular case. We then tested different  values (2, 1.5, 1.25, 1). The best results are shown in Table 4 obtained by setting =1 . As we can see the NN strategy significantly reduces the deletions (63.74% absolute) essentially because more speech is processed. However, as we might expect, the price to pay is an increasing of insertions and especially of renames (5.47% and 16.4%, respectively). However, the total average DER significantly decreases in 45.96% yielding a 38.14%. It is still a very high DER value according to NIST and LIUM levels, evidently; but we think much more manageable one than the 80% from the basis case. 14 Scores are small numbers we scale distances for this reason. This scale value should probably depend on average segment duration.

18

We want to dedicate some brief comments to the intuitive “soundness” of the NN-strategy. Our reasoning is based on the assumption that the LIUM system can get good clusters for the selected data in terms of low insertions and renames. In some way, we can consider our approach a kind of taskbased learning, where the first task is to learn the initial clustering (labeling) on the basis of a limited amount of selected data and the second task to predict the “best label” (that is the cluster label) of the discarded data15. As a final note in this section and back to our approach, we indicate that prototyping the NN strategy into LIUM appears simple but it required some changes in code that, in light of the complexity, were not so trivial, initially. We had to discover and intercept the filtered out segments in the right place and assure to apply to them the same flow that the surveying data receives in LIUM. As explained, LIUM is multi-staged and changes the feature configuration according to each stage. In addition, we had to recover the GMM model product of the last clustering in order to use it for the metric calculation. Having this model opens opportunities for further evaluation of improvement strategies, we believe. Sample id

Insert (%) (FA)

Delete (%) (Missed)

Rename (%) (Speaker error)

DER (%)

5

4.2

22.8

7.9

34.9

7

8.8

10.1

12.3

31.4

8

6.1

15.3

23.1

44.5

10

4.2

9.5

28

41.7

AVG

5.8

14.4

17.8

38.14

Table 4: LIUM-NN Clustering results (  =1)

7 Comparison with Another Tool: ICSI As an interesting comparison, we show the results of running the data with the ICSI system in its basic form ([8]). Table 5 shows the ICSI diarization results per sample. As we can observe, ICSI (used without any tunning) manages to significantly reduce DER to 43.1% against LIUM baseline. And gets just more than 5% DER against LIUM-NN16. Table 6 compares the different systems tested where we emphasize that ICSI was used without any tunning (ICSI was available to us at the end of this work 17 but only as binary code; it is not open-source. Apparently for setting clustering parameters as we would need it is necessary to recompile the application). The results from ICSI confirms our assumption about missing speech driving the DER. ICSI still doubles the deletions value in comparison with our approach. However, the speaker LIUM-NN insertions are likewise doubled with to respect to ICSI. In renames ICSI is almost 2% better. Thus, the 15 Clearly, it is not the same, but we were motivated by [1]. 16 According to [8] , ICSI obtains DER values of 32.09% in its basic form in meetings. 17 Kindly provided by C. Müller (DFKI) and G. Friedland (ICSI) for research purposes.

19

difference in favor of LIUM is still driven by the deletions rate. In any case, it seems possible to improve these ICSI results using some tunning or, as we did, by means of domain-specific strategies18. In the tested scenario, over clustering occurs with ICSI because we are not providing this information to the system. We know ICSI uses BIC-clustering while as we remind that LIUM refines BICclustering with a NCLR-clustering which uses an UMB model.

Sample id

Insert (%) (FA)

Delete (%) (Missed)


DER (%)

5

0.3

35.5

3.6

39.36

7

7

22.6

14.3

43.84

8

0.7

19.3

16.5

36.5

10

1.5

21.7

29.4

52.62

AVG

2.4

24.8

16

43.1

Table 5: ICSI Base Line with Sample Data System-Test

Insert (%) (FA)

Delete (%) (Missed)


DER (%)

LIUM-Basic

0.33

78.4

1.4

80.1

LIUM-NN

5.8

14.4

17.8

38.14

ICSI-Basic

2.4

24.8

16

43.1

Table 6: Systems Test Comparisons (averages from samples)

8 Conclusions and Future Work In this work, we researched SD as a feasible strategy by dealing with noisy speech in a potentially realistic case. Domain specific strategies, like the explored NN strategy, suggests an appealing and applicable way to face the problem and improve recognition error rates. Potentially means that the results are the product of a preliminary short-term scope and initial work; they should serve just as a basis and starting point for future and deeper test, research and application. However, the obtained figures seem reasonably encouraging, at a first sight, given the strong audio 18 However, according to [9] experiments performed using the number of speakers did not improve error rate.

20

quality limitations we have. These limitations and the measured error rates using state-of-the-art systems imply that a completely SD automated process with near zero DER rates will be certainly not possible. But, if results can be confirmed with more data, it can surely provide a great help by the work of processing audios which are considered of great social value. Of course, in combination with better recording conditions such value should increase the expected benefits, substantially. Human aid will be surely required but it can alleviated the work if SD results are reasonably manageable. In terms of future work, we realized several avenues to explore. First of all, to gather more data to test strategies like the here explored. In the presence of more data, for instance, we will like to incorporate and explore on-line adaptation in the NN-strategy; so that GMMs can be adapted as a product of the proportion of segments being assigned to the corresponding cluster. In our current case, the assigned data was too low for testing such a possibility without probably biasing the model. The setting of the NN parameters should also be dynamic, in a way we need to determine. An interesting research possibility would be to learn and store prototypes of clusters for known speakers in such a way that the SD starts with more domain specific knowledge that could help to improve performance. That would be a combination of SD and speaker identification/recognition of know speakers ([24, 29]). We have experimented using the GMM final clustering model as a speaker representation. Using Kullback-Leibler approximations of such GMM ([13]) seem to produce interesting preliminary results with our small sample (which occur in the same room). However, because only one speaker is common to the samples the results need to be proven with more data. Another path to explore would it be to work on the pre-processing stage, for instance, to explore cleanup tasks working at or near the feature level. We still need to look for ways to deal with noisy SD clusters. Perhaps the alternatives from inventories of speaker prototypes could be practicable here. Likewise, it will be quite interesting to incorporate the [32] approach in LIUM in order to evaluate it as an alternative stage in the current clustering work-flow. That can yield advantages because no UBM model would be required as in the CLR-based clustering is the case. Eventually, performance (in terms of processing time) could consequently be improved, as well. Finally, if the opportunity to work with the ICSI in an open-source19 fashion could be arranged the testing of the SD could be enriched and probably the quality,as well.

Acknowledgments We want to thank to the DAAD for supporting this visit, especially to Frau Almut Mester for her help before and during my stay. Thanks to the UNA, to the School of Computing, for giving me the opportunity to perform this visit through a special license. Special thanks for that to Dr. Francisco Mata and MSc. Alberto Segura; and Lic. Deyanira Amador for administrative assistance. Very special thanks go to Ing. Juan Carlos Gomez CEO of Grupo Asesor en Informatica 20 (GA) for providing us with speech samples for our experimentation and for showing interest in the future of this research in a eventually joint work with the UNA. Special thanks to Dr. Christian Muller and Dr. Gerald Friedland for grating us access to valuable SD information and to the ICSI system. And last but not least to Prof. Dr. Dietrich Klakow from the UdS for his invitation, his great support during my stay in 19 At the moment of writing this report such a possibility is being initially analyzed. 20 http://www.grupoasesor.net/

21

Saarland. I want to make my thanks extensive to his group and particularly to his assistant Frau Diana Schreyer and to Dr. Deepu Vijayasenan. In any case, any technical incorrectness in this work is only author's own responsibility.

9 References 1 2

R.K. Ando & T. Zhang. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabelled Data, Journal of Machine Learning Research 6. 2005. X. Anguera. Robust Speaker Diarization for Meetings. PhD. Thesis. U. Politecnica de Catalunya. 2006.

3

X. Anguera & S. Bozonoet & N. Evans & C. Fredouille & G. Friedland. Speaker Diarization: A Review of Recent Research. IEEE-TALSP. 2010.

4

K. Boakye & B Trueba-Hornero & O. Vinyals & G. Friedland. Overlapped Speech Detection for Improved Speaker Diarization in Multiparty Meetings. ICASSP. 2008.

5

S. Chen & P. Gopalakrishnan. Speaker, Environment and Channel Change Detection and Clustering Using the Bayesian Information Criterion. DARPA Broadcast News Transcription and Understanding Workshop. 1997.

6

E. El Khoury & S. Meigner & C. Senac. Speaker Diarization: Combination of the LIUM and IRIT Systems. Internal Report. 2008.

7

G. Friedland & O. Vinyals & Y. Huang & C. Mueller. Prosodic and Other Long-Term Features for Speaker Diarization. IEEE TASLP. 17(5). 2009.

8

G. Friedland. The ICSI Speaker Diarization System: A Tutorial. International Computer Science Institute. 2010.

9

G. Friedland. Personal Communication. 2012.

10 M.R. Gupta & Y. Chen. Theory and Use of the EM Algorithm. Foundations and Trends in Signal Processing. 4(3). 2010. 11 V. Gupta & P. Kenny & G. Boulianne & P. Dumouchel. Combining Gaussianized/nonGaussianized Features to Improve Speaker Diarization of Telephone Conversations. IEEE Signal Processing Letters. 2007. 12 H. Gish & M-H. Siu & R. Rohlicek. Segregation of Speakers for Speech Recognition and Speaker identification. ICASSP. 1991. 13 J.R. Hershey & P.A. Olsen. Approximating the Kullback Leibler Divergence between Gaussian Mixture Models. ICASSP. 2007. 14 Y. Huang & O. Vinyals & G. Friedland & C. Mueller & N. Mirghafori & C. Wooters. A FastMatch Approach for Robust, Faster than Real-Time Speaker Diarization. IEEE Workshop on Automatic Speech Recognition and Understanding. 2007. 15 F. Jelinek. Statistical Methods for Speech Recognition. MIT. 1997. 16 D. Kolossa & R. Nicke & S. Zeiler & R. Martin. Inventory-Based Audio-Visual Speech 22

Enhancement. Interspeech 2012. 17 M. Kotti & V. Moschou & C. Kotropoulus. Speaker Segmentation and Clustering. Signal Processing 88(5).2008. 18 VB Le & O. Mella & D. Fohr. Speaker Diarization Using Normalized Cross Likelihood Ratio. Interspech 2007. 19 P.C. Loizou. Reasons why Current Speech-Enhancement Algorithms do not Improve Speech Intelligibility and Suggested Solutions. IEEE TASLP 19(1). 2011. 20 C. Loría-Sáenz . On Recent Approaches to Improve HMM Speech Recognition Tasks. Unpublished Report.. April 2011. 21 C. Loría-Sáenz & J.C. Gómez & A. Guevara & A. Ujueta. A Study on ASR Transcriptions Using Costa Rican. Speech. (Extended Abstract). LACNEM 2012. 22 S. Meignier & T. Merlin. LIUM_SPEAKERDIARIZATION: An Open Source Toolkit for Diarization. Report LIUM U. Maine. 2010. 23 C. Mueller. Personal Conversation. 2012. 24 C. Mueller. Speaker Diarization Tutorial. SCALE Workshop. 2010. (facilitated by the author). 25 A.V. Nefian & L. Liang & X. Pi & L. Xiaoxiang & C. Mao & K. Murphy. A Coupled HMM for Audio-Visual Speech Recognition. ICASSP 2012. 26 NIST. Rich Transcription Evaluation Project. Page: http://www.itl.nist.gov/iad/mig/tests/rt/. 2009. 27 L.R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications on Speech Recognition. Proceedings of the IEEE 77(2). 1989. 28 D.A. Reynolds & E. Singer & B.A. Carlson & G.C. O'Leary & J.J. McLaughlin & M.A. Zissman. Blind Clustering of Speech Utterances Based on Speaker and Language Characteristics. ICSLP. 1998. 29 D.A. Reynolds & T.F. Quatieri & R.B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10 19-41. 2000. 30 M. Siegler & U.U. Jain & R. Raj & R.M. Stern. Automatic Segmentation, Classification and Clustering of Broadcast News. DARPA Speech Recognition Workshop. 1997. 31 S. Tranter & D. Reynolds. An Overview of Speaker Diarization Systems. IEEE TASLP. 2006. 32 D. Vijayasenan. An Information Theoretic Approach to Speaker Diarization of Meeting Recordings. PhD Thesis. Lausanne, EPF. 2010 (Chapters 1-3). 33 D. Vijayasenan. Personal Communications. 2012. 34 W. Walker & P. Lamere & P. Kwok & B. Raj & R. Singh & E. Gouvea & P. Wolf & J. Woelfel. Sphinx-4: A Flexible Open Source Framework for Speech Recognition. Sun Technical report TR2004-0811. 2004. 35 A.S. Willsky, & H.L. Jones. A Generalized Likelihood Ratio Approach to the Detection and 23

Estimation of Jumps in Linear Systems. IEEE Transactions on Automatic Control AC-21(1). 1976. 36 X. Zhu & C. Barras & S Meignier & J-L. Gauvain. Combining Speaker Identification and BIC for Speaker Diarization. Interspeech 2005.

24