Online speaker segmentation and clustering using cross-likelihood ...

5 downloads 384 Views 372KB Size Report
Mar 1, 2010 - Bayesian information criterion (BIC) is often used for the task of model selection ... segmentation approach in online systems or as a pre-/post-.
www.ietdl.org Published in IET Signal Processing Received on 21st September 2009 Revised on 1st March 2010 doi: 10.1049/iet-spr.2009.0235

ISSN 1751-9675

Online speaker segmentation and clustering using cross-likelihood ratio calculation with reference criterion selection M. Grasˇicˇ M. Kos Z. Kacˇicˇ Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ulica 17, SI-2000 Maribor, Slovenia E-mail: [email protected]

Abstract: This study presents a new online method for speaker segmentation and clustering in real-world environments. It analyses and discusses the difficulties of online speaker diarisation and proposes a new segmentation and clustering method, in which the Bayesian information criterion (BIC) and the normalised cross-likelihood ratio (NCLR) are combined into an online speaker diarisation system. A new decision parameter for BIC and NCLR is proposed using normalisation with reference criterion selection (NRCS), together with a window normalisation technique called window-length compensation (WLC), which normalises the criterion value according to analysed window length. The effectiveness of the proposed system and techniques in comparison to the standard offline speaker diarisation system (mClust) is demonstrated on the Slovenian Broadcast News database (BNSI) and an English Broadcast News database (the HUB-4). The online system presented in this study achieves similar performance to the BIC-based offline approach.

1

Introduction

The aim of speaker segmentation and clustering, also referred to as speaker diarisation, is to localise and segment individual homogenous speaker data in a conversational audio file or stream. In general, speaker diarisation answers the question ‘Who spoke when?’. The ways audio streams are processed can be divided into online and offline approaches. Offline approaches have access to all the recording material before they start processing it and are generally better described in the literature (e.g. [1]). Online approaches, on the other hand, only have access to data that have been recorded up to the current instant of processing. In such approaches only a small latency (in seconds) in output is allowed, which means that only a limited amount of input data are available for processing. No information about the complete recording material is available at any time. Online approaches start with one speaker at the beginning of the recording and iteratively increase the number of speakers as they intervene. Many different techniques for segmentation and clustering can be used in an iterative manner for offline speaker IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

diarisation [2]. Most offline clustering algorithms use hierarchical formats, in which speech segments are iteratively split or merged until the actual number of speakers is reached. Top-down clustering techniques [3] start with one or a small number of segments that are further split into the real number of speaker segments. Bottom-up clustering techniques [4] do the opposite because the audio stream is divided into homogenous audio segments, often using the generalised likelihood ratio (GLR) [5], and then grouped/clustered using the selected clustering method. The bottom-up clustering technique is the most popular and widely used clustering technique because of its superior performance [1, 3]. Online clustering is more restrictive compared to offline clustering because the speaker merging/splitting decision has to be made on the fly and iterative methods cannot be applied. A very popular approach in speaker segmentation and clustering is a metric-based approach derived from a statistical formulation that reformulates segmentation as a 673

& The Institution of Engineering and Technology 2010

www.ietdl.org model-selection task between two competing models. The Bayesian information criterion (BIC) is often used for the task of model selection because it is very effective and straightforward [6, 7]. This method has the advantage that no prior knowledge of the acoustic conditions is required and also no prior model training is needed. This method works and achieves good performance in specific environments but can perform very poorly in certain conditions that are common in real-world environments. The main problem of online speaker diarisation is segmentation of short speaker segments of 7 s or less [8]. In these conditions, speaker variations limit proper model generation and comparison because of the different phonetic contexts. This problem is normally absent in offline speaker segmentation in which proper speaker models can be generated on longer acoustic homogenous parts of the same speaker, and can then be used to successfully cluster short segments. This paper examines the problem of online speaker diarisation [4, 9] in conditions with limited speaker data and presents methods that try to compensate for the lack of information in short segments. The problem of absolute threshold selection is discussed because threshold selection represents a shortcoming in all major diarisation and segmentation methods (online and offline). Finally, a novel threshold selection approach is given, in which a reference criterion value is used for the segmentation and clustering decision. The remainder of this paper is organised as follows: Section 2 presents previous work on this topic. Section 3 describes the BIC speaker segmentation method. Section 4 presents an adapted model-based method for segmentation and clustering. Section 5 introduces the proposed segmentation method. Section 6 presents the experimental design. Section 7 introduces the evaluation method. Section 8 discusses the results, and the conclusion is given in Section 9.

2 Previous work in speaker diarisation and speaker recognition In the field of unsupervised speaker diarisation, BIC represents the most widely used segmentation criterion. BIC is an efficient speaker diarisation method with reasonable computational complexity and one tuneable parameter [4]. It is commonly used as either the main segmentation approach in online systems or as a pre-/postsegmentation step during offline segmentation/clustering. Although good performance can be achieved with BIC, the criterion still has some performance disadvantages. The segmentation of short speaker segments (shorter than 7 s) can be particularly problematic using the BIC criterion, as is pointed out in [6]. 674 & The Institution of Engineering and Technology 2010

Statistical model-based approaches that are commonly used for speaker recognition and verification have an advantage in this area. These methods benefit from using a universal background model (UBM) [1, 10 – 12] for normalisation because speaker phonetic variations in short segments can be compensated for using data assembled from a general UBM speaker model. The cross-likelihood ratio (CLR) measure [12] uses UBM models and normalisation to calculate speaker segment diversity. A very similar approach is used in the adapted model-based bilateral scoring-based speaker change detection ABLSSCD statistical framework [8]. These two methods share the cross-probability concept of XBIC [13] with added normalisation using an adapted UBM model and with better model capability because multiple Gaussians are used to model speaker features instead of just one Gaussian. A slightly different CLR-based method called the normalised cross-likelihood ratio (NCLR) measure was presented by the same author [14] in the area of speaker verification. The main difference between these two measures is that, in the case of NCLR, the normalisation term is also adapted. This technique increases performance in conditions in which the training data for the UBM and the testing data differ significantly. Although good segmentation performance can be achieved using all the methods mentioned above, threshold selection (tuning) of criterion parameters remains a problem when achieving optimal performance on a new database. This paper introduces a new online speaker diarisation method that uses BIC for segmentation and NCLR for segmentation point verification and clustering. In addition, we introduce additional normalisation techniques that enhance speaker diarisation performance.

3

Bayesian information criterion

The BIC is an asymptotically optimal likelihood criterion penalised by model complexity, as introduced by Schwarz in 1971 [15]. The BIC likelihood expression is defined as [7] 1 BICj = log P(X |Mj ) − l bj log N 2

(1)

where log P(X|Mj) is the log likelihood of training data X for model Mj , l is the weight of the second term and N represents the number of samples. The item bj represents the number of parameters in the model Mj .

3.1 BIC model selection for speaker segmentation and clustering For the speaker diarisation task, BIC is applied as a model-selection criterion and used to decide which of the parametric models best represent the given data samples IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org [15]. A model-selection test is carried out in a sliding window using two sub-windows [7] in order to calculate speaker segment similarity at frame i, whereby model M2 is chosen over M1 if the expression DBIC ¼ BIC1 2 BIC2 is negative. M1 is defined using data X drawn from a single full-covariance Gaussian, whereas M2 is defined using data X ¼ [XA; XB] from two full-covariance Gaussians. Full-covariance matrices are employed to define model complexity. Cluster and speaker turn decision making is made according to [7] N i log |S| + log |SA | 2 2   N −i 1 d (d + 1) log |SB | + l d + log N + 2 2 2

DBIC = −

(2) where |S| is the determinant of the tested window’s covariance (model M1), |SA| is the determinant of the first sub-window’s covariance (corresponding to model M2: first Gaussian), |SB| is the determinant of the second subwindow’s covariance (corresponding to model M2: second Gaussian) and d defines the number of dimensions in the samples used. When performing speaker segmentation, a speaker turn candidate is chosen if the expression DBIC , 0 is true. For clustering, a segment is merged with an existing cluster if DBIC . 0. l is used in (2) as a penalty factor (tuning parameter) that tunes segmentation/ clustering sensitivity. The penalty factor is separately determined using experiments on a selected training database, for segmentation and for clustering.

4 Normalised adapted modelbased approach for segmentation and clustering This paper focuses on the NCLR criterion for speaker clustering and for verification of speaker turn candidates. NCLR is based on a statistical approach that involves a cross-log likelihood measure and UBM-based score normalisation to measure speaker differences. We used NCLR in our experiments because a performance increase over CLR was shown in [14]. For the task of speaker segmentation and clustering a NCLR-derived criterion is used as follows NCLR(XA , XB ) = (|L(XB |MA )| − |L(XA |MA )|) + (|L(XA |MB )| − |L(XB |MB )|) (3) where MA represents the model adapted with data extracted from sub-window XA , MB represents the model adapted with data extracted from sub-window XB . In (3) L is represented as L(.) ¼ log(P), where P is the probability. For the purpose of segmentation and clustering we introduce a threshold IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

parameter, to which the criterion value is compared and the segmentation/clustering decision is made. In the case of speaker segmentation, a speaker-turn candidate is chosen if the value of NCLR(XA , XB) is greater than the threshold value. For clustering, a segment is merged with an existing cluster if NCLR(XA , XB) is smaller than the threshold value. The threshold value is chosen separately for segmentation and clustering.

5 Online speaker diarisation using CLR calculation with reference criterion selection This paper introduces a combined segmentation and clustering method that integrates the BIC criterion with the NCLR criterion. This online diarisation system uses BIC for the pre-segmentation of hypothesised speaker change point candidates and NCLR for final speaker change point and clustering decisions. In the proposed approach, BIC is applied to pre-select speaker candidates and to select sub-window boundaries in which only speech of one speaker is to be found. Our previous work [16], in which we analysed cross-likelihood measures that use UBM for normalisation, indicated that such measures are more robust at segmentation than BIC, with better performance in conditions with limited speaker data. In order to take advantage of this ability in our system, longer equal-size subwindows are extracted from the BIC window selection information during the segmentation. Equal-size windows are used with the NCLR criterion values because variable-size window selection would make the Gaussian mixture model (GMM) probability computed in the criterion fluctuate more. The stabilised criterion advantage outweighs the lesser data presented with a fixed-window scenario compared to a variable-size window scenario. Equal-size windows were used only for segmentation but not for clustering because they worsened performance.

5.1 Window selection strategy for speaker segmentation The window selection strategy for speaker diarisation is used to determine the speaker data for comparison. Data are selected in a main window that is later split into two subwindows used to build speaker models when applying the chosen criterion. The proper window selection strategy is crucial for high speaker diarisation accuracy. Window selection has to be performed in a way that prevents a mix-up of more than one speaker’s data in one sub-window. In addition, care has to be taken when selecting the size of a sub-window because too little speaker data in a sub-window results in an imperfect speaker model representation. At this point, false insertions or deletions of boundaries can occur [7]. For the proposed segmentation method, we use variablesize window selection for BIC [4] and variable equal-size 675

& The Institution of Engineering and Technology 2010

www.ietdl.org experimentally determined on the development set. To limit computational complexity, we use the sliding window approach shown in Fig. 1c, if the window size is equal to or greater than MAXWINSIZE. We propose a new variable sub-window selection strategy for the NCLR, in which the size of the sub-window is adapted to the size of the smallest sub-window used in BIC. Fig. 1d shows the adaptation of the window used for the NCLR criterion, where T point represents the BIC hypothesised speaker turn candidate, at which the NCLR speaker turn test is applied using the adapted window and sub-window.

5.2 Window-length compensation Figure 1 Window selection strategy a Sub-window re-selection b and c Window re-selection d Window and sub-window re-selection for NCLR

sub-window selection for NCLR. Fig. 1 presents the subwindow selection strategy used with BIC. Fig. 1a shows sub-window selection in the initial window, where subwindow XA is increased whereas sub-window XB is decreased so that all speaker candidates in the window are analysed. Sub-windows are not selected at the edges of the window because there are insufficient data to build a proper speaker model. In Fig. 1, T point represents the intersection point between sub-windows XA and XB where the BIC criterion is applied. Fig. 1b shows a window selection/re-selection strategy, in which the window size is incremented to the value of MAXWINSIZE, which is

Comparing a log-likelihood-based speaker segmentation criterion value with a threshold value can be troublesome because the criterion value depends on the amount and type of speaker data used. Both these parameters are dependent on the window length. Therefore we propose a method that adapts the criterion value according to the length of the window. To compare the values of the NCLR criterion in various conditions (e.g. the influence of sub-window size), the mean NCLR criterion value with different sub-window sizes was analysed on the BNSI Broadcast News database [17]. The result of this analysis is presented in Fig. 2, in which the relationship of sub-window size and the mean NCLR criterion value is shown using a fourth degree polynomial fit. This polynomial curve was later used to compute the NCLR value in relationship to the sub-window size. The

Figure 2 Sub-window size against NCLR criterion value 676 & The Institution of Engineering and Technology 2010

IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org NCLR adaptation factor was defined as ⎧ ⎨ PolyFit(LenREF ) if 100 ≤ Len ≤ 800 WLC(Len) = PolyFit(Len) ⎩ 1 else (4) where WLC represents the window-length compensation factor for NCLR, Len represents the analysis window length, LenREF represents the reference window length (defined experimentally) and PolyFit represents the fourth degree polynomial fit. We used the WLC to adapt (3) as follows NCLR(XA , XB ) = WLC(LenA )(|L(XB |MA )| −|L(XA |MA )|) + WLC(LenB)(|L(XA |MB)| −|L(XB |MB )|)

(5)

where WLC represents the adaptation factor, LenA represents the length of the sub-window XA and LenB represents the length of sub-window XB . In our previous work [16] we used a similar approach to compensate for varying window-length analyses in ABLSSCD. When comparing the results, we noticed that NCLR is more robust to changes in the windows analysed because there was less fluctuation in the criterion of the same test data.

5.3 Normalisation technique for speaker segmentation and clustering Almost all methods used for online speaker segmentation rely on absolute threshold selection for speaker turn selection and clustering. Proper thresholds are not always trivial to select because they often require a time-consuming ‘trial and error’ phase to pinpoint the proper threshold value; the process often has to be repeated when the test database is changed. Methods using threshold selection often give poor results under changing acoustic conditions. In this paper we define normalisation with reference criterion selection (NRCS), which improves decision threshold selection and enhances overall segmentation performance. The NRCS criterion is similar to the hypothesis search and selection used in speech recognition systems to select the hypothesis with the best log-likelihood. Normally, a decision regarding speaker segmentation is made using two sub-segments (Fig. 3; sub-window XA , sub-window XB).

The criterion is applied and computed in the intersection of those two sub-segments. In NRCS, we further divide sub-window XA and subwindow XB as XAA = [x1 , x2 , . . . , xHscp /2 ] XAB = [xHscp /2+1 , xHscp /2+2 , . . . , xHscp ] XBA = [xHscp +1 , xHscp +2 , . . . , xHscp +Hscp /2 ]

(6)

XBB = [xHscp +Hscp /2+1 , xHscp +Hscp /2+2 , . . . , xN ] where Hscp represents the hypothesised speaker change point candidate, XAA represents the left part of sub-window XA , XAB represents the right part of sub-window XA , XBA represents the left part of sub-window XB and XBB represents the right part of sub-window XB . The process of window division and testing point selection is shown in Fig. 3. Criterion test points and the main speaker change point criterion are defined as RCSA = f (XA ), RCSB = f (XB ), SCD = f (X ),

XA = XAA + XAB XB = XBA + XBB

(7)

X = XA + XB

where RCS represents the sub-window criterion reference, SCD represents the main speaker change point candidate and f denotes the criterion. Reference criterion testing points are matched against the hypothesised speaker change candidate with  NRCSA =  NRCSB =

True False

if (SCD − RCSA ) ≤ r else

True

if (SCD − RCSB ) ≤ r

False

else

(8)

where r represents the decision threshold used to define the NRCS decision. Using (8) we can normalise the speaker change point with the criterion values obtained from homogeneous surrounding speaker audio data found in the left and right parts of the particular sub-window. The final speaker change point decision can be made using (8) as follows IF NRCSA = True AND NRCSB = True THEN Write speaker change point

Figure 3 Block diagram of reference testing point selection for NRCS normalisation IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

(9)

ELSE Continue segmentation analysis 677

& The Institution of Engineering and Technology 2010

www.ietdl.org The final decision threshold is obtained individually for the particular criterion used. Using (7) we can now define the NRCS formulation for BIC and NCLR segmentation terms. We define the reference criterion terms for BIC as NA N log |SA | + A log |SAA | 2 4   N 1 d (d + 1) + A log |SAB | + l d + log NA 4 2 2

DBICRCSA = −

NB N log |SB | + B log |SBA | 2 4   NB 1 d (d + 1) log |SBB | + l d + log NB + 4 2 2 (10)

DBICRCSB = −

where NA represents the length of sub-window XA , NB represents the length of sub-window XB and |S| represents the determinant of the window/sub-window’s covariance. We also defined the normalisation approach for NCLR as NCLRRCSA (XAA , XAB )

(11)

= (|L(XBB |MBA )| − |L(XBA |MBA )|) + (|L(XBA |MBB )| − |L(XBB |MBB )|) Reference criterion selections are calculated as DBICRCSA and DBICRCSB for BIC and as DNCLRRCSA and DNCLR RCSB for NCLR. These selections are used for comparison with the main speaker change point candidate SCD (8). The speaker change points are defined according to (9).

5.4 Speaker clustering Online speaker clustering is a straightforward process where speaker segments are compared to individual speaker data from clusters. A cluster represents speaker data segments from one speaker that were clustered before the segment currently compared. Based on the comparison, the segment is merged with the cluster or a new cluster is created if the speaker segment belongs to a speaker that is not in the cluster database. Offline speaker diarisation methods are much more complex in this respect because they use iterative approaches where clusters are merged selectively and final speaker change points are selected during the final iteration. In our system we use the NCLR criterion presented in (3) for the speaker cluster merging decisions. Clusters C0, . . . , CN21 are compared to each new speaker segment XA . Clusters are 678 & The Institution of Engineering and Technology 2010

min(NCLR(XA , Ci ) ≤ CLUSTTR ,

Ci [ {C0 , . . . ,CN −1 }) (12)

where min represents the minimal NCLR criterion value between the currently segmented speaker segment XA and the best matching cluster (Ci) for segment XA , N represents the number of clusters and CLUSTTR is used in the expression as a decision threshold parameter for clustering. CLUSTTR is experimentally determined on the development set. When we use NCLR and NRCS together, expression (11) is re-formulated according to the NRCS framework presented in (8). The reformulated clustering condition is defined as (NRCSA) min(NCLR(XA , Ci )) ≤ NCLR(XAA , XAB ) + rTR , Ci [ {C0 , . . . ,CN −1 }

(13)

where NCLR(XAA , XAB) represents the NCLR criterion value adapted and calculated on the same speech segment XA and rTR represents the NRCS threshold parameter applied for NCLR online clustering.

5.5 Cluster merging

= (|L(XAB |MAA )| − |L(XAA |MAA )|) + (|L(XAA |MAB )| − |L(XAB |MAB )|) NCLRRCSB (XBA , XBB )

merged or new clusters are created according to the condition

It is necessary to merge speaker clusters because this optimises the speaker model presented in the cluster. Although only the first long speaker segment can be used for speaker cluster representation, merging or adapting the cluster improves the speaker model and enhances overall diarisation performance. Cluster adaptation is a simple process where speaker segments are added to segments already present in the cluster. During preliminary experiments, we analysed the difference between full adaptation (all speaker segments are added) and an adaptation method in which specific speaker segments were selected. We used the previously defined NRCS expression (Section 5.3) to select individual speaker segments suitable for the adaptation process. The speaker cluster was adapted with the given segment if the given segment satisfied the following rule ADAPTTR ≥ NCLR(XAA , XAB )

(14)

where ADAPTTR is the experimentally defined decision threshold used in the adaptation phase for speaker segment selection. Experimental results were identical when using full or specific segment selection adaptations. Based on the experimental results, we decided to use the adaptation method in which specific speaker segments were selected for the final system configuration. The experimental results also showed that the same speaker voice characteristics, channel characteristics and background noise change during an audio stream in a domain such as broadcast news. This tendency was shown IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org when analysing the NRCS criterion value between the same speaker segments. We noticed that the same speaker segments that are located closer in time (local segments) achieved better similarity scores, such as when the segments were selected from the beginning and the end of an audio stream. Based on these findings, we decided to use a double clustering setup in which a speaker was presented using two clusters. The speaker’s second cluster was built directly after the first one. Both clusters were adapted using the clustering method described above. Preliminary test results showed an increase in performance when using a double clustering configuration.

6

Experimental framework

6.1 Simplified system flowchart A simplified flowchart of the system presented in this paper is shown in Fig. 4. Mel frequency cepstral coefficient [1] vectors (with length 13) are produced every 10 ms with a window size of 32 ms in an acoustic front-end unit. These vectors already include speech/non-speech tags. When testing, the speech/non-speech information is derived directly from the reference audio transcription, but in actual usage automatic

speech/non-speech information retrieval can be used [17]. During the window-selection process, the tag is used to avoid speaker change detection in non-speech segments. Non-speech segments have to be removed for online speaker clustering. If this is not done, segments are falsely clustered according to the non-speech information. The speech/non-speech classification was made using manual reference transcriptions. Offline methods are not as dependent on non-speech segmentation because clustering is done on small homogeneous parts. The BIC criterion is calculated and the window selection process is repeated until a speaker change point fits the criterion described in (2). The hypothesised speaker change point is then accepted or rejected using the NCLR – NRCS – WLC approach. The same approach is used to assign proper clusters to segments during the clustering stage. Segments that cannot be assigned to an already existing cluster are modelled with a newly created cluster. This process is repeated for the length of the audio stream.

6.2 Slovenian BNSI Broadcast News speech database The Slovenian BNSI Broadcast News speech database was used to evaluate the speaker segmentation and diarisation performance of the proposed methods. The database was designed in cooperation between the University of Maribor and the Slovenian national broadcaster, RTV Slovenia [18]. The BNSI database includes two different types of TV news shows. The first type is the evening news, in which a general overview of daily events is given. The second type is the late-night news, in which major events of the day are analysed. The speech corpus consists of 42 news shows, totalling 36 h of speech material. This material is further grouped into three sets: training, development and evaluation. The size of the training set is 30 h, whereas the sizes of the development and evaluation sets are 3 h each. Two of the most frequently focused conditions in the BNSI database are F0 (read studio speech, 36.6%) and F4 (read or spontaneous speech with background other than music, 37.6%). In all, 16.2% of speech in the database is spontaneous, recorded in a studio environment (F1), and 6.0% is spoken in the presence of background music (F3). Altogether 1565 different speakers are present in the BNSI database. The majority (1069) are male, and 477 are female. The gender of the remaining 19 speakers was classified as unknown.

Figure 4 Flowchart of the online speaker diarisation system using CLR calculation with reference criterion selection IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

The distribution of speaker segments in the BNSI database according to their length is shown in Fig. 5. Approximately 9% of speaker segments are shorter than 3 s. Such short segments are often incorrectly segmented with a speaker diarisation criterion such as BIC. 679

& The Institution of Engineering and Technology 2010

www.ietdl.org reduce the computational complexity. Further on, MAP adaptation was carried out using a relevance factor of 14 [23]. BIC was employed with the advance window selection strategy as described in Section 5.1, in which the MAXWINSIZE was set to 20 s with a 4-s initial window size. The fixed-window size test was done using a 4-s fixed window, with two 2-s sub-windows. A trial and error procedure was used to set NRCS and NCLR thresholds to: ADAPTTR (2.0), r TR (0.5) and CLUSTTR (2.5).

7

Evaluation method

7.1 Evaluation metrics Figure 5 Histogram of speaker segment length in the BNSI database

6.3 English HUB4 Broadcast News speech database To validate speaker segmentation and diarisation performance, the 1996 Broadcast News Speech Corpus was used [19]. The 1996 Broadcast News Speech Corpus contains a total of 104 h of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. From the available data we used the NIST 1996 HUB-4 extracted evaluation test material for continuous speech recognition for evaluation. The evaluation test data consist of approximately 3 h of speech, with 60% data content from television shows and 40% from radio shows. A subset of the training data together with the NIST 1996 development test material was used for UBM training and tuning of segmentation and clustering parameters, respectively. The training material incorporates around 10 h of speech material.

6.4 Evaluation environment The UBM GMM for NCLR and CLR models was trained using 450 mixtures and 2 h of speech data from the training set. The training data only included audio segments labelled as speech, where speech also occurred in adverse acoustic conditions (e.g. background music, overlapping speech). The non-speech segments were excluded from the training phase using manually transcribed reference transcriptions. Only diagonal-covariance matrices were involved in the training and adaptation of the UBM model. The UBM GMM model training required for NCLR and CLR during the development phase was performed using the expectation – maximisation (EM) algorithm [20, 21]. Following (3), adaptation of the UBM GMM was performed using the maximum a posteriori (MAP) adaptation [22, 23]. Only the means were adapted to 680 & The Institution of Engineering and Technology 2010

The primary metric for speaker diarisation – that is, speaker segmentation and clustering – is the overall speaker diarisation error rate (DER) [1]. DER is the percentage of time that the system misattributes speaker segments. Speaker segmentation performance achieved during speaker segmentation is usually expressed by two types of error measures. A false alarm occurs when a speaker change is detected even though it does not exist. This type of error is measured using the precision measure (P). This measure is defined as P=

number of correctly found boundaries total number of boundaries found

(15)

Misdetection occurs when the algorithm does not detect an existing speaker change. This type is measured using the recall measure (R). This measure is defined as R=

number of correctly found boundaries total number of correct boundaries

(16)

The combined F-measure (F ) is also used to evaluate the performance and to compare the results with other systems. This measure is defined as F=

2RP R+P

(17)

A margin is used for evaluation at each detected speaker change point. If the hypothesised speaker change is within the 1-s margin before or after the reference point, then the turn is considered correct. Direct speaker clustering performance is expressed as a cluster purity measure. The metric measures the percentage of improper speaker speech frames in a cluster. The measure we used calculates the speaker cluster purity on a frame basis. Reference cluster purity measures how many of a particular speaker’s data were correctly labelled; this measure is defined IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org as follows Reference purity =

number of correctly assigned frames all reference speaker frames (18)

Computed cluster purity on the other hand measures how many of the speaker labels were incorrectly assigned. The measure is defined as Computed purity =

number of correctly assigned frames all computed speaker frames (19)

Speaker cluster purity in this article is presented twice: for mapped clusters alone, and for mapped and unmapped clusters together. Unmapped clusters are labelled segments not belonging to a particular speaker; that is, incorrectly assigned clusters.

7.2 Offline reference speaker diarisation system (bottom-up clustering) An offline reference speaker diarisation system (mClust) was used to define a reference result. MClust is a software package dedicated to speaker diarisation [24]. The segmentation and clustering is based on bottom-up clustering, in which the signal is first split into small homogeneous segments using the GLR. These segments are later clustered using the selected clustering method and, finally, the boundaries are adjusted. The tools provided allow BIC hierarchical clustering, Viterbi decoding using GMM models trained by EM or MAP and CLR hierarchical clustering using GMM (clustering based on automatic speaker recognition methods).

8

Results

8.1 Results on the Slovenian BNSI database BIC was used as the online reference method for speaker segmentation and clustering. We also used the offline diarisation system presented in Section 7.2 to compare online and offline speaker diarisation systems’ performance.

Table 1 Segmentation performance of analysed online methods with the same fixed-size window selection strategy on the BNSI database Method

R, %

P, %

F, %

BIC-fixed (l ¼ 2.5) 65.80 51.50 57.78 CLR

72.70 61.00 66.34

NCLR

72.13 60.58 65.85

Table 1 shows the segmentation performance of the methods analysed with the same fixed-size window selection approach on the BNSI database. The data show that CLR achieves better performance than NCLR. We can attribute the performance gain with CLR to the normalisation technique, in which the cross-likelihood term is directly compared to the UMB likelihood and not to the adapted likelihood presented in NCLR. Speaker change candidates with segments that do not match the UBM model are therefore discarded and this improves performance in the case of CLR. When observing the recordings we noticed that a non-adapted UBM normalisation likelihood is better for segmentation on segments with background noise (changing microphone, various sounds). Despite its better segmentation performance, CLR was not used for the system design because preliminary clustering tests gave a DER value of approximately 30%. Table 2 shows segmentation performance of the methods analysed with BIC using variable-size window selection on the BNSI database. A clear performance advantage in the form of better precision is noticed when using a statistical model-based approach such as NCLR. We found better precision in BIC segmentation when using NRCS, and better performance is gained mostly on speaker turns with noisy segments. When comparing offline and online speaker turn selection approaches, similar performance in terms of F-measure value is observed. In our selected test database, mClust-BIC achieved slightly better performance than mClust-CLR. This performance difference can be attributed Table 2 Segmentation performance of the online methods analysed with reference to offline methods on the BNSI database; for BIC a variable-size window selection was used Method

R, %

P, %

F, %

BIC (l ¼ 3.8)

82.80 67.60 74.43

BIC – NRCS (l ¼ 3.8)

82.20 70.80 76.32

BIC – NCLR (l ¼ 3.8)

77.30 77.70 77.53

† speaker segmentation performance (precision, recall, F-measure),

BIC – NCLR– NRCS (l ¼ 3.8)

77.20 79.10 78.14

† clustering performance (cluster purity) and

mClust-BIC (offline method)

92.30 69.50 79.34

mClust-CLR (offline method)

94.30 68.30 79.22

Methods were analysed using various selected diarisation criteria:

† overall performance (DER metric). IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

BIC – NCLR– NRCS – WLC (l ¼ 3.8) 77.20 79.10 78.14

681

& The Institution of Engineering and Technology 2010

www.ietdl.org Table 3 Clustering performance of the online methods analysed (mapped and unmapped) with reference to offline methods on the BNSI database; for BIC a variable-size window selection was used Method

Reference cluster purity, % Computed cluster purity, %

BIC (l ¼ 3.8)

65.12

59.18

BIC – CLR (l ¼ 3.8)

67.10

61.54

BIC – CLR – NRCS (l ¼ 3.8)

69.77

68.43

BIC – NCLR – NRCS – WLC (l ¼ 3.8)

70.13

68.74

mClust-BIC (offline method)

67.13

68.87

mClust-CLR (offline method)

75.34

75.01

Table 4 Clustering performance of the online methods analysed (only mapped) with reference to offline methods on the BNSI database; for BIC a variable-size window selection was used Method

Reference cluster purity, % Computed cluster purity, %

BIC (l ¼ 3.8)

74.43

88.93

BIC – CLR (l ¼ 3.8)

76.02

89.37

BIC – CLR – NRCS (l ¼ 3.8)

78.08

89.46

BIC – NCLR – NRCS – WLC (l ¼ 3.8)

81.18

89.86

mClust-BIC (offline method)

83.11

86.56

mClust-CLR (offline method)

87.54

90.10

to the different internal system parameter setups of the two reference test systems, which we did not examine or tune. Tables 3 and 4 show clustering performance expressed with cluster purity for mapped and unmapped clusters (Table 3) and for mapped clusters only (Table 4). When analysing the results, better clustering purity can be seen in the systems featuring the proposed NRCS and WLC approaches. From the data presented, similar clustering performance can be observed for the proposed online BIC – NCLR – NRCS –WLC and the reference offline mClust-BIC system. Table 5 shows the overall speaker diarisation performance of the methods analysed on the BNSI database. The offline mClustCLR system achieves the highest overall clustering performance. The proposed online system BIC–NCLR–NRCS–WLC achieves better performance when compared to a BIC-based offline speaker diarisation system. From the tests analysis, clear performance advantages of UBM GMM-based cross-likelihood methods are seen compared to BIC-based segmentation methods. We contribute the performance increase mainly to the improved speaker modelling and normalisation capabilities of the NCLR criterion compared with the BIC criterion, which uses only a 682 & The Institution of Engineering and Technology 2010

one-dimensional Gaussian for speaker model representation and diversity calculations. NCLR cross-likelihood criterion uses multiple Gaussians for speaker modelling and speaker normalisation, which helps segmentation and clustering in situations with limited data where phonetic sequences of the same speaker can cause false speaker boundary detection and false creation of a new speaker cluster. The enhancement techniques presented here successfully enhanced system performance. There was a significant

Table 5 Diarisation performance of the online methods analysed with reference to offline methods on the BNSI database; for BIC a variable-size window selection was used Method

DER, %

BIC (l ¼ 3.8)

24.97

BIC – NCLR (l ¼ 3.8)

23.16

BIC – NCLR – NRCS (l ¼ 3.8)

20.95

BIC – NCLR – NRCS – WLC (l ¼ 3.8)

20.52

mClust-BIC (offline method)

21.02

mClust-CLR (offline method)

16.09

IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org advantage to using NRCS in the clustering stage, in which a high-performance increase was achieved. NRCS also increased segmentation performance. WLC also proved to enhance the system’s performance, although it produced a much lower performance increase than NRCS did. It must be noted that in our final system we only used WLC to adjust the NCLR criterion in situations in which the analysis window was within the range of 100 – 800 frames (1 – 8 s). In this range the influence of the window length on the criterion proved to be linear. In larger windows we determined that the window length depends much more on the type of data in the window itself. Therefore updating windows larger than this resulted in an overall decrease in performance. Experiments were done in which the BIC tuning parameter was decreased (in reference to standalone BIC performance) in order to increase potential speaker-turn candidates presented to NCLR. These experiments failed to give better performance mainly because BIC pre-selection incorrectly selected speaker candidates prior to the real speaker boundary point, which led to segment boundaries being incorrectly selected. Such segments often have speaker information from multiple speakers and are therefore very hard to cluster. The decision to use NCLR instead of CLR for segmentation and clustering was confirmed by our previous experiments, in which NCLR gave a 10% better DER compared to CLR. We attribute the performance disadvantage of CLR to its different normalisation term, in which a generic UBM model without test database adaptation is used. Differences between UBM training and diarisation testing data can therefore influence segmentation and clustering decisions.

8.2 Results on the English HUB4 database Table 6 shows segmentation performance of methods analysed with BIC using variable-size window selection on Table 6 Segmentation performance of the online methods analysed on the HUB-4 database with reference to offline methods; variable-size window selection was used for BIC Method

R, %

P, %

Method

DER, %

BIC (l ¼ 3.8)

22.35

BIC – NCLR (l ¼ 3.8)

18.35

BIC – NCLR – NRCS (l ¼ 3.8)

17.46

BIC – NCLR – NRCS – WLC (l ¼ 3.8)

17.34

mClust-BIC (offline method)

18.55

mClust-CLR (offline method)

13.89

the HUB-4 database. As for the BNSI database, a clear performance advantage in the form of better precision and F-measure is noted when using NCLR and NRCS. When comparing offline and online speaker turn selection approaches, a slightly better performance of offline methods is noted. Overall segmentation performance on the HUB-4 database compared to BNSI decreased. When analysing the HUB-4 we found longer (3-s) silence sections in the database that were not transcribed as non-speech. Databases are normally transcribed so that speaker boundary points are selected in the middle of a short, non-transcribed, non-speech segment between speakers. In a database with longer nontranscribed, non-speech segments (like HUB-4), the likelihood of misdetected speaker turns and falsely selected speaker turns is higher because the calculated speaker turn is not placed in the specified interval to the reference speaker point. Table 7 shows the analysed methods’ overall speaker diarisation performance on the HUB-4 database. The offline mClust-CLR system achieves the highest overall clustering performance. The proposed BIC – NCLR – NRCS –WLC online system achieves better performance when compared to a BIC-based offline speaker diarisation system. Overall diarisation performance on the HUB-4 database compared to BNSI increased. This improved diarisation performance can be attributed to the better signal-to-noise ratio and the better speaker diversity found in the HUB-4 database.

F, %

BIC (l ¼ 3.8)

74.30 64.30 68.94

BIC – NRCS (l ¼ 3.8)

73.80 67.10 70.29

BIC – NCLR (l ¼ 3.8)

71.60 70.60 71.10

BIC – NCLR– NRCS (l ¼ 3.8)

71.20 72.20 71.70

BIC – NCLR– NRCS – WLC (l ¼ 3.8) 70.90 72.80 71.84 mClust-BIC (offline method)

80.30 69.00 74.22

mClust-CLR (offline method)

82.50 69.70 75.56

IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

Table 7 Diarisation performance of the online methods analysed on the HUB-4 database with reference to offline methods; variable-size window selection was used for BIC

9

Conclusions

Online speaker diarisation is attracting attention because of research interests in combining speaker diarisation with large vocabulary online speech recognition. In speech recognition systems, speaker diarisation systems are used to boost recognition performance with acoustic models adapted with individual speakers’ speech data. Most of these approaches are currently run offline, but online speaker diarisation systems are needed for online video subtitling. 683

& The Institution of Engineering and Technology 2010

www.ietdl.org This paper analysed online segmentation and clustering approaches and presented techniques to further enhance performance. The online system presented combines the BIC speaker segmentation criterion and the GMM-based NCLR first used in the domain of speaker verification. The concept of combining these two methods proved to be reasonable, given the performance advantage over BIC alone presented in the previous section. The paper also introduces additional techniques that increase system performance. Performance was enhanced with a specific, less complex cluster adaptation technique and with methods that limit or normalise the influence of threshold selection (NRCS, WLN). The performance of the systems and enhancing methods presented was demonstrated on the Slovenian BNSI and on the HUB-4 broadcast news database.

[8] MALEGAONKAR A.S., ARIYAEEINIA A.M., SIVAKUMARAN P.: ‘Efficient speaker change detection using adapted Gaussian mixture models’, IEEE Trans. Speech Audio Process., 2007, 15, (6), pp. 1859 – 1869 [9] ROUGUI J., RZIZA M., ABOUTAJDINE D., GELGON M., MARTINEZ J.: ‘Fast incremental clustering of Gaussian mixture speaker models for scaling up retrieval in on-line broadcast’. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France, 2006 [10] JAYANNA H.S., MAHADEVA PRASANNA S.R.: ‘Multiple frame size and rate analysis for speaker recognition under limited data condition’, IET Signal Process., 2009, 3, pp. 189– 204 [11] PILLAY S.G., ARIYAEEINIA A., PAWLEWSKI M., SIVAKUMARAN P.: ‘Speaker verification under mismatched data conditions’, IET Signal Process., 2009, 3, pp. 236– 246

From these results it can be concluded that online speaker segmentation performance achieves similar results to the BIC-based offline speaker segmentation and clustering system; therefore performance improvements in online large vocabulary speech recognition systems are feasible.

‘Blind clustering of speech utterances based on speaker and language characteristics’. Proc. Interspeech, Sydney, Australia, December 1998

10

[13] ANGUERA X.: ‘Xbic: Real-time cross probabilities measure for speaker segmentation’. ICSI Technical report TR-05-008, 2005

References

[12]

REYNOLDS D., SINGER E., CARLSON B., O’LEARY G., MCLAUGHLIN J.,

ZISSMAN M.:

[1] ANGUERA X.: ‘Robust speaker diarization for meetings’. PhD thesis, Polytechnic University of Catalonia, 2006

[14] BAC L.V., MELLA O. , FOHR D. : ‘Speaker diarization using normalized cross likelihood ratio’. Proc. Interspeech, Antwerp, Belgium, August 2007

[2] TRANTER S. , REYNOLDS D.: ‘An overview of automatic speaker diarisation systems’, IEEE Trans. Speech Audio Process., 2006, 14, (5), pp. 1 – 8

[15] SCHWARZ G.: ‘Estimating the dimension of a model’, Ann. Stat., 1978, 6, pp. 461– 464

[3] TRANTER S., REYNOLDS D.: ‘Speaker diarization for broadcast news’. Proc. ISCA Speaker Recognition Workshop Odyssey, Toledo, Spain, May 2004 [4] AJMERA J., BOURLARD H., LAPIDOT I.: ‘Improved unknownmultiple speaker clustering using HMM’. Technical report RR-02-23, IDIAP Research, 2002 [5] WILLSKY A.S., JONES H.L.: ‘A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems’, IEEE Trans. Autom. Contol, 1976, AC-21, (1), pp. 108– 112 [6] CHEN S.S., GOPALAKRISHNAN P.S.: ‘Speaker, environment and channel change detection and clustering via the Bayesian information criterion’. Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998 [7] TRITSCHLER A., GOPINATH R. : ‘Improved speaker segmentation and segments clustering using the Bayesian information criterion’. Proc. Eurospeech, Budapest, Hungary, 1999, pp. 679– 682 684 & The Institution of Engineering and Technology 2010

[16] GRASˇ ICˇ M. , KOS M., Zˇ GANK A., KACˇ ICˇ Z.: ‘Two step speaker segmentation method using Bayesian information criterion and adapted Gaussian mixtures models’. Proc. Interspeech, Brisbane, Australia, September 2008 [17] JO Q.-H., CHANG J.-H. , SHIN J.W., KIM N.S.: ‘Statistical model-based voice activity detection using support vector machine’, IET Signal Process., 2009, 3, pp. 205– 210 [18] ZˇGANK A., VERDONIK D., MARKUSˇ A.Z., KACˇICˇ Z.: ‘BNSI Slovenian broadcast news database: speech and text corpus’. Proc. Interspeech, Lisbon, Portugal, September 2005 [19] GRAFF D.: ‘The 1996 Broadcast News speech and language-model corpus’. Proc. DARPA Speech Recognition Workshop, Chantilly, VA, February 1997 [20] DEMPSTER A.P., LAIRD N.M., RUBIN D.B.: ‘Maximum likelihood from incomplete data via the EM algorithm’, J. R. Stat. Soc. B, 1977, 39, (1), pp. 1 – 38 IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673– 685 doi: 10.1049/iet-spr.2009.0235

www.ietdl.org [21] ZHANG J., WOO W.L., DLAY S.S.: ‘Expectation– maximisation approach to blind source separation of nonlinear convolutive mixture’, IET Signal Process., 2007, 1, pp. 51–65

[23] BIMBOT F., BONASTRE J.F., FREDOUILLE C., ET AL .: ‘A tutorial on text-independent speaker verification’, EURASIP J. Appl. Signal Process., 2004, 4, pp. 430 – 451

[22] GAUVAIN J.L., LEE C.H.: ‘Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains’, IEEE Trans. Speech Audio Process., 1994, 2, (2), pp. 291– 298

[24] DELEGLISE P., ESTEVE Y., MEIGNIER S., MERLIN T.: ‘The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news’. Proc. Interspeech, Lisbon, Portugal, September 2005

IET Signal Process., 2010, Vol. 4, Iss. 6, pp. 673 – 685 doi: 10.1049/iet-spr.2009.0235

685

& The Institution of Engineering and Technology 2010

Suggest Documents