GMM based clustering and speaker separability in

0 downloads 0 Views 480KB Size Report
phonemes, so that speaker classes are highly fragmented and interspersed. ... this problem of speaker class separability, there are typically some speakers who do ... speaker classification, in which MLPs might be usefully applied to speaker .... value in [0,1] which tells you how much, for a given data subset X , the class in ...
IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

1 PAPER Special Section/Issue on Corpus-Based Speech Technologies

GMM based clustering and speaker separability in the Timit speech database Andrew Morris, Dalei Wu and Jacques Koreman Summary

Speaker recognition on the 630 speaker Timit speech database, using maximum probability selection with a simple Gaussian Mixture Model (GMM) for the data distribution for each speaker, gives above 99% correct recognition. In contrast, a powerful classifier such as a Multi Layer Perceptron (MLP), trained to estimate speaker probabilities, even on a small subset of speakers often performs no better than random selection. We hypothesise two effects which could combine to produce this situation. MLPs do badly because the acoustic feature data is primarily clustered around phonemes, so that speaker classes are highly fragmented and interspersed. In contrast, GMMs model speaker data distributions well because variation within the phonetic cluster identified by each Gaussian is primarily due to speaker variation, with the result that when speaker models are trained by adapting only the means from a multi speaker world model, the resulting GMMs are highly discriminative between speakers. In this article we analyse the distribution of speech and speaker information, both overall and within the cluster identified by each Gaussian in a GMM tuned for speaker recognition on Timit. We show that the results of this analysis support the above hypotheses, and then discuss ways in which the enhanced speaker separability within each Gaussian cluster could be used to harness the discriminative power of MLPs to provide feature data enhancement and improved speaker identification.

Keywords:

GMM, MLP, cluster analysis, Timit, speaker recognition

1. Introduction The most widely used model in ASR is currently the HMM, using GMMs to model the distribution of data for each phone state. Another well known model is the HMM which uses an MLP, instead of a GMM, to estimate phone state probabilities. In this model the MLP is trained to map acoustic data directly onto phone state probabilities [1,8]. This normally achieves about 80% frame level phoneme recognition, and word level recognition comparable to that of the HMM/GMM. In speaker recognition, whether text dependent or independent, the model consisting of a simple GMM for each speaker is very successful [11,12]. One would not expect an MLP to be suitable for speaker recognition for the reason alone that the number of speakers, unlike the number of phonemes, is unbounded. However, even when an MLP is trained to estimate speaker probabilities for just a small number of speakers, performance is usually very low. In this article we look at why GMMs do so well for speaker recognition, while MLPs used in this way do so badly. From this analysis we outline an approach in which the powerful modelling capacity of universal classifiers, such as MLPs, could be more effectively applied to improve speaker recognition.

1.1 Why are GMMs so effective for speaker modelling, while MLPs are not? The performance of any classifier which is trained to estimate class posterior probabilities is generally related to the "separability" of the classes it is being trained to separate, which decreases both with class overlap and with the complexity of the separating class boundaries. One reason that speech sounds are more easily separable could be that it is in the nature of communication that everyone sharing the same language is obliged to approximate the same set of distinctive speech sounds, so that acoustic data clustering largely reflects this set of sounds. This allows GMMs to model the position of each cluster centre as a compact characterisation of "the way someone talks", comprising an effective set of speaker discriminative features. Conversely, the data from any one speaker is systematically distributed over all phoneme clusters, and is therefore highly entangled (fragmented and interspersed) with the data from other speakers, so that speaker classes are highly unseparable. Besides this problem of speaker class separability, there are typically some speakers who do not have much speech data, so each speaker model is often trained by using this limited data to adapt the parameters for a

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

2 PAPER Special Section/Issue on Corpus-Based Speech Technologies GMM which has been previously trained on a large number of different speakers. Although adapting speaker models from such a world model does not help with Timit, best results were obtained on several other more challenging databases using MAP adaptation of the Gaussian means (only) from a world model [9]. In having one model to separate all speakers, MLPs are not as well suited as GMMs to adaptive training or to incremental speaker enrolment. However, in Section 4 we discuss other ways, besides direct speaker classification, in which MLPs might be usefully applied to speaker recognition.

1.2 Analysis of clustering in the Timit speech database For an analysis of the effect of GMM clustering on the separability of speech and speaker classes we selected the Timit speech database. This is because Timit is widely used for both speech and speaker recognition tests, and each utterance is phonetically hand labelled and provided with codes for speaker number, gender and dialect region. A previous study of data clustering in Timit [14] attempted to expose a class hierarchy, but a different analysis of variance was obtained according to the class hierarchy imposed. In our analysis we make no assumptions about the data. Our goal is to look at ways in which the partitioning of data by a GMM can be exploited to enhance speaker separation. In Section 2 we describe the GMM training procedure, the choice of data features, the data labelling, and the class separability measures used. In Section 3 we look at speaker separation conditioned on broad phonetic class, where separability is measured by the proportion of correct classification achieved by LDA. In Section 4 we present the Gaussian clustering for the different classes and subclasses, and analyse its effect on speech and speaker separability. This analysis is repeated for two equally high performing but very different feature types, and for GMMs comprising from 2 to 32 Gaussians. In Section 5 we discuss the implications of this analysis for the following three hypotheses. 1.

Everyone sharing the same language is obliged to approximate the same set of distinctive speech sounds. Acoustic data clustering largely reflects this set of sounds, making them easily separable.

2.

The data from any one speaker is systematically distributed over all phonetic clusters, and is therefore highly fragmented and interspersed, so that speaker classes are highly unseparable.

3.

Speaker separability within each of the clusters identified by a GMM is significantly enhanced.

2. Model training, feature analysis and experimental design The choice of GMM clustering and data feature details used in this article were selected to be relevant to as wide a readership as possible. They are well known standard techniques which give near state of the art speaker recognition performance. Both GMM and data feature specification were taken from [13]. We have analysed the effect of the clustering on class separation for phoneme for different levels of speech and speaker class, including speaker, dialect region, gender and three levels of phonetic class.

2.1 Data features In order to check the sensitivity of the data analysis to the data features used, analysis was repeated on two different types of features, one FFT and the other wavelet based. Both of these feature types were used in [13], where it was shown that the more recently introduced wavelet based features performed marginally better than the standard MFCC features, also used in [12]. As in [13], all of the Timit signal data was first downsampled to 8 kHz, to simulate telephone line transmission, but no further low- or high-pass filters were applied. The MFCC features used 20ms windows and 10ms shift, with a pre-emphasis factor of 0.97, a Hamming window, 20 Mel scaled feature bands, and all MFCC coefficients except c0. As in [13], all of the optional processing steps commonly used in speaker recognition, including silence frame removal, cepstral mean subtraction, and appended time difference features, were tested, but none showed any advantage, so none were used. Why use features for speaker recognition which were designed for speech recognition? Other features, including LPC and LP reflection coefficients, were also tested and only reduced performance.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

3 PAPER Special Section/Issue on Corpus-Based Speech Technologies The wavelets features used a 24ms sample window, with 10ms shift. Their ability to compete with the MFCC features was due to the use of a special wavelet packet tree which effected a Mel-like frequency scaling, instead of the usual geometric frequency scaling. The wavelet log energy coefficients were also passed through a DCT for othogonalisation. We will denote these wavelet coefficients WAVC.

2.2 Gaussian mixture models GMMs were trained using K-means clustering followed by EM iteration. This was performed by the Torch machine learning API [2]. The Torch variance threshold factor of 0.01, and minimum Gaussian weight of 0.05, were optimal for both MFCC and WAVC features, with performance falling sharply if either was halved or doubled. GMMs for 2, 4, 8, 16 and 32 Gaussians (which we will refer to as GMM2 to GMM32, respectively) were trained separately. Mixture splitting was tested, but gave no improvement. For the purpose of data analysis in this article, the GMM was trained with all of the Timit speech data. The GMM model and features used give state-of-the-art speaker recognition performance. The relevant train/test divisions and speaker identification performance results with Timit were as follows. As in [13] or [12], the 10 phrases per speaker (labelled SA1..2, SI1..3 and SX1..5) were divided into training set (SA1..2, SI1..3, and SX1..3) and test set(SX4..5), with no data reserved for a development set. One GMM was trained, for each of the 630 speakers, on all the training data from that speaker. Selecting the speaker with the greatest GMM density, our own tests gave 96.1% correct for MFCCs and 96.7% for WAVCs. With no down-sampling, both achieved above 99.7%.

2.3 Data labelling Timit has 630 speakers (438 male, 192 female), from 8 dialect regions in the USA, each speaking 10 sentences (of which five were always the same). Each utterance in Timit is phonetically hand labelled and provided with codes for speaker, gender and dialect region. After feature processing and GMM training, each feature frame (x) was labelled with the index of the Gaussian (Gi) for which it gave the greatest posterior probability, P(Gi|x). Broad phone and speaker classes are also of interest because broad class detection can be relatively robust, and in speech or speaker recognition an initial broad class identification is often used to either select or condition the model used for fine class recognition. A record was therefore compiled for every feature frame, with fields as shown in Table 1. Partition phone, P61 speaker, SPK gender, GEN dialect region, DRE broad phone 1, P20 broad phone 2, P07 broad phone 3, P04 Gaussian index, GID

Num. categories 61 18, 630 2 8 20 7 4 2, 4, 8, 16, 32

Table 1. Categories P61 to DRE above were obtained from the Timit labeling. P61 was grouped into 3 broad class partitions, Pnn, where nn is the number of categories. The categories in P04 are (obstruent,

sonorant, consonant, vowel, silence). Each data frame was also labelled with the index of the Gaussian which gave the highest probability density.

As well as analysis of the full set of data frames, in order to observe the clustering of data for individual speakers, we also analyse the data from a subset of just 18 distinctive speakers, consisting of 3 males and 3 females from each of 3 dialect regions, with 2 sentences per speaker (the two SA sentences).

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

4 PAPER Special Section/Issue on Corpus-Based Speech Technologies

Fig.1(a,b)

Histograms for data frames falling into each Gaussian in GMM32 for (a) gender, (b) dialect region.

Fig. 1 shows histograms for gender (a) and dialect region (b), in each Gaussian set in MFCC GMM32. Each Gaussian selects a subset of data from a different region of acoustic space, which accounts for a different proportion of data from each class division. For example, some Gaussians, like 0, 12, 23, 28 of Fig.1a clearly separate male/female (sonorants in Fig.1b), while others, like 3, 15, 21 do not (voiceless obstruents and silence). Fig.1b shows that each dialect region has almost an equal probability of being in each Gaussian, which reflects the results in Table.1.

Fig.2(a,b)

Histograms for data frames falling into each Gaussian in GMM32 for (a) speaker, (b) phone class.

Fig2a shows that most Gaussians have data from most speakers(for a subset of 18 speakers in the Timit database), although each selects a different proportion of data from each speaker, sometimes excluding a number of speakers completely, e.g. Gaussian 19, which is dominated by a single speaker. Fig.2b shows that vowels and sonorant consonants (which carry more speaker distinguishing information) are clearly distinguished from obstruents and silence. Closer inspection of the data showed further systematic patterns. For instance, when obstruents and silence frames are represented by the same Gaussian, the obstruents are mainly voiceless ones (not shown here).

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

5 PAPER Special Section/Issue on Corpus-Based Speech Technologies

Fig.3 Histogram for phoneme class division P04 for the 2 Gaussians in MFCC GMM1

Figure 3 shows the P04 phone class division for MFCC GMM2. A two Gaussian GMM is often used to detect and remove “silence” frames in preprocessing for speaker recognition. This figure shows that the Gaussian most associated with non-speech or silence frames in fact also picks up a lot of speech frames, but on closer inspection it was found that these frames are mostly for voiceless obstruents, which have little speaker separation value and are therefore best ignored in speaker verification.

2.4 Data analysis performed The main goal of this article is to analyse the effect of the clustering identified by the GMM on speech and speaker class separabilities. Therefore, for each GMM, we wish to compare the clustering within the full dataset with the clustering within the subset identified by each Gaussian. For a given data subset, Z, the 7 different class divisions (P04, P07, P20, P61, Spk, Gen, DRe) each define a different clustering. The measures we used to compare the clustering in a given dataset Z, with respect to a given class division, D, were as follows. The RI measure is a measure of the degree to which any two class divisions are related, which is independent of the type of mapping between them. •

Sep ( D , X ) , the "(class) separability index", for a given class division D of data set X , is the ratio of the sum of within-class variances to the sum of between-class variances. Sep (Eq.1) increases as the number of clusters for each class, the interlacing of data from different classes, and class boundary complexity decrease. (1) Sep = trace ( S b ) trace ( S w )



NH ( D , X ) , the "normalised (class) entropy", is a measure in [0,1] of the uncertainty as to which category in class division D the data in a given set X belongs. NH (Eq.2) gives the proportion of classification perplexity. NH ( D , X ) = H ( D , X ) / log 2 D



(2)

RI ( D1 , D 2 , X ) , the "normalised (mutual) information", or Relative Information [10], (Eq.3) is a value in [0,1] which tells you how much, for a given data subset X , the class in D1 is statistically dependent on the class in D2, and vice versa. RI (Eq.3) is obtained by suitable normalisation of the Chi-squared statistic, L (Appendix A). It makes no assumptions about the mapping between D 1 and D 2 . RI ( D1 , D 2 , X ) = L ( D1 , D 2 , X ) /( 2 N log e (min( D1 , D 2 )))

(3)

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

6 PAPER Special Section/Issue on Corpus-Based Speech Technologies Our analysis is repeated for GMMs having 2, 4, 8, 16 and 32 Gaussians (32 Gaussians achieve 100% speaker identification on Timit), and for both MFCC and WAVC features. For speaker recognition, all variation apart from inter-speaker variation is harmful. Noise and channel distortion can be reduced through suitable feature processing, while coarticulation effects can be reduced by modelling with context sensitive speech units. But this still leaves phonetic variation, the main source of acoustic variability, uncorrected. In Section 3 we look the benefit of reducing phonetic variation by conditioning on broad phonetic class.

3. Speaker subset separation conditioned on broad phonetic class When data is restricted to one phonetic class, the proportion of inter speaker acoustic variation to other sources of variation is increased, while each speaker is still represented by approximately the same amount of data. Broad phonetic class is often more reliably estimated than fine class. In some situations, such as text prompted speaker recognition, the phoneme sequence is specified a-priori. In this section we perform a systematic analysis of the relative effect on speaker, gender and dialect region separability of conditioning on each broad phonetic class. The measure of class separability which we use in this section is the proportion of correct classification obtained by linear least mean squares (LMS) classification. With one 0/1 target output per class, the LMS solution (Eq. 4) is known to estimate a-posteriori output class probabilities, and in the case of a linear classifier, the transformation so obtained is equivalent to that obtained by linear discriminant analysis (LDA) [8,3,15], which also maximises the ratio of between-class to within-class variances.

W = X + Y , where X

+

= ( X ' X ) − 1 X ' , the pseudo-inverse of X

(4)

We tried projecting each point, either from MFCC features or from its associated log Gaussian probabilities, onto class probabilities. While the classification performance of these two projections was similar, results in Table 2 below are reported only for the latter. There is not space here to show the full table of results. The top and bottom three broad phonetic classes for predicting speaker, gender and dialect region are shown in Table 2. speaker nas opn cen vow mid bak vow … vcl fri vcl plo rel vcl clo average random choice

27.2 25.3 23.1 … 2.5 2.4 2.2 10.7 0.2

Gender opn fro vow gli clo bak vow … vcl plo rel bre vcl clo average random choice

94.7 94.3 93.8 … 73.6 71.2 70.9 80.9 50.0

Dialect region opn cen vow 22.9 pau 21.8 nas 21.7 … … vcl plo rel 18.2 vcl clo 18.2 vcl fri 18.0 average 18.0 random choice 12.5

Table 2.

Percent correct speaker, gender and dialect region classification by LDA [3] on data within the set belonging to each of the 20 broad phonetic classes in set P20. (nas=nasal, opn=open, cen=central, vow=vowel, bre=breathing, vcl=voiceless, plo=plosive, rel=release, clo=closure, fri=fricative).

We see in Table 1 that dialect region was very little separated at all, even when conditioned by the phone class (P20). This suggests that, within this database at least, dialect is not well characterised by the MFCC features obtained within a 20ms window. Gender is better separated by sonorants (top 3 lines). Speakers were best separated by nasals, which convey the characteristic shape of the nasal cavity by their timbre, as well as pitch. All classes were worst separated by voiceless sounds, which carry least information about vocal tract shape and none about the characteristics of the glottal source. In Section 4 we analyse the effect on speaker separability of conditioning based on knowledge only of the GMM cluster to which the data frame belongs.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

7 PAPER Special Section/Issue on Corpus-Based Speech Technologies

4. Speech and speaker separation conditioned on Gaussian class In the previous section we noted that phoneme classes do not in fact correspond directly to the actual clustering of acoustic data. There are also many situations in which phoneme class is not readily available. We therefore look here at the extent to which class separability can be conditioned by knowledge of which cluster (i.e. of which Gaussian in the GMM) it belongs to. Sep and NH values were obtained for every Gaussian within each of GMM2 to GMM32. We report here the average Sep and NH values over all of the Gaussians within each GMM. We also report the RI value for each GMM, because it provides a direct measure of statistical dependence, whereas Sep measures only linear separability. In order to test the sensitivity of the results reported to the choice of data features, all tests were repeated for WAVC as well as MFCC features.

Fig. 4(a,b). Fig.(a) (left) shows speaker separability Sep values for MFCC data against the total number of Gaussians. Fig.(b) (right) shows same for WAVC features. In both cases speech class separability decreases, and speaker separability increases, with the number of Gaussians in the GMM clustering. In Fig 4(a,b) we see that as the number of Gaussians increases, phone class separability for MFCC data decreases by about 50% (overall), while speaker separability increases by a factor of about 3. For GMM32, speaker separability is greater than phone separability. Approximately the same is true for WAVC data, although the wavelet features provide greater linear speech separability for GMM1 (the raw data without GMM modelling).

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

8 PAPER Special Section/Issue on Corpus-Based Speech Technologies Fig.5(a,b) Fig.(a) (left) shows speaker normalised entropy NH values for MFCC data against the total number of Gaussians. Fig.(b) (right) shows same for WAVC features. In both cases phone entropy is consistently lower than speaker entropy, and decreases as the number of Gaussians increases. In Fig.5 it can be observed that phone entropy is lower than speaker entropy for all GMMs, and decreases as the number of Gaussians increases. By contrast, both speaker and dialect region entropies are close to their maximum possible value (1.0) for every GMM, with little decrease as the number of Gaussians increases. This confirms our hypothesis (2) that, unlike phoneme classes, each of which is well clustered and little fragmented, the distribution of speaker data is much more 'holistic', being almost invariant with respect to the region of feature space sampled.

Fig.6(a,b). Fig.(a) (left) shows speaker RI between each of the speech and speaker partitions and Gaussian index, for MFCC data, against the total number of Gaussians. Fig.(b) (right) shows same for WAVC features. In Fig.6 we see that Gaussian index is strongly dependent on phonetic class for all GMMs (confirming hypothesis 1), and speaker RI increases at first, but levels out at a low level (confirming hypothesis 2). RI for dialect region is near to zero throughout, showing that neither MFCC nor WAVC features capture any information required for dialect region separation. Gender dependence keeps increasing with the number of Gaussians in the GMM. This indicates an increasing separation of phonetic clusters into separate sets for males and females.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

9 PAPER Special Section/Issue on Corpus-Based Speech Technologies Fig.7(a,b) Fig. (a)(left) shows speaker separability Sep values for MFCC data in each Gaussian of GMM32, Fig. (b) (right) shows NH values for MFCC data in each Gaussian of GMM32.) In Fig.7 we see that speaker separability varies strongly between Gaussian subsets for GMM32, but Sep does not correlate with NH. This suggests that the observed differences in speaker separability within Gaussian clusters are mainly due to differences in speaker class overlap and/or fragmentation, rather than to differences in speaker distribution perplexity (number of different speakers).

5. Discussion In classification problems such as phoneme recognition, where classes are not heavily overlapping or entangled, LDA or NLDA [4] can sometimes be used to project all data features onto a new set of features which are better separated. LDA has been applied in this way to speaker separation, but without dramatic improvement. We have tried the straightforward application of NLDA (by MLP) to speaker separation and this also does not help. In [6] an MLP was trained for each speaker by further training of an MLP, previously trained to separate phonemes, on data from this speaker. However, each MLP was very large, no attempt was made to condition the selection of data processor on the phonetic identity of the data. In this article we have identified cluster entanglement, rather than perplexity or class overlap, as the major factor limiting speaker separability in the Timit speech database. In Section 3 we used LDA based classification to show that speaker entanglement can be reduced by conditioning on broad phonetic class. In Section 4 we used the three measures Sep, NH and RI to show that as the number of Gaussians in the GMM increases from 2 to 32, the average speech entropy in each Gaussian decreases, while the average speaker entropy remains near constant, with the effect that the ratio of speaker to speech separability increases from 0.12 to 1.75. This established the validity of hypothesis (3) in Section 1, i.e. that the entanglement of speaker classes within each Gaussian subset is significantly reduced, thereby increasing speaker separability. This suggests that a GMM could possibly be used to select a different NLDA separation according to which Gaussian each point belongs, thereby permitting us to harness the discriminative power of MLPs for improved speaker identification. In [4] an MLP was trained to output phoneme probabilities. For speaker separation the number of speakers is unlimited, but an MLP (or LDA) could be trained to output probabilities for some limited subset of representative speakers, with the expectation that any transformation which improved separation for this group of speakers would also to some extent improve separation for most speakers. In [3,8,15] it is shown that training any classifier to estimate class posterior probabilities can be achieved by least mean squares error (LMS) training on the difference between outputs and target 0/1 probabilities, or by maximising the mutual information between them. It is further shown that, in the case of a linear classifier, the LDA solution (which maximises Sep) is equivalent to the LMS solution. It follows that, for a linear classifier, class probabilities are also the best separating features. For a non-linear classifier it was found in [4] that not the probability estimates themselves, but the pre-squashed MLP outputs, or the outputs from the preceding hidden layer, have a more Gaussian distribution and were thereby better suited to GMM modelling. The class separation power of MLPs could therefore possibly be harnessed for improving speaker separation as follows. 1. Train a GMM on all of the speech data as usual for speaker recognition 2. Train a separate MLP(or LDA) on the data from each Gaussian subset, to output a probability for each speaker from a representative subset of speakers 3. Replace the original features for each data point by the (pre-squashed) speaker class probabilities from the MLP (or LDA) corresponding to the Gaussian to which this point belongs 4. Train a new GMM on these new features and perform speaker recognition as usual.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

10 PAPER Special Section/Issue on Corpus-Based Speech Technologies It is possible that a soft rather than hard separation of points into Gaussian subsets would be beneficial. This could be achieved by taking a weighted sum of the outputs from the MLP for each Gaussian, weighting each by the point’s Gaussian probability density.

6. Conclusion In this article we identified cluster entanglement as the major factor limiting speaker recognition. We showed how a GMM could be used to partition data into subsets, within which speaker class separation is significantly enhanced. We proposed a model based on piecewise (N)LDA which could possibly be used to exploit this. These models remain to be tested.

Aknowledgements This work was supported by the EC SecurePhone project IST-2002-5063. We would also like to thank Dietrich Klakow for developing our wavelets encoding.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

11 PAPER Special Section/Issue on Corpus-Based Speech Technologies Appendix A : Notation and standard formulae Notation

x X

a feature data point (vector) from Timit data matrix, with x as rows target output matrix, with 0/1 target vectors as rows Gaussian pdf for GMM cluster (i)

Y

f i (x) Gi

{ x : f i ( x ) ≥ f j ( x ) ∀ i ≠ j} , the set of x in Gaussian cluster (i)

D

speech or speaker class division (e.g. phone, speaker, or gender) categories within class D (e.g. male, female)

di D

number of categories in class division D

Xi Si

D ∩ G i , the set of all D in Gaussian (i) covariance matrix for data set X i

Sb Sw

between-class covariance matrix within-class covariance matrix number of data points in set X

X

contingency table counts of x for class division D1 against D 2 element (i,j) of T

T ( D1 , D 2 , X )

n ij

∑ ij n ij

N

Standard formulae µi =

∑ x∈C i

x

C i , mean of data in category C i

Si = E[(x − µi )(x − µi ) ' ] = Ci Ci' / Ci − µi µi' , within-class covariance matrix Pi = C i

G i , relative frequency of category C i

∑ i Pi µ i , overall data mean = ∑ i Pi ( µ i − µ )( µ i − µ ) ' , between-class covariance matrix

µ = Sb

Sw =

∑ i Pi S i , overall within-class covariance matrix

H ( D ) = − ∑ d ∈D P ( d ) log 2 P ( d ) , entropy of the probability distribution of D L ( D1 , D2 ) = ∑ij ( nij − ( ni n j N ))

2

( ni n j / N ) , Pearson’s large sample (or Chi-squared) statistic

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

12 PAPER Special Section/Issue on Corpus-Based Speech Technologies

References [1] H. Bourlard & N. Morgan (1994), Connectionist speech recognition - a hybrid approach, Kluwer. [2] Collobert, R., Bengio, S. & Mariéthoz, J. (2002) "Torch: a modular machine learning software library", Technical Report IDIAP-RR 02-46, Pdf=http://www.idiap.ch/%7Ebengio/cv/publications/pdf/rr02-46.pdf Torch web page=http://www.torch.ch/credits.php [3] R.O. Duda, PE. Hart & D.G. Stork (2001), Pattern classification, Wiley. (ref for LDA) [4] D. Ellis & M. Reyes-Gomez (2001), "Investigations into Tandem acoustic modeling for the Aurora task", Proc. Eurospeech 2001, pp. 189-192. (article where MLPs used for NLDA data enhancement for noise robust ASR) [5] W.M. Fisher, G.R. Doddingtion and K.M. Goudie-Marshall (1986), "The DARPA speech recognition research database: Specifications and status", Proc. DARPA Workshop on Speech Recognition, February 1986, pp. 93-99. (a good reference to TIMIT) [6] D. Genoud, D. Ellis, D. & N. Morgan, (1999). Combined speech and speaker recognition with speaker-adapted connectionist models, in Proc. ASRU. [7] D. George & P. Mallery (2002) SPSS for Windows Step by Step: A Simple Guide and Reference, 4th Ed., Allyn & Bacon [8] B. Gold & N. Morgan (2000) Speech and audio signal processing: processing and perception of speech and music, Wiley. [9] J. Mariéthoz & S. Bengio (2002), "A comparative study of adaptation methods for speaker verification", Proc. ICSLP 2002, Pdf=ftp://ftp.idiap.ch/pub/reports/2002/mariethoz-icslp2002.pdf [10] A.C. Morris (2002), "An information theoretic measure of sequence recognition performance", IDIAP Communication com02-03 (Morris article where RI is related to Chi-squared statistic) Pdf=ftp://ftp.idiap.ch/pub/reports/2002/com02-03.pdf [11] Reynolds, D.A., Speaker identification and verification using Gaussian mixture speaker models, Speech Commun., 17 (1995), 91-108. [12] D.A. Reynolds, M.A. Zissman T.F. Quatieri, G.C. O'Leary & B.A. Carlson (1995) "The effect of telephone transmission degradations on speaker recognition performance", Proc. ICASSP'95, pp.329-332. [13] R. Sarikaya, B.L. Pellom & J.H.L. Hansen (1998) "Wavelet packet transform features with applications to speaker identification", Proc. IEEE Nordic Signal Proc. Symp., NORSIG'98, pp. 81-84. (ref for MFCCs and wavelet features used) [14] D.X. Sun & L. Deng, "Analysis of Acoustic-Phonetic Variations in Speech", Proc. ICASSP'95. [15] S, Theodoridis & K. Koutroumbas (2003) Pattern Recognition, 2nd Ed., Academic Press [16] S. Young et al. HTKbook (V2.2), Cambridge University Engineering Dept.

IEICE TRANS. FUNDAMENTALS/COMMUN./ELECTRON./INF. & SYST., VOL. E85-A/B/C/D, No. 1 March 2005

13 PAPER Special Section/Issue on Corpus-Based Speech Technologies Andrew Morris - received a PhD related to ASR from the Institute for Spoken Communication (ICP), INPG, France, in 1992. He has continued research at the Speech Technology Group (GTH), ETSIT, UPM, Spain; the Speech and Hearing research group (SpandH), DCS, University of Sheffield, UK, and the Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP), in Switzerland. He is now with the Phonetics group at Saarbruecken University, Germany.

Dalei Wu - received MSc. in Computer Science from Tsinghua University, China in 2002. During 2002-2004, he worked in video compression and implementation of ASR on PDA. He is now studying for a PhD in the area of speaker recognition with the Phonetics group at Saarbruecken University, Germany

Jacques Koreman - received a PhD in phonetics from Nijmegen University in 1996. He has worked in experimental phonetics, particularly on voice production and speech perception. He has also worked on phonetics in speech technology for several years. He is now a senior lecturer in Phonetics at Saarland University, where he is engaged in research in phonetics and the application of phonetics to automatic speech recognition.

Suggest Documents