Classification of Speech Dysfluencies Using LPC ... - Semantic Scholar

10 downloads 14980 Views 487KB Size Report
Bachelor of Engineering (Computer Systems)/Bachelor of Infor- mation Technology (Applied Computer Science): University of. Southern Queensland, 2006, p.
J Med Syst DOI 10.1007/s10916-010-9641-6

ORIGINAL PAPER

Classification of Speech Dysfluencies Using LPC Based Parameterization Techniques M. Hariharan & Lim Sin Chee & Ooi Chia Ai & Sazali Yaacob

Received: 22 October 2010 / Accepted: 8 December 2010 # Springer Science+Business Media, LLC 2011

Abstract The goal of this paper is to discuss and compare three feature extraction methods: Linear Predictive Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC) and Weighted Linear Prediction Cepstral Coefficients (WLPCC) for recognizing the stuttered events. Speech samples from the University College London Archive of Stuttered Speech (UCLASS) were used for our analysis. The stuttered events were identified through manual segmentation and were used for feature extraction. Two simple classifiers namely, k-nearest neighbour (kNN) and Linear Discriminant Analysis (LDA) were employed for speech dysfluencies classification. Conventional validation method was used for testing the reliability of the classifier results. The study on the effect of different frame length, percentage of overlapping, value of ã in a first order pre-emphasizer and different order p were discussed. The speech dysfluencies classification accuracy was found to be improved by applying statistical normalization before feature extraction. The experimental investigation elucidated LPC, LPCC and WLPCC features can be used for identifying the stuttered events and WLPCC features slightly outperforms LPCC features and LPC features. Keywords Stuttering . LPC . LPCC . WLPCC . kNN . LDA

Introduction Speech is a verbal means used by humans to express their feelings, ideas and thoughts in communication. Speech consists of articulation, voice and fluency. However, there is M. Hariharan (*) : L. S. Chee : O. C. Ai : S. Yaacob School of Mechatronic Engineering, UniMAP, Perlis, Malaysia e-mail: [email protected]

1% of the population have noticeable speech stuttering problem and it has been found to affect female to male with ratio 1: 3 or 4 times [1, 2]. Stuttering is defined as a normal flow of which is disrupted by unintentionally of dysfluencies such as repetition, prolongation, interjection of syllables, sounds, words or phrases and involuntary silent pauses or blocks in communication [1, 3]. Stuttering cannot be completely cured; it may go into remission for some time [1]. Stutterers can learn to shape their speech into fluent speech with appropriate speech pathology treatments. Therefore a stuttering assessment is needed to evaluate performance of stutterers before and after therapy. Traditionally, speech language pathologist (SLP) counts and classifies occurrence of dysfluencies such as repetition and prolongation in stuttered speech manually. However, these types of stuttering assessment are subjective, inconsistent, time consuming and prone to error [1, 4–8]. Therefore, it might be good if stuttering assessment can be done automatically and thus having more time for the treatment session between stutterer and SLP. In past two decades, researchers have focused on developing objective method to facilitate the SLP during stuttering assessment. Table 1 depicts some of the significant research works which have been conducted in the last two decades chronologically. Based on the previous works, it can be concluded that feature extraction algorithms and classification method performs important role in the area of stuttered events recognition. In this work, LPC, LPCC and WLPCC were used to extract salient features from the stuttered events and their effectiveness were evaluated using 2 simple classifiers that are kNN and LDA. In addition, normalization technique was employed to enhance the classification accuracy. Finally, conventional validation was applied to assess the reliability of the classifiers.

J Med Syst Table 1 Summary of several research works on stuttering recognition, detailing the number of subjects, the features used and the classifiers employed First Author

Database

Features

Classifiers

Best Results (%)

Howell [10]



ANNs

≈ 80%

Howell [7, 8] Geetha [11]

12 Speakers (UCLASS)

ANNs

78.01%

51 Speakers

ANNs

92%

Nöth [9]

37 speakers

HMMs



Czyzewski [12] Prakash [13]

6 normal speech samples + 6 stop-gaps speech samples 10 normal + 10 stuttering children

ANNs & rough set

73.25% &≥90.0%





Szczurowska [14] Wiśniewski [15]

8 speakers

Autocorrelation function and envelope parameters. Duration, energy peaks, spectral of word based and part word based. Age, sex, type of dysfluency, frequency of dysfluency, duration, physical concomitant, Rate of speech, historical, attitudinal, and behavioral scores, family history. Duration and frequency of dysfluent portions, speaking rate Frequency, 1st to 3rd formant’s frequencies and its amplitude Formant patterns, speed of transitions, F2 transition duration and F2 transition range Spectral measure (FFT 512)

Multilayer Perceptron (MLP), Kohonen HMMs

76.67% 70%

HMMs

Approximately 80%

HMMs

96%

Perceptron

83%

SVM

94.35%

Kohonen, Multilayer Perceptron(MLP), Radial Basis Function(RBF) kNN, LDA

88.1%–94.9%

90.91%

kNN, LDA

89.77%

Wiśniewski [16] Tian-Swee [4] Ravikumar [6] Ravikumar [5] Świetlicka [17]

Sin Chee [18] Sin Chee [19]

38 samples for prolongation of fricatives + 30 samples for stop blockade + 30 free of silence samples – 15 normal speakers + 10 artificial stuttered speech 10 speakers 15 speakers 8 stuttering speakers + 4 normal speakers (yields 59 fluent speech samples +59 non-fluent speech samples) 10 speakers 10 speakers

Mel Frequency Cepstral Coefficients (MFCC)

Mel Frequency Cepstral Coefficients (MFCC) Mel Frequency Cepstral Coefficients (MFCC) Mel Frequency Cepstral Coefficients (MFCC) Mel Frequency Cepstral Coefficients (MFCC) Spectral measure (FFT 512)

Mel Frequency Cepstral Coefficients (MFCC) Linear Prediction Cepstral Coefficients (LPCC)

The synopsis of this paper is organized as follows. The methodology of the system is presented in “Methodology”. Moreover, this section covers also database that is used in the experiment, feature extraction algorithm and classification technique. Experimental results and discussion are presented in “Result and discussion”. Finally, conclusions discussed in “Conclusion”.

Fig. 1 Block Diagram of Stuttering Recognition

Stuttered Speech

Segmentation

Signal Pre-processing (Signal Normalization)

Methodology Feature Extraction

In this paper, dysfluencies classification is divided into four stages as illustrated in Fig. 1: segmentation, signal preprocessing, feature extraction and classification. LPC, LPCC and WLPCC were used as feature extraction algorithms; kNN and LDA were employed to evaluate the effectiveness of the features in the recognition of repetition

Classification

Output

J Med Syst

and prolongation in stuttered speech. In order to enhance the effectiveness of the features, signal normalization (SN) was applied as shown in grey block in Fig. 1. The effectiveness of features was compared between each others. However, it is hard to compare results when experiment is done using different databases. Thus, in this work, features were compared using the same database, for instances, stuttered speech database. It can be accessed easily in the University College London Archive of Stuttered Speech (UCLASS) website [19].

reasonable for speech processing task in present work, because most of the salient speech features are within 8 kHz bandwidth [21]. Figure 2 displays the speech signal pre-processing. The recognition system was implemented in MATLAB. Before the stage of pre-emphasis, all sampled speech signals were normalized. The reason of normalization is to prevent the error estimation caused by speakers’ volume changes. The normalization formula used is as expressed by Eq. 1.

Segmentation

Si ¼

Si mi ð1Þ si Where Sbi and Si is ith element of the signal after and before signal normalization respectively, μi is the mean of vector S and σi is the standard deviation of vector S. Normalized speech signals were pre-emphasized to spectrally flatten the signal and to even the spectral energy envelope by amplifying the importance of high-frequency components and removing the DC component in the signal. A digital filter was used to compensate the high-frequency components, is shown in Eq. 2. ^

The database was obtained from UCLASS [19]. It consists of 3 types of recordings such as monologs, readings and conversation. There are 43 different speakers contributing 107 “reading” recordings. Table 2 shows the distribution of the “reading” recording database. In this work, only a subset of the available sample that is 39 samples of speech were taken from the UCLASS archive [20] for analysis. It includes 2 female speakers and 37 male speakers’ age ranged between 11 years 2 months and 20 years 1 month. The samples were chosen to cover a broad range of both age and stuttering rate. Most of the “reading” samples do not have text script and hence only “reading” samples with text script were chosen for our investigation. In this study, only 2 types of lexical dysfluencies namely prolongation and repetition were investigated. Both types of dysfluencies can be detected easily in monosyllabic words. Thus, only speech sample with content of “One more week to Easter” and “Arthur the rat” are selected due to 90% of its content are monosyllable words [7]. Each of the 2 passages consists of more than 300 words. Through listening to stuttered speech signal, the 2 types of dysfluencies were identified and were segmented manually. The segmented speech samples were subjected to speech signal pre-processing.

s0 ðnÞ ¼ sðnÞ  easðn  1Þ

ð2Þ

The Z-transform was applied on both sides of Eq. 2 and the first-order pre-emphasis filter has a transfer function, HðzÞ ¼ 1ea»z1

ð3Þ

0:9  ea  1:0

where ã is positive parameter used to control the degree of pre-emphasis filtering. Normally ã is chosen between 0.9 and 1.0. For fixed-point implementations a value of ã=15/ 16=0.9375 is often used with the sampling rate of 8 kHz [22]. Figure 3 presents the magnitude response of preFig. 2 Block diagram of speech signal preprocessing

Speech signal sampled at 16 kHz

Speech signal pre-processing Signal

The original sampling frequency of the speech samples is 44.1 kHz. For speech processing purpose, each of the speech samples was down sampled to 16 kHz. It is

Normalization (SN)

Pre-emphasis

Table 2 Age and gender distribution of the reading recordings in the chosen subset of UCLASS databases Age Range

Gender

s n' Frame-blocking

Male

Female

92 38

15 5

x(n)

Recording Speaker

7y10m–20y7m

Windowing w(n)

J Med Syst

Fig. 3 Magnitude responses of the pre-emphasis filter

emphasis filter with sampling rate of 16 kHz and ã was varied from 0.91 to 0.99. The magnitude responses of the pre-emphasis filter were slightly different from 0 Hz to 1 kHz. The effect of different ã values on classification was examined. The pre-emphasized signal was blocked into short overlapping frames. Each frame was multiplied with Hamming window to minimize the signal discontinuities. Frame length was varied from 10–50 ms with different percentage of overlapping for further investigation. Speech features used for classification were extracted from each frame. The feature extraction process is discussed in the next section. Feature extraction: Linear prediction There were 3 acoustical features extracted from stuttered speech signals, the overview of the suggested feature extraction process is depicted in Fig. 4. Linear prediction is speech coding technique used to extract the relevant information that represents the signal. The fundamental idea of the linear predictive coding is that a given speech sample

at time n, ^sðnÞ, can be estimated as a linear combination of the past p speech samples where p represent the order of the LPC [22]. ^sðnÞ ¼

LPC LPC Parameter Conversion LPCC

WLPCC

Fig. 4 Overview of linear prediction process

ð4Þ

The prediction error, e(n), is the differences between the actual and the estimated sample value, defined as eðnÞ ¼ sðnÞ  bsðnÞ ¼ sðnÞ 

p X

am sðn  mÞ

ð5Þ

m¼1

where am is the linear prediction coefficients (LPC). LPC were calculated by minimizing the mean squared error over a frame of the speech signal and the steps is explained in [22]. Thus, autocorrelation method was employed to each frame of the windowed signal, described by Eqs. 6 and 7 rðmÞ ¼

N 1m P

xðnÞxðn þ mÞ;

m ¼ 0; 1; . . . ; p

ð6Þ

n¼0

where the autocorrelation function is symmetric, thus the LPC equations can be stated as rðjm  kjÞam ¼ rðmÞ; 1  m  p

ð7Þ

m¼1

LPC Analysis

Parameter Weighting

am sðn  mÞ

m¼1

p P

Autocorrelation Analysis

p X

Linear Prediction Cepstral Coefficients are LPC represented in the cepstrum domain. LPCC are the coefficients of the Fourier transform representation of the log magnitude spectrum. LPCC have been proven to be more robust and reliable compared to LPC [22, 23]. In this study, the number of LPCC used to represent each frame was  calculated by applying Q  32 p with Q > p. LPCC were obtained by deriving directly from the LPC using recursion technique: c0 ¼ ln s 2

ð8Þ

J Med Syst

c m ¼ am þ

m1 P k¼1

cm ¼

m1 P k¼1

σ2 cm am p

k m

k m



k-nearest neighbor (kNN) ck amk ; 1  m  p

 ck amk ; m > p

ð9Þ

ð10Þ

gain term in the LPC model Linear Prediction Cepstral Coefficients (LPCC) Linear Prediction Coefficients (LPC) order p

Weighted Linear Predictive Cepstral Coefficients (WLPCC) can be easily generated by multiplying LPCC with the weighted formula: h  i wm ¼ 1 þ Q2 sin pm ; 1mQ ð11Þ Q Weighted function acted like the bandpass filter in the cepstral domain to de-emphasizes cm around m=1 and around m = Q. In addition, the filter was used to reduce the sensitivity of the low-order cepstral coefficients to overall spectral slope (m=1) and the sensitivity of the high-order cepstral coefficients to noise (m = Q). WLPCC, bcm were determined with the formula defined as, bcm ¼ wm »cm ;

1mQ

ð12Þ

In the literature, WLPCC have been used as feature extraction method in other applications [24, 25]. In this study, the number of LPC, LPCC and WLPCC used to represent each frame for the 2 types of dysfluencies classification were depended on the p values. Values of p from 8 to 16 have been widely used with sampling frequency equal to 8 kHz [22, 26]. Thus p values should be decided based on the application and sampling frequency. The effect of p values on 2 types of dysfluencies classification was analyzed. Furthermore, the effect of the three other parameters, namely, percentage of overlapping, frame length and ã value of the first order pre-emphasis filter on the speech dysfluencies classification were investigated. The results of the analysis were discussed in the Result and discussion section.

kNN is an uncomplicated classification model that employs lazy learning [27]. It is a supervised learning algorithm by classifying the new instances query based on majority of kNN category. Minimum distance between query instance and each of the training set is calculated to determine the kNN category. Each query instance (test speech signal) is compared against each of training instance (training speech signal). The kNN prediction of the query instance is determined based on majority voting of the nearest neighbour category. Since each query instance (test speech signal) is compared against all training speech signal, kNN encounters high response time [27]. In this work, for each test speech signal (to be predicted), minimum distance from the test speech signal to each of the training speech signal in the training set was calculated to locate the kNN category of the training data set. The Euclidean Distance measure dE (x, y) was used to compute the distance between training and testing, where x, y are training and testing speech signals composed of N features, namelyx ¼ fx1 ; x2 ; . . . ; xN g; y ¼ fy1 ; y2 ; . . . ; yN g. Euclidean distance is defined in Eq. 13. dE ðx; yÞ ¼

N qffiffiffiffiffiffiffiffiffiffiffiffiffiffi X x2i  y2i

ð13Þ

i¼1

From this kNN category, class label of the test speech signal was determined by applying majority voting among the k nearest training speech samples. Thus, the k values represent an important role in kNN classification. Generally, k-values of kNN should be determined in advance and the best choice of k-values depends on the data. Normally, larger values of k will reduce the effect of noise on the classification but cause boundaries between classes less distinct [28]. Therefore in this study, for each of the experiment, different k values ranging between 1 and 10 were applied. For each value of k, the experiment was repeated for 10 times, each time different training and testing sets were prepared. Linear discriminant analysis (LDA)

Classification This section explains the elementary theory of k-Nearest Neighbor (kNN) and Linear Discriminant Analysis (LDA). The input for both classifiers is the set of feature vectors discussed above. Conventional validation was performed with 70% of data used for training and 30% of data used for testing. Data in both testing and training sets were shuffled randomly.

LDA is used for data classification and feature selection. It has been widely used in many applications such as speech recognition, face recognition, image retrieval, etc. In this work, discriminant function was used to assign the speech features into two feature vector classes. The aim of LDA is to find a linear transformation that maximizes class separability in reduced dimensional space [29]. Linear transformation is also known as discriminant function. It

J Med Syst

was generated by transforming the speech features vector to a projection feature vector. The projection feature vector was computed using discriminant function as: 1 fi ¼ mi Sw 1 xTk  mi Sw 1 mTi þ InðPi Þ 2 μi xk Pi

ð14Þ

Mean feature in group i, where i =1 and 2. x is the feature of all data. k represents one speech feature. Total sample of each group divided by the total samples.

Initially, the mean (μ1 -mean of class 1, μ2- mean of class 2) of each feature set and mean μ3 of the total data set were computed, is shown in (15). P1 and P2 were prior probabilities of the class 1 and class 2 respectively. m3 ¼ P1  m1 þ P2  m2

ð15Þ

In LDA, class discrimination was measured by applying within-class scatter Sw and between-class Sb scatter [29]. The within-class scatter Sw is the expected covariance of both classes and is computed using Eq. 16. Sw ¼ P1  cov1 þ P2  cov2

Sb ¼

X

ðmi m3 Þ  ðmi  m3 ÞT

Experiments

Frame Length

ã values

Percentage of order p overlapping

Effect of frame length Effect of ã values

Varied 10 ms to 50 ms 20 ms

0.9375

33.33%

Effect of percentage 20 ms of overlapping Effect of p order 20 ms

8

0.91–0.99 33.33%

8

0.99

8

0.99

0%–75% 75%

2–70

experiment, the effect of ã was analyzed by fixing the frame length to the best value determined in first experiment while the others 2 parameters were fixed to the typical values. The effect of percentage of overlapping was analyzed in third experiment by fixing frame length and ã values to the best value found in the previous experiments while order p was fixed to typical values. Finally, the fourth experiment was conducted by using the best parameters (frame length, ã and percentage of overlapping) obtained from the previous three experiments. Effect of frame length

ð16Þ

Thus, cov1 and cov2 are symmetrical. Equation 17 was used to calculate covariance matrix. The between-class scatter was calculated using Eq. 18. Sw can be considered as the covariance of data set by replacing mean vectors of each class with mean vectors of total data set. covi ¼ ðxi  mi Þðxi  mi ÞT

Table 3 Parameters setting of experiments

ð17Þ ð18Þ

i

All feature data were transformed into discriminant function, training data and the prediction data are drawn into new coordinate. Class label of prediction data was determined by calculating the difference between fi.

Results and discussion This section describes the experimental results of the 2 types of speech dysfluencies classification based on kNN and LDA. Furthermore, the effect of parameters, namely, frame size, ã values in first order pre-emphasis filter, percentage of overlapping and order p of Linear Prediction features on classification results were discussed. There are four experiments were performed based on configuration of the parameters tabulated in Table 3. For the first experiment, the best values of frame length was determined by fixing the typical values for ã (0.9375), percentage of overlapping (33.33%) and order p equal to 8. In the second

Generally, the typical values of frame length in LPC frame analysis is 45 ms, 30 ms and 30 ms at 3 different sampling rates of 6.67 kHz, 8 kHz and 10 kHz are used for speech recognition system [22]. However, there are some applications using 10–50 ms as the frame length for frame analysis [30, 31]. In this study, the effect of frame length varied from 10 ms to 50 ms was analyzed with the sampling rate of the stuttered speech signals were fixed to 16 kHz and the results of analysis are illustrated in Fig. 5. Experiment was conducted in a consistent manner where the others parameters were fixed to typical values as shown in Table 3. For kNN classifier, it can be seen that 20 ms frames generated better recognition accuracy than other frame lengths for available stuttered data. It can be observed that WLPCC was able to provide a higher overall recognition accuracy of 97.06% when compared to LPCC and LPC with 95.69% and 93.14% respectively. For LDA classifier, 20 ms frames obtained best recognition accuracy for the available stuttered data. It can be perceived that WLPCC was able to grant a better overall recognition accuracy of 97.45% when compared to LPCC and LPC with 95.10% and 90.39% respectively. In this study, it can be concluded that 20 ms frames are the best frame length with sampling rates equal to 16 kHz for both classifiers. Effect of ã values in first order pre-emphasis filter In this section, the effect of ã values were examined by varying ã values between 0.91 and 0.99 while the other

J Med Syst Fig. 5 Classification accuracy (%) of kNN and LDA for speech dysfluencies recognition system based on 3 acoustical features, namely, WLPCC, LPCC and LPC for different size of frame length

Effect of percentage of overlapping

parameters were fixed to typical values or best values as describes in Table 3. The accuracy of kNN and LDA classifiers for the 3 acoustical features are shown in Figs. 6 and 7 respectively. In Fig. 6, it can be observed that ã value equal to 0.99 produced best accuracy for Linear Prediction features. The highest accuracy obtained by WLPCC is 97.45%, LPCC is 95.88% and LPC is 93.73%. Based on Fig. 7, it can be observed that WLPCC, LPCC and LPC acoustical features provide highest accuracy of 97.06%, 96.67% and 92.35% respectively when ã=0.99. As explained in “Speech signal pre-processing”, the purpose of pre-emphasis filter is to compensate the high-frequency component and the ã values is used to control the degree of pre-emphasis filtering. From this studies, it can be concluded that the optimal ã values that used to control the degree of pre-emphasis is 0.99. For both classifiers, the pre-emphasis filter with ã=0.99 was perfoming well by amplifying the important high-frequency component and supressing low-frequency components of the available stuttered speech data.

In this section, the experiment on the effect of overlapping percentage for acoustical features and both kNN and LDA was conducted based on the parameters configuration in Table 3. In general, 33.33% and 50% overlapping are used for short time frame analysis [21, 22], but in this study, the effect of no overlap and 75% of overlap were investigated and the results are tabulated in Table 4. It can be observed that for kNN, 75% frame overlapping gives the best accuracy for LPC, LPCC and WLPCC with the accuracy of 92.16%, 96.47% and 97.45% respectively. Thus, it also implies that features extracted from the 75% of frame overlap are sufficient for kNN to classify the available stuttered data. For LDA classifiers, it can be observed that LPC and LPCC acoustical features extracted from 33.33% of frame overlapping obtained better recognition accuracy than no overlapping with the accuracy of 90.20% and 96.47% respectively. There are no further improvement obtained for higher percentage of overlapping for both LPC

Fig. 6 Accuracy of kNN for speech dysfluencies recognition system based on 3 acoustical features, namely, WLPCC, LPCC and LPC for different ã values in pre-emphasis

Fig. 7 Accuracy of LDA for speech dysfluencies recognition system based on 3 acoustical features, namely, WLPCC, LPCC and LPC for different ã values in pre-emphasis

J Med Syst Table 4 The effect of percentage of frame overlapping on speech dysfluencies recognition system based on 3 acoustical features, namely, WLPCC, LPCC and LPC by using LDA and kNN classifiers

Classifiers

kNN

Percentage of frame overlapping

00

33

50

75

00

33

50

75

Acoustical features

91.57 95.88 97.06

90.78 96.27 97.06

91.18 96.27 97.06

92.16 96.47 97.45

89.22 96.27 96.86

90.20 96.47 96.86

90.20 96.47 97.45

90.20 96.27 97.84

LPC LPCC WLPCC

LDA

For all the parameters analysis discussed in previous section, an 8th order of LP analysis was applied. The typical values of order p is from 8 to 16 for speech signal with sampling rate equal to 8 kHz [22]. In this section, the effect of the order p between 2 and 70 among 3 linear prediction features on the speech dysfluencies classification were investigated in a consistent manner while the other parameters were fixed to the best values found in previous experiment as shown in Table 3 and the results are depicted in Figs. 8 and 9 for kNN and LDA respectively. It can be observed from Fig. 8, for order p in the ranges 8– 50, features able to capture the salient information that used to recognize the 2 types of dysfluencies and are sufficient for kNN to classify the available stuttered data sampled at 16 kHz. The highest accuracies obtained by WLPCC, LPCC and LPC are 98.24%, 97.25% and 93.73% respectively.

From Fig. 9, for order p in the ranges 8–50, all 3 acoustical features were able to capture the significant information that used to classify the 2 types of dysfluencies. The highest accuracy that obtained by WLPCC, LPCC and LPC is 98.04%, 97.06% and 94.90% respectively. From the studies, it can be concluded that the optimal range of the order p is in the range of 8–50 for stuttered speech signal sampled at 16 kHz. For low orders, the LPC spectrum may pick up only the prominent resonance peak. On the other hand, if a large order p was employed, then the LPC spectrum contains several spurious. It is obvious that as the order p increases, the signal spectrum may contain more peaks. However when the order p increase after some value, the spectrum may contains peaks that represent the irrelevant spectral resonance peak. When a proper order p was used, Linear Prediction features mostly contain the sufficient information which is used for kNN and LDA classification. The results of this study indicate for lower order p (

Suggest Documents