Using speech rhythm knowledge to improve dysarthric ... - Springer Link

5 downloads 0 Views 429KB Size Report
Aug 31, 2011 - Abstract We introduce a new framework to improve the dysarthric speech recognition by using the rhythm knowl- edge. This approach builds ...
Int J Speech Technol (2012) 15:57–64 DOI 10.1007/s10772-011-9104-6

Using speech rhythm knowledge to improve dysarthric speech recognition S.-A. Selouani · H. Dahmani · R. Amami · H. Hamam

Received: 15 June 2011 / Accepted: 3 August 2011 / Published online: 31 August 2011 © Springer Science+Business Media, LLC 2011

Abstract We introduce a new framework to improve the dysarthric speech recognition by using the rhythm knowledge. This approach builds speaker-dependent (SD) recognizers with respect to the dysarthria severity level of each speaker. This severity level is determined by a hybrid classifier combining class posterior distributions and a hierarchical structure of multilayer perceptrons. To perform this classification, rhythm-based features are used as input parameters since the preliminary evidence from perceptual experiments shows that rhythm troubles may be the common characteristic of various types of dysarthria. Then, a speaker-dependent dysarthric speech recognition is performed by using Hidden Markov Models (HMMs). The Nemours database of American dysarthric speakers is used throughout experiments. Results show the relevance of rhythm metrics and the effectiveness of the proposed framework to improve the performance of dysarthric speech recognition.

S.-A. Selouani () Université de Moncton, Campus de Shippagan, Moncton, NB, Canada e-mail: [email protected] H. Dahmani INRS-EMT, Université du Québec, Montréal, QC, Canada e-mail: [email protected] R. Amami École ESPRIT, Tunis, Tunisia e-mail: [email protected] H. Hamam Université de Moncton, Moncton, NB, Canada e-mail: [email protected]

Keywords Dysarthria · Speech recognition · Severity level assessment · Neural networks · Hybrid systems · Rhythm metrics · Posterior distributions · Nemours database

1 Introduction Dysarthria is linked to the disturbance of brain and nerve stimuli of the muscles involved in the production of speech. This impairment induces disturbances in timing and accuracy of movements that are necessary for prosodically normal, efficient and intelligible speech. Rhythm troubles may be the common characteristic of various types of dysarthria, but all types of dysarthria affect the articulation of consonants and vowels (in very severe dysarthria) leading to a slurring speech (Liss et al. 2009). While the rhythm is identified as the main feature characterizing dysarthria, assessment methods are mainly based on perceptual evaluations. Despite their numerous advantages including the ease of use, low cost and clinicians’ familiarity with perceptual procedures, perceptual-based methods suffer a number of inadequacies and aspects affecting their reliability. These methods also lack evaluation protocols that may help standardization of judgments between clinicians and/or evaluation tools. Various approaches have been proposed for the classification of speech disorders linked to dysarthria. These methods fall into three broad categories. The first category includes statistical methods that aim at determining a likelihood function which is defined as the probability of observing the data given the model. Therefore, the probability of an observation is estimated as a function of the model, and the maximum likelihood estimate is the parameter of interests in the classification process (Selouani et al. 2009; Tolba and Eltorgoman 2009). The main advantage of the

58

statistical methods is their high recognition rate. However, their downside is the high volume of utterances required for an accurate learning phase. The second category is based on soft computing techniques. Non-linear techniques based on self-organizing maps and feed-forward neural networks have been used with success to elaborate discriminative models for disabled speech (Rudzicz 2009). The advantages of connectionist classifiers are their simplicity and ease of implementation. Finally, the third category is based on a combination of the two first categories. The idea is to exploit the advantages of both statistical and soft computing based methods. These hybrid models have been investigated intensively towards normal speech recognition applications and for the classification of biological signals (Tsuji et al. 1999). In Polur and Miller (2006), a hybrid technique combining the Hidden Markov Models (HMM) and Neural Networks has been proposed for the analysis of dysarthric speech signals using the standard cepstral features. The hybrid approach was found more accurate than HMMs. In this paper, we present a hybrid approach that combines class posterior distributions, hierarchical structure of Multilayer Perceptrons (MLPs) and HMMs to perform dysarthric speech recognition after assessing different severity levels of dysarthria. We also propose a novel approach for the analysis of input speech. In addition to the standard MelFrequency coefficients (MFCCs) used to analyze normal speech, we propose to include rhythm metrics that we believe are more relevant to represent dysarthric speech. For the severity level assessment, we compare the hybrid system with baseline systems, namely Gaussian Mixture Models (GMMs), single MLP and standard hierarchical structure of MLPs. For recognition, an HMM phone-based system is used, and the results obtained by speaker dependent models are compared with the results reported in our previous work (Selouani et al. 2009). The remainder of the paper is organized as follows. Section 3 gives the definitions of rhythm metrics used in our system. Section 4 presents the overall hybrid method and depicts the hierarchical structure of MLP using the posterior distributions. Section 5 describes the process of dysarthric speech recognition. In Sect. 6, the results are presented and discussed. Section 7 concludes the paper.

2 Dysarthric speech characteristics Dysarthria affect millions of people and covers various speech troubles that are mainly due to neurological disorders. It induces mispronounced phonemes, variable speech amplitude and rate, bad articulation, etc. The neurological troubles disturb the nerve stimuli of the muscles involved in the production of speech. As a result, organs of speech production may be affected to varying degrees and are characterized by weakness and impaired muscle tone during the

Int J Speech Technol (2012) 15:57–64

production of speech. Therefore, this disorder has a direct impact on the speech intelligibility. Common classification of the various form of dysarthria is based on the symptoms of neurological disorders. An auditory perceptual evaluation of disturbed speech is often used to assess the severity level of dysarthria. Observations of all types of dysarthria show that the articulation of consonants is considerably affected causing the slurring of speech. In severe cases, the vowels are also distorted. According to Darley et al. (1975), dysarthria can be categorized into seven types: – Ataxic dysarthria, also termed Ataxic speech, affects many functions including respiration, phonation, resonance and articulation. Therefore, an increased effort of articulation is noticed. Persons who suffer from this type of dysarthria tend to place the same excessive stress on all syllables. – Spastic dysarthria is characterized by the harshness of the vocal quality. The voice of the speaker is perceived as strained or strangled. Abnormal long durations are observed in phoneme-to-phoneme transitions and syllables. The fundamental frequency is low and may show some breaks. – Hypokinetic dysarthria is associated with Parkinson’s disease. Therefore, hoarseness is common in this type of dysarthria. The intelligibility is reduced because the volume is low. Compulsive repetition of syllables associated with mono-pitch and mono-loudness often occurs. – Hyperkinetic dysarthria is usually associated with involuntary movement. The vocal quality is harsh sounding. This type of dysarthria is also characterized by a hyper nasality and sometimes by frequent pauses associated with dystonia. A total lack of intelligibility is noticed. – Flaccid dysarthria is due to damages into the lower motor neurons that are involved in articulation. Commonly, a paralysis is observed on one vocal fold. The voice is harsh sounding with low volume and sometimes with inspirational stridency. – Mixed dysarthria is characterized with various forms of trouble depending on the type and place of motor neurons that remain mostly functional. Voice harshness is noticed if upper motor neurons are not functional. Conversely, if lower motor neurons are affected, the voice will sound breathy. – Unclassified dysarthria covers all types that do not belong to the six above categories. Clinicians treat dysarthria differently depending on its level of severity. Persons suffering from a moderate form of dysarthria are encouraged to follow recover strategies that make their speech more intelligible. However, in some cases, people whose dysarthria is more severe may have to learn to use alternative forms of communication. Different methods have been proposed to evaluate the severity level

Int J Speech Technol (2012) 15:57–64

59

of dysarthria. For instance, Darley involves listeners to identify unintelligible and/or mispronounced phonemes by the means of an articulation test (Darley et al. 1975). The widely used method is proposed by Kent and is based on a list of words that the patient should pronounce aloud; the listener has to select one of four words depending on what he/she heard. The word lists consider the phonetic contrasts that can be disrupted by dysarthria. The Nemours speech database that we use in this paper is based on the Kent method.

Researchers have developed a number of metrics that quantify speech rhythm in different languages. These rhythm metrics are based on acoustic measures of the duration of vocalic and consonantal intervals in continuous speech, they take into account variability in these durations, and they can be calculated in both raw and rate-normalized forms. This quantitative approach of speech has contributed new insights into language typology (Arvaniti 2009). Grabe and Low (2002) calculate the durational variability in successive acoustic-phonetic intervals using Pairwise Variability Indices (PVI). The raw Pairwise Variability Index (rPVI) is defined by: T −1 t=1

|dt − dt+1 | , T −1

(1)

where dt is the length of the t th vocalic or intervocalic segment and T the number of segments. A normalized version of the PVI index (noted nPVI) is defined by:  T −1 1   dt − dt+1  nPVI =  (d + d )/2  × 100. T −1 t t+1

4.1 Estimation of posterior probabilities The problem of classification of observed vectors into one of the K classes can be formalized by the Bayes rule. A specific class is determined if a posteriori probability of the vector belonging to the class is larger than the ones of the other classes. This a posteriori probability is given by: P (k|x) =

3 Speech rhythm metrics

rPVI =

4 Hybrid system for dysarthric speech assessment

(2)

t=1

Ramus et al. (1999) based their quantitative approach of speech rhythm on purely phonetic characteristics of the speech signal. They measured vowel durations and the duration of intervals between vowels. They computed three acoustic correlates of rhythm from the measurements: (i) %V , the proportion of time of vocalic intervals in the sentence; (ii) ΔV : the standard deviation of vocalic intervals; (iii) ΔC: the standard deviation of inter-vowel intervals. In this work we apply rhythm metrics to analyze variable speech rhythm patterns in the pronunciation of dysarthric speech. The purpose is to examine whether rhythm metrics are sensitive to the observed durational differences among dysarthric speakers. For each sentence of each speaker, we have measured the durations of the vocalic, consonantal, voiced and unvoiced segments. This permits us to calculate seven metrics: %V , C, V , the Vocalic-rPVI, VocalicnPVI, Intervocalic-rPVI, and Intervocalic-nPVI.

P (k)P (x|k) , P (x)

(3)

where P (k) is a priori probability of the class k and P (x|k) is the probability density function (pdf) of x agreed upon the class k. In our application, the pdf of the feature vector is represented by a general Gaussian model with K classes: P (x) =

K 

N (x; μk , Σk ),

(4)

k=1

where N (x; μk , Σk ) is the d-dimensional Gaussian distribution; μk ∈ d , and Σk ∈ dxd represent respectively the mean vector and covariance matrix. This distribution can be written as: N (x; μk , Σk ) 1

= (2π)−d/2 |Σk |−1/2 exp− 2 (x−μk )

T Σ −1 (x−μ ) k k

(5)

.

Then, using (3) and (4), the a posteriori probability P (k|x) can be written as: N (x; μk , Σk ) P (k|x) = K . k  =1 N (x; μk  , Σk  )

(6)

The a priori probability of all classes is considered as a scale factor. Since the mean vector μk = (μ1,k , . . . , μd,k )T and the inverse of the covariance matrix Σk−1 = [σij k ], and using (5), we can write: N (x; μk , Σk )

 d j  1 (2 − δij )σij k xi xj = (2π)−d/2 |Σk |−1/2 exp − 2 +

j d  



σij k μj k μik exp

j =1 i=1

j =1 i=1



j d  

 σij k μj k xi , (7)

j =1 i=1

where δij is the Kronecker symbol. Let’s consider the ψ function defined as the logarithm of N (x; μk , Σk ) and which can be written as follows: ψ = log(N (x; μk , Σk )) = αkT x˜ ,

(8)

60

Int J Speech Technol (2012) 15:57–64

where αk and x˜ ∈ G are defined by the following equations:  d d   αk = α0,k σj 1k μj k , . . . , σj dk μj k j =1

i=1

1 − σ11k − σ12k , . . . , σ1dk , . . . , 2 1 1 − (2 − δij )σij k , . . . , − σddk 2 2

following equations define the relations between inputs and outputs of each layer: Ek1

= x;

Ok1

 T 1 exp( G g=1 αgk Eg ) . = K G T 1 k  =1 exp( g=1 αgk  Eg )

(14)

αkT will play the role of classical weights in neural networks. For the second layer, we can write:

T ,

(9)

Ek2 = Ok1 ;

exp(αkT O1 ) . Ok2 = K T 1 k  =1 exp(αk  O )

(15)

where, 1  σij k μj k μik 2 d

α0,k = −

d

5 Speech recognition of dysarthric speech

j =1 i=1



d 1 log(2π) − log |Σk |, 2 2

(10)

and, x˜ = (1, xT , x12 , x1 x2 , . . . , x1 xd , x22 , x2 x3 , . . . , x2 xd , . . . , xd2 )T . (11) The dimension G is defined by: G=1+

d(d + 3) . 2

(12)

Our approach consists of using αk as the MLP activation function in the hybrid structure. Indeed, the a posteriori probability of (6), taking into account (8), can be written: exp(αkT x˜ ) P (k|x) = K . T ˜) k  =1 exp(αk  x

(13)

4.2 Hierarchical structure of MLPs The connectionist approach presented here proposes to assess the severity level of dysarthria by using a mixture of neural experts. This structure turned out to be very efficient in the case of phoneme recognition (Schwarz et al. 2006). In our application, binary classification sub-tasks are individually assigned to each MLP which is independently trained to discriminate between two severity levels. During the learning phase, a flow of segmented data is presented at the network input. Since the learning is supervised, the data are labeled with respect to the dysarthria severity level. This input vector is preprocessed according to (11). Therefore, the input layer of each MLP is composed of G units calculated by (12). The activation function is defined by (13). We will refer by E 1 and O 1 to the input and output of first layer and by E 2 and O 2 the input and output of the second layer. The

The assessment of the dysarthria severity level will help creating accurate models for automatic speech recognition dysarthric speakers. Indeed, specific parameters and configurations can be used for each level of severity. For instance, we have shown in Selouani et al. (2009) that it is suitable to increase the window’s length for a very severe case where the speech rate can be very low as shown in Fig. 1. Actually, the variety of dysarthric persons may require dramatically different speech recognition systems since the symptoms of dysarthria vary so much from a subject to another. Hasegawa-Johnson et al. (2006) present three categories of audio-only and audiovisual speech recognition algorithms for dysarthric users. These systems include phone-based and whole-word recognizers using HMMs, phonologic-featurebased and whole-word recognizers using support vector machines (SVMs), and hybrid SVM-HMM recognizers. The results showed that HMMs are effective in dealing with large-scale word-length variations by some patients and SVMs brought some degree of robustness against the reduction and deletion of consonants. The speech recognition system presented in this section extends the speaker-dependent system using HMMs we have developed in Selouani et al. (2009). As illustrated in 2, the process of dysarthric speech recognizer is performed subsequently to the severity level categorization. Speakers belonging to a same category share the same HMMs. The analysis frame length has a significant effect on the accuracy of the dysarthric speech recognition. As expected in the most cases of dysarthria, an increased duration of the utterances is noticed, as shown in Fig. 1. Therefore, the proposed hybrid system is evaluated for different lengths of analysis frame. The MFCCs are the acoustical parameters used by the HMM component of the proposed system. The MFCCs have been successfully used to classify speech disorders by HMMs. Indeed, to discriminate between normal and disturbed speech due to various disorders, GodinoLlorente and Gomez-Vilda (2004) use MFCCs and their

Int J Speech Technol (2012) 15:57–64

61

Fig. 1 Examples of a non-sense sentence extracted from the Nemours database and pronounced by a dysarthric speaker (BK) and a clinician (reference). BK speaker belongs to the category of speakers suffering from severe dysarthria

Table 1 Frenchay dysarthria assessment scores of dysarthric speakers of Nemours database (Polikoff and Bunnell 1999)

Patient

KS

SC

BV

BK

RK

RL

JF

LL

BB

MH

FB

Severity(%)



49.5

42.5

41.8

32.4

26.7

21.5

15.6

10.3

7.9

7.1

derivatives as front-end for a neural network based classifier. The reported results lead to conclude that short-term MFCC is a good parameterization approach for the detection of voice impairments.

6 Experiments and results 6.1 Speech material Nemours is one of the few databases of recorded dysarthric speech. It contains records of American patients suffering different types of dysarthrias (Polikoff and Bunnell 1999). The full set of stimuli consists of 74 monosyllabic names and 37 bi-syllabic verbs embedded in short nonsense sentences. Speakers pronounced 74 sentences having the following form: THE noun 1 IS verb-ING THE noun 2. The recording session was conducted by a speech pathologist considered as the healthy control (HC). The speech waveforms were sampled at 16 kHz and 16 bit sample resolution after low pass filtering at a nominal 7500 Hz cutoff frequency with a 90 dB/Octave filter. 6.2 Subjects The speakers are eleven adult males with dysarthria caused by cerebral palsy (CP) or head trauma (HT) and one nondysarthric adult male. Seven speakers have CP, among

whom three have CP with spastic quadriplegia and two have athetoid CP, and both have a mixture of spastic and athetoid CP with quadriplegia. The four remaining subjects are victims of head trauma. A two-letter code was assigned to each patient: BB, BK, BV, FB, JF, KS, LL, MH, RK, RL and SC. By using the Frenchay dysarthria assessment scores (see Table 1 and reference Enderby and Pamela 1983), the patients can be divided into three subgroups: one mild, including subjects FB, BB, MH and LL; the second subgroup includes the subjects RK, RL, and JF and the third is severe and includes subjects KS, SC, BV, and BK. The perceptual data and the speech assessment did not take into consideration the too severe case (patient KS) and the too mild case (patient FB). For the severity level tests, all speakers of Nemours have been considered. In the dysrathric speech recognition tests, one speaker of each category of severity level is considered: (BB, L1); (JF, L2) and (BK, L3). 6.3 Setup of severity classification and speech recognition tasks The connectionist component is composed of three MLPs. These MLPs use activation functions that generating posterior distributions for binary discrimination tasks. As illustrated in Fig. 2, the task of the first MLP is to discriminate between the healthy control (HC) and dysarthric speakers regardless the severity level of their impairments. The

62

Fig. 2 Hierarchical structure of MLP for dysarthric speech assessment and recognition. The MLPs are used as experts to assess the dysarthria severity level before performing accurate HMM-based SD speech recognition

second MLP will process only dysarthric speech utterances in order to classify them into mild or severe levels of impairment. The third MLP performs a classification of severe cases into two sub-categories: most severe and severe. Four levels of classification are considered: L0, L1, L2, and L3. The L0 level corresponds to the HC and L3 to the most severe cases of dysarthria. To extract the MFCCs features, the frame sizes were 30 ms long with 40% overlap using a Hamming window. Since the number of MLP inputs is fixed and the number of frames varies, we performed a compression based on averaging the features on every five frames. This number was found optimal for the studied cases after extensive cross-validation tests. These tests permit also to select the most relevant rhythm metrics that constitute the inputs of the single MLP using the standard backpropagation algorithm. For single MLP, five V (of the five average frames), the %V and n-PVI parameters are found to be the best combination. Therefore, the number of single MLP inputs was seven. For the hybrid system we used the seven rhythm metrics without averaging. The number of inputs of hybrid system was 72 (13 MFCCs×5+seven rhythm metrics). In order to evaluate the SD speech recognition system using severity level information, the HTK speech recognition system using HMMs and described in HTK (2009) has

Int J Speech Technol (2012) 15:57–64

been used throughout all experiments. The toolkit was designed to support continuous-density HMMs with any numbers of state and mixture components. It also implements a parameter-tying procedure allowing the creation of complex HMM topologies. Each phoneme is represented by a 5-state HMM model with two non-emitting states (e.g. 1st and 5th state). MFCCs and energy are used as parameters to train and test the system. Twelve MFCCs were calculated on a Hamming window advanced by 10 msec each frame. Moreover, the normalized log energy is added to the 12 MFCCs to form a 13-dimensional (static) vector. This static vector is then expanded to produce a 39-dimensional vector by adding first and second derivatives of the static parameters. The language model used in this application is based on a bigram which depends on the statistical numbers that were generated from the phonetic transcription. Statistics are generated by using the HLStats function of HTK (HTK 2009). This function computes the occurrences of all labels and then generates the back-off bigram probabilities using the phoneme-based dictionary of the corpus. This file counts the probability of the occurrences of every consecutive pairs of labels in all labeled words of the dictionary. Then, the HBuild function uses the back-off probabilities file as an input and generates the bigram language model. 6.4 Severity level assessment The results are given in Table 2. A comparison of the hybrid system performance is made with baseline systems, namely Gaussian Mixture Models (GMMs), single MLP and standard hierarchical structure of MLPs. Both of single MLP and standard hierarchical MLPs use the backpropagation algorithm with standard sigmoid activation function (Schwarz et al. 2006). The task is to perform the discrimination between the four severity levels. The global classification rate of both hybrid system and MLP hierarchical structure are calculated as the product of the three MLP that compose the structure. The results show that the proposed system outperforms the baseline ones. An improvement of more than 3% was achieved in comparison with GMMs using the same acoustic analysis. The impact of using rhythm metrics, in addition to the MFCCs, was very significant (improvement varying from 3% to 6% was obtained) which confirms the relevance of such features for dysarthria assessment tasks. We have noted that BV, whose Frenchay score was 42.5%, is always misclassified. Indeed, on examining the speech of BV, we noted that the speed of BV speech was quite normal and almost intelligible but with nasality. We have also noted that FB is categorized as mild by the Frenchay test but his utterances are mostly classified in L3 category. In fact, his speech is very intelligible but his speech rate is very slow.

Int J Speech Technol (2012) 15:57–64 Table 2 Comparison of proposed hybrid method with baseline systems

Table 3 Accuracy comparison with respect of window size involving SD-HMM based system using exclusive speaker-dependent models and hybrid system using SD-HMM and prior classification of severity level

63

System

GMM

Single MLP

Hierarchical

Hybrid

MFCCs

80.75

78.58

80.14

82.64

MFCCs+Rhythm metrics

83.05

81.42

82.80

86.35

System

SD-HMM

SD-NN-HMM

Speaker

15 msec

20 msec

25 msec

30 ms

BB

62.50

63.89

65.28

68.66

BK

52.08

55.56

56.86

54.17

JF

61.74

63.82

64.54

67.14

BB

60.15

61.68

63.24

62.55

BK

51.28

52.68

53.64

53.26

JF

58.96

60.45

62.51

60.73

6.5 Evaluation of the dysarthric speech recognition We compare the proposed hybrid system with that presented in Selouani et al. (2009). Similar experimental conditions are provided. Three dysarthric speakers (BB, JF, BK) of the Nemours database are used for the evaluation of speaker-dependent dysarthric Automatic Speech Recognition (ASR). Different Hamming Windows are tested. For each speaker, the training set is composed of 50 sentences (300 words) and the test is composed of 24 sentences (144 words). The models for each speaker are triphone left-right HMMs with Gaussian mixture output densities decoded with the Viterbi algorithm on a lexical-tree structure. Due to the limited amount of training data, for each speaker, we initialize the HMM acoustic parameters of the dependent model randomly with the reference utterances as baseline training (clinician speech data). We carry out experiments in order to compare the optimal frame size of the acoustical analysis window between systems. The tested lengths of these windows are 15, 20, 25 and 30 msec. It should be noted that the frame size is not exclusively controlled by the stationarity and ergodicity constraints, but also by the information contained in each frame. Therefore, the selection of analysis frame length is a compromise between having sufficiently long frames to get reliable estimates of acoustical parameters, and limiting the frame length to perform accurate capture of rapid events. Two sets of experiments have been carried out. The first set involves the purely speaker-dependent ASR system (SDHMM) without prior assessment of severity level. The second set of experiments is performed by using the same HMM for a group of speakers having the same severity level. We referred to this latter system as (SD-NN-HMM) since it uses the connectionist structure for the classification of the severity level. In the first set we exclusively use one SDsystem for each speaker. In the second set of experiments,

each level of severity (L1, L2, L3) has its own HMMs created by using speech data of all the speakers belonging to this category. For instance, for speaker BB the SD-system of L1 will be used. Table 3 shows the recognition accuracy for different lengths of Hamming window and the best result (in bold) obtained for BB, BK, and JF speakers. These results show that window size plays an important role which confirms the findings in Selouani et al. (2009). The best window sizes are 30 msec and 25 msec for SD-HMM and SD-NN-HMM respectively. The average recognition rate for the three speakers when using prior classification of dysarthric severity level is about 60%, which is a very satisfactory result (in the case of pathologic speech). The purely SD systems achieve 64.5% of correct recognition rate. This leads us to conclude that the SD systems remain better than the SD-NN-HMM based systems. However, it should be mentioned that the exclusive SD ASR systems induce more costs since they need more specific data and time for training. The systems operating with respect to dysarthria severity level share data and models which is crucial in the field of impaired speech processing because of the lack of this specific speech data.

7 Conclusion In this paper, we presented a hybrid system using a connectionist approach to estimate the dysarthria severity level and a speaker-dependent system to recognize dysarthric speech. Through the use of a new activation function based on a class posterior distribution, a hierarchical structure of neural networks classifies the severity level of dysarthria prior to speech recognition. The input features are composed of conventional parameters and rhythm metrics based on durational characteristics of vocalic and intervocalic intervals and Pairwise Variability Index used with their both raw and

64

normalized measures. The advantage of the proposed hybrid approach consists of providing accurate models for a group of speakers instead of building one recognizer for each speaker which could be costly. Another advantage is that the system could constitute a relevant objective test for the automatic evaluation of dysarthria impairments. Clinicians may find this tool useful since it can be used to prevent inaccurate subjective judgments.

References Arvaniti, A. (2009). Rhythm timing and the timing of rhythm. Phonetica, 66, 46–63. Darley, F. L., Aronson, A., & Brown, J. R. (1975). Motor speech disorders. Philadelphia: Saunders. Enderby, P., & Pamela, M. (1983). Frenchay dysarthria assessment. London: College Hill Press. Godino-Llorente, J. I., & Gomez-Vilda, P. (2004). Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors. IEEE Transactions on Biomedical Engineering, 51, 380–384. Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. Papers in Laboratory Phonology, 7, 515–546. Hasegawa-Johnson, M., Gunderson, J., Perlman, A., & Huang, T. (2006). HMM-based and SVM-based recognition of the speech of talkers with spastic dysarthria. In International conference on acoustics, speech and signal processing (ICASSP) (pp. 1060– 1063). HTK (2009). The HTK book (Version 3.4.1). Cambridge: Speech Group Cambridge University.

Int J Speech Technol (2012) 15:57–64 Liss, J., White, L., Mattys, S., Lansford, K., Lotto, A., Spitzer, S., & Caviness, J. (2009). Quantifying speech rhythm abnormalities in the dysarthrias. Journal of Speech, Language, and Hearing Research, 52, 1334–1352. Polikoff, J. B., & Bunnell, H. T. (1999). The nemours database of dysarthric speech: A perceptual analysis. In The XIVth international congress of phonetic sciences (ICPhS) (pp. 783–786). Polur, D., & Miller, G. (2006). Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals. Medical Engineering & Physics, 28, 741–748. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265–292. Rudzicz, F. (2009). Phonological features in discriminative classification of dysarthric speech. In International conference on acoustics, speech and signal processing (ICASSP) (pp. 4605–4608). Schwarz, P., Matejka, P., & Cernocky, J. (2006). Hierarchical structures of neural networks for phoneme recognition. In International conference on acoustics, speech and signal processing (ICASSP) (pp. 325–328). Selouani, S. A., Yakoub, M., & O’Shaughnessy, D. (2009). Alternative speech communication system, for persons with severe speech disorders. EURASIP Journal on Advances in Signal Processing, 2009, 540409. doi:10.1155/2009/540409. Tolba, H., & Eltorgoman, A. (2009). Towards the improvement of automatic recognition of dysarthric speech. In IEEE international conference ICSIT (pp. 277–281). Tsuji, T., Fukuda, O., Ichinobe, H., & Kaneko, M. (1999). A loglinearized Gaussian mixture network and its application to EEG pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, 29, 60–72.

Suggest Documents