Improved Acoustic Modeling for Automatic Dysarthric ...

Improved Acoustic Modeling for Automatic Dysarthric Speech Recognition Sriranjani. R

?

, M. Ramasubba Reddy

?

and S. Umesh

†

?

Biomedical Engineering group, Department of Applied Mechanics, † Department of Electrical Engineering Indian Institute of Technology Madras, Chennai, India. Email: [email protected], [email protected] and [email protected] Abstract—Dysarthria is a neuromuscular disorder, occurs due to improper coordination of speech musculature. In order to improve the quality of life of people with speech disorder, assistive technology using automatic speech recognition (ASR) systems are gaining importance. Since it is difficult for dysarthric speakers to provide sufficient data, data insufficiency is one of the major problems in building an efficient dysarthric ASR system. In this paper, we focus on handling this issue by pooling data from unimpaired speech database. Then feature space maximum likelihood linear regression (fMLLR) transformation is applied on pooled data and dysarthric data to normalize the effect of inter-speaker variability. The acoustic model built using the combined features (acoustically transformed dysarthric + pooled features) gives an relative improvement of 18.09% and 50.00% over baseline system for Nemours database and Universal Access speech (digit set) database. Keywords—Dysarthric speech recognition, Data pooling, fMLLR, Data insufficiency

I.

I NTRODUCTION

Dysarthria is a motor speech disorder, occurring due to the in-coordination of speech musculature. This leads to oral communication problems on account of muscle weakness, disturbed strength, resulting in unintelligible speech. The degree of dysarthria ranges from very low to severe based on the percentage of speech intelligibility. Automatic speech recognition (ASR) systems are built, in order to aid in communication and provide better quality of life for dysarthric people.

each speaker. Different adaptation techniques like maximum likelihood linear regression (MLLR) [4], constrained MLLR (cMLLR) [4] and maximum a posteriori (MAP) [5] were used in literature to improve the performance of the system. Since SD models cannot be built efficiently due to lack of per speaker data, SA models were built for each speaker to obtain better estimates. In literature, several studies have been performed to see the feasibility of using SA and SD models for dysarthric speech recognition [6, 7, 8, 9, 10, 11]. One such study [12] shows that, SA models perform better for mild and moderate dysarthria, while SD models perform better for severe dysarthria. Further study includes, building separate SI models for different severity level speech data and selecting the SI models for further adaptation depending on the severity of the test speaker [7, 13]. In this paper, we address the issue of data insufficiency by pooling data from unimpaired speech databases like TIDigits [14], wall street journal (WSJ0) [15] etc. The models built using the pooled data and dysarthric data provides improvement upto certain limit based on the amount of pooled data. To enhance the performance of this approach, feature space of pooled data and dysarthric data are transformed using featurespace MLLR (fMLLR) [4] technique. The acoustic model built using the combined (pooled+dysarthric) fMLLR features provides a relative improvement of 18.08% and 50.00% compared to the baseline (models built only using dysarthric data) for Nemours and UA speech (digit set) database.

Since the dysarthric speech is slurred/disordered, ASR systems needs to be designed specifically for dysarthric speech. This rules out the option of using conventional ASR systems. In order to build a robust dysarthric ASR system, one of the major challenge faced is data insufficiency. Due to muscle weakness and fatigue speech, the dysarthric speaker finds it difficult to provide sufficient amount of speech data. This results in availability of only handful number of popular databases like universal access (UA) speech data [1], TORGO [2], Nemours [3] for dysarthric speech. Using these data, different acoustic models like speaker dependent (SD), speaker independent (SI) and speaker adaptive (SA) models have been built in literature.

The rest of the paper is organized as follows: Section II explains about data pooling method, section III explains the proposed method. The details of databases used are provided in section IV. Experimental setup is explained in section V, while results are discussed in section VI. The final section VII concludes the present work.

SD system is built only using specific captures the characteristics of that speaker manner. SI models are built using data from while SA models are formed by adapting SI

In data pooling, the unimpaired speech data from other databases like TIDigits, WSJ0 etc., are combined with dysarthric data. This would help to increase the amount of data, thereby improving the estimates of the model. The datapooled SI model is formed by combining varying amount of pooled

speaker’s data, in an efficient all the speakers models towards

II.

DATA P OOLING M ETHOD

Baseline continuous density hidden Markov model (CDHMM) is built using dysarthric data from all speakers. This model is then adapted using MLLR and fMLLR+ speaker adaptive training (SAT) [16] to form SA models for each speaker as shown in figure 1.

data with fixed amount of dysarthric data. These datapooled models were adapted using MLLR / fMLLR+SAT techniques to form SA models for each speaker as shown in figure 2.

Figure 1. Schematic diagram showing the implementation of baseline model

Figure 3. Schematic diagram showing the implementation of proposed method (fMLLR on pooled data)

IV.

DATABASES U SED

A. Nemours Database

Figure 2. model

Schematic diagram showing the implementation of datapooled

The amount of data pooling was varied approximately from 20% - 100% of the available pooled data and the performance is measured as shown in table I. As the amount of pooled data increases, the performance of the model decreases. This is due to the fact that the model lean towards unimpaired speech data. This is a limitation to increase in complexity of the model. In order to overcome this issue, we tried to apply acoustic transformation on the pooled data and dysarthric data. III.

P ROPOSED M ETHOD - F MLLR ON POOLED AND DYSARTHRIC DATA

In this paper, we use the acoustic transformation technique fMLLR[4], where the acoustically transformed feature vector oˆt is represented as, oˆt

= Aot + b = W ξt

where ot ∈ RD is a D dimensional observation vector at time t in the original feature space, A ∈ RD×D is the transformation matrix and b ∈ RD×1 is the bias. The extended (D + 1) × 1 dimensional feature vector is ξt = [1 oTt ]T and the extended D × (D + 1) dimensional acoustic transformation matrix is W = [b A]. The transformed feature vector oˆt lies in the normalized feature space reducing the effect of inter-speaker variability. In this method, fMLLR is applied on both dysarthric data and pooled data to move the data lying in two different feature spaces into a normalized feature space. The acoustic model is built using this combined data (transformed dysarthric + pooled data) are further adapted using MLLR/fMLLR+SAT technique to form SA models for each speaker as in figure 3. The advantage of our proposed method is that, complexity of the model can be increased by providing more amount of pooled data, thereby providing robust estimation of parameters.

Nemours database [3] contains continuous speech data spoken by 11 speakers covering the degree of dysarthria from mild to severe and recorded at 16KHz sampling rate. The vocabulary contains 113 words with each speaker uttering 74 sentences. Only 10 speakers were used for our experiment (speaker KS is removed due to incomplete perceptual information [3]). Each sentence is of the form “The X is Ying the Z” where X and Z are monosyllabic noun and Y is a disyllabic verb. All the sentences from each dysarthric speaker is repeated using the unimpaired (control) speaker forming the unimpaired speaker’s data. The assessment for each speaker is provided using the standard Frenchay dysarthric assessment (FDA) [17] scores. Since this database has less amount of data, covering words with varying acoustic characteristics, it was chosen for our experiments. B. Wall Street Journal (WSJ0) Wall street journal (WSJ0) [15] database with sampling rate 16KHz and a vocabulary size of more than 5K words (WSJSI84) was used for our experiments, to pool data with Nemours database. Speech data contains continuous sentences with equal number of utterances from male and female speakers. C. Universal Access (UA) Speech Database UA-Speech database [7] contains dysarthric speech with sampling rate 16KHz, collected for developing largevocabulary ASR system. The database contains data of 15 speakers of spastic dysarthria, out of which only 7 speakers of varying speech intelligible ratings were considered for our experiments as described in [18]. The intelligible ratings were measured using 5 native listeners, who transcribed the data orthographically. The details about each speaker [18] is mentioned in Table II. Each speaker recorded 3 blocks of 255 words, of which two blocks were used for training and one block for testing. Each block of 255 words contains 100 non-overlapping words (i.e., different from 100 words in other blocks) and 155 overlapping words (i.e., words common to all blocks). The 100 nonoverlapping words (like paragraph, sentence etc.) are collected

Table I.

I MPROVED ACOUSTIC MODEL RESULTS FOR VARIOUS TASKS USING N EMOURS AND UA S PEECH ( DIGITS ) DATABASE

Database

Hours of Pooled Data

0.16 0.29 0.44 0.52 0.73 -

Nemours

UA Speech (Digits)

0.14 0.23 0.30 0.44 0.55 0.67 DP -

Total hrs of train data

Data Pooled base SI model (%WER)

Proposed MethodfMLLR-pooled data+dysarthric data SI model (%WER) 0.36 35.17 36.42 (Baseline) 0.52 35.58 31.62 0.65 35.83 32.04 0.80 36.33 29.83 0.88 36.75 30.21 0.99 36.79 30.42 0.64 10.64 8.94 (Baseline) 0.78 12.55 4.68 0.87 12.77 6.38 0.95 11.49 5.53 1.09 13.19 4.47 1.19 12.55 6.81 1.32 12.98 5.74 Proposed models are statistically significant with p-value < 0.001 Datapooled, WER - word error rate, Rel Imp - Relative Improvement, wrt- with respect to

from children’s novels digitized by Project Gutenberg. The 155 overlapping words in each block can be divided into 4 subsets namely, digits, computer commands, radio alphabets and common words. The digits set contains spoken utterances of zero to nine, while the set of computer commands are given as Enter, Alt etc. Radio alphabets contains set of 26 radio alphabet letters (Alpha, Bravo etc.) and 100 common words are as go, in, by etc. taken from Brown’s corpus of written English. We have chosen only digits set for our experiments. D. TI Digits TI Digits corpus [14] contains clean data, which has both isolated and connected digits sampled at 20 KHz. The data was downsampled to 16 KHz to match the sampling rate of dysarthric speech data. Speech data is collected from adults and children with equal gender representation. Each speaker produces isolated and connected digits of approximately 77 utterances of digits 0 to 9 and “oh”. For our experiments, only isolated digit utterances spoken by adults of both genders were used to be pooled with UA digit set database. Depending on the amount of data pooling, equal number of utterances were taken from both the genders. Table II.

I NFORMATION ABOUT SPEAKER ’ S INTELLIGIBILITY Speaker M09 M05 M06 F02 M07 F03 M04

V.

Speech Intelligibility (%) High (86) Mid (58) Low (39) Low (29) Low (28) Very Low (6) Very Low (2)

E XPERIMENTAL S ETUP

All our experiments were performed using Kaldi toolkit [19]. Mel-frequency cepstral coefficients (MFCC) features with energy, delta and acceleration forming 39 dimensional feature vector was created using 25ms window length and 10ms shift. fMLLR is applied on MFCC features to form the fMLLR features. Lexicon contains 39 phonemes for Nemours database and 41 phonemes for UA speech digit set (including

% Rel Imp wrt baseline

13.17 12.02 18.09 17.05 16.47 47.65 28.63 38.14 50.00 23.82 35.79

silence) in ARPAbet (advanced research project agency) symbol set. Trigram language models were used and performance is measured in terms of word error rate (WER). Dysarthric baseline model using Nemours database is built using fMLLR features with train data containing 340 utterances (0.36 hours) and test data with 400 utterances (0.44 hours). The number of senones and mixtures are varied empirically and the best performance is obtained with 200 senones and 12 Gaussian mixtures. Since our main aim is to form a data pooled baseline model, train data is less than test data. Adaptation experiments were performed only using Nemours data with 34 utterances from each speaker. For UA speech digit set, the train data contains 940 utterances (0.64 hours) and 470 utterances (0.29 hours) for testing. Best performance is obtained for baseline model using 100 senones and 8 Gaussian mixtures. Datapooled models are formed using MFCC features for both Nemours and UA speech data. In order to have balanced data during pooling, the data from WSJ and TIDigits were divided into small chunks to be pooled with dysarthric data. The actual amount of hours for each tasks is mentioned in table I for Nemours database and UA speech database. To have equal distribution of phonemes in both dysarthric and pooled data, some phonemes with similar pronunciation were added/modified to pooled dataset [20]. VI.

R ESULTS AND DISCUSSION

Nemours database: Table I shows that, the datapooled model degrades in performance compared to baseline model due to the inter-speaker variability between the pooled data and dysarthric data. Model with WER 35.17% without pooling is better than baseline WER 36.42%, since only less data is available to estimate the transformation matrix. Increasing the amount of pooled data decreases the performance of the system, since the model lean towards pooled data. fMLLR transformation applied on pooled and dysarthric data, transform the features into normalized feature space. The acoustic model built using the transformed (dysarthric+pooled) fMLLR features with 0.44 hours of pooled data, yields better performance with WER 29.83% compared to other models with varying amount of data pooling. This indicates that the

Speaker wise (mild to severe) % WER for Improved Acoustic Models using NEMOURS data 70 60.83

Baseline

% Word error rate (WER)

60

56.25

Proposed Method 50

47.50 43.33 41.25

40

35.83 34.58

32.08

34.58

33.75 31.25

32.08 28.75

28.75

27.92

30 23.33 21.67 20.83 20.42

20

15.42

10 0

FB

MH

BB

LL

JF

RL

Moderate Speakers

Mild Speakers

RK

BK

BV

SC

Severe Speakers

Figure 4. Speaker-wise (from mild to severe category) %WER for Baseline model, fMLLR+SAT model and fMLLR on pooled data (proposed) model using NEMOURS dataset

corresponding amount of pooled data of 0.44 hours is found to be optimal. Compared to baseline model, a relative improvement of 18.09% is obtained for the proposed method. Also the proposed method gives consistently improved performance compared to datapooled models with increasing amount of pooling data. Matched pair sentence error test and McNemour test are used to check for statistical significance. The results for the proposed method (for all the 5 datapooled models) are 99% statistically significant when compared with dysarthric baseline model (p value < 0.001). Speakerwise WER for improved acoustic models ( High to very low) UA digits set


Speaker-wise results for acoustic models: Speaker-wise performance of the various acoustic models arranged in the order of increasing severity for Nemours database is shown in figure 4. Proposed method performs better compared to baseline model across all the speakers. For UA speech digit set, the proposed method performs better for all the speakers except for M05 and M04 speakers. It gives better improvement for high, low speakers compared to mild, very low intelligibility speakers as shown in figure 5.

Proposed method Baseline

20

The proposed models gives consistent improvement compared to datapooled model with increasing amount of pooling data. Similar to Nemours database results, the proposed models are 99% statistically significant (p value < 0.001) compared to baseline using the matched pair sentence error test and McNemour test.

15

10

Results on adaptation for the JF speaker 40

5

CDHMM SI model MLLR on SI model

0

M09 High

M05 Mid

M06

Speakers

F02 Low

M07

M04 F03 Very Low

Figure 5. Speaker-wise (from high to very low category) %WER for Baseline model, fMLLR+SAT model and fMLLR on pooled data (proposed) model using UA speech (digit set) dataset


38

fMLLR+SAT on SI model

36

34

32

30

UA speech database: Results for the Digit set is shown in table I. Similar argument as Nemours database holds good here and our method is also verified using this database. Baseline model with WER 8.94% is better compared to model with WER 10.64% without pooling. This is due to more amount of data availability resulting in better estimation of the transformation matrix. As the pooled data amount increases, the performance of pooled model increases and becomes saturated. The pooling amount 0.44 hours is found to be optimal using the model performance.The proposed model with WER 4.47% is performing better compared to other models with varying amount of data pooling. A relative improvement of approximately 50% is obtained when compared with baseline model.

28

Baseline

Datapooled model

Proposed Method

Figure 6. Adaptation results on CDHMM, MLLR and fMLLR+SAT model for the JF speaker

Adaptation results for acoustic models: On the whole for SA models, adaptation using fMLLR+SAT technique gave improved performance across all the speakers compared to MLLR. Compared to adapting the baseline models directly using fMLLR+SAT, the proposed method gives a relative improvement of 17.16%. Adapting the proposed models further using fMLLR+SAT/MLLR techniques, does not give improve-

ment. This is owing to the fact that the proposed method, by itself represent dysarthric data in efficient manner. For example, consider the graph for different acoustic models for the JF speaker shown in figure 6. In case of baseline model, CDHMM is found better compared to models adapted using fMLLR+SAT/MLLR. On the other hand, adaptation improves performance for datapooled models. Finally, the proposed method CDHMM gave improved performance compared to fMLLR+SAT/MLLR adapted models. This shows that proposed method is efficient compared to adapting the models over baseline. FDA scores compared with proposed acoustic model: The FDA scores for each speakers in terms of the WER [3] for Nemours database are compared with the proposed method scores in figure 7. It is clearly shown that, for mild and moderate dysarthric speakers, human intelligibility is better compared to dysarthric ASR systems. But for severe speakers (except BK), ASR system is able to perform well comparing with FDA intelligibility scores (in terms of WER). Hence the results using our approach verifies the results obtained in [9, 18].

Speakerwise WER FDA scores vs fMLLR−DP model 49


50 45.42

45

FDA scores in terms of WER

40

fMLLR−DP SI model

43

42

40

37.5

35.83

35 31

30

30.42

28.33 27

30 25 21

20

17.5

15.42

17.08 16

15 10

10 7

8

5 0

FB

MH

BB

LL

JF

RL

RK

Moderate speakers

Mild speakers

BK

BV

SC

Severe speakers

Figure 7. Speaker-wise FDA scores (in terms of WER) vs WER of improved acoustic model

VII.

C ONCLUSION

Data insufficiency was handled by pooling data from unimpaired speech data and acoustic transformation of pooled and dysarthric data using fMLLR. The performance improvement of various acoustic models like datapooled model and proposed model over dysarthric baseline model was shown for two different databases, one with less data and one with large vocabulary. The gain obtained using our proposed method is that using less amount of dysarthric data, robust acoustic SI models can be built by acoustically transforming the dysarthric and unimpaired speech data. R EFERENCES [1] H. Kim, M. Hasegawa-Johnson, A. Perlman, J. Gunderson, T. S. Huang, K. Watkin, and S. Frame, “Dysarthric speech database for universal access research.,” in Proc. Interspeech, pp. 1741–1744, 2008.

[2] F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, no. 4, pp. 523–541, 2012. [3] X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “The nemours database of dysarthric speech,” in Proc. ICSLP, vol. 3, pp. 1962– 1965, IEEE, 1996. [4] M. J. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998. [5] C. Chesta, O. Siohan, and C.-H. Lee, “Maximum a posteriori linear regression for hidden markov model adaptation.,” in Proc. Eurospeech, 1999. [6] H. Christensen, S. Cunningham, C. Fox, P. Green, and T. Hain, “A comparative study of adaptive, automatic recognition of disordered speech.,” in Proc. Interspeech, 2012. [7] M. J. Kim, J. Yoo, and H. Kim, “Dysarthric speech recognition using dysarthria-severity-dependent and speakeradaptive models.,” in Proc. Interspeech, pp. 3622–3626, 2013. [8] K. T. Mengistu and F. Rudzicz, “Adapting acoustic and lexical models to dysarthric speech,” in Proc. ICASSP, pp. 4924–4927, IEEE, 2011. [9] K. T. Mengistu and F. Rudzicz, “Comparing humans and automatic speech recognition systems in recognizing dysarthric speech,” in Advances in Artificial Intelligence, pp. 291–300, Springer, 2011. [10] F. Rudzicz, “Comparing speaker-dependent and speakeradaptive acoustic models for recognizing dysarthric speech,” in Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility, pp. 255–256, ACM, 2007. [11] O. Saz, E. Lleida, and A. Miguel, “Combination of acoustic and lexical speaker adaptation for disordered speech recognition.,” in Proc. Interspeech, pp. 544–547, 2009. [12] P. Raghavendra, E. Rosengren, and S. Hunnicutt, “An investigation of different degrees of dysarthric speech as input to speaker-adaptive and speaker-dependent recognition systems,” Augmentative and Alternative Communication, vol. 17, no. 4, pp. 265–275, 2001. [13] M. B. Mustafa, S. S. Salim, N. Mohamed, B. Al-Qatab, and C. E. Siong, “Severity-based adaptation with limited data for asr to aid dysarthric speakers,” PloS one, vol. 9, no. 1, p. e86285, 2014. [14] R. G. Leonard, G. R. Doddington, and LDC, “Studio quality speaker-independent connected-digit corpus: TI Digits,” Philadelphia, USA, 1993. [15] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in Proceedings of the workshop on Speech and Natural Language, pp. 357– 362, Association for Computational Linguistics, 1992. [16] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compact model for speaker-adaptive training,” in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 2, pp. 1137– 1140, IEEE, 1996. [17] P. Enderby, “Frenchay dysarthria assessment,” International journal of language & communication disorders, vol. 15, no. 3, pp. 165–173, 1980.

[18] H. V. Sharma, M. Hasegawa-Johnson, J. Gunderson, and A. Perlman, “Universal access: speech recognition for talkers with spastic dysarthria.,” in Proc. Interspeech, pp. 1451–1454, 2009. [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in Proc. ASRU, pp. 1–4, 2011. [20] C. Lopes and F. Perdigão, “Phone recognition on the timit database,” Speech Technologies/Book, vol. 1, pp. 285– 302, 2011.