Introducing Objective Acoustic Metrics for the ...

8 downloads 96545 Views 6MB Size Report
testing any medical diagnostic software applications which I developed. Doctors ... totally convinced of the benefits to be accrued from the informatics revolution, but who ... Creating a bespoke Dysarthric Speech Corpus. 83 .... basis of these objective measures and highlights certain disadvantages associated with the.
Introducing Objective Acoustic Metrics for the Frenchay Dysarthria Assessment Procedure

JAMES N. CARMICHAEL AUGUST, 2007

DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF SHEFFIELD

Abstract This study reports on the elaboration of a computerised system of objective acoustic metrics specifically designed for the articulatory and intelligibility evaluation procedures of the Frenchay Dysarthria Assessment (FDA) series of diagnostic tests. These objective measures are based on deviance-from-norm template matching: certain acoustic features, such as pitch contour, are extracted from the patient’s oral response to test stimuli and the discrete values derived from these features are analysed to determine how much they vary from some pre-established norm. It is demonstrated that the pattern and magnitude of variation indicate the type and severity of the dysarthric condition manifested. A novel method of assessing the intelligibility of pre-selected isolated words and phrases is proposed: objective intelligibility measurements are computed from normalised goodness of fit (GOF) likelihood scores returned by a Hidden Markov Model-based (HMM) automatic speech recognition (ASR) system trained, in this particular case, on data obtained from a variety of normal speakers typifying the speech patterns of the southern Yorkshire region of the United Kingdom. It has been found that the respective HMMderived GOF scores returned by normal and dysarthric speakers form sufficiently distinct cluster groups, thus justifying the use of these scores as a form of standardised intelligibility measurement to supplement the FDA’s current subjective evaluation techniques that are demonstrated to be psychometrically weak. Upon completion of the FDA assessment procedure, the resulting scores and medical notes describing the patient’s performance for each individual test are processed by an expert system along with a pre-trained multi-layer perceptron (MLP); these two classifiers operate in conjunction to return an overall diagnosis of the type and severity of dysarthria which is apparently manifested. When tested on FDA assessment data from 85 individuals whose dysarthric status has been independently confirmed by expert clinicians, this hybrid system’s diagnostic correctness is demonstrated to be 90.6% under certain conditions. When used by clinicians in real-world conditions, the computerised FDA (CFDA) application has proven to be a useful evaluation instrument, making possible the precise acoustic measurement of speech data and therefore facilitating a more consistent and robust diagnosis.

i

Acknowledgements The following individuals have figured prominently in the realisation of this thesis: Professors Phil Green and Martin Cooke, my joint supervisors, whose patience and commitment in terms of time and resources went far above and beyond the call of duty. Professor Pam Enderby, who patiently sat through hours of interrogation in order that her clinical expertise as the author of the paper-based FDA could be documented. Dr Rebecca Palmer, a colleague in several clinical research projects and one who was personally responsible for gathering much of the dysarthric speech data used in this investigation; furthermore, she was always willing to undertake the daunting task of testing any medical diagnostic software applications which I developed. Doctors Vincent Wan and Stuart Wrigley, painstaking proof-readers, whose invaluable suggestions and insights in the domains of signal processing and machine learning greatly assisted all implementation efforts. Finally, my mother, Mrs. Gladys Carmichael who – at 90 years of age – is a lady not yet totally convinced of the benefits to be accrued from the informatics revolution, but who nevertheless resolutely supports her son in his efforts to make a contribution to this field.

ii

TABLE OF CONTENTS Page ABSTRACT

i

ACKNOWLEDGEMENTS

ii

LIST OF ABBREVIATIONS

vii

FORMATTING AND BIBLIOGRAPHICAL CONVENTIONS

viii

CHAPTER 1: INTRODUCTION

1

1.1 1.2 1.2.1 1.2.2 1.3

2 2 5 6 7

Dysarthria Diagnosis Description of key terms and FDA sub-tasks Explanation of Key Terms FDA Task Composition. Thesis Structure

CHAPTER 2: DYSARTHRIA AND RELATED SPEECH DISORDERS: AETIOLOGY AND CURRENT METRICS

9

2.1 2.1.1 2.1.1.1 2.1.1.2 2.1.1.3 2.1.2 2.1.3 2.1.3.1 2.1.3.2 2.1.3.3 2.1.3.4 2.1.3.5 2.1.4 2.2 2.2.1 2.3 2.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.5.7.1

9 10 11 11 11 12 13 13 13 14 14 15 15 15 17 19 24 25 25 27 33 36 39 41 42 43

General Classification of Speech Disorder Types Aphasia Broca’s Aphasia Anomic Aphasia Assessment of Aphasic Conditions Apraxia of Speech Dysarthria Extrapyramidal Hypokinetic Dysarthria Spastic Dysarthria Ataxic Dysarthria Flaccid Dysarthria Mixed Dysarthria Measurement of Dysarthric Symptoms Objectivity and Metric Systems Feature Selection Criteria Grading Scales used in Dysarthria Assessment Validity of Statistical Operations on FDA Diagnostic Data Task-Specific Measurement Methodologies Measurement of Articulator Reflex Actions Measurement of Respiratory Functions Evaluating Labial, Lingual and Palatal functions Evaluation of Lip Seal Evaluation of Diadochokinesis Evaluation of Velarpalatal competence Evaluation of Laryngeal functions Evaluation of Voice Quality

iii

2.5.7.2 Spectral Analysis Techniques to describe Voice Quality 2.5.8 Evaluation of Pitch Modulation 2.5.8.1 Pitch Detection Techniques 2.5.9 Evaluation of Volume Control 2.5.9.1 Quantifying Perceptions of Loudness 2.5.10 Evaluation of Speaking Rate 2.5.10.1 Defining Parameters for Speech Rate Measurement 2.5.11 Evaluation of Intelligibility 2.5.11.1 FDA Intelligibility Evaluation Procedures 2.5.11.2 Different Aspects of Intelligibility 2.5.11.3 Intelligibility at the Source 2.5.11.4 Articulatory vs. Communicative Competence 2.5.11.5 Current Intelligibility Evaluation Methodologies 2.5.11.5.1 Effects of Noise on Intelligibility 2.5.11.5.2 Intrinsic Intelligibility Evaluation Systems 2.6 Formulating the Overall Diagnosis 2.6.1 Automating the Diagnostic Procedure 2.7 Chapter Summary

47 50 52 54 56 59 60 62 62 66 67 68 70 70 73 74 75 77

CHAPTER 3: PROTOCOLS FOR EQUIPMENT STANDARDISATION, TESTING CORPORA AND SIGNAL PROCESSING

79

3.1 3.2 3.3 3.4 3.5 3.5.1 3.6

Validating Experimental Results: Adopting the Expert System Paradigm Creating a bespoke Dysarthric Speech Corpus Data Processing Protocols Objective measurement to FDA Grade Conversion Generic Algorithms Rate of Change Analysis Chapter Summary

CHAPTER 4: MEASUREMENT OF RESPIRATORY FUNCTIONS 4.1 4.1.1 4.1.2 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.3

Measurement of Respiratory Pressure in a Non-Speech Context Perceptual Identification of Respiration Irregularities Criteria Weighting for Test 4 Grading Test 5: Objective Measurement of Respiration during Speech Regulation of speaking rate Detecting In-Speech Inhalation The Influence of Speech Rate Monitoring Strength of Phonation Criteria Weighting and Correlation with Expert Panel

CHAPTER 5: MEASUREMENT OF LABIAL, PALATAL AND LINGUAL FUNCTIONS 5.1 5.2

Devising pseudo-words for the ASR-based analysis of articulatory behaviour Evaluating the Quality of Lip Seal

80 83 85 87 91 91 91

92

92 94 97 99 100 102 105 105 106

111 114 116

iv

5.2.1 5.2.2 5.3 5.3.1 5.3.2 5.4 5.4.1 5.5

Detecting Defective Lip Seal Task 8 Criteria Weighting and Correlation with Expert Panel Evaluation of Diadochokinesis Quantifying DDK incompetence Task 24 Criteria Weighting and Correlation with Expert Panel Evaluation of Velarpalatal competence Objective Measurement of Hypernasality Chapter Summary

116 119 122 124 124 125 126 129

CHAPTER 6: MEASUREMENT OF LARYNGEAL FUNCTIONS

131

6.1 6.1.1 6.2 6.2.1 6.2.2 6.2.3 6.3 6.3.1 6.3.1.1 6.3.2 6.4 6.4.1

131 131 136 137 138 140 145 146 146 150 151

6.5 6.6

Speech Rate Estimation Syllable Rate Estimation for Dysarthric Speech Measurement and Description of Voice Quality Differentiating Between Pathological Voice Quality Types Diagnosis and Classification of Breathy Voice Diagnosis and Classification of Hoarse Voice

Pitch Modulation Measurement Measuring Control Perceptual vs. Objective Pitch Modulation Assessment Quantifying Task Evaluation Criteria

Estimating Loudness Investigating the Effect of Environmental and Hardware Variables Efficacy of CFDA Loudness Measurement Assessing the Composite Effect of Voice quality, Volume and Pitch Modulation

153 157 160

CHAPTER 7: ASSESSING INTELLIGIBILITY

162

7.1 7.1.1

162

7.1.1.1 7.1.1.2 7.2

Measuring Naïve Listener Perceptions

Experiment Design: Multi-dimension 163 Intelligibility Measurement Selection of the Utterance Test Set and Testing Protocols 165 Experimental Results for the Percept-based Intelligibility Tests 173

Consistency in HMM-based vs. Perceptual Intelligibility Measurement Investigating the Learning Effect Implementing a Computerised Naïve Listener Model

177 180 182

Intelligibility Estimator HMM Model Architecture and Standardisation Protocol

185

7.2.3

Intelligibility Estimator Experimental Results

189

7.3

Chapter Summary

191

7.2.1 7.2.2 7.2.2.1

v

CHAPTER 8: OVERALL DIAGNOSIS 8.1 8.2 8.2.1 8.2.2

8.3 8.3.1 8.3.2 8.3.3 8.3.4

OF DYSARTHRIA TYPE

193

Combining Pattern Recognition and Expert Knowledge Encoding Expert Knowledge Information capture through dialogue: questioning the clinician Reviewing the machine-generated output: considering the objective evidence

193 194 195

Selecting and Training the Appropriate Classifier Coping with Data Scarcity Neural Network Configuration and Training Parameters Performance of Rule-based classifier in stand-alone and hybrid modes The CFDA Protocol: Diagnosis with a Caveat

202 204 204 207 210

200

CHAPTER 9: DISCUSSION, CONCLUSIONS AND FURTHER WORK

212

9.1 9.2 9.3 9.4 9.5 9.5.1 9.5.2

212 213 214 214 215 216 216

Evaluation of Respiratory Measurement Systems Encoding Expert Knowledge Evaluation of Intelligibility Measurement Systems Evaluation of CFDA’s Overall Usability Future Work The Self-Adjusting CFDA Establishing the New Paradigm

REFERENCES

220

APPENDIX A:

227

APPENDIX B:

228

vi

LIST OF ABBREVIATIONS AI AIDS AF AMDF ANN ASR BPI CIP CFDA CT dBA DDK DSP DTW EFA FDA FDT FFT GOF HI HMM HSR HTK ISO LDA LVCSR MDVP MFCC MLP MLU MND MRI NIST NTIA PDA PDI PII PSQM PVI NT PSOLA RAP RDP ROC RP SII SLT SNR SPL STI SUFDAR WER

Articulation Index Assessment of the Intelligibility of Dysarthric Speech Articulatory Feature Average Magnitude Difference Function Artificial Neural Network Automatic Speech Recognition Breathy phonation index Clarity and intensity of phonation Computerised Frenchay Dysarthria Assessment software application Computed Tomography Decibel “A” rating Diadochokinesis Digital Signal Processing Dynamic time warping Enderby Frenchay Assessment FDA diagnostic data corpus Frenchay Dysarthria Assessment (paper-based evaluation procedure) Frenchay Dysarthria Test (software application) Fast Fourier transform Goodness of fit Hoarseness Index Hidden Markov model Human Speech Recognition Hidden Markov Model Toolkit International Standards Organisation Linear discriminant analysis Large vocabulary connected speech Multidimensional Voice Program Mel frequency cepstral coefficient Multi-layer Perceptron Maximum Likelihood Utterance Motor Neuron Disease Magnetic Resonance Imaging

National Institute of Standards and Technology National Telecommunications and Information Administration Pitch detection algorithm Pathological Deviation Index Perceptual Intelligibility Index Perceptual speech-quality measure Pitch Variation Index Normalcy Threshold Pitch Synchronous Overlap and Add Relative Average Perturbation Index Robertson Dysarthria Profile Receiver operating characteristic Received Pronunciation Speech Intelligibility Index Speech and Language Therapist Signal-to-noise ratio Sound pressure level Speech Transmission Index Sheffield University Dysarthria Assessment Recordings Word Error Rate

vii

FORMATTING AND BIBLIOGRAPHICAL CONVENTIONS Tables and figures are numbered according to the chapter in which they appear and their sequencing within the given chapter, thus the second table in the fourth chapter would be labelled Table 4.2, for example; the second figure in the fourth chapter would also have the same ‘4.2’ numbering but would be designated Figure 4.2. A similar numbering convention obtains for all equations and algorithms that are presented. Since this study deals with the computerisation of an existing paper-based evaluation procedure, there are extensive references and quotations within this thesis to this procedure. A light grey background is used to highlight such citations, the excerpt presented below (taken from the second chapter of this study) contains an example of this highlighting: “The grading criteria are the following: Ask the patient to say /may pay/ and /nay bay/ while you listen to the change of quality. The assessor may find that placing his/her fingers on the bridge of the nose or using a mirror under the patient's nose will assist reliable grading. Grade A: Normal resonance. No nasal emission. Grade B: Slight hypernasality/imbalanced nasal resonance and / or occasional slight nasal emission.

In terms of the general document formatting style, the following layout conventions apply: •

Major chapter section headings are formatted in bold font, e.g: “2.1. General Classification of Speech Disorder Types”,



Secondary level headings appear in italicised font, for instance: “2.1.1. Aphasia”



All the lower level headings are underlined, e.g. “2.1.1.2 Anomic Aphasia”



All in-line citations appear in brackets with the author’s surname listed along

with the cited document’s year of publication, e.g. (Lippmann, 1997) •

The following bibliographical formatting is used for references: o Authors’ surnames and initials are listed followed by the title of their work; the title is underlined if it is a monograph, otherwise it is placed in double quotation marks. o If the work cited is a monograph, the publishing house’s particulars are listed along with the year of publication. If the work is an article that has published in a journal, or conference / workshop proceedings, then such details are presented in italicised font.

viii

o

o

Volume numbers for journal or conference publications are presented in bold font, with the sub-volume numbers in brackets and the article’s page numbers listed immediately after. The year of publication appears at the end of the citation.

A typical example of a reference appears below: Lippmann, R., P., “Speech recognition by machines and humans”, Speech Communication, 22 (1), 1-15, 1997. In the reference cited above, the article was published in 1997 and appears in volume 22, no. 1, of the Speech Communication journal. _____________________

ix

Chapter 1: Introduction The production of well-inflected and clearly enunciated speech is dependent on the ability to exercise a precise and fine-grained physical control over those organs, collectively known as the articulators, which contribute to the process of oral communication. Any disruption to the neuro-muscular mechanisms responsible for manipulating the articulators will usually result in speech which is, in some measure, acoustically malformed and often unusually difficult to understand (Darley et al., 1975; Enderby, 1980). The term dysarthria is used to describe those conditions which are responsible for such neuro-muscular articulatory malfunction and the resultant abnormal speech.

As can be inferred from the above, a dysarthric condition can arise from impairment to any or all of the various speech subsystems, such as those which regulate respiratory, lingual and laryngeal functions. Given the susceptibility of each of these organ groups to disruption by disease or injury, the incidence of dysarthria – either of the permanent or temporary variety – is quite significant in the general population, estimated to be 170 per 100,000 in the United Kingdom (Enderby and Emerson, 1995). Moreover, dysarthric symptoms are manifested in approximately a third of all individuals suffering traumatic brain injuries (Theodoros et al., 1995), similarly this disorder will affect to some degree in at least 19% of those stricken by degenerative diseases such as multiple sclerosis (MS), Parkinson’s disease and motor neurone disease (MND). A similar prevalence is noted among victims of stroke and other types of cerebrovascular accident (Thompson et al., 1997).

In order to initiate an appropriate program of treatment and rehabilitation for dysarthric conditions, the first phase of intervention is – of course – the identification of the most likely dysarthria sub-type which appears to present itself. This investigation has sought to devise and implement a system of objective metrics to facilitate more robust (and reproducible) diagnostic procedures; the section that follows introduces the empirical basis of these objective measures and highlights certain disadvantages associated with the current dysarthria assessment methodologies and techniques.

1

1.1 Dysarthria Diagnosis Due to its substantial reliance on subjective assessment, the diagnosis of speech disorders in general can occasionally be complicated by significant disagreement among experts regarding the appropriate classification of actual pathological cases. This inherent inconsistency, or psychometric weakness, of opinion-based evaluation is partly responsible for the continuing polemic concerning the identification and categorisation of various categories of dysarthria (Auzou et al, 1998). In broad terms, the two principal sources of contention centre around matters of taxonomy – how to distinguish the different disease sub-groups based on their respective causative conditions – and parameterisation (the best method of measuring and describing symptoms). This study is not a treatise on speech disorder aetiology and therefore does not pretend to make any significant contributions to the debate concerning optimal categorisation of the various dysarthria sub-types. The investigation focuses, instead, on the elaboration of a system of acoustically-based objective measures which, by quantifying the description of certain dysarthric conditions, facilitates a more accurate and consistent diagnosis. Moreover, the measurement protocols and output of this metric system are reproducible, the intention being to associate some specific empirical value to subjective descriptions of pathological conditions which can often be ambiguous and imprecise. When assessing, for example, a patient’s attempt at demonstrating appropriate pitch variation during speech, it is more accurate to describe the utterance as a range of discrete glottal frequency values as opposed to a vague and minimalist comment such as “patient exhibited minimal pitch change”.

This empirical approach, implemented as a computerised application, is based on the Frenchay Dysarthria Assessment (FDA) battery of tests (Enderby, 1983), one of the most widely used dysarthria diagnostic evaluation procedures (Aronson, 1993). In addition to providing objective acoustic descriptions of various speech sounds, a new metric is proposed which employs automatic speech recognition (ASR) technology to rate the intelligibility of dysarthric utterances representing the words and short phrases of the FDA intelligibility tests. These intelligibility measurements are computed from the goodness of fit (GOF) likelihood scores returned by a continuous density hidden Markov Model-based (HMM) speech recognition system trained on data obtained from a variety of normal speakers typifying the speech patterns of the subject’s peer group.

2

The importance of formulating an objective metric to assess intelligibility becomes all the more apparent given the inherent psychometric weakness of current subjective intelligibility evaluation: a listener’s capacity to understand a given speaker is influenced by several factors including the level of previous exposure to said speaker’s accent, shared cultural experiences and personal association with the individual being assessed. Carmichael and Green (2003) have also demonstrated that there is significant discrepancy in speech intelligibility assessment between expert evaluators (those with formal training in a relevant discipline, such as linguistics) and non-expert judges. In addition to this inter-evaluator variability, the incidence of intra-evaluator inconsistency – i.e. the same listener returning differing scores for the same data after re-assessing said data set – can also adversely affect the reliability of traditional intelligibility assessment.

The aim of ASR-based intelligibility assessment is the elimination of this element of psychometric weakness by introducing testing procedures and conditions which are reproducible and constant in their output: automatic speech recognition based on HMM technology has the advantage of always returning the same score for the same recording of a speech sample provided that the automatic speech recogniser’s operating parameters remain unchanged.

The principal rationale, of course, for proposing objective description of various pathological conditions is to lessen the possibility of incorrect diagnosis of the specific dysarthria disease sub-type and the consequences which such a misdiagnosis might entail, namely the prescribing of an inappropriate or even counterproductive course of treatment. A failure to correctly identify an instance of Extrapyramidal Hypokinetic Dysarthria, for example, might well result in an inappropriate therapy regimen targeting the patient’s motor programming competence when such is unnecessary since this particular category of dysarthria does not affect that domain 1 , merely the muscular dexterity associated with it (Darley et al, 1975; Brookshire, 1992). A misdiagnosis of Extrapyramidal Hypokinetic Dysarthria could also result in a missed opportunity to initiate early treatment of a possible causative agent – Parkinson’s disease.

1

Ataxic Dysarthria, on the other hand, normally does indicate an impairment of motor programming control.

3

In addition to improving diagnostic accuracy, the introduction of an objective metric system which emulates the analytical behaviour of an expert speech and language therapist (SLT) may well make it possible for general practitioners to perform preliminary assessments – for the purposes of referral to a specialist – on any of their patients whom they suspect to be afflicted with a dysarthric condition. Such preliminary assessments, if performed using the computerised FDA application that has been developed, would also produce invaluable evaluation data (in the form of digital acoustic recordings 2 ) that would enable a consulting specialist to more easily identify and categorise the patient’s dysarthria sub-type. Moreover, the recording of a patient’s responses to the FDA’s intelligibility tests permits independent review and processing of such data by objective intelligibility measurement systems, a particularly important verification procedure in the case of intelligibility evaluation since it eliminates any bias in judgement resulting from a human listener’s prior acquaintance with the speaker being assessed.

Apart from the theoretical issues concerning the formulation of an objective metric system to promote accurate diagnosis and treatment, this research encompasses more utilitarian objectives, namely the production of a commercially available computerised diagnostic tool which automates the actual diagnostic process (i.e. identifying the most likely dysarthria disease sub-type given an FDA test result profile). This level of automation of any speech disorder diagnostic procedure would appear to be unprecedented 3 .

1.2 Description of key terms and FDA sub-tasks To facilitate a clearer understanding and appreciation of FDA testing conventions and associated terminology, this section incorporates a glossary of terms that have a specialist

2

Such digital recording and archiving of dysarthric speech samples would also make possible the establishment of a database of FDA-specific audio data, which could prove an invaluable source of clinical reference material. Currently, the Nemours database (Menéndez-Pidal et al, 1996) is the only publicly accessible collection of dysarthric speech data, however the majority of the material therein is not suitable for FDA diagnostic exercises. 3 There have been previous attempts to automate various aspects of the FDA procedure (see chapter 2) but these have met with limited success, both in terms of diagnostic accuracy and acceptance by clinicians.

4

meaning in the context of dysarthria assessment. A ‘quick reference’ full listing of all the FDA tasks is also presented.

1.2.1 Explanation of Key Terms When discussing FDA testing procedures and their outcomes in this report, frequent reference is made to certain medical terms and phrases, the specific meanings of which are as follows below: (i) Assessment Criteria (also referred to as Evaluation Criteria): When evaluating the performance of an individual undergoing an FDA diagnostic test, the test administrator is expected to grade said performance by consulting a set of guidelines defining a series of requirements or reference values – corresponding to specific letter grades – which indicate the level of competence displayed by the test subject. In the case of the voice quality task, for example, one of the assessment criteria that must be satisfied for the award of an ‘A’ rating requires the speaker to produce a minimum of fifteen seconds of clear phonation; in contrast, a “C” competence rating only requires five to ten seconds of voicing which may be “interrupted by intermittent huskiness or breaks in phonation” (Enderby, 1983). These benchmark values are meant to give the clinician a clear idea concerning how to grade the severity of any apparent abnormalities and/or pathological symptoms. (ii) Task: a task is the description of an action or series of actions – normally involving manipulation of the articulators – which the patient / test subject is expected to perform in response to some set of stimuli or instructions. To cite an example: the task definition for the FDA voice quality evaluation procedure reads as follows: “Ask the patient to say "AH" for as long as possible. Demonstrate and note the second attempt.” (Enderby, 1983) Overall, the paper-based FDA comprises 28 tasks sub-divided into eight assessment categories, the specifics of which will be discussed in the following chapter. (iii) Test: in the context of the FDA, a test comprises the description of a task (see above) along with its associated evaluation criteria. It is not uncommon for the terms “test” and

5

“task” to be used interchangeably, although – strictly speaking – it is possible to perform an FDA task without the intention (or result) of being formally assessed, thus the mere execution of the task does not in itself constitute a test. (iv) Test subject: in the context of this study, the term “test subject” refers to the person undergoing an FDA test or series of tests. It is often the case that “test subject” and “patient” are treated as synonyms; however, such usage is not semantically accurate since a person displaying no evidence of dysarthric symptoms can still take an FDA evaluation 4 (and therefore be a test subject for the duration of the diagnostic interview) but would not be considered a “patient” in the usual sense of the term, i.e. under medical supervision due to some ailment. Such distinctions in semantic nuance notwithstanding, the terms “patient” and “test subject” are often used synonymously in this thesis since the evaluation and task criteria in the paper-based FDA also use them interchangeably.

1.2.2 FDA Task Composition The computerised FDA (CFDA), like its paper-based counterpart, is composed of 28 tasks divided into eight categories as listed in Table 1.1. The thesis chapters which discuss the implementation of the metric systems associated with these tasks are listed in the table’s “Reference Chapter” column; objective signal processing metrics have been introduced for those tasks which are annotated with a single asterisk “*” while a double asterisk “**” indicates that some form of speech or phone recognition technology has been incorporated 5 .

4

Indeed, during the course of this investigation, various comparative analysis experiments required the collection of substantial amounts of non-pathological acoustic data from normal speakers who volunteered to undergo the full FDA diagnostic test battery. None of these normal speakers, of course, could be considered “patients” (in the sense of being under the care of a physician) but they were, most definitely, test subjects. 5 It will be noted that objective acoustic metric systems have been introduced only for those twelve tasks which are assessed primarily by auditory inspection (see chapter 3 for a more detailed discussion of this issue).

6

Table 1.1: List of FDA Tasks and Associated Levels of Automation Task Category 1. Reflexes

Task No. and Name 1: Cough 2: Swallow 3: Dribble 2. Respiration 4: Respiration at Rest 5: Respiration in Speech 6: Lips (At Rest) 3. Lips 7: Lips (Spread) 8: Lip seal 9: Lip (Alternate Movements) 10: Lips (In Speech) 11: Jaw (At Rest) 4. Jaw 12: Jaw (In Speech) 13: Soft Palate (Fluids) 5. Soft Palate 14: Soft Palate (Maintenance) 15: Soft Palate (In Speech) 16: Time/Phonation Quality 6. Laryngeal 17: Pitch Modulation 18: Volume Control 19: In Speech 20: Tongue (at Rest) 7. Tongue 21: Tongue (Protrusion) 22: Tongue (Elevation) 23: Tongue (Lateral Movements) 24: Tongue (Alternate Movements) 25: Tongue (In Speech) 8. Intelligibility 26: Isolated Words 27: Isolated Phrases 28: Conversation

Level of Automation None None None * * None None ** None None None None None None ** * * * * None None None None

Reference Chapter ----

**

6

None ** ** *

-7 7 6

4 --5 -----6 5 6 -----

1.3 Thesis Structure The structure and sequence of this thesis mirrors that of the FDA assessment procedure on which it is based. The following chapter presents a review of evaluation theories for dysarthria assessment along with a discussion of how signal processing techniques may be adapted to provide objective measurement and classification of pathological acoustic phenomena. Chapters 3 to 7 discuss the actual implementation and performance of the computerised FDA battery of tests, comparing objective with subjective measurements. Chapter 8 reports on improved procedures employed to arrive at an overall diagnosis of disease sub-type; furthermore, issues concerning the usability of the CFDA computer program under real-world conditions are also addressed. The final chapter of this thesis presents a summative assessment of the contribution and role of this software package to

7

the goal of better consistency in the description and diagnosis of dysarthria symptoms and conditions.

8

Chapter 2: Dysarthria and Related Speech Disorders: Aetiology and Current Metrics The classification of dysarthria sub-types and associated diagnostic methodologies proposed by Darley et al. (1975) is the most widely accepted of all such taxonomies and constitutes the framework upon which the two major dysarthria assessment procedures – the Robertson Dysarthria Profile (RDP) (Robertson, 1982), and the Frenchay Dysarthria Assessment (FDA) (Enderby, 1983) – are based. Although several analytical procedures and studies have been published on the topic of dysarthria diagnosis and rehabilitation, only the RDP and the FDA are designed specifically as diagnostic tools, that is not only do they include a series of speech and articulatory evaluation exercises, they also incorporate customised measurement systems to allow categorisation and some form of estimation of the severity of pathological phenomena such as abnormal voice quality. Compared to the RDP, the FDA is substantially more ambitious in its diagnostic objectives since it seeks to identify the specific dysarthria sub-type which the patient appears to manifest. Given that this study is not intended to be a review of speech data processing techniques in general, the discussion of speech measurement methodologies presented here will focus primarily on the use – or possible use – of those technologies which are suitable for the particular requirements of the various FDA acoustic assessment tasks. Subsequent sections of this chapter will present various reviews of these metric systems as they relate to the articulatory category (e.g., respiratory function) which they seek to evaluate. Prior to such an analysis of articulator-specific measurement systems, however, it is useful to gain some appreciation of the general aetiology of dysarthria, which is the subject of the section that follows.

2.1 General Classification of Speech and Language Disorder Types Darley et al. (1975) suggest that speech and language disorders may be classified in accordance with two fundamental criteria: the extent to which the ability to physically manipulate the articulators has been compromised and the degree of disruption to the innate capacity for language processing that produces coherent and fluent speech (see Figure 2.1’s schematic diagram showing the vocal tract articulators). Although it is not uncommon for those afflicted to exhibit both pathology types in some combination, it is nevertheless critical to make the distinction between the physical and cognitive domains since dysarthria is defined as the neuromotor impairment of the various speech

9

production organs as opposed to any degradation of the higher level mental processes that drive them. It is quite possible, for example, for a dysarthric to show no signs of any psycho-linguistic incompetence even though his or her speech is hardly intelligible to the naïve listener 6 . Conversely, certain types of aphasia – such as Wernicke’s aphasia – do not occasion any decrease in intelligibility per se despite the fact that speech resulting from this condition lacks normal semantic fluency at the level of appropriate vocabulary selection, verb conjugation, etc. The following summary of the major speech disorder categories is intended to highlight any overlap between these pathologies in terms of articulator dysfunction, resultant symptom profiles and the measurement of associated phenomena, particularly where such metric systems incorporate objective acoustic analysis. Figure 2.1: Mid-Section of Vocal Tract showing Place of Articulation (taken from Handbook of the International Phonetic Association, 1999)

2.1.1. Aphasia As alluded to in the introductory paragraphs of this chapter, aphasia is broadly defined as the reduction in the capacity for language receptivity and production to the point where communication and comprehension is compromised. Aphasic conditions may be further 6

A very well-known example of this being the physicist Stephen Hawking who has suffered virtually total loss of articulator control but who certainly is in no way suffering from higher-order language skill impairment.

10

sub-categorised into the fluent and non-fluent varieties (Darley, 1975; Linebaugh, 1983), of which the following are the most commonly encountered:

2.1.1.1 Broca’s Aphasia Resulting from damage to the section of the brain known as Broca’s area, the sufferer’s ability to systematically retrieve the desired word or word sequence is severely disrupted, thus oral fluency is reduced to sentence fragments not exceeding five words with pronunciation adversely affected (Freed, 2000). Language comprehension skills, that is reading and listening, are not normally affected.

2.1.1.2 Anomic Aphasia Anomic aphasia resembles a mild form of the Broca variety since those afflicted manifest a chronic inability to supply the appropriate nouns and verbs to generate succinct sentences. As a result their speech, while fluent and grammatically acceptable, is usually replete with circumlocutions since the desired words are perennially “on the tip of the tongue” just beyond reach. Word retrieval is as problematic in writing as in speech (Brookshire, 1992).

2.1.1.3 Assessment of Aphasic Conditions Since Aphasia is essentially a cognitive rather than articulatory impairment, the associated testing procedures do not normally include any significant component of acoustic analysis to determine articulatory competence. The testing emphasis of the major Aphasia diagnostic tools – such as the Aphasia Language Performance Scales (ALPS) (Keenan and Brassell, 1975) and the Boston Diagnostic Aphasia Examination (BDAE) (Goodglass et al., 2000) – concentrate primarily on assessing competence at the level of appropriate vocabulary selection in response to visual or aural stimuli. Typical testing procedures include the following: •

Oral and/or written description of an image or series of images;



Pointing to objects/pictures named by the examiner;

11



Choosing from a group of pictures or geometric shapes those which are related in some way;



Responding to questions and/or directives orally presented by the examiner

In terms of gauging the severity of pathological conditions, virtually all aphasic diagnostic tests subscribe to a tripartite categorisation system which classifies symptoms as pertaining to one of three categories: mild, moderate or severe. It would appear that the boundaries separating these categories are based on the subjective assessments of the examining clinician do not always correspond to any particular measurable value.

2.1.2 Apraxia of Speech At the general level, apraxia represents a disruption to either ideational (the ability to preplan sequences of actions to manipulate some object or tool) or ideomotor (the capacity to demonstrate actual competence in tool manipulation) processes (Brookshire, 1992). Unlike the various forms of aphasia, apraxia of speech (also known as dyspraxia 7 ) is not a disorder that causes language production/processing difficulties at the cognitive level (Dogil and Mayer, 1998). They who suffer from speech apraxia do, however, encounter extraordinary difficulty in controlling articulatory movements in such a way as to consistently produce well-formed meaningful speech. This inadequate articulatory motor control is a characteristic which apraxia shares with dysarthria, however it is to be noted that – unlike dysarthria – apraxic conditions do not occasion any articulator muscle weakness or paralysis. Nevertheless, due to the impaired physical control of the articulators caused by the disease, apraxia is often considered to be both a speech and language disorder (Miller, 1992). The main characteristics of apraxia of speech are: (a) Anticipatory errors (producing a sound before it occurs in a phrase, e.g. “thoothbrush” for “toothbrush”. (Brookshire, 1992). (b) Substitution of the voiced counterparts of unvoiced sounds, such as /b/ for /p/ (Brookshire, 1992). (c) Errors in consonant production, which tend to be more frequent than errors in vowel production (Brookshire, 1992).

7

There are two main varieties of apraxia: (i) acquired apraxia of speech, caused by some form of brain damage and (ii) developmental apraxia of speech (DAS) which is a congenital (Miller, 1992).

12

(d) Incorrect use of prosody, i.e. inappropriate rhythm and intonation during speech. (Dogil and Mayer, 1998)

2.1.3 Dysarthria As mentioned previously, dysarthria – amongst all the abovementioned speech disorders – is perhaps the most obvious to diagnose since the observable symptoms constitute the malady: any impairment of articulator control usually results in abnormal speech and this pathological speech production is itself dysarthria. 8 The nature and pattern of symptoms manifested – i.e. the specific type of dysarthria – are strongly correlated to the region of the brain which has been subject to injury or the effects of disease. Darley’s original taxonomy of dysarthria categories has been simplified by Enderby (1983) to five major categories:

2.1.3.1 Extrapyramidal Hypokinetic Dysarthria Damage or deterioration to the brain’s extrapyramidal tract, a typical condition of Parkinson’s disease, normally results in shallow respiration and rigidity of the laryngeal muscles. As Brookshire (1992) observes, voice quality is likely to be strained and breathy with volume greatly reduced. Articulation is noticeably imprecise with frequent bouts of rapid indistinct speech characterised by poor consonant enunciation and disjointed pauses.

2.1.3.2 Spastic Dysarthria Spasticity of the laryngeal and other articulatory muscles usually occurs due to upper motor neuron disease, producing what Darley et al (1969) label as a “strained-strangledharsh” voice quality characterised by significantly reduced resonation and disruption to prosody. It is not uncommon for the spastic condition to affect the velopharyngeal 8

It should be noted, however, that the diagnosis of dysarthria does not apply to abnormal speech production with no apparent neuro-motor cause. Stuttering and stammering, for example, cannot be classified as forms of dysarthria since these afflictions do not appear to result from any neurological anomaly, hence the curious phenomenon whereby even the worse stutterers are able to sing or chant without difficulty the very words which they labour to produce in normal speech.

13

muscles thus preventing the velum from being properly sealed off from the rest of the vocal tract during speech which is not intended to manifest nasalisation. This hypernasality, or involuntary nasalisation, is often accompanied by diminished ability to vary pitch and volume (Enderby, 1980).

2.1.3.3 Ataxic Dysarthria Ataxic dysarthria is associated with damage to the cerebellar control circuit and is often accompanied by hypotonia (Darley, 1975); voice quality tends to be coarse and tremulous with minimal pitch and volume control. Inhalation is usually shallow while exhalation occurs in uncontrolled bursts which cause the sufferer to prematurely run out of breath even during relatively short sentences. It is not uncommon for the swallowing reflex to be adversely affected causing saliva to pool in the mouth obliging the speaker to make unscheduled pauses in mid-utterance in order to swallow. Normal lateral and alternating movement of the tongue during speech is also affected.

2.1.3.4 Flaccid Dysarthria Flaccid Dysarthria is a common product of stroke and similar vascular incidents which usually cause an asymmetrical – one side of the body more affected than the other – loss of muscular control. In terms of speech production, this asymmetry manifests itself as abnormal positioning of the lips (and consequent dribbling) during rest along with poor lip seal during speech (Enderby, 1980). This inadequate lip seal results in failure to attain the required amount of air pressure build-up necessary to produce bilabial phones such as plosives. Another characteristic of this flaccid dysarthria is the inability to properly execute rapid alternating tongue movements 9 – a condition known as dysdiadochokinesis (DDK) – resulting in poor intelligibility of words or phrases containing quick-fire repetitive sequences, e.g. “Clinical cleaning of contaminated clothes is councilled”.

9

typically from the back to the front of the oral cavity such as would be necessary to produce the utterance “Ka-La”

14

2.1.3.5 Mixed Dysarthria Despite the all-inclusive nature that its name suggests, the various types of mixed dysarthrias seldom incorporate more than two of the major sub-groups, the most common combinations being Flaccid-Spastic and Ataxic-Spastic (Dauniloff, 1983); one of the neuromuscular disorders commonly associated with mixed dysarthria is Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. The prevailing symptoms of both types of mixed dysarthrias tend to be those associated with spasticity, since this is present to some degree in more than seventy-five per cent of diagnosed cases; although there may be a variety of other underlying conditions, Enderby (1980) has found that the majority of mixed dysarthrias will exhibit some type of reduced maintenance of palatal and tongue elevation with some evidence of Hypernasality but not as pronounced as would obtain for pure spastic dysarthria.

2.1.4 Measurement of Dysarthric Symptoms As previously stipulated, this investigation’s primary objective is to establish a system of acoustically based objective measures for dysarthria diagnosis, with software implementations specific to the FDA. This section discusses current diagnostic metric systems, both perceptual and objective, with the intention of determining which subjective assessment systems should be supplemented or entirely replaced by objective criteria based on digital signal processing and automatic speech recognition technology.

2.2 Objectivity and Metric Systems The very notion of objectivity, in scientific terms, implies the possibility of empirically defining a given entity: it is possible, for example, to define real-world objects in terms of their age or physical dimensions and such descriptions/measurements are themselves based on reproducible phenomena.

The unit of time defined as a second, for instance,

corresponds to the temporal interval necessary for radiation emitted from the atomic element cesium-133 – a substance readily procurable by any appropriately licensed facility – to execute 9,192,631,770 cycles. The cesium-133 radiation emission rate represents a constant and absolute value which makes it possible to precisely demarcate

15

the period of time corresponding to the passage of one second. This definition of the basic unit of time, adopted and ratified by the International Standards Union (SI), is a typical example representative of that organisation’s philosophy that the essential attributes of any objective measurement system should be the reproducibility and validity of its metric units, “validity” in this case signifying the demonstration of suitability as a measurement tool via comparison with some trusted external point of reference, commonly referred to as the “gold standard” or “ground truth”.

Establishing the validity of a reference value is less straightforward, of course, if this value is in some way dependent on the observer. A class of such observer-dependent phenomena are percepts, i.e. sensations partially or wholly defined by the subjective impressions they create and these impressions will, of course, vary to some extent from individual to individual. In the case of loudness measurement, for example, the SI has accepted and ratified a set of percept-based reference values known as equal loudness contours. These particular acoustic reference values are based on the concept that certain pure tone energy-frequency combinations will produce identical impressions of loudness for most listeners with normal hearing 10 . As to be expected with any investigation into subjective responses to some set of stimuli, there have been significant discrepancies in the results of equal loudness experiments reported by different research groups, even though all the studies only used individuals with normal hearing. In the strictest sense, these discrepancies indicate a level of perceptual variability which would suggest that it is not possible to devise an objective measure for loudness. Indeed, as seen in Figure 2.2, substantial revisions have been made to the ISO definition of these contours which reemphasise the fact that this ‘empirical’ metric is founded upon one of the more variable percepts.

10

See section 2.5.10.1 of this chapter

16

Figure 2.2: Revision of ISO Equal Loudness Contour Line Standard (Original ISO 40-phon standard shown as blue line)

Such variability notwithstanding, a set of equal loudness contour reference values were eventually adopted as the SI standard and they represent the most typical trends in the experimental results reported by the various studies. This possibility of defining a quasiobjective reference value based on the typical subjective reaction of some selected population is an important assumption which is the principle that underpins the majority of speech metric systems proposed in this thesis. Of course, this quasi-objectivity hypothesis depends on the correct selection of features for measurement, a topic which will now be discussed in greater detail.

2.2.1 Feature Selection Criteria As posited by Dunn and Everitt (1995), any metric system should seek to quantify the phenomena of interest and only the phenomena of interest. One of the major difficulties in formulating a robust and accurate objective metric system is the implementation of filtering mechanisms to compensate for any environmental factors which could introduce some element of interference to prejudice the observation process. When attempting to measure, for example, the intensity of some sound, variable conditions such as atmospheric conditions, the microphone’s distance from the sound source and the responsiveness of the microphone itself (known as the transfer function) can produce inaccurate readings due to the skewing effect of noisy data, i.e. data not originating from

17

the sound source but erroneously interpreted (and processed) by the measuring instrument as representative of said source. Since it is not always feasible or possible to eliminate the physical presence of all variable environmental conditions or instrument configurations which might prejudice measurement computations, it is often the case that some method must be used to compensate, or normalise, for the element of bias that has been introduced.

This normalisation procedure presupposes that the proposed metric system has been subjected to some form of standardisation, the term “standardisation” in this context signifying the specification of environmental conditions which will guarantee accurate measurements. In the case of the Celsius temperature scale, for example, 0o Celsius is defined as the temperature at which pure water freezes provided the atmospheric pressure is exactly 1013.25 millibars (mb); varying the atmospheric pressure from this specified value will also vary the water’s freezing temperature. If, using this aqueous sensitivity as a reference point, temperature measurements were attempted under conditions where the barometric pressure was not 1013.25mb, it would still be possible to produce accurate temperature readings by computationally simulating standard barometric pressure using the correlation that defines the effect of barometric pressure on water’s transition from a liquid to solid state. This computational simulation of a standardised condition is an integral concept of the objective metric systems proposed in this study; it is meaningless to propose some numerical value as a rating of a speech-related task if that value has been derived from an observation where extraneous elements may have exerted effects of unknown magnitude: it is well documented, for example, that using microphones and signal processing equipment featuring different configurations to those used for recording the training data may significantly affect the accuracy of automatic speech recognition systems (Rabiner & Juang, 1993, Lippmann, 1997). Although various algorithms have been proposed, and implemented, to normalise for such equipment configuration difference, such compensatory mechanisms have not always proven entirely satisfactory. Nevertheless, it is the intention in this investigation to specify not only a series of standardised conditions for the proposed objective measures, but also to elaborate algorithms and computational models which attempt to negate the effects of certain nonstandard conditions that can be expected during real-life diagnostic observations. In the context of FDA diagnosis, however, although it is feasible to normalise for possible irregularities in the aforementioned environmental conditions or recording equipment configurations, it is certainly not a trivial task to normalise for the other category of non-

18

standard condition often encountered during a diagnostic evaluation: the possibility that the pathological phenomena elicited by the performance of a specific task has not been produced by the articulator(s) under evaluation. In the FDA task testing the ability to maintain an airtight lip seal, for example, a patient’s inability to produce a well-formed bilabial plosive might be entirely due to involuntary nasal emission rather than impaired muscular control of the lips – an issue which is discussed at greater length in a subsequent section of this study. Unfortunately, it would appear that there is no automatic robust compensatory technique capable of normalising for the adverse effect which a nontargeted malfunctioning articulator might have on the performance of the other articulator(s) being assessed during a particular procedure. It is usually the case that some form of mechanical intervention is necessary to temporarily eliminate the source of interference; for the aforementioned FDA lip seal test, an effective method of eliminating the presence of noisy (and potentially deceptive) data caused by involuntary nasal emission is by pinching the nostrils shut while the patient is performing the test.

Despite this lack of automatic normalisation techniques to compensate for prejudicial data coming from non-targeted articulators, a set of standardised operating conditions are proposed to facilitate accurate measurements by the FDA signal processing algorithms. These conditions are detailed in chapters four through six. Moreover, some highly specialised speech processing algorithms – such as those which compute the intelligibility of words and phrases – will require specific standardised conditions and data normalisation protocols. Such requirements are detailed in the chapter 7, which discusses the implementation of these metric systems. At this juncture, however, it is more appropriate to consider the current measurement system employed by the FDA and RDP, this being the subject of the section that follows.

2.3 Grading Scales used in Dysarthria Assessment Both of the major diagnostic testing procedures employ some type of interval scale which correlate a patient’s performance in the evaluation tasks with the intensity of symptoms manifested; since the purpose of the assessment is to determine the quality and severity of any abnormality present, the grading is biased towards pathological categorisation, i.e. it is usually the case that the highest grade is reserved to indicate no evidence of impairment while the other grades are intended to show not merely the presence of

19

pathological conditions but to categorise said phenomena with some precision. The RDP uses a five point grade scale with explicit descriptive tags, ranging from “none” (signifying that no response has been elicited or is so negligible as to not merit evaluation) to “normal”. It is to be noted that the RDP scale’s second highest grade would appear to be a misnomer since its descriptive label, “good”, is used to classify performances which rank below that of “normal”, the highest grade 11 . An example of this apparently contradictory ranking is presented below (as applied to one of the RDP’s respiration tasks where the objective is to maintain frication – /s/ – for as long as possible): 20 – 30 seconds = Normal 15 –19 seconds = Good 10 – 14 seconds = Fair 1 – 9 seconds = Poor (Robertson, 1982) Such misleading descriptions notwithstanding, the RDP’s success as a diagnostic instrument is well documented (Wit et al, 1993; Snowden, 1995) and, after completing the test series, it provides the clinician with a visually distinctive vertical bar chart type profile which facilitates diagnosis.

The Frenchay Dysarthria Assessment (Enderby, 1983) is also a profile-type test that provides the assessor with a distinctive visual representation of the data which can be compared to similar visualisations of pre-assessed and graded data (i.e. templates) so that the patient’s condition can be properly categorised. In its current form, the test is composed of eight assessment categories – Reflexes, Respiration, Lips, Jaw, Palate, Larynx, Tongue and Intelligibility – which are themselves subdivided into 28 specific testing activities that are presented in Figure 2.3. The underpinning diagnostic principle for both the FDA and the RDP is the concept that, for each of the evaluation tasks, there is some specific identifiable threshold value, or normalcy threshold (NT) which consistently differentiates normal from abnormal speech data; in the case of the FDA pitch modulation task, for example, it is assumed that normal speakers can produce a sequence of utterances featuring at least six variations of pitch, 11

After consultation with the authors of the RDP, the descriptive label “good” in the context of the RDP test signifies a performance which is good for an individual presumed to be speech impaired. A better description, therefore, for this grade category designating a performance only slightly below that of a normal speaker might be “mild”. There are plans to revise the RDP in the near future and this apparent contradiction in grade description will be rectified.

20

with each consecutive variation augmenting in frequency by at least a half-tone. This assumption that the ability to produce six half-tone pitch increments constitutes a definitive demarcation between non-dysarthric and dysarthric speakers implicitly presupposes that the individual being assessed suffers from no other disorders – such as tone deafness – which could also be at least partially responsible for such an inability. This presumption of the absence of mitigating conditions constitutes, arguably, a significant weakness in the evaluation methodology of both the RDP and FDA: neither of these test series contain any comprehensive instructions detailing how to re-adjust or normalise the threshold value to account for the possible presence and effects of debilitating but non-dysarthric conditions such as the aging process. The FDA makes some attempt to provide general guidelines for age-based threshold normalisation by instructing the test administrator to award the highest grade if the patient’s performance in any given test is considered “normal for age” (Enderby, 1983), however no advice is proffered concerning the expected performance norms for various age categories (as would be required in other medical diagnostic areas such as the detection of obesity or hypertension).

Figure 2.3: Frenchay Dysarthria Assessment Profile Scoring Profile Template

FDA SUBTESTS

21

This ambiguity in terms of symptom assessment guidelines is even more evident in the case of evaluation procedures for certain non-articulatory behaviours – collectively referred to as “Influencing Factors” – which, if these behaviours indicate unusual stress on the patient’s part, could exert such a substantial skewing effect as to invalidate any associated FDA test score. Although it is clear that factors such as a patient’s emotional state during the FDA interview are important, it is not made explicit how these influencing factors should alter the various individual test scores and overall diagnostic outcome. It is not evident, for example, how a patient’s mood during an FDA assessment exercise should affect any score he or she receives for Tasks 4 and 5 (Respiration at Rest and Respiration in Speech) or any of the other tasks. The actual FDA guidelines on this matter are somewhat cryptic:

“Comment on whether the patient has insight into the difficulty, whether the patient is cooperative, motivated, and what the emotional state is.

Analysing the patient’s attitude in terms of willingness to co-operate and level of motivation does have diagnostic importance but the lack of explicit guidelines concerning the assessment thereof would appear to be an oversight which should be addressed. Given that the diagnostic role of these Influencing Factor assessments is so vaguely defined, no attempt has been made during this investigation to model their effect upon the FDA’s mechanisms for formulating an overall diagnostic hypothesis; instead attention shall now be focused on the grading system of said FDA and its role in formulating an overall diagnosis.

Each individual column has nine horizontal lines corresponding to nine letter grades: “A”, “B+”, “B”, etc. to “E”. Assigning a score to a particular sub-task requires the evaluator to shade the area up to the height of the horizontal line corresponding to the grade. Table 2.1 illustrates the shading pattern resulting from a scoring of “B+”, “B” and “D+” assigned to the Cough, Swallow and Dribble tests respectively (Reflex group) 12 .

12

The thick horizontal lines at the top of the shaded areas demonstrate that the examiner is following good practice by making preliminary pencil marks before finally shading in the area after coming to a definitive decision.

22

Table 2.1: Bar chart Profile for Reflex Subtask

A

Reflex

B C D 3. Dribble/Drool

2. Swallow

1. Cough

E

Filling out the entire series of columns produces a de facto bar chart, such as those in Figures 2.5 and 2.6, that represents a dysarthria sub-type profile. This profile may then be used to make a diagnosis specifying the dysarthria category and stage of development (i.e. mild, moderate and severe) via a visual comparison with the various FDA templates for the five dysarthria types. A manual definitive identification is then made based on a goodness-of-fit of the profile to a particular template or template grouping. The FDA’s grading scale, with nine points of differentiation, has twice the precision of that of the RDP (Robertson, 1982) and uses traditional academic-style letter grades with the following general criteria: A= Normal for age B= Mild abnormality noticeable to skilled observer C= Abnormality obvious but can perform task/movements with reasonable approximation D= Some production of task but poor in quality, unable to sustain, inaccurate or extremely laboured. E= Unable to undertake task/movement/sound

The four remaining points on the FDA scale are composed of interval grades, (“B+”, “C+”, “D+” and “E+”) which designate a performance better than the grade immediately below (i.e. the letter grade without the “+” augmenter) but not sufficiently competent as to merit the grade above. In fact, the term “interval grade” is misleading when applied to these “+” grades because the designers of the FDA have associated all the grades with nominally equi-distant numerical values on an integer scale, with the “E” grade mapping

23

to a value of one at the lower end, and the “A” grade mapped to a value of nine at the upper end 13 . Although this letter-to-integer mapping does not strictly adhere to mathematical principles which would permit statistical calculations to be performed on this data, such calculations have been performed and appear to yield valid results, an anomaly which deserves more detailed discussion.

2.4 Validity of Statistical Operations on FDA Diagnostic Data Since the FDA employs a grading scale which is not absolute 14 , it may be argued that data representing individual FDA scores – such as those which comprise an FDA barchart profile – cannot support such mathematical operations as multiplication, division, subtraction or addition. It would therefore be inappropriate to attempt to derive from such data any statistical information which presumes the use of an absolute scale. Despite this apparent ineligibity, Enderby (1986), in an attempt to confirm the validity of the FDA procedure as a reliable and robust diagnostic tool, used linear discriminant analysis (LDA) on a corpus of FDA sub-task scores originating from 107 dysarthric individuals (see Appendix A); eighty-five of whom had previously undergone a conventional diagnostic evaluation and were classified as follows: • • • • •

30 Spastic 10 Flaccid 14 Ataxic 13 Mixed 18 Extrapyramidal

Although it may be argued that Enderby’s analysis is flawed since there was no explicit modelling of a normal speaker group 15 in her LDA category definition, Enderby’s

13

The full letter-to-integer mapping is as follows: “A”=9, “B+” = 8, “B” = 7, “C+” = 6, “C” = 5, “D+”=4, “D” = 3, “E+” = 2, “E” = 1. 14 As Fenton (1979) reminds us, values on an absolute scale should be ordered, equally spaced by a fixed unit and have a zero value. In the FDA grading scale, an “E” grade does not always represent a value of zero and – for example – a performance meriting an “A” grade is not necessarily twice as good in some objectively measurable way as a “C” grade performance (although in some instances, this does hold true). 15 The three fundamental prerequisites for LDA analysis are that: (i) the values for the discriminatory features should be represented as numerical values (ii) these discriminatory feature integer representations should be part of a scale with a clearly defined maximum and minimum value. (iii) all eligible groups should be modelled. For robust categorisation, it is important to allow for the possibility of non-pathological cases, i.e. that the incoming data may represent a normal speaker or, at least, one whose abnormality is not due to any variety of the suspected disease. Enderby’s analysis did not include ‘normalcy’ models.

24

statistical method nevertheless recorded a classification accuracy of 90.6% on the training data corpus 16 . Although the evident success of performing statistical operations – such as discriminant analysis – on a non-scalar data set may be attributed to the fact that said statistical operations apply solely to frequency distribution trends observed within the data, (and are thus not influenced by the actual values these data points represent), it is also undeniable that there is some form of non-linear relationship between the FDA grades: an “A” grade performance, for example, may not be nine times superior to an “E” grade (as their corresponding numerical FDA scores of “1” and “9” would suggest) but it is also clear that these two scores are seven units apart in the FDA grade interval spacing. This researcher proposes that the concept of such an interval spacing can support statistical and arithmetical operations which will derive valid results from manipulating FDA scores as representing quantifiable values as opposed to merely labels. This hypothesis is discussed in greater detail and demonstrated to be valid in a subsequent section of this thesis. At this juncture, however, a review of conventional measurement systems and methodologies for the various FDA sub-task groups is of more immediate interest.

2.5 Task-Specific Measurement Methodologies The twenty-eight FDA tests incorporate a range of evaluation methodologies which include audio-visual and tactile inspection. Sections 2.5.1 to 2.5.12 review the current metric systems for these various FDA tests, comparing them – where appropriate – with the evaluation stratagems of the other major paper-based dysarthria assessment procedure, the Robertson Dysarthria Profile (RDP) (Robertson, 1982).

2.5.1 Measurement of Articulator Reflex Actions The FDA’s evaluation of the function of the upper respiratory tract reflex mechanisms is the most thorough among the two major dysarthria diagnostic procedures 17 . The first

16

It was not possible to report any meaningful results on the pattern recognition system’s performance on the test data (which would comprise of FDA barchart profiles from 22 of the 107 patients) since said test data had not been reviewed by a third party to confirm the initial diagnostic findings.

25

three of the FDA’s test series are devoted to assessing the reflex actions of coughing and swallowing which serve to clear the respiratory passages of excess saliva and irritants which can – of course – adversely affect speech production. Essentially, these tests are not suitable for the implementation of any acoustically based objective metric system since they rely primarily upon visual assessment and/or require information gathered over an extended period of time under conditions where recording the test subject’s performance would often be inappropriate. This ineligibility for objective measurement is evident when one examines, for example the assessment criteria for the Cough Reflex task:

Grade A: No difficulty Grade B: Occasional difficulty with choking, or food sometimes going down the wrong way -- states that some care must be taken. Grade C: Patient has to take particular care, chokes once or twice during the day. May have difficulty clearing phlegm from throat. Grade D: Patient chokes frequently on food or drink or faces danger of inhaling substances. Chokes at times other than mealtimes, such as on saliva. Grade E: No cough reflex. Patient on nasogastric tube and / or continual choking on food, drink, and saliva. For the awarding of a “B” grade for the above task, the examining clinician is explicitly instructed to arrive at a diagnostic assessment based on the opinion of the test subject who must confirm that “some care must be taken” when swallowing food.

Given these rather empirically amorphous guidelines for this and the two other articulator reflex action tests, no attempt has been made to implement for these assessment procedures any acoustically-based metric system, a situation which fortunately is not the case for the respiratory function tests, the measurement of which is discussed in the section that follows.

17

The revised edition of the RDP no longer includes any specific tests to measure these reflex actions.

26

2.5.2 Measurement of Respiratory Functions The respiratory system’s contribution is integral to every stage of the speech production cycle, thus the evaluation criteria for this articulatory group constitutes one of the core components of any dysarthria diagnostic tool. Experts concur (Enderby, 1980; Robertson, 1982; Colton & Casper, 1990; Brookshire, 1992) that the two critical respiratory function components to be measured are respiratory flow and respiratory pressure, the regulation of these at both the conscious and subconscious level being essential to the production of fluent and well-inflected speech. The RDP includes five tests to evaluate respiratory function, four of them measuring ability to control air flow to the vocal tract (by requiring the test subject to produce /s/ with varying levels of intensity, frequency and duration) while the fifth exercise assesses competence in regulating breathing patterns to permit fluent normal speech. The FDA devotes two of its 28 tests – referred to as the Respiration at Rest and the Respiration in Speech tests – to respiratory function evaluation: the first of these tests is for air flow control measurement in a non-speech context (i.e. audible exhalation through the mouth without any constriction of the vocal tract) and the other to gauge breath control competence for a precisely defined speech task, namely reciting the digit sequence from one to twenty. For the “Respiration at Rest” test, the NT is a minimum of five seconds of smooth exhalation – “smooth” in this context signifying the expulsion of air at a constant rate which would indicate good control of the diaphragm. This five-second sustained respiratory pressure NT threshold is widely accepted as the minimum required for most speech tasks (Brookshire, 1992). However the FDA method of measuring this is not as precise as that of Netsell and Hixon’s (1978), which incorporates the use of a manometer to measure breath pressure as evidenced by the test subject’s ability to displace a column of water (of pre-determined volume) by five centimetres for at least five seconds (see Figure 2.4). The manometer method would appear to offer the most objective method of measuring respiratory pressure since specific pounds per square inch (p.s.i) values can be calculated for any given test performance. The possibility of calculating breath pressure p.s.i values by analysing differentials in spectral energy values will be discussed in the second chapter of this study.

27

Figure 2.4: Manometer for Respiratory Pressure Measurement (from Brookshire, 1992)

The direct measurement and verification of adequate respiratory pressure levels during speech is a difficult challenge due to the highly dynamic nature of speech itself. The build-up and release of air pressure within the vocal tract is dependent upon the combination of phones which compose the specific utterance the speaker is attempting to produce. Even if it is known beforehand what the speaker intends to say, it would not be a trivial matter to determine what respiratory pressure readings would be normal or abnormal at any given moment during an actual speech task.

Given the complexity of implementing any form of direct measurement system, it is not surprising, therefore, that virtually all dysarthria diagnostic procedures adopt a more intuitive method of assessing competence in maintaining adequate respiratory pressure during speech: the observation of the patient performing some prescribed task which will probably result in disjointed and/or breathy speech if adequate breath pressure cannot be maintained. Both the FDA and RDP incorporate such respiration-in-speech evaluation exercises; however, the RDP’s criteria are somewhat vague with the assessor simply being instructed to make a completely subjective judgement concerning how well the speaker manages to coordinate phases of inhalation and exhalation during speech such that fluency is maintained. The equivalent FDA test incorporates more objective criteria with both the speech task – counting from one to twenty – and the normalcy threshold

28

(performing the task without drawing breath after counting has commenced) being well defined. There is, however, an apparent contradiction re the instructions given to the test subject and the marking scheme. The test administrator is directed to “ask the patient to count to 20 as quickly as possible on one breath” but is then instructed to award a ‘C’ grade if the “patient has to speak quickly because of poor respiratory control”. The specific grading criteria are as follows:

Grade A: No abnormality, no breaths taken during execution of task (i.e. no breaths after initial one before start of counting). Grade B: Very occasional breaks in fluency due to poor respiratory control. The patient may state that he or she is conscious of having to stop to take in a deep breath on occasions. An extra breath may be required to complete the task. Grade C: Patient has to speak quickly because of poor respiratory control voice may fade. Patient may require up to four breaths to complete the task. Grade D: Patient speaks on inhalation as well as exhalation, or breath is so shallow that only a few words are managed. Poor coordination and marked variability. Patient may require seven breaths to complete the task. Grade E: Speech grossly distorted by lack of control over respiration - may only manage one word on each breath (i.e. up to 20 breaths). Although the intention of the FDA authors in this instance may have been to indicate that the speaker being evaluated should only be penalised if rapid speech resulted due to poor breath control, it is very possible that the speaker might produce slurred and indistinct speech due to an overzealous interpretation of the instructions rather than as a result of any incompetence in respiratory function. It would seem advisable that this instruction to the test subject be rephrased to indicate that this speech task should be executed in such a fashion that it does not defeat the underlying purpose, which is to determine whether or not the test subject has sufficient lung capacity as evidenced by maintaining adequate respiratory pressure during speech for at least six seconds without drawing breath 18 .

Another significant shortcoming of this testing and measurement methodology is the possibility that a sub-normal performance may be the result of other factors – such as velopharyngeal incompetence – not directly attributable to poor breath control but which are, nonetheless, not uncommon manifestations among dysarthric individuals. The

18

It is debatable, however, whether the task itself sets an ambitious target for even normal speakers: assuming a typical word rate of 140 to 150 words per minute, the recital of twenty digits should take approximately eight seconds, three seconds more than the usual normalcy threshold of five seconds of sustained speech without drawing breath.

29

authors of the FDA recognise this shortcoming and advise the test administrator to adopt appropriate countermeasures:

“As people with velopharyngeal incompetence may be mistaken for patients with poor respiratory control, you could ask the patient to hold his or her nose to discriminate between the two.” (Enderby, 1983)

Such compensatory techniques, however, do not entirely compensate for the underlying fallibility of this testing technique: breathy speech or speech rendered disjointed by inappropriate pauses for inhalation and/or exhalation does not necessarily indicate poor respiratory control. This method continues to be used, however, since a better practical alternative is yet to be proposed.

Figure 2.5: FDA Profile Typical of Moderate Ataxic Dysarthria

30

The other major assessment criterion for this test is the constancy of phonation strength and intensity. It is expected that the speaker’s voice should not substantially decrease in loudness or become noticeably more breathy towards the end of the utterance. Although precise measurements of phonation strength can be obtained via the use of a laryngograph, the prohibitive cost of this device has pre-empted its widespread use as a diagnostic tool in this context, thus perceptual assessment remains the preferred method. The major anomaly here, however, is not the measurement technique but the weighting of it: the FDA evaluation criteria for this task are somewhat vague regarding the importance of maintaining phonation loudness in awarding a definitive grade for this task. What severity of penalty should be incurred if voice fade is the only observable abnormality? If, for example, one were to compare two performances where each featured only one area of incompetence – one in which phonation clarity is maintained throughout the recital but the patient is obliged to draw three breaths while, in the other instance, the test subject completes the task comfortably without drawing breath but suffers much-reduced voice strength – which performance is inferior? If considered in isolation from each other, is phonation strength accorded the same importance as breath control for this particular task? The FDA in its current format does not unambiguously address this issue. The objective metric system proposed for this respiration-in-speech task, for which the implementation is discussed in section 4.2 of chapter 4, seeks to resolve such ambiguities.

31

Figure 2.6: FDA Profiles for other FDA Dysarthria Sub-groups

Extrapyramidal

Flaccid

Spastic

Mixed

32

2.5.3 Evaluating Labial, Lingual and Palatal functions Acoustic analysis has proven an inexact science in verifying any specific physical behaviour of the articulator groups during speech; for those articulators which are not easily visible, the only failsafe method of verifying their in-speech movements is via the use of intrusive and/or uncomfortable imaging devices – such as Magnetic Resonance Imaging (MRI) (see Figure 2.8) and Computed Tomography (CT) scanners – which by their very presence and/or physical requirements may alter articulatory behaviour 19 . The electropalatograph, for example, completely prevents the tongue from making any physical contact with the palate once the device is inserted (Figure 2.7), resulting in some measure of sensory deprivation and restricted movement which are hardly conducive to normal speech production.

Figure 2.7: Electropalatograph and its Insertion into Oral Cavity (from Engwall, 2002)

Figure 2.8: MRI Images (Side Elevation) of Oral Cavity during Speech (from Engwall, 2002)

/a/ in “Matt”

/ι/ in “Vit”

19

MRI scanners, for example, require the subject to lie prostrate in a narrow metal tunnel of electromagnets that, when in full operation, can be so noisy as to preclude the speaker from actually hearing his / her own words. In Baer’s 1991 study, test subjects had to produce a sustained vowel for three and a half minutes (with brief pauses to draw breath) to allow the MRI scanner sufficient time to complete the imaging process.

33

Given the overall objective of this research – the formulation of objective acoustic measures for improved diagnosis – none of these devices which impose such abnormal constrictions on articulatory movement are acceptable since their very presence could worsen the dysarthric symptoms and thus prejudice the diagnostic outcome; no further mention, therefore, will be made of them in this section devoted to the discussion of current measurement methodologies. It is well documented that articulatory movement and placement for a given speech sound does vary to some degree from individual to individual; indeed, some with specialist training – such as singers and ventriloquists – can deliberately manipulate their articulators in an atypical fashion and yet produce normal speech. It is usually the case, therefore, that assessment procedures analysing articulatory function – particularly for the more visible speech organs such as the lips – will incorporate some form of visual or tactile inspection to confirm that the test subject’s articulators are actually executing the expected movements and/or placements.

Due to their close physical proximity and inter-dependence, the review of speech measurement techniques in this section will consider three articulators: the tongue, lips and palate. Treating these articulators as a tripartite unit is all the more practical given the fact that only three out of the FDA’s fourteen labial / lingual / palatal function evaluation tasks are suitable for the introduction and implementation of speech technology-based objective measures; the other eleven do not solicit responses that necessarily require the production of any speech sounds. Given the scope and emphasis of this thesis, the discussion of various metric systems for evaluating this labial / lingual / palatal articulatory group will therefore focus on those types of assessment procedures which do offer substantial scope for objective acoustic analysis, namely (i) evaluation of the ability to produce lip seal (ii) evaluation of the ability to execute rapid articulator movements in a co-ordinated fashion and (iii) testing for any evidence of hypernasality. In terms of suitability for these particular articulatory assessment tasks, automatic articulatory feature (AF) analysis has evolved as a speech science sub-discipline which specifically attempts to monitor and classify in-speech articulatory behaviour as opposed to performing automatic recognition of word or sub-word units (Kirchoff, 1999; King and Taylor, 2000). The implication here, of course, is that AF systems are, ideally, both language and speaker independent: an ASR system will not perform correctly if the speech signal it is expected to decode is not of the same language as the speech data it has

34

been trained on; AF analytical techniques, however, are potentially able to identify specific in-speech articulatory movements without any a priori knowledge of the speaker’s language or identity. For such articulatory categorisation, AF systems generally adopt well-established phonology taxonomies such as that proposed by Ladefoged (1982), which is reproduced below:

Table 2.2: Articulatory Feature Classification (from Ladefoged, 1982) Articulatory Feature Manner Place Voicing High-Low Front-Back Round

Values approximant, retroflex, fricative, nasal, stop, vowel, silence Bilabial, labiodental, dental, alveolar, velar, nil, silence Voiced, Unvoiced high, mid, low, nil, silence front, central, back, nil Rounded, unrounded, nil

For the specific articulatory behaviours under review in this section, the principal categories of interest are place and manner of articulation, specifically relating to lip and tongue movements. It is important, of course, that any AF classifier incorporated into any computerised version of the FDA be particularly sensitive to detecting certain minimal differences in articulatory behaviour – such as the start of voicing – which may result in the production of phone combinations that are often mutually confused, e.g. /t/ and /d/. In the FDA’s lip seal evaluation task, for example, the test subject is required to execute an airtight bilabial lip seal followed by a plosive release of air, an articulatory behaviour which usually produces the unvoiced phone /p/. It is not unlikely that speakers suffering from flaccid dysarthria might produce instead the voiced bilabial /b/. There is also the possibility, as will be discussed in a subsequent chapter, that the test subject may attempt to produce /p/ in a labiodental fashion (i.e. by pressing the lower lip against the upper front teeth to achieve the necessary airtight seal). Ideally, any CFDA articulatory feature classifier should be capable of discerning such aberrant articulatory behaviour, particularly since such aberrations may indicate the presence of a specific incompetence or pathological condition.

35

In terms of state-of-the-art performance and accuracy, the best performing AF classifiers have proven to be those which are based on multilayer perceptron (MLP) (Bishop, 1996) and support vector machine (SVM) technologies (Vapnik,1999). For binary decision tasks – e.g. detecting voice onset – Juneja (2004) reports between 79 and 95 per cent accuracy using an SVM classifier. King and Taylor (1999) record similar performances using artificial neural network (ANN) systems. Scharenborg and Wan (2006) attempt a more sophisticated approach, opting for multi-level 20 as opposed to Juneja’s binary AF classification. This multi-level technique is a more attractive AF solution in the context of dysarthria diagnosis since it enables the classifier to holistically examine the incoming speech data and possibly notice articulatory irregularities other than the ones which are specifically being screened. The following sections will include some discussion concerning the possibility of using multi-level AF for the three tasks under consideration. The first of these tasks – the evaluation of lip seal – is the subject of the following section.

2.5.4 Evaluation of Lip Seal The ability to maintain a hermetic lip seal when air pressure within the oral cavity exceeds that of the surrounding environment is critical in the production of bilabial plosives; moreover, from a diagnostic perspective, consistent and chronic failure to achieve airtight lip seal often indicates some degree of generalised facial muscle weakness normally associated with flaccid dysarthria. Given this importance of lip seal for speech production generally and dysarthria diagnosis in particular, it is surprising that the FDA is the only major dysarthria diagnostic tool that dedicates one of its evaluation exercises specifically to assessing competence in this area 21 . In rehabilitative dysarthria care, however, clinicians routinely employ several methods to monitor and measure a patient’s competence in achieving lip seal and such techniques may also serve diagnostic purposes. The most common of these techniques is an audio-visual inspection of the lips while the test subject is attempting to achieve and maintain an airtight seal. An incomplete seal will result in inter-labial gaps which are easily detectible visually;

20

In multi-level classification, the classifier attempts to identify multiple articulatory features for incoming speech data during a single iteration, thus they are capable of verifying – for example – that a given speech sound is labiodental, voiced and nasalized. Such a tripartite categorisation would be beyond the scope of a binary classifier. 21 In the RDP evaluation, assessing competence in maintaining lip seal is but a sub-component of one of the rapid articulator movement (diadochokinetic) tests, specifically the one requiring the patient to repeat the phone sequence “/p t k/” rapidly.

36

confirmation of the patient’s effort to increase intraoral air pressure may be confirmed either auditorily – by the sound of air escaping through the lips – or via some pressuresensing device placed directly in front of the mouth to measure the force of air expelled during the plosion.

Assessing lip seal using a combination of audio-visual and pressure-sensing methods offers yet another advantage: it reduces the possibility of erroneously diagnosing lip seal impairment in certain situations where involuntary nasal emission – as opposed to any lack of labial muscular control – is the main pathological factor responsible for the patient’s inability to sustain elevated intraoral pressure levels. Enderby (1983) recommends certain precautions when evaluating lip seal competence and has incorporated these recommendations in the relevant FDA sub-test, as detailed below:

Task 1: Ask patient to blow air into cheeks and maintain for 15 seconds. Demonstrate and note second attempt. Note time and whether any air leaks from lips - do not penalize for nasal emission in this section. Therapist should pinch the patient's nose between thumb and forefinger if there is nasal emission. Task 2: Ask the patient to say /p/ /p/ crisply and clearly 10 times. Demonstrate and encourage the patient to exaggerate the plosion. Note the second attempt and observe the consistency of seal for the plosion /p/. Although it is usually the case that only one of the two abovementioned tasks are administered during any given instance of an FDA diagnostic session, it is occasionally useful to have the test subject perform both of them in order to observe the significantly different articulatory behaviour, and areas of competence, evoked by the two tests 22 . The assessment criteria are detailed below:

22

The first test/task emphasises the maintenance of lip seal during conditions where intraoral pressure far exceeds levels which typically obtain during normal speech. Requiring the patient to maintain such intense pressure within the oral cavity is useful since any lip seal inadequacies are starkly revealed by virtue of the exertion required. The second test, although not as physically demanding, obliges the patient to make rapid alternating labial movements and thus evaluates bilabial seal in conditions more characteristic of normal speech where the necessity to produce a series of stop consonants in quick succession is not uncommon.

37

Grade A: Good lip seal. Retains pressure for 15 seconds or repeats /p/ /p/ with even seal. Grade B: Occasional air leakage, break in lip seal, lip seal not consistent for plosion on each sound. Grade C: Patient able to retain pressure for 7 to 10 seconds. Lip seal observed on sound, but auditorily weak. Grade D: Very poor lip seal, pressure lost from one segment of lips. Patient able to attempt closure but unable to maintain. Not auditorily represented. Grade E: Patient unable to maintain any pressure. Patient unable to visually or auditorily represent sound.

The abovmentioned criteria imply that the intensity and duration of the post-seal air expulsion are important indicators of the quality of the attempted lip seal. These indicators are, however, secondary in importance to verifying that the sound produced is indeed a bilabial plosive, a task which, if attempted solely by acoustic analysis, is by no means trivial. Scharenborg and Wan’s MLP implementation appear to boast the highest rates for correct classification of unvoiced bilabial plosives, as shown in Table 1.3.

Table 2.3: Scharenborg and Wan’s (2006) MLP AF Classification Articulatory Feature Value Bilabial Labiodental Dental Alveolar Velar Voiced Unvoiced

Classification Accuracy 68.3 67.4 19.7 78.3 63.1 93.8 89.8

It is to be noted that, based on the accuracy rates of the two relevant articulatory features (i.e, bilabial and unvoiced), the probability of Scharenborg and Wan’s system correctly classifying a phone as both bilabial and voiceless is approximately 0.61. Given the context of the FDA task, it may be possible to improve on this performance by considering contextual information in the acoustic signal: since the specific articulatory behaviour solicited by this FDA task is the execution of an airtight lip seal, the detection of premature expulsions of air may – in this context – serve as a reliable articulatory

38

classifier to indicate a pathological condition. The exploitation of such features as indicators of articulatory behaviour is discussed in greater detail in chapter 5. For the moment, however, the evaluation of diadochokinesis is of more immediate interest.

2.5.5 Evaluation of Diadochokinesis Diadochokinesis (DDK), defined in this context as the conscious / deliberate execution of rapid and precise alternating articulator movements 23 , constitutes one of the fundamental diagnostic exercises for both the RDP and the FDA since sluggish movement of the articulators – particularly in the case of the tongue and lips – is often an indicator of a specific dysarthria subtype, such as spastic dysarthria. Both the RDP and the FDA devote multiple evaluation exercises to DDK assessment, with the RDP’s testing procedures being somewhat more extensive since it requires the test subject to perform three distinct DDK-type tasks involving bilabial, labiodental and alveolar-velar articulatory movements, corresponding to the phone sequences “/u i/”, “/p t k/” and “/k a/ – /l a/”. The FDA omits the “/p t k/” task and, for the labial diadochokinetic task, does not require the patient to actually phonate when executing the rounded and lateral alternating lip movements associated with the “/u i/” vowel combination 24 . Since the assessment procedures for the bilabial DDK task rely predominantly on visual inspection, it is not suitable for the implementation of any acoustically based objective metric system and will not be discussed further in this study. The FDA’s alveolar-velar DDK task, however, does require phonation, as stipulated by the instructions to the test administrator:

“Ask the patient to say "ka la" 10 times as quickly as possible. Demonstrate 10 units in 5 seconds. Pay careful attention to observe: 1) Accuracy of articulatory placement from velum to alveolar ridge 2) Number of 'ka-la' segments produced 3) Time taken to produce ten segments 4) Regularity of production of the segments”

23

Diadochokinesis, in its broadest sense, can refer to deliberate alternating repetitive movements of any body part. 24 The instructions to the therapist for the bilabial DDK task state explicitly that “It is not necessary for the patient to use voice” when attempting to make the lip movements corresponding to the target “oo ee” speech sounds. The test administrator is reminded, however, that the patient should demonstrate a full range of lip movement: “Ask patient to exaggerate movement”

39

The grading criteria are the following: Grade A: No difficulty observed. 'ka-la' utterances produced with accurate articulatory placements, 10 segments produced in five seconds or less, segments produced with regular spacing. Grade B: Some difficulty observed - slight incoordination, slightly slow; task takes 5 to 7 seconds to complete. Grade C: One sound well articulated, other poorly presented, or task deteriorates; task takes up to 10 seconds to complete. Grade D: Tongue changes in position, different sounds can be identified. Grade E: No change in tongue position.

As can be inferred from the above, it is expected that the ten /k a/ – /l a/ utterances should each be of approximately five hundred milliseconds’ duration with the alveolar (“ka”) and velar (“la”) lingual placements clearly and evenly separated by intervening /a/ vocalisations. It should be noted, of course, that the selection of /a/ as the interval phone is deliberate since the pronunciation thereof obliges the tongue to assume a lowered position after having elevated to execute the higher alveolar /k/ and velar /l/ placements; this enforced high-low transition maximises the required articulatory range of motion, making the task suitably challenging and capable of differentiating varying degrees of DDK competence.

Apart from eliciting a sufficiently extensive range of articulator motion, the other important objective of any speech measurement DDK task is to assess the speed and rhythm of articulator movement. In this regard, both the FDA and the RDP rely solely upon the test administrator’s subjective on-the-spot judgement to evaluate the patient’s performance; it is somewhat unfortunate that neither the RDP nor the paper-based FDA mandate the recording of the patient’s oral responses for this task since it is not uncommon for the examining clinician to make significant errors regarding the timing and number of /k a/-/l a/ segments produced by the test subject 25 . Such errors would be reduced, of course, if the test administrator were afforded the possibility of reviewing recordings of the patient’s utterances at leisure, preferably with the assistance of some speech analysis tool – such as an AF classifier – capable of automatically counting the 25

During several informal interviews with various speech therapists concerning the usage of the paper-based FDA during a typical examination, it was established that one of the major assessment difficulties for this particular DDK task was attempting to keep count of the number of /k a/ - /l a/ segments produced by the patient while simultaneously assessing the quality of each individual segment.

40

number of alveolar-velar placements produced and the spacing of such segments. As discussed in previous sections of this study, although there exists a number of AF classifiers with rates of recognition accuracy adequate for this task, none of these classifiers are configured to explicitly measure the duration of an articulatory movement, i.e., they may confirm that the desired articulatory behaviour has been observed but will offer no direct information concerning the length of time taken for execution 26 . Since such temporal information would be highly useful in the context of this particular DDK assessment, it was therefore necessary to custom design an AF classifier for this purpose, the details of which are presented in chapter 5. For the moment, however, attention shall be focussed on the evaluation of hypernasality.

2.5.6 Evaluation of Velarpalatal competence Involuntary nasal air emission during speech, or hypernasality, results from a joint dysfunction of the palate and pharynx where the tissues of these two articulators fail to properly regulate airflow between the oral and nasal cavities 27 . Due to this excessive – and oftentimes inappropriate – escape of air from the nostrils during speech, methods of detecting hypernasality are quite similar to those used for detecting poor lip seal: in both cases, unexpected and erratic discharges of air create high frequency noise which occludes the speech signal. Both the RDP and the FDA hypernasality evaluation tasks require the test subject to perform repetitive DDK-like utterances. The FDA’s version of the task is, however, better suited to exposing any insufficiencies in this area since it specifically elicits the production of a combination of nasalised and non-nasalised phone sequences (“may” and “nay”), the pronunciation of which will almost always provoke an obvious manifestation of excessive nasal air emission characteristic of hypernasality. The criteria for the FDA hypernasality task are as follows:

26

Typically AF classifiers will seek to identify the speech information present in each windowed segment of the acoustic signal presented for evaluation (see chapter 3) but they are not usually configured to extract the associated timing information; for example, if a speech signal contained 10 consecutive frames which the AF classifier recognised as /a/, the AF classifier would not normally be passed information specifying the signal’s sampling rate frequency or frame rate length, thus the AF classifier could not calculate the actual duration of a speech sound occurring for a specified number of frames. 27 This impaired capacity to regulate airflow is usually due to the inability of the palate and pharynx to achieve an airtight seal, a necessary stricture such as during the production of nonnasalised speech sounds such as vowels.

41

Ask the patient to say /may pay/ and /nay bay/ while you listen to the change of quality. The assessor may find that placing his/her fingers on the bridge of the nose or using a mirror under the patient's nose will assist reliable grading. Grade A: Normal resonance. No nasal emission. Grade B: Slight hypernasality/imbalanced nasal resonance and / or occasional slight nasal emission. Grade C: Moderate hypernasality or imbalanced nasal resonance, some nasal emission. Grade D: Moderate to gross hypernasality or imbalanced nasal resonance, some nasal emission. Grade E: Speech completely masked by gross hypernasality or nasal emission

As indicated by the abovementioned evaluation guidelines, a combination of audio-visual and tactile inspection is the usual method for measuring the severity of symptoms; palpating the patient’s nose bridge to detect nasal airflow during the production of nonnasalised vowels can prove particularly helpful in cases where manifested symptoms are relatively mild. Although the detection of hypernasality purely by auditory inspection would not provide the wealth of diagnostic information which can be extracted from tactile and visual examination, the introduction of some type of objective acoustic metric system to quantify hypernasal symptoms would still be of considerable benefit re the classification and description of the observed abnormalities. It is also desirable, of course, that such a system should also be capable of verifying whether the test subject has attempted to execute the solicited articulatory behaviour in order to produce the phones /ay/, /m/, /p/ and /b/. This type of AF analysis would be quite similar to that needed for lip seal evaluation, as discussed in section 2.5.4. Details of the hybrid hypernasality-AF recognition system implemented as a solution for this particular FDA evaluation task are presented in chapter 5. For the moment, however, the measurement and description of laryngeal functions shall be the focus of attention.

2.5.7 Evaluation of Laryngeal Functions All articulatory activity relating to voicing, or the vibration of the vocal cords during speech, constitutes perhaps the most distinguishing physical characteristic of human oral

42

communication. Even though other animal species – most notably the parrot – may be capable of producing speech sounds, the physical development of their larynx and associated functions are considerably inferior to that evidenced in Homo sapiens. Given the integral role of phonation in oral communication, it is not surprising that both the RDP and the FDA devote a series of tasks to evaluate laryngeal function, placing special emphasis on voice quality, pitch modulation, the ability to vary the loudness of speech and the overall co-ordination of the aforementioned laryngeal functions to produce fluent and well-inflected speech. The following section reviews measurement techniques for one of the most fundamental characteristic of phonation, voice quality.

2.5.7.1 Evaluation of Voice Quality A common symptom of dysarthria is some abnormal change in the quality of phonation, particularly those pathological changes which are described as hoarse, breathy, creaky or strangled. It is often the case – and thus the inclusion of a specific FDA phonation task targeting this area – that the extent and specific pathological texture of voice quality degradation is an important indicator of the specific type of dysarthria manifested and its level of severity. Breathy voice, for example, is normally caused by flaccid dysarthria while strangled voice is associated with spasticity. Table 2.4 presents voice pathology types associated with various categories of dysarthria types: Table 2.4: Voice Pathology and Associated Dysarthria Sub-Type (Enderby, 1995) Dysarthria Type Ataxic Extrapyramidal Flaccid Mixed Spastic

Associated Voice Pathology Guttural Strained/strangled and/or breathy Breathy Strained and/or breathy, occasionally guttural Strangled/strained, harsh

Both the FDA’s and RDP’s voice quality assessment tasks use purely subjective criteria, with the RDP’s guidelines being the more vague 28 . The FDA voice clarity test, even though more precise in its specification of pathological phenomena to evaluate, is still flawed in one important aspect: there is no definitive or standard clinical interpretation 28

The RDP, unlike the FDA, does not incorporate a specific test solely for the evaluation of voice quality; RDP assessment of phonation clarity is incidental, with the test administrator being instructed to grade the patient’s voice clairty based on utterances in response to stimuli from other laryngeal tests.

43

regarding the meaning of certain key terms used to describe abnormal voice quality. It is often the case, for example, that the descriptions “hoarse” and “husky” are used interchangeably when classifying unclear phonation, the FDA’s voice clarity test’s evaluation guidelines incorporates descriptions such as “gutteral” and “strangled” without offering any clear explanations concerning what they actually signify. A careful perusal of the task rubric and grading instructions – presented in the following paragraph – reveals certain ambiguities:

Task: this task assesses quality and length of phonation. Thus if the patient’s phonation is continually husky they would score an ‘E’. Ask the patient to say “AH” for as long as possible. Demonstrate and note the second attempt. Only time voice that is clear. Exclude phonation not produced by vocal folds, e.g. gutteral, pharyngeal vibration. Grade A: Patient can say “AH” clearly for at least 15 seconds Grade B: Patient can say “AH” clearly for 10 seconds Grade C: Patient can say “AH” for 5 to 10 seconds – phonation interrupted by intermittent huskiness or breaks in phonation. Grade D: Patient can say “AH” for 3 to 5 seconds clearly. Grade E: Patient unable to maintain a clear phonation on “AH” for 3 seconds. Voice continually strained/strangled or gutteral.

It is not evident what is meant by the descriptor “strained” in relation to phonation quality, its use in conjunction with “strangled” implies that the two terms are synonymous. Presumably, the directive to only “time voice that is clear” infers that one should not consider any portion of an utterance which evidences any type of laryngeal abnormality. In practical terms, this could prove problematic at a conceptual level. If a patient is only mildly husky during the execution of this task, should that individual’s entire phonation effort be discarded or accorded the same grade as a very hoarse utterance? Moreover, do the grading criteria for this task even accommodate the notion of varying degrees of impairment? Is it possible to be just “mildly” gutteral or “moderately” strained in voice quality? Such areas of uncertainty re-emphasise the fundamental problem in this assessment exercise: most of the descriptive terms used to classify pathological voice quality lack sufficient precision, making the introduction of objective acoustic measures particularly appropriate.

44

Fortunately, attempts at formulating objective voice quality metric systems have generally met with considerable success, with the two principal methodologies being spectral analysis of the speech signal and direct mechanical or visual inspection of the larynx. Videography techniques offer the most conclusive diagnostic evidence (as presented in Figure 2.9) but are not practical in the context of a typical FDA examination since most speech therapists would lack the specialised training required to manipulate the filming apparatus within the laryngeal cavity.

Figure 2.9: Vocal Cords at Point of Maximum Closure (from Fourcin, 2005) Normal Voice

Breathy Voice

Creaky Voice

Figure 2.10: Full Glottal Cycle and Laryngograph Wave Form for Normal Voice (from Fourcin, 2005)

45

Figure 2.11: Laryngograph Wave Form for Breathy Voice (from Fourcin, 2005)

t Figure 2.12: Laryngograph Wave Form for Creaky Voice (from Fourcin, 2005)

t Figure 2.13: Laryngograph Wave Form for Normal Voice (from Fourcin, 2005)

t

The laryngograph, pioneered by Fourcin (1974), detects both the frequency and force of contact between the vocal cords during in-speech excitation and is thus able to gauge the completeness of vocal chord opening and closure. The open-close cycle of healthy vocal cords during voicing will produce a waveform period consisting of just one peak and trough at regular intervals. Multiple peaks and/or troughs within the same period usually indicate some form of pathological condition. The laryngograph waveform presented in

46

Figure 2.11, for example, is that of an individual diagnosed with breathy voice. Any pathological features of the phonation can therefore be quantified and described in terms of the number, amplitude and frequency of the peaks observed per cycle. Such a description would clarify the ambiguities of subjective categorisation and simplify the issue of classifying the severity of the symptoms. Unfortunately, the advantages of adopting laryngograph-based metrics for voice quality are counterbalanced – as mentioned previously – by the prohibitive cost of equipment acquisition which makes this option somewhat unattractive. Furthermore, even though the apparatus is non-invasive, the need for it to be fitted tightly around the oesophagus during operation may prove problematic for severely dysarthric individuals suffering from vocal cord spasticity. It is thus necessary to explore other less expensive (and less uncomfortable) alternatives which could still offer a viable voice quality classification solution. Various speech signal spectral analysis techniques would appear to offer such a solution and their accuracy is discussed in the section that follows.

2.5.7.2 Spectral Analysis Techniques to describe Voice Quality Spectral analysis of voice quality, which consists in the measurement / description of phonation clarity by reference to its acoustic spectrum properties, is a well researched discipline boasting more than four decades of investigation. Since it extracts information from the speech signal using non-contact techniques, spectral analysis is, in comparison to laryngograph-based analysis, a more suitable technology for FDA diagnostic procedures. Although adequate voice classification accuracy has been achieved using signal-to-noise ratio (SNR) techniques, such as those employed by Nessel (1962) and Emanuel and Sansone (1969), better precision has been obtained using harmonic to noise energy ratio (HNR) analysis, this analysis being, essentially, the comparison of energy intensities for the harmonic and non-harmonic components of the speech signal. The HNR may be defined as:

(Eq: 2.1, from Boersma, 1993)

47

where the degree of periodicity across the frequency ranges within a signal, x(t), is calculated based on computing the maximum autocorrelation, r'x(

max)

at a time

lag which is greater than zero (Boersma, 1993).

Using a variation of HNR analysis which they dub relative harmonic intensity analysis (Hr), Hiraoka et al (1984) report an accuracy rate of 95% in the classification of hoarse and normal voices for a selected group of speakers with normal and pathological voices 29 . Essentially, Hiraoka’s Hr metric is a ratio of the sum of the energy intensities of the second and higher harmonics, Hr, expressed as a percentage of the overall harmonic intensity, including that of the fundamental frequency. The relative harmonic frequency energy ratio may be defined as:

H r = (∑ pi / p)100(%)

(Eq: 2.2 from Hiraoka et al., 1984)

i≥2

where pi is the spectral energy at the ith harmonic and P is the total energy including that contained in the fundamental. Hiraoka observed that ninety per cent of the speakers in the pathological voice group exhibited an Hr smaller than 67% (i.e. the harmonics contributed less than two thirds of the overall energy) while 94 % of speakers from the non-pathological group manifested Hr values greater than this 67% threshold value. Despite their excellent results 30 , the study of Hiraoka et al. (1984) seems somewhat inadequate in one important aspect: their categorisation of the abnormal features of pathological voice is not sufficiently discriminatory. Hiraoka divides the speakers into just two categories, namely hoarse and normal, and thus implicitly suggests that “hoarseness” should be considered a generic category encompassing all types of abnormal voice texture. The validity of this hypothesis is inadvertently challenged by Hiraoka himself when making the observation that “Some hoarse voices (e.g. breathy, asthenic) manifest a prominent F0”. Hiraoka found this characteristic so distinctive that 29

Hiraoka’s test group consisted of 66 subjects, thirty-six of which were normal and the remainder diagnosed with the following disorders: (i) Recurrent nerve paralysis – 7 cases; (ii) Laryngeal polyp – 6 cases; (iii) Laryngeal cancer – 5 cases; (iv) Polypoid vocal cords – 4 cases; (v) Vocal cord nodules – 3 cases; (vi) Spastic dysphonia – 2 cases; one case each of (vii) Laryngeal papilloma and (viii) Vocal cord cysts 30 Six per cent of the normal speakers exhibited Hr values below the 67% threshold, rendering them indistinguishable from the pathological group using the harmonic intensity threshold.

48

it required the application of compensatory techniques to prevent this F0 prominence from causing the Hr algorithm to incorrectly classify breathy/asthenic voice types: “To analyze voices manifesting this characteristic, we distinguished the F0 component from the harmonic component in the harmonic intensity analysis”. (p. 1650, Hiraoka et. al., 1984) This need to treat breathy voices in such a special manner would seem to indicate that these voice types are actually a distinct pathological category. Certainly, in the context of dysarthria assessment and diagnosis, “hoarseness” and “breathiness” are two types of voice abnormality which are not considered to be synonymous (Enderby, 1980).

Hiraoka et al. (1984) are not alone in failing to provide a sufficiently fine-grained categorisation; despite using a variety of features such as jitter, shimmer and spectral slope in their analytical model, Li et al (2004) only offer a generalised classification, designating the voice quality of a given speech sample as either “normal”, “moderately noisy” or “severely noisy”. This reluctance to propose a more specific voice quality classification may stem from an awareness that – as discussed in section 2.5.8 – there is still a lack of consensus regarding what is meant by the various terms used to describe abnormal voice quality. Indeed, even commercial state of the art voice analysis software applications such as the Multidimensional Voice Program (MDVP) (Kent, 1996) avoid the use of these contentious terms when offering a diagnosis. The dysarthria-specific version of the MDVP, for example, measures the amount of deviation a vocalisation may exhibit from some pre-determined norm, but no attempt is made to qualify any observed deviation by categorising it, for example, as “creaky”, “strangled” or “husky” (Enderby, 1980; Colton and Casper, 1990). Despite the ambiguity surrounding the definition of these terms, an attempt has been made by this researcher to identify any spectral feature patterns which appear to be correlated with specific types of abnormal voice texture. The success, or lack thereof, in finding such patterns is discussed in chapter 5 of this thesis. For the moment, however, a review of current measurement techniques pertaining to another important laryngeal function – pitch modulation – is of more immediate interest.

49

2.5.8 Evaluation of Pitch Modulation The ability to purposively vary the frequency at which the vocal cords vibrate during voicing – a phenomenon commonly referred to as pitch modulation – is an essential skill for the production of well-inflected speech; the term “well-inflected speech” in this context being defined as oral communication incorporating tonal cues which appropriately emphasise emotional state and message content 31 (Markel et al., 1973). It is important to note, of course, that pitch, defined in its broadest sense, is a percept encompassing more than a mere measurement of vocal cord vibration during phonation. In the context of dysarthria diagnosis, however, issues concerning the perception of pitch are not relevant and will receive no further consideration in this investigation. The evaluation and measurement of pitch modulation will thus be restricted to its manifestation as a physical behaviour of the vocal cords during a speech task. Both the FDA and RDP incorporate tests which specifically require the patient to demonstrate pitch modulation competence by varying glottal frequency in regular incremental steps. The RDP’s test offers a more comprehensive evaluation than its FDA equivalent since the RDP version requires the test subject to produce both an incremental and a decremental series of frequency changes to demonstrate competence. The FDA’s pitch modulation task is not as well designed as the RDP’s in another aspect: the test instructions imply 32 that the patient should enunciate a multiple phone utterance – namely the well-known “Doh Ray Fah Me So Lah Tee Doh” sequence or a subset thereof – while attempting to produce at least six incremental pitch changes by “singing up” a musical scale. The implicit recommendation of the use of the “Doh Ray Fah Me So Lah Tee Doh” recitation may make the execution of the task more intuitive but it also introduces certain complexities which inappropriately solicit articulatory behaviour not strictly relevant to pitch modulation: these extra articulatory exertions needed to produce a series of diverse phones – such as the fricative /f/ in “Fah”, or the vowel /i/ in “Tee” – could well constitute a significant distraction from the principal task of producing the required series of incremental pitch changes. Introducing the possibility for such distraction becomes all the more undesirable considering the grading criteria for this task: 31

In tonal languages, pitch contour is also used to convey denotative meaning, i.e. correct interpretation of a spoken word or series of words may depend just as much on pitch contour cues as on correct pronunciation of component phonemic units. In the Mandarin language, for example, the phoneme sequence “/m/” followed by “/a/” may signify either “horse” or “mother”, with the distinguishing feature being that /a/ is spoken with a flat tone for “mother” but with an inflected (i.e rising and then falling) tone for “horse”. 32 The instructions to the test administrator are as follows: “Ask the patient to sing a scale (at least six notes). Demonstrate and make assessment on second attempt. Using visual indication of pitch, e.g. on laryngograph or similar display assists in grading this section reliably.”

50

Grade A: No abnormality. Grade B: Good, but patient shows some minor difficulty – pitch breaks or labouring. Grade C: Patient able to represent four distinct pitch changes. Uneven progression. Grade D: Minimal change in pitch – shows difference between high and low. Grade E: No change in pitch.

It is somewhat paradoxical that the above cited assessment criteria should stipulate that the speaker is to be penalised in the event of inadvertent pitch breaks or uneven progression, yet the probability of such irregularities occurring is increased due to the unnecessary introduction of multiple-phone sequences that make the task needlessly difficult for patients suffering from speech impairments – such as velarpharyngeal incompetence – which make pronunciation of sibilants and fricatives problematic. To further complicate matters, the presence of long-duration unvoiced phones in the exemplar, namely the /f/ of “Fah” and the /s/ of “So”, usually results in an utterance with substantial portions of unvoiced speech yielding no pitch information. The RDP’s version of the pitch modulation task avoids this unnecessary variability and inclusion of unvoiced sounds by explicitly specifying that only a single speech sound should be used – the vowel /a/ – when attempting to demonstrate pitch modulation competence. This simplification is an example of good test procedure because it adheres to a fundamental principle of diagnostic assessment: a test should be so designed that it targets and measures only the phenomena of interest and excludes, as much as possible, the presence of extraneous elements which could introduce noise. Unfortunately, the FDA’s version of the pitch modulation task does not adhere 33 to this “extraneous element” exclusion principle, a situation which – as discussed in the chapter 6 – has complicated the computerisation and implementation of an objective metric system for this task.

33

The FDA instructions to the test administrator – “Ask the patient to sing a scale” – are deliberately vague, permitting either a multiple- or mono-phone execution of the task (i.e. the test subject could opt to use the traditional “Doh Ray Me…” sequence or simply repeat a single voiced vowel, such as /a/, to attempt the desired incremental pitch changes). The optimal solution, of course, would be to eliminate this ambiguity and specify – as is the case with the RDP – which monophone vowel should be used.

51

The various merits and demerits of the FDA and RDP versions of the pitch modulation testing procedure notwithstanding, the core engine of any objective pitch modulation assessment tool must be some type of pitch detection algorithm (PDA). Various PDA methodologies and their suitability will now be considered in greater detail:

2.5.8.1 Pitch Detection Techniques Devising signal processing algorithms to determine vocal cord vibration frequency boasts one of the longest traditions in the speech technology sciences, with the first softwarebased PDA application implemented by Noll (1964). In order to determine the pitch period (the amount of time it takes for a single open-close cycle of the vocal cords), pitch tracking applications may employ a variety of techniques, a detailed description of which are provided by Rabiner et al. (1976) and Hermes (1993). However, in terms of feasibility of implementation in the context of this investigation, there are two PDA methodologies which are of interest: •

time domain detection methods, where pitch information is extracted by autocorrelation – a technique which compares a segment of the speech signal with adjacent segments to search for any repeating patterns of peaks and troughs in the amplitude envelope which would be indicate a pitch period, or



frequency domain detection methods, where the pitch period is determined via autocorrelation, but in this instance the speech signal is decomposed into component frequency bands and then separate autocorrelation analyses are computed for each band to ascertain the minimum period of delay necessary to observe a simultaneous peak in energy across all bands. The summary autocorrelogram in Figure 2.14 presents a typical instance of such frequency autocorrelation analysis. In this particular example, the cross-frequency energy peak observed at a lag of ten milliseconds indicates a pitch of 100Hz.

52

Figure 2.14: Summary auto-correlogram for 15ms of speech data

Unfortunately, consistent and accurate estimation of glottal frequency, also referred to as the fundamental frequency or F0 (Licklider, 1951; Rabiner et al, 1976; Hermes, 1993), remains an elusive goal with neither time-domain nor frequency-domain PDAs proving able to consistently and accurately extract pitch information under non-optimal conditions, such as in the presence of unpredictable loud background noise or when voice quality is very breathy or hoarse. Given the expectation, however, that the FDA interview should be conducted in quiet conditions, it is reasonable to assume that the speech signal to be evaluated for pitch information should be clean, i.e. exhibiting a signal-to-noise ratio (SNR) of 50 dB or greater. For clean speech, Cheveigné’s autocorrelation-based frequency domain “YIN” PDA (2002) represents the state of the art in terms of accuracy and is even capable of simultaneously extracting pitch information for multiple speakers active at the same time. This poly-pitch tracking capability comes at a substantial computational cost, however, and makes Cheveigné’s implementation not the most practical solution given the FDA’s task requirements in this regard. A computationally less expensive variant of the autocorrelation function is the Average Magnitude Difference Function (AMDF), the core algorithm of which is represented in the following equation:

53

N −1

AMDF d(n,l) =

∑ | y( j + lM ) − y( j + lM − n) | i =0

(Eq: 2.3, from Lin and Goubran, 2005)

where n represents the time lag, N corresponds to the window length, l is the time frame index and M is the frame update step in time. As evident from the above equation, the appeal of the AMDF method as an FDA-pitch tracking solution rests principally in its efficiency: it does not require any floating point multiplication. Furthermore, the AMDF subtractive method of amplitude peak-pattern comparison allows it to use shorter speech signal segments to calculate pitch period. These advantages, however, do not lessen the AMDF technique’s susceptibility to a generic weakness of all autocorrelation-based PDAs: pitch doubling/halving.

As discussed previously, the autocorrelation method

determines the pitch period by examining contiguous segments of a speech signal and locating – if present – distinct peak-trough patterns in the amplitude envelope. The pitch period normally corresponds to the distance between the highest amplitude peaks in two adjacent cycles; on some occasions, however, the highest peak in a cycle may appear at a point corresponding to half the pitch period – causing the vocal cord vibration frequency to be erroneously estimated at twice its true value. Conversely, the highest peak may not be detected at all if it is not significantly taller than neighbouring peaks. It is often the case that the non-detection of these demarcatory peaks follows some coherent pattern, such as the suppression of alternate peaks in a periodic series. If the PDA does indeed fail to detect every other peak, then the pitch period will be miscalculated at half the correct frequency. Of course, the incorrect estimation of amplitude peak position can be any multiple of the correct value, so that pitch quadrupling, for example, is also a possibility. Whatever the magnitude of the miscalculation, the essential issue to consider is that of error recovery: given a series of PDA-extracted frequency values extracted for some pitch modulation attempt, some type of confidence measure should be implemented in order to (i) correct measurements which seem likely to be inaccurate and (ii) assist the test administrator in gauging the reliability of the returned scores as a whole. The provision of error recovery mechanisms is secondary, of course, to the principal goal of providing the test administrator with objective F0 pitch contour tracking information in order to allow a more accurate description of the patient’s competence in this domain. It is important that key diagnostic criteria for this task – such as the notion of a “distinct” pitch change or “even” progression – should be unambiguously defined in objective acoustic terms so as to establish a common point of reference and, ultimately, some form of standardisation. If one considers, for example, that the FDA pitch modulation task

54

specifies the production of at least six incremental changes while singing a scale, should the term “singing a scale” be rigorously interpreted to mean that the test subject is expected to double the pitch frequency of the initial phonation using between six to eight evenly spaced incremental steps? What degree of variability is allowed – in terms of the timing and frequency changes between steps – beyond which it is considered that the progression is uneven and should be penalised? These issues will receive more careful scrutiny in the chapter 6. At this point, however, it is more appropriate to review current methodologies for the evaluation of another important component in the production of well-inflected speech: control of speech volume.

2.5.9 Evaluation of Volume Control The ability to appropriately increase and decrease vocal acoustic energy during speech is just as important as pitch modulation in enhancing oral communication. The RDP devotes two specific tasks to the evaluation of volume/loudness control, requiring the patient to attempt both an incremental and decremental series of volume changes while repeating the vowel /a/. Conversely, the FDA volume control test only requires the patient to attempt incremental changes via the recitation of the digit sequence “one” to “five”. The FDA grading criteria are as follows:

Grade A: Patient able to change volume in controlled manner. Grade B: Minimal difficulty - occasional numbers sounding similar in volume. Grade C: Changes in volume but noticeably uneven progression. Grade D: Only limited change in volume and great difficulty in control. Grade E: No change in volume. If patient is excessively loud or quiet mark this grade even if makes some minor shifts in volume.

In terms of research objectives, the principal challenge concerning the computerisation of this task is the formulation of some reliable method of objectively measuring loudness, a non-trivial task since the impression of how loud a sound seems to be depends not only on the actual acoustic properties of the sound itself but also on environmental factors and the listener’s receptivity. A review of current loudness measurement systems is presented in the section that follows.

55

2.5.9.1 Quantifying Perceptions of Loudness Loudness, like pitch, is a percept which is correlated to, but not solely defined by, the intensity of energy in the acoustic signal and how this energy is distributed across the signal’s principal component frequencies. For those frequencies to which the human ear is most responsive (30 – 15000Hz), it has been demonstrated that, in general, the higher the sound’s frequency, the less energy is required to create the same impression of loudness. When presented, for example, with a 100Hz pure tone at 60 dB, most listeners with normal hearing will perceive this 100Hz tone to have the same loudness as a 30Hz tone at 80dB, even though the second tone has an intensity difference of 20dB. This energy-frequency correlation with perceived loudness was first systematically demonstrated by Fletcher and Munson (1933) who expressed this relationship graphically as a series of contour lines – known as equal loudness contours – which identify those pure tone energy-frequency combinations that produce identical impressions of loudness for most listeners with normal hearing. Figure 2.15: Equal Loudness Contours (ISO 226)

This equal loudness contour hypothesis forms the basis of a simple loudness metric known as the phon (ISO 31-7), which is defined as the sensation of loudness produced by a pure tone of 1000Hz at an intensity of one decibel (Moore, 1997), the implication here being that the decibel and phon values for a thousand hertz pure tone are identical.

56

Unfortunately, the validity of equal loudness contours, and by extension the phon, is only of limited applicability in the case of complex sounds (i.e. those sounds – such as speech – which incorporate a range of frequencies manifesting differing energetic intensities). As first confirmed by Fletcher (1940), the perceived loudness of a given complex sound depends on the energy distribution pattern across that sound’s component frequencies: the more diffused the energy is across the acoustic spectrum, the lesser the sensation of loudness. Since the equal loudness contour hypothesis does not model the effect of crossspectrum energy distribution, the phon is not the most suitable metric for describing the loudness of speech. This difficulty in quantifying perceived loudness for diverse and complex sounds remains an issue which has not been definitively resolved. An all-inclusive computational model is yet to be implemented which can simulate the human auditory system’s response to such variables as changes in a sound source’s frequency, duration, complexity and distance from the listener. In the context of FDA speech volume evaluation, however, this loudness evaluation problem is to some extent simplified since the signal to be evaluated – the human voice – should always be of the same general type and acoustic complexity. In the same vein, it is reasonable to assume that the microphone’s distance from the sound source (the test subject’s mouth) will not vary significantly, especially if a headmounted microphone is used. Given this relative conformity in terms of environmental factors and sound source acoustic properties, the sone metric proposed by Stevens would appear to be an appropriate unit of measurement for this particular FDA task. The sone is derived from Stevens’ hypothesis that perceptions of loudness, L, are correlated to the signal’s physical intensity, I, but filtered by a psychoacoustic constant k:

L=kI0.3

(Eq: 2.4 from Moore, 1997)

57

Figure 2.16: Sone Loudness Units and Corresponding dB Values (from Moore, 1997)

The identification of an appropriate metric represents, of course, only a partial solution to the CFDA voice loudness evaluation problem. It would be very useful if loudness scores output by the CFDA for this task could give some approximation of the absolute loudness of a given speaker’s voice. As discussed in preceding sections, microphone quality has a significant weighting effect on the rendering of real-world acoustic phenomena as a digitised signal. The effect of a given microphone’s response to a sound source – this effect being known as the transfer function – will depend on the microphone’s specific configurations. It is a non-trivial issue, therefore, to devise some method of normalising this transfer function in order to allow the comparison, in absolute terms, of the loudness levels of different vocalisations from different speakers recorded by different microphones on different occasions. The objective here is that it would be very useful to furnish the CFDA test administrator with a loudness evaluation metric system that could provide a specific standardised range of scores to distinguish normal from abnormal test performances. In attempting to devise such a standardised loudness metric, however, it must be taken into consideration, of course, that microphone sensitivity is not the only significant hardware equipment variable in the digital recording process; the other important component in the channel – i.e. the process of digitising the sound – is the sound card, the dedicated processor responsible for converting the analog data received from the

58

microphone into some digital representation. Countering the effects of the sound card’s transfer function is rendered even more problematic given the fact that this device is often user-programmable, i.e. it is possible to change the sound card’s signal processing protocols via its driver (the software program that governs the sound card’s interaction with other programs and the computer’s operating system in general). A possible solution to this problem of standardisation could be the use of some reference speech signal to calibrate the recording equipment. In such a scenario, a recording of a speech sample for which the intensity values are known is played through the computer’s sound card via loudspeakers of a specified configuration and then re-recorded by the computer’s microphone. The difference in intensity values between the re-recorded signal and the original are then used to compensate for the combined transfer function effects of the specific microphone and sound card installed on the computer being used to administer the CFDA evaluation. The success of these proposed normalisation procedures are discussed in chapter 6. At this point, however, issues pertaining to the measurement of speech rate will become the focus of more detailed scrutiny.

2.5.10 Evaluation of Speaking Rate Speech rate – which may be defined in this instance as the speed at which perceptually distinct word or sub-word speech sounds (phones) are produced during oral communication – can provide important indicators concerning the patient’s emotional state 34 and dysarthric condition. It has been noted by Enderby (1986) for example, that the only type of dysarthria which occasionally causes involuntarily rapid speaking rates is Extrapyramidal dysarthria; conversely, all other categories of dysarthria tend to have the opposite effect and cause abnormally slow rates of speech. Given this acknowledged importance of speech rate in diagnostic evaluation, it is somewhat surprising that the FDA relegates speech rate measurement to the status of being merely one of the “influencing factor” tasks with no separate grade nor clear guidelines as to how influential this feature should be when making the overall diagnosis. The actual FDA instructions for speech rate assessment are quoted below:

34

Universally, a relaxed and informal communication environment (as would obtain among family and friends) usually results in high words per minute rates; conversely, slower and more carefully articulated speech is usually indicative of a more formal communication context.

59

“The rate of speech is an important diagnostic characteristic which can assist in determining the difference between hypokinetic and spastic dysarthria in particular. However, it is difficult to determine fine grades of ‘rate’. Thus having undertaken the task, determine whether the patient is talking at ‘normal rate’ or too fast or slow.” It is rather perplexing that, although mention is made that speech rate “can assist in determining the difference between hypokinetic and spastic dysarthria”, no details are provided specifying how such a distinction can be made based on rate measurements. Furthermore, it is not clear whether the test administrator should evaluate speech rate over the course of the whole FDA diagnostic assessment interview or only during those specific tests where the speaker is likely to produce sustained periods of natural spontaneous speech. In any event the introduction of objective speech rate measurement techniques – contrary to the protestations of the above cited FDA test rubric – does indeed allow the test assessor to determine “fine grades of rate”, as discussed in the following paragraph.

2.5.10.1 Defining Parameters for Speech Rate Measurement One of the most popular and well-known speech rate measurement units is the words per minute (wpm) measure; subjective wpm measurements are usually carried out via a manual count of the number of words uttered during a sixty-second time period or some fraction thereof. Unfortunately, the wpm method attempts to measure a unit – the word – which is a rather amorphous and non-standardised entity: conceptually, the notion of a written word usually corresponds to some series of alpha characters preceded and followed by some vacant space separating such a string of characters from other strings. In virtually all Indo-European languages, words may be combined to form compound units – e.g. “black” and “bird” to produce “blackbird” – and these aggregates are themselves considered as single words in the context of wpm-based speech rate measuring techniques. This possibility of aggregation renders it quite unlikely that any two words produced during spontaneous speech would consist of the same number of phones or syllables. Such unavoidable variability precludes the possibility of any sort of word level normalisation and makes speech rate measurement techniques based on word count unsuitable for adoption as an objective metric. Given these intrinsic word-level normalisation difficulties associated with wpm-based speech metrics, most objective speech rate estimation techniques are based on some mechanism which counts sub-word speech sounds, typically at the syllable level. Over

60

the last decade, several approaches to syllable rate estimation have been proposed, their principal motivation being the improvement of ASR accuracy by detecting significant speaking rate fluctuations and the resulting change in pronunciation styles. Mirghafori et al (1996) used HMM-ASR analysis to locate inter-word boundaries and then calculate the syllable-per-second rate based on utterance duration. This technique, as the researchers admit in a subsequent report, was inherently flawed since it erroneously assumed a certain base level of ASR accuracy under all conditions: “…this method requires the assumption that the speaking rate determined by a potentially errorful [sic] recognition hypothesis would be sufficiently accurate…this is often not the case, particularly for unusually fast or slow speech.” (Morgan et al, 1997: 2080) Subsequent attempts therefore focused on extracting speech rate directly from the acoustic signal without any reference to speech encoding technologies. Such an approach, of course, has the advantage of being language independent and avoiding the circular logic trap of using ASR methods to detect those conditions which actually defeated said ASR methods in the first place! The mrate method, advanced by Morgan and Fossler-Lusier (1998), a refinement of a similar approach adopted by Kitazawa et al. (1997), represented the first substantive acoustic-based speech rate measure based on spectral moment analysis. In the context of speech rate measurement, the spectral moment is essentially a measure of the ratio of high energy sounds (which usually correspond to voiced speech) to lower-energy unvoiced phones such as consonants. The greater the incidence of voiced speech segments, the closer the spectral moment shifts towards the higher frequencies. Morgan and Fosler-Lussier introduce the mrate (1998), a refinement of the enrate measure effected via the introduction of multiple peak detection estimators: the mrate application divides the speech signal into four distinct frequency bands and the correlation between energy peaks in contiguous bands provides a better indicator of speech rate. When compared with the manual transcriptions of expert phoneticians for the Switchboard corpus (Godfrey et al., 1992), the mrate speech rate algorithm yielded accuracy rates varying no more than two syllables per second from the manually verified syllable rate. This performance has not been significantly bettered by any subsequent algorithm, thus justifying its implementation – in modified form – for speech rate measurement in the CFDA. Of course, the computation of speech rate detection is

61

ancillary to the principal consideration of analysing the intelligibility of the speech itself, which is the subject of the ensuing section.

2.5.11 Evaluation of Intelligibility Intelligibility, at its most fundamental level, may be defined as the degree of success in establishing communication between the sender and intended recipient of a message 35 . In the context of oral communications, this process may be disrupted at several stages, including during the act of speech production, transmission of the speech signal via its carrier medium, and the decoding of the signal by the receiving party. A more detailed discussion of these factors will be presented in chapter 7 and section 2.5.11.2 of this chapter. At this juncture, however, it is useful to review current dysarthria assessment intelligibility testing procedures.

2.5.11.1 FDA Intelligibility Evaluation Procedures The inability to produce comprehensible speech is the core pathology of the dysarthric condition, the severity of such a condition usually being defined as some type of intelligibility score 36 , typically consisting of some form of assessment exercise test where a listener attempts to interpret a series of utterances that have been produced by the speaker being evaluated. According to FDA classification of dysarthria severity, an individual is deemed to be: •

severely dysarthric if scoring less than thirty per cent overall on the intelligibility tests



moderately dysarthric if between thirty to fifty-nine per cent of his/her utterances are correctly interpreted,

35

The National Telecommunications and Information Administration (NTIA) defines it even more broadly as the “capability of being understood”. Of course, demonstration of comprehension by the recipient may be manifested in different ways – ranging from a simple gesture to producing various forms of transcription – depending on the context and purpose of communication. 36 Paradoxically, the FDA intelligibility tests in themselves are not designed to offer any extra information to assist in identifying the patient’s dysarthric sub-type.

62



mildly dysarthric if obtaining a score between sixty and seventy nine per cent 37

The FDA intelligibility assessment procedure consists of three tests, two of which require the patient to repeat aloud a series of single words – in the case of the first test – or short phrases (in the case of the second). In both instances, the stimuli consist of fifty predefined items, a selection of which is listed in Tables 2.5 and 2.6. Table 2.5: Selection of FDA Vocabulary for Isolated Words Task (Enderby 1983) Beautiful

Charity

Huge

Parade

Shout

Broken

Forgive

Justify

Power

Tissue

Cab

Generation

Logical

Ridiculous

Wonderful

Coin

Gallery

Mug

Salary

Zero

Table 2.6: Selection of FDA Vocabulary for Isolated Sentences Task (Enderby 1983) Can I have soup? Go to the vet's Do what you like Get a cab

I've got a new My daughter is a That's good toy

nurse

Let's go to the

Put it in a dish

There's my dad

Look at his neck She looks sad

Who are they?

My new van

You have to pay

theatre Don’t

be

so Have a bath

noisy Enjoy your food It's got late

She gave me a coin

When presenting a subset of these items during the course of an actual test, the test administrator should avoid acquiring any specific a priori knowledge of said items by ensuring that only the patient can see and read them at the point of delivery 38 . This is particularly important if the test administrator will be the sole listener responsible for assigning an intelligibility score to the test subject’s utterances, a situation which is not

37

A patient receiving an average score of eighty per cent or above would not be classified as suffering from any significant speech impairment.

38

Of course, there is also the possibility that experienced FDA test administrators will become quite familiar with the vocabulary items and such familiarity may introduce some element of bias.

63

unlikely since the FDA does not mandate recording of the patient’s performance for this or any other FDA task.

The assessment criteria for the isolated words and

phrases/sentences tests are as follows:

Grade A: Ten (10) words correctly interpreted by the therapist, with speech easily intelligible. Grade B: Ten (10) words correctly interpreted by the therapist, but therapist had to use particular care in listening and interpreting what was heard. Grade C: Seven to nine (7-9) words interpreted correctly. Grade D: Three to five words (3-5) interpreted correctly. Grade E: Two or less (0-2) words interpreted correctly.

As discussed at length in chapter 3, the “particular care in listening” distinction is not consistently presented in the abovementioned evaluation guidelines. Although it is the only factor separating the two highest grades, there is no further mention made of it in the criteria demarcating the ‘C’, ‘D’ and ‘E’ grades, where the only stipulation is the minimum and maximum number of correctly interpreted utterances allowable in the given grade range (e.g. 3-5 correct interpretations to merit a ‘D’). In addition to the inconsistencies mentioned in the preceding paragraph, there is yet further possibility for confusion concerning the evaluation guidelines for these isolated utterance intelligibility exercises: it not clear how these minimum-maximum boundary values should be applied in the case of the isolated phrases test, where each of the utterances produced by the patient are intended to represent a multiple-word sequence. If, for example, a listener-judge managed to correctly decipher three out of four words for one of these multiple-word vocalisations, should the one word which defeated the listener cause the whole utterance to be deemed as being incorrectly interpreted? Would there be a significant difference if, for the same utterance, the listener had deciphered only two out of the four words correctly? The issue is further complicated in view of the fact that FDA sentence items vary in length from two to five words, thus it is quite possible that a listener could understand sixty percent of the individual word items in a ten-sentence test set but yet be obliged to give the speaker a score of zero if there was at least one misunderstood word in each sentence. Conversely, if the aforementioned percentage of correctly interpreted word items were to have been more densely concentrated in a subset of the same ten utterances, the resulting intelligibility rating could be as high as 60%. The implication here, of course, is that it is possible for two speakers performing the FDA isolated sentences test to score the same percentage of correctly deciphered words and yet

64

emerge with substantially different intelligibility ratings, varying by as much as two intelligibility classification categories (i.e. “mild” to “moderate”). The FDA test administrator instructions do not address this paradox and – from consultations with various clinician groups – it would appear that there is no standard procedure for scoring multiple-word utterances which are partially correct. The third intelligibility test is less structured in format and consists essentially of a fiveminute dialogue between the test administrator and the patient with no specific format or guidelines, as indicated by the test instructions:

“Engage the patient in conversation for about 5 minutes. Ask about jobs, hobbies, relatives, and so on.”

The assessment criteria for this conversation-style intelligibility test are the following:

Grade A: No abnormality. Grade B: Speech abnormal but intelligible - patient occasionally has to repeat. Grade C: Speech severely distorted, can be understood half the time. Very often has to repeat. Grade D: Occasional words decipherable. Grade E: Patient totally unintelligible.

This particular type of intelligibility testing where assessment is based on spontaneous verbal output is, arguably, a more naturalistic and truer reflection of the test subject’s ability to be understood in a real-world setting. Unfortunately, since the FDA testing procedure does not mandate the recording of oral responses for this task which would allow review by a third party, there is still some cause for concern re the risk of a biased assessment due to familiarity with the patient’s speaking style (Thomas-Stonnell et al., 1998, Carmichael and Green, 2003). This form of bias is better controlled in the other major dysarthria-oriented intelligibility test battery, the procedures for which will now be considered.

65

The RDP in itself does not incorporate any comprehensive intelligibility evaluation, this instead being the domain of a related but separate battery of tests known as the Assessment of the Intelligibility of Dysarthric Speech (AIDS) (Yorkston, 1981). Unlike the FDA, the AIDS evaluation does mandate the tape-recording of utterances made by the test subject 39 ; furthermore, the test administrator’s involvement in the actual grading/scoring of the recorded speech data should be restricted to that of supervising the playback of audio files to a panel of judges and tabulating their scores. The judging panel itself should be composed of experts in the discipline who have had no contact with the individual under assessment. Although the abovementioned “experts-only” evaluation procedure constitutes usual practice for this type of closed-set vocabulary intelligibility test, it has been demonstrated (Tjaden and Liss, 1995; Carmichael and Green, 2003) that familiarity with vocabulary items along with immediate prior exposure 40 to the speaking style of the test subject will engender a learning effect that can cause listeners to vary their assessments by as much as 30% after just four iterations of learning. This learning effect phenomenon will be discussed in greater detail in the sections that follow, along with an overview of the various intelligibility measurement methodologies.

2.5.11.2 Different Aspects of Intelligibility As the foregoing definition implies, the communication process may be divided into three domains: •

Production: the psycho-linguistic and neuro-motor activities responsible for the formulation and encoding of the message; in the context of dysarthria assessment, an equally important factor is the articulatory behaviour responsible for the subsequent acoustic production of the oral message.



Transmission: the effect which the carrier medium (or media) may have upon the speech signal, examples of such effects are environmental noise, reverberation

39

With the exception of those responses to stimuli made during practice sessions. In Carmichael and Green’s 2003 study, a group of non-expert listeners heard a selection of dysarthric speech samples presented in five rounds; during each round the listeners were instructed to transcribe for each utterance what they thought the speaker was trying to say. At the end of each round, the listeners were apprised of the intended meaning of each utterance for that round. All of the listeners exhibited, to varying degrees, an in-test learning effect enabling them to improve their recognition rates for the dysarthric speech to which they were exposed (see section 7.2.1 of chapter 7 for a more detailed discussion). 40

66

and the acoustic modifications introduced by recording equipment, (see section 2.2.1). •

Reception: the decoding competence of the recipient in relation to the incoming message; this competence is affected by: o

biological factors, such as hearing ability

o

the extent to which the speaker and listener share the same or similar socio-linguistic experiences, such as speaking style (accent) and dialectspecific lexicon.

The main focus of this study shall be the measurement of the production and reception aspects of intelligibility, the various aspects of which now merit closer inspection.

2.5.11.3 Intelligibility at the Source Articulatory competence is, of course, merely one of the phases in the process of formulating and producing an oral message. As detailed in section 2.1 of this chapter, there are certain speech disorders – such as apraxia – which do not impair neuro-muscular control of the articulators but it is still difficult or impossible for those who suffer from such disorders to produce intelligible speech even at the isolated word level. In such cases, reduced intelligibility arises from some disruption in the cognitive process, resulting in speech which is acoustically and phonetically well-formed but linguistically incoherent, a phenomenon referred to in popular parlance as “talking gibberish”. Even though it is not uncommon for dysarthric individuals to also exhibit some signs of linguistic cognitive disruption, it is beyond the scope of this study to attempt to measure intelligibility at the semantic level. The intelligibility measurement systems proposed here are primarily designed to detect and measure oral communicative competence as measured by acoustic abnormality, with the following assumptions: i)

the oral communication to be evaluated represents a semantically coherent message, i.e. if the dysarthric speaker were to transcribe his/her utterance, it would be perfectly intelligible.

ii)

the speech sample for evaluation is “clean”, i.e. that the speech signal’s acoustic quality has not been altered by the transmission medium in such a way as to adversely affect said speech sample’s intrinsic intelligibility, the term “intrinsic” in this sense referring to the acoustic properties which the signal would exhibit if produced under standard atmospheric conditions and 25dB SNR.

67

iii)

the speaker being assessed is fully aware that he/she is under examination and is therefore making an optimal effort to speak clearly.

Prior to a more detailed review of communicative competence intelligibility tests, it is important to formulate a clear definition of the term, particularly as it relates to articulatory competence. These distinctions will now be considered.

2.5.11.4 Articulatory vs. Communicative Competence One of the basic premises of linguistics holds that the speakers of any given language will collectively have certain perceptions concerning what constitutes noticeable and meaningful differences between various types of speech sounds. When categorised according to such criteria, the term phoneme is used to refer to any speech sound which is considered perceptually distinctive (Clark et al., 2006). Phonemes are not only language specific (i.e. a speech sound or combination of speech sounds perceived as being distinctive in one language may not be accorded the same status in another language) but may also vary within the same language depending on such factors as geographical location and socio-cultural milieu. Native speakers of French, for example, often lack articulatory competence vis-à-vis the production of /ð/ (‘the’) while Anglophones tend to show similar incompetence when attempting to produce the French vowel /y/ (as in ‘rue’: ‘street’) which, unlike in English, is recognised by Francophones as distinct from /u/ (e.g. ‘roue’: ‘wheel’). The implication here, of course, is that it is extremely unlikely that any one speaker can exhibit articulatory competence in the production of all the phones (and their associated language-specific phonemic distinctions) for all languages; in practical terms, universal articulatory competence is an impossible proposition. The concept of articulatory competence, therefore, may be defined as the ability to recognise and produce the phonemes of one’s own native language.

The above definition of language-specific articulatory competence, however, presupposes that there is some set of criteria by which one language can be distinguished from another. Identifying such criteria becomes problematic when considering that the linguistic entities labelled as languages are, in themselves, not homogenous in terms of morphology, phonology nor lexicon; such integral components of language will vary

68

significantly within any given idiom depending on the influence of the following variables: i) Socio-economic / Cultural grouping: Virtually all societies are stratified – according to some form of socio-economic filter – into distinct sub-groups, referred to as social classes. It is not uncommon for there to be noticeable differences in speaking styles and lexicon between such classes, even if their respective habitats are in close physical proximity (Wells, 1982). ii) Geographical location: Intra-language differences are also observed between groups residing in different geographical regions, although there is often no systematic relationship between the actual distance separating communities and the extent of their linguistic dissimilarity, e.g. the dialect and accent of a group may bear more resemblance to that of another group living fifty kilometres distant than to a third group only twenty kilometres away (Wells, 1982; Allsopp and Allsopp, 1996). iii) Temporal / Historical context: As well documented, languages and their associated linguistic conventions change over time. These modifications may be gradual (as in the case of the Great Vowel Shift in English) or precipitous, resulting from a relatively sudden-onset development such as an encounter with a foreign culture 41 . iv)

Technological innovation: The advent of new devices and procedures, particularly in the domain of mass communications and information technology 42 , often introduces new lexical terms and occasionally changes the meaning of existing ones (Toffler, 1976).

For any given language spoken by a sufficiently diverse and populous community, the influence of all of the above-cited factors usually results in the formation of distinct varieties of said language, particularly in terms of its oral expression. Each of these varieties, or dialects, may have evolved their own pronunciation conventions and phonephoneme groupings. In most varieties of North American and Caribbean English, for example, the words “bear” and “beer” are homophones (pronounced in the same way), however this is not the case for the majority of British English dialects where “bear” is rendered as /bε:/ while /biə/ is used when referring to the alcoholic beverage. This 41

The Gallicisation of the English language resulting from William the Conqueror’s 1066 invasion of England constitutes a good example of such precipitous linguistic change. 42 The influence and spread of Anglo-American language and culture on a worldwide scale has been facilitated by the advent of television and radio; this influence is reinforced by the fact that some of the most important news and mass entertainment networks are controlled by AngloAmerican interests.

69

difference in pronunciation conventions between dialects of the same language can result in the speakers of one dialect experiencing difficulties in readily comprehending speakers of another.

Such difficulties are, of course, usually not due to any articulatory

incompetence but dialect-specific communicative incompetence: the speaker may be capable of emulating the speaking style of his or her intended audience but is not accustomed to doing so and – through force of habit – reverts to a more familiar speaking style and is therefore not readily understood. The foregoing implies that the concept of oral communicative competence does not only include language-specific articulatory competence but also encompasses such suprasegmental elements as prosody and inflection. Given that the FDA’s intelligibility tests implicitly seek to evaluate communicative as opposed to articulatory competence, the review of intelligibility testing techniques presented in the following sections concentrate on those which also emphasise proficiency in communication.

2.5.11.5 Current Intelligibility Evaluation Methodologies Although this study is primarily concerned with the evaluation of intelligibility in terms of communicative competence, it is nevertheless important – for a more general appreciation of the topic – to give some consideration to intelligibility assessment methodologies which focus on measuring the carrier medium’s effects on the speech signal. These transmission channel-induced acoustic modifications are referred to as noise, a phenomenon which now deserves closer scrutiny.

2.5.11.5.1 Effects of Noise on Intelligibility Due to evolutionary processes, the optimum transmission medium for human aural/oral communication is that which naturally occurs in the atmosphere at sea level. Significant deviations from these optimum conditions – such as altering the barometric pressure or composition of the atmospheric gases – usually result in some level of acoustic signal distortion and resulting decrease in intelligibility. A more common source of speech signal degradation is noise contamination; the sources of such contamination include both natural and man-made elements.

70

The most commonly used metric for representing the level of speech signal contamination by noise is the signal-to-noise ratio (SNR), this being the ratio between the average energy components of the original speech waveform and that of the contaminating element 43 . The noise contamination e(n) may be represented as: (Eq: 2.5)

e(n) = s(t) – s’(t)

where s(t) represents the original speech signal at the instant t and s’(t) is the corresponding corrupted signal. The SNR is normally expressed as decibels (dB):

SNR(dB) = 10 log10

E (s) = 10 log10 E (n)

∑s

2

(t )

n

∑ [s(t ) − s' (t )]2

(Eq: 2.6)

n

where E(s) and E(e) signify the speech and noise energy respectively (Holmes and Holmes, 2001). Various attempts have been made to fine-tune the SNR metric by placing emphasis on those energy regions in the speech signal which most influence perceptual assessments of speech quality. The perceptual speech-quality measure (PSQM) has emerged one of the more recognised speech-decoding oriented metrics and has been standardised by the International Telecommunications Union (ITU). Another metric system, initially devised by Fletcher and Galt (1950), that is designed specifically to measure the degrading effects of noise on speech intelligibility is the Articulation Index (AI) (French and Steinberg, 1947) which expresses intelligibility as a percentage (i.e. 0% signifies that the contaminating noise in the speech signal renders it totally unintelligible while an AI rating of 100% indicates that intelligibility has not been affected by any noise which may be present). Essentially, this index is calculated by measuring, for any given speech signal, the noise levels in the 200 – 6300 Hz frequency range. This range is subdivided into sixteen bands (also referred to as bins), each approximately a third of an octave 44 wide, energy readings are then computed in dB for each band using the A-weighting method, all the resultant readings are then multiplied by the AI weighting and the final percentage is based on averaging the sum of the sixteen values.

43

This contaminating element could be another speech signal or, if the physical environment is conducive to echo/reverberation, the same speech signal convolved with itself. 44 See section 6.3.1 of chapter 6 for a more in-depth discussion on the acoustic definition of an “octave”

71

The Speech Intelligibility Index (SII) is a refinement of the AI and considered by some researchers to be its replacement. The SII, adopted as an ANSI standard in 1997, is more flexible than the AI in the configuration of its input parameters; unlike the AI which stipulates that the 200 – 6300 Hz frequency range should be divided into sixteen bands (as detailed in the preceding paragraph), the SII permits some degree of variability in bandwidth granularity, ranging from 6 to 21. Additionally, the relative influence – or frequency importance function as it known in the SII context – which is assigned to the energy present in the signal’s component frequency bands is also user-definable to some extent. Such parameter input flexibility, although useful in compensating for diverse environmental conditions, does have the unfortunate result of rendering SII measurements non-uniform: an SII measurement based on a six-band full octave frequency segmentation cannot be directly compared to another with a finer segmentation granularity (e.g. sixteen segments with a one-third octave bandwith).

Compared to the AI and SII, the Speech Transmission Index (STI, Steeneken and Houtgast, 1980) adopts a more focused approach to spectral energy analysis, giving greater weighting to the noise levels in the higher frequency ranges which are most likely to cause loss of consonant definition. This particular evaluation protocol is known as the measured percentage of Articulation Loss of Consonants (%ALcons). A speech sample featuring five %ALcons – or a 5% loss of consonant definition – is considered to be of good intelligibility; conversely, readings between 10 and 15 %ALcons represent a substantial loss of speech information while any measurement in excess of 15 %ALcons is indicative of unacceptable levels of signal degradation.

Despite their respective levels of effectiveness in terms of detecting and describing acoustic conditions which obscure speech clarity, all of the above-cited intelligibility measures are – in the strictest sense – not actually indicators of the ease with which the listener can extract meaning from an oral message. If, for example, a dysarthric utterance (or one in a foreign language) were produced in optimal environmental acoustic conditions, the AI/SII or STI metric systems would return readings not be commensurate with the effort needed by the naïve listener to decipher it. These metric systems are predicated on the implicit assumption, of course, that – from the listener’s aspect – the speech to be evaluated is intrinsically intelligible, that the speaker’s idiom and accent are familiar and that said speaker is not handicapped by any articulatory incompetence which

72

could obscure the meaning of any verbal message he or she produces. Such systems are, obviously, not suitable for the intrinsic type of intelligibility evaluation which the FDA requires. A review of such systems is presented in the section that follows.

2.5.11.5.2 Intrinsic Intelligibility Evaluation Systems Any intelligibility assessment procedure which purports to measure the intrinsic intelligibility of an utterance must incorporate some model of the listener’s decoding behaviour, i.e. how the intended recipient of the message undertakes the task of deciphering what was said. Up to the time of writing, all objective evaluation systems which try to emulate such listener reaction employ some form of speech recognition technology, the premise being that a recognition engine trained on normal speech would – when processing abnormal speech – return an objective score correlating with the effort required by human listeners to decipher such disordered speech.

The earliest of these ASR-based systems would appear to be Sy and Horowitz’s (1993) attempt to model a naïve listener’s response using dynamic time warping (DTW) analysis. The relatively poor correlation they observed (only 0.53) between subjective and objective scoring for a dysarthric isolated word corpus may have resulted from the limitations of the DTW technology they employed at the time. Both Menéndez-Pidal et al (1997) as well as Carmichael and Green (2003) report improved results using HMM statistical techniques, the former using phoneme-level analysis while the latter opted for word-level modelling. It must be noted, however, that Pidal et al.’s excellent correlation between perceptual and objective dysarthric speech assessment (0.94) may have been partly due to their listener group’s explicit a priori knowledge of the items in the test set vocabulary. The listener group in Carmichael and Green’s study were not biased with such a priori knowledge and it is therefore likely that their error patterns were more naturalistic. Vijayalakshmi et al. (2006) also employed phone-level analysis. However, their primary objective was the evaluation of the test subject’s articulatory competence in general rather than intelligibility assessment in particular.

73

Among the various differences in parameter configuration for the abovementioned systems, the most important distinguishing feature is the granularity of their speech analysis: it would appear that phone-level analysis provides a perspective on intelligibility rating which can be substantially different to word or utterance level evaluation. The merits and demerits of each approach will be considered in greater detail in chapter 7 of this thesis. At this juncture, however, a review of techniques used to determine a definitive diagnosis of dysarthria sub-type is of more immediate interest.

2.6 Formulating the Overall Diagnosis The ultimate objective of the FDA evaluation exercise is, of course, to identify the specific dysarthria sub-type manifested by the person being examined. In this regard, the FDA would appear to be the sole discriminant dysarthria assessment procedure, that is, it explicitly seeks to identify the particular disease sub-type indicated by the symptoms displayed. The RDP testing procedure’s objective has a somewhat different emphasis, the intention being to highlight / diagnose the patient’s individual articulatory weaknesses (e.g., velopharyngeal incompetence or poor voice quality). The RDP is not so concerned with identifying any overall patterns in the symptoms that would suggest, for example, a dysarthric state caused by spasticity in the articulators as opposed to flaccidity.

Due to this orientation towards differentiation between sub-types, a key component of the FDA overall diagnostic protocol is template matching, or the formulation of a diagnostic hypothesis based on comparing the patterns in a given FDA grade profile with the patterns of other grade profiles for which the diagnosis has been confirmed. These grade profiles are presented in the form of five specialised FDA bar chart profiles showing the average scores and the associated standard deviations for the five dysarthric sub-groups, such as those appearing in Figures 2.5 and 2.6.

Accompanying these mean score – standard deviation profile charts are explanatory notes highlighting certain distinguishing characteristics of the various dysarthria categories. The test administrator is reminded, for example, that Lower Motor Neuron dysarthrics have a tendency to score relatively well on the intelligibility tests but usually do not

74

exhibit velarpharyngeal incompetence.

These guidelines, in conjunction with the

templates, serve as the principal instructions for formulating a definitive overall dysarthria sub-type diagnosis. These criteria are adequate for cases where the patient’s FDA profile is a clear match to a particular template; in the event, however, that a patient’s profile and symptoms are somewhat ambiguous and could match two or more templates, there are no clear rules / recommendations advising how to decide which is the most likely contender or even conclude that the submitted grade profile is so anomalous that it does not correspond to any template. Some efforts were made to offer a solution to this ambiguity problem by way of a computer program (Roworth, 1990) which implemented the discriminant analysis algorithms used by Enderby (1986) to identify distinctive profile grade patterns associated with particular dysarthric categories. The success – or lack thereof – of this software application is discussed in the section that follows.

2.6.1 Automating the Diagnostic Procedure Roworth’s FDA diagnosis software implementation (1990), known as the Frenchay Dysarthria Test (FDT), represents the first attempt to automate the dysarthria diagnostic process. The FDT application – which uses Enderby’s linear discriminant analysis (1986) – takes as its input the patient’s FDA grade profile and returns the dysarthria sub-type which best corresponds to said profile. When tested on the FDA diagnostic data from the 107 confirmed cases in Enderby’s 1986 study (see section 2.4), the FDT’s overall classification accuracy was found to be 76%, although this accuracy was not uniform across the five dysarthria categories 45 .

Unfortunately, the FDT has not enjoyed the same widespread usage as the paper-based FDA which it supports. This lack of acceptance may well be due to the application’s minimalist interface and non-intuitive data formatting requirements: the program requires, for example, that all profile scores must be entered as a series of numbers rather than the letter grades to which therapists are accustomed when using the paper-based FDA. In the event that the user does attempt to enter letter grades into the program, there are no error-trapping mechanisms or online help facilities to specify the correct format for 45

A more detailed analysis and discussion of the FDT’s classification is presented in chapter 8 of this thesis

75

data input or how to interpret the program’s various information visualisation displays and metrics. When proposing a diagnosis, for example, the FDT outputs both the diagnostic category and some form of confidence measure score, but no explanation is offered concerning the significance of this score (e.g., whether the 0.783 rating assigned to the FDA profile in Figure 2.17 represents a high degree of confidence or otherwise). Moreover, the FDT does not incorporate any data archiving functionality: it is not possible to save a patient’s grade profile fed into the program or any resultant diagnosis; this lack of storage or file-writing capability is all the more crippling given the fact that the application’s DOS-based printing functionality is not supported on 32-bit Microsoft Windows operating systems. Figure 2.17: Screen Shots of FDT (pre- and post- diagnostic data input)

Further to the abovementioned user interface and data representation inadequacies, the FDT regrettably persists with the “sole candidate” diagnosis hypothesis flaw mentioned earlier: in cases – such as the one in Figure 2.17 – where the patient’s grade profile could well correspond to more than one dysarthria sub-type 46 , the FDT will nevertheless return only the most likely category, even if that category outranks the second-most probable hypothesis by a suspiciously narrow margin. An improvement on this informationrestrictive procedure, which has been proposed and implemented in the CFDA, is discussed in chapter 8 of this study. The next section of this particular chapter, however, is devoted to the review rather than elaboration of the issues debated so far.

46

The FDA profile presented in Figure 2.5 – taken from the patient group used in Enderby’s study (1986) – is that of an individual confirmed as suffering from severe Ataxic Dysarthria, and not Spastic dysarthria as diagnosed by the FDT program. Of course, it would be interesting to discover what the FDT’s n-best list of dysarthria categories would have been for this particular case.

76

2.7 Chapter Summary Although summaries are not necessary for every chapter in this study, the sheer volume and scope of information presented in this particular chapter merits an overview of the issues considered. The following paragraphs constitute such an overview, summarising the measurement techniques considered and the various pathological phenomena they attempt to quantify.

Dysarthria is a category of diseases of which, according to Enderby’s classification (1980), there are five sub-types: (i) Ataxic (ii) Extrapyramidal (iii) Flaccid (iv) Mixed and (v) Spastic. The two most successful and well-established dysarthria diagnostic procedures are the Robertson Dysarthria Profile (RDP, 1982) and the Frenchay Dysarthria Assessment (FDA, 1983), both of which incorporate a series of audio-visually assessed speech-oriented tasks designed to expose any articulatory insufficiencies the test subject may have. In the case of the FDA, the performance of all 28 sub-tests will result in the generation of a bar chart profile which can be used to identify the specific dysarthria sub-category apparently manifested. Based on a study of 85 individuals whose various dysarthria diagnoses were independently verified, Enderby (1986) validated the FDA diagnostic procedure by employing linear discriminant analysis to confirm that different dysarthria sub-types evoke substantially different pathological behaviours, which in turn produce distinctive bar chart profile patterns.

Before such bar chart profiles can be generated, however, it is necessary to evaluate the performance of the patient during the execution of the various sub-tasks. In most cases, the paper-based FDA’s assessment protocols for these tests include a combination of objective and subjective criteria, which may be facilitated by a range of analog and digital measuring devices such as manometers and laryngographs. The advantages and shortcomings of perceptual judgements are considered and it is demonstrated that objective metric systems engender a more robust diagnosis, especially in cases where there are inconsistencies and ambiguities in the FDA evaluation criteria.

Objective

measurements in themselves, however, are not always meaningful unless they can be standardised in some fashion. Unfortunately, there seem to be no standardised methods or

77

metric units to describe certain articulatory pathological phenomena; this is particularly evident re voice quality and intelligibility assessment. For those sub-tasks suitable for the introduction of acoustic objective measurement systems, there are a number of possible speech technology solutions and these can be divided into two categories: (i) audio signal processing techniques such as pitch extraction and signal-to-noise ratio estimation and (ii) ASR-based speech intelligibility evaluation which seek to model the effort needed by the listener to decipher the oral communication.

The following five chapters present the solutions proposed by this researcher to these still unresolved issues of dysarthria symptom measurement and diagnosis.

78

Chapter 3: Protocols for Equipment Standardisation, Testing Corpora and Signal Processing As specified in the previous chapter, the aim of this research is not purely theoretical, it is intended that the software solutions devised and implemented for automating the FDA evaluation are to be made available to clinicians to be used as a practical tool. This consideration imposes certain practical limitations on what types of hardware equipment can be utilised. It would not be feasible, for example, to require the use the laryngograph since this apparatus is costly (over £5000) 47 and may therefore not be a feasible acquisition for health workers in public sector health care organisations. Consultation with various speech therapist interest groups in the United Kingdom’s Yorkshire and Lancashire counties has made it apparent that clinicians working under the financial constraints typical of the National Health Service would find it difficult to justify the purchase of a computerised speech diagnostic application which would require non-standard high-cost peripherals. Given such restrictions, the computerised application developed during the course of this investigation has been designed so that the input of a head-mounted or desktop microphone constitutes its minimum peripheral equipment requirement. Provision is also made for the archiving of digital photographs and video clips – for the purposes of documenting and measuring facial symmetry – but such functionality is not essential for the FDA diagnostic exercise 48 and, in any event, the cost of hardware needed for such operations is not prohibitive 49 .

47

This cost is inclusive of the proprietary software necessary for system operation as detailed on the company’s web site at http://www.laryngograph.com. The specific product that would be most convenient and useful for speech therapists – the Speech Studio – is described on the web page http://www.laryngograph.com/pr_studio.htm [Last visited: June 30, 2007] 48

In fact, the number of FDA tasks which are primarily visually assessed (15) outnumber those that rely mainly on auditory assessment (12). It is not within the scope of this project, however, to attempt the implementation of any software applications incorporating some form of computer vision. Such computer vision solutions could, however, be the focus of further research and the possibilities of such are more thoroughly discussed in the final chapter of this thesis. 49

At the time of writing, good quality personal computer digital camera peripherals (popularly known as “web cams”) typically cost around £50.

79

3.1 Validating Experimental Results: Adopting the Expert System Paradigm Due to the scarcity of available dysarthric speech data suitable for testing and validating the computerised FDA application (CFDA), it was not always possible to divide said data into separate training and testing sets for the development and assessment of pattern recognition algorithms. Such separation of training and testing data corpora constitutes the usual technique for developing computerised applications which include some type of supervised or unsupervised machine learning (ML) protocols (Bishop, 1996). In the case of supervised ML, the expected classification output for the training data is pre-specified so that – in a typical training scenario – an ML classifier will attempt to optimise its training data pattern recognition analysis in such a manner as to produce a categorisation result which aligns as closely as possible with the expected outcome. Moreover, in the case of multi-class categorisation, ML classifiers will attempt to identify those patterns in the supplied numerical information (usually referred to as feature vectors) which best distinguish one data class from another according to some pre-supplied ground truth classification criteria.

In the case of unsupervised machine learning, the classifier is not provided with any ground truth classification criteria, thus the final outcome of the training process is not as readily predictable as for supervised learning. Despite this significant difference between supervised and unsupervised ML training protocols, both processes optimise pattern recognition performance in a manner which is essentially dissimilar to subjective discrimination: the mechanisms underpinning automatic speech recognition, for example, bear little resemblance to those that inform human speech recognition (Lippmann, 1997). Moreover, it is normally the case that ML techniques are data-intensive (Bishop, 2006), a requirement which is not compatible with data corpora of limited size 50 , such as the dysarthria speech corpus that has been collected for this project. Given this data scarcity problem and the resultant implications for ML algorithm training and testing, the decision

50

There are ML classification techniques, most notable among them Support Vector Machines (SVM, Vapnik, 1999) which have been developed to be specifically tolerant of data sparcity. Indeed, Wan and Carmichael (2005) demonstrated their potential on a dysarthric speech recognition task. It is still important, however, that the date should be representative of the target group even if limited. Given that the dysarthric speech data collected for this study originate from just five individuals (and not all of them exhibit all the symptoms to be classified), the use of SVM classification techniques would not have been appropriate.

80

was made to implement discriminant analysis techniques which are rule-based rather than ML-based, i.e. they attempt to emulate the analytical behaviour of a human expert by applying the same problem-solving rules that the human would. In effect, this problemsolving via the application of some computerised model that mimics expert subjective analysis – such models being known as expert systems – is a well-established alternative to the machine learning techniques (Jackson, 1998) and has been successfully applied in such varied domains as prediction of stock market trends, meteorology and, of course, medical diagnostics. The deployment of an expert system is particularly advantageous when it is possible to identify a relatively limited set of rules used by the human skilled practitioner to clearly define a procedure for arriving at a definitive conclusion or, at the minimum, a viable hypothesis. Moreover, expert systems do not require any data-hungry ‘learning’ processes since the subjective knowledge that informs their decision-making is derived from an extrinsic source that is already ‘trained’, this source being the collective experience of human experts whose problem-solving techniques have been repeatedly tested and proven under real-world conditions. Of course, there are certain drawbacks to using the expert system analytical approach, chief among which is the inability of these rule-based classifiers to optimise their decision-making processes if they return below-optimum performances (Jackson, 1998). This inherent inflexibility of the expert system architecture is due to the fact that such systems are, essentially, a collection of inter-linked rules which cannot be readily altered due to their interdependency (Jackson, 1998). It is also important that the encoding of the human expert’s knowledge be a comprehensive and, as far as possible, thorough process. Care must be taken by those responsible for capturing and encoding the expert’s knowledge to explore all possible scenarios in order to extract all relevant information; this process should be pre-planned and systematic since the expert may – through simple oversight – neglect to mention some pertinent problem-solving technique while being interviewed by the knowledge encoder. It is often useful, therefore, to test newly developed expert systems by exposing them to a mixture of typical and atypical (but valid) data in order to lessen the possibility of the expert system failing in the event that it encounters unusual cases.

In the context of FDA diagnostic analysis, the adoption of an expert system approach would signify the retention, to some degree, of the ‘letter grade’ assessment methodology which is currently used to determine the severity of manifested symptoms. Essentially,

81

these letter grades will be mapped to an ordinal scale, as detailed in section 3.4 of this chapter. In terms of assessing the diagnostic output of the expert systems implemented in the CFDA, a comparative evaluation protocol has been devised whereby the CFDA’s output is compared with that of a group of three highly experienced speech therapists – hereafter referred to as the expert panel – at three levels: (i) the description of low-level pathological acoustic features, if such are present, in speech data presented for evaluation; for example, in a given attempt at performing the Test 5 ‘Respiration in Speech’ task, the judges in the expert panel may come to the conclusion that three inhalations were heard during the execution of the task. This lowlevel diagnostic description is then compared to the output of the CFDA. (ii) the holistic assessment for a specific FDA task: the analysis of low-level features as described in the previous paragraph is meant to facilitate the award of a letter grade for said task (see section 2.3 of chapter 2). The various grades awarded by the expert panel members are compared with each other and also with the grade output by the CFDA application. An inter-judge correlation is then computed along with the correlation of the CFDA grade with vis-à-vis those of the judges. (iii) the definitive diagnosis of the dysarthria-type: as detailed in the previous chapter, the patient’s performance in the various FDA task categories play a pivotal role in identifying and distinguishing the various dysarthria sub-types. Once all the relevant FDA diagnostic data for a patient has been collected (this information being the 28 letter grades awarded by the test administrator along with any additional comments regarding the quality of symptoms manifested), both the paper-based and computerised versions of the FDA incorporate a series of rule-based procedures to assist the test administrator in formulating a diagnosis concerning the dysarthria sub-type apparently displayed by the individual under examination. For a selected sample of FDA test results, a comparison will be made of the dysarthria sub-type diagnostic hypotheses produced by the CFDA and those produced by members of the expert panel. In this instance, the correlation between the CFDA’s diagnoses and those of the expert panel will be computed based on criteria which are detailed in chapter 8. The adoption of the expert system paradigm as described above makes it possible to overcome the limitations of data scarcity by rendering unnecessary the reservation of data exclusively for training purposes. It therefore becomes possible to use all available data

82

as a valid test set. For the given data set, the assessments rendered by the experts – provided they concur – will be considered as the ‘ground truth’ against which the computerised FDA’s objective measurements will be compared and validated in terms of strength of correlation. In order to give the reader some appreciation of the diagnostic accuracy of the computerised measurements, every chapter which discusses the various implemented objective metric systems will include a table, such as Table 3.1, which presents both human and machine diagnostic descriptions for a selection of dysarthric speech data.

Table 3.1: Example Table Presenting Subjective and Automatic Evaluations for Selected Speech Data Samples Test Category Task 4: Respiration at Rest

Sample No.

Grade

Assessor ID

1

E+

1

1 1

E C

2 3

1

D

FDA App

Observations/Measurements 3.8 sec =1.5 times that of the peaks representing consonant-vowel transition 6 (B) 7-9 plosions, with acoustic characteristics as described for the ‘A’ grade. 4 (C) 4-6 plosions, with acoustic characteristics as described for the ‘A’ grade. 2 (D) 3-5 plosions, with acoustic characteristics as described for the ‘A’ grade. 0 (E) 30 ms) and 2:1 HLF ratio, thus a penalty point is incurred for every instance of such longduration high frequency noise immediately preceding a voiced segment and which matches this HLF footprint. It would appear that the adoption of these parameters as evaluation criteria have resulted in a strong correlation between automated and perceptual assessments of attempts at lip seal utterances made by the dysarthric speakers under review. Table 5.1 presents the evaluations of the expert panel and the CFDA application for a selection (five samples) of SUFDAR lip seal test data.

120

Table 5.2: Expert Listener and Automatic Evaluations for Selected Speech Data Samples for Task 8 Test Sample Assessor Category No. Grade ID Observations/Measurements Task 8: Lip Seal

1

C

1

1 1

D+ D+

2 3

1 2 2 2

B A A A

CFDA 1 2 3

2 3

A A

CFDA 1

3 3

C B

2 3

3

D

CFDA

Lip seal achieved but voicing issue: voiced velar plosive No Comment 5 instances of amplitude spikes > 1.5 above surrounding peaks detected. Post-plosive F0 detected in only 3 out of 19 attempts. Recognition result = 2 out of 19 attempts recognised as “puh”

4

D+

1

Can hear lip seal/plosion but very weak as no voice used

4 4

E D

2 3

4

E

CFDA

5

C+

1

5 5

D E+

2 3

5

E

CFDA

Plosion sounds weak due to nasal escape Nasal escape, lack of velar pharyngeal competence No Comment 9 plosions meeting required criteria. Air escape noted but not immediately preceding vowel onset, therefore not bilabial. Recognition result= 7 out of 10 attempts all recognised as “puh”. Normal No abnormality detected No Comment All HLF ratios less than 1.4:1. Amplitude spikes > 1.5 times that of surrounding peaks noted for all plosions. Recognition result= 9 out of 10 attempts all recognised as “puh”. 10 plosions meeting required criteria.

Sound of some lip closure but not possible to discern any auditory characteristics No Comment No plosions detected. Recognition result = 0 out of 10 results interpreted. Log likelihood scores all less than -35 Lip seal variable, some weak, some strong, some plosive present but some nasal emission Initial nasal rather than labial, some evidence of /p/ in middle segment but inconsistent. Nasal escape noted No Comment 2 weak plosions detected (8 dB above ambient noise), 4 instances of frication-type noise detected. Recognition result = 0 out of 10 results interpreted. Log likelihood scores all less than -24

121

For this particular task, the overall inter-judge correlation is 0.73. The correlation between the CFDA application’s assessment and the respective expert panel judges are as follows: (i) 0.80 for the judge 1, (ii) 0.73 for the second judge and (iii) 0.87 for the third. A similar level of objective-subjective agreement is also noted for evaluation of DDK velarpalatal tasks, as discussed in the section immediately following.

5.3 Evaluation of Diadochokinesis This diadochokinesis (DDK) test 66 – by requiring the oral production of multiple evenlyspaced repetitions of the phone sequence /ka la/ – evaluates the speaker’s ability to move the tongue rhythmically and rapidly from velar to frontal palatal positions, a task which will usually reveal any velopharyngeal incompetence that may be present. The assessment criteria emphasise speed and accuracy of articulatory placement:

Grade A: No difficulty observed. 'ka-la' utterances produced with accurate articulatory placements, 10 segments produced in five seconds or less, segments produced with regular spacing. Grade B: Some difficulty observed - slight incoordination, slightly slow; task takes 5 to 7 seconds to complete. Grade C: One sound well articulated, other poorly presented, or task deteriorates; task takes up to 10 seconds to complete. Grade D: Tongue changes in position, different sounds can be identified. Grade E: No change in tongue position.

Apart from the expected items (i.e. “ka” and “la”), the recogniser vocabulary for this task also included the most probable mispronunciations, namely “ga” “ra”, “da”, “va”, “ja” and “sha”; with these configurations, the ASR implementation for this particular task achieved an overall categorisation correctness of 85.7% for classification of the SUFDAR DDK test utterances. The most frequently occurring ASR error was the confusion of “ga” with “ka”, a misrecognition which can be tolerated since it has no particular diagnostic significance. Unfortunately, the second most frequent ASR error – the misinterpretation of “ra” and “la” – is of greater import in terms of determining the speaker’s DDK competence because it implies that the recogniser is occasionally incapable of confirming that the speaker’s tongue has made contact with the palate (which would normally be the case if the sound “la” is produced but not “ra”). Of course, it must be emphasised that the

66

See chapter 2, section 2.5.5 for a more detailed discussion of diadochokinesis and its diagnostic significance.

122

recogniser’s classification is merely an initial evaluation which is independent of the signal processing techniques that have been implemented to detect any signs of pathology. These techniques will now be considered in greater detail. Table 5.3: Expert Listener and Automatic Evaluations for Selected Speech Data Samples for Test 24: Alternating Tongue Movements (DDK)

Test Category Task 24: DDK Assessment

Sample No.

Grade

Assessor ID

1

D+

1

1 1

E+ E

2 3

1 2 2 2

E A A B+

CFDA 1 2 3

2 3

A A

CFDA 1

3 3

B A

2 3

3

B

CFDA

4

C

1

4 4

D D+

2 3

4

D

CFDA

5

C

1

5 5

D+ D+

2 3

5

C+

CFDA

Observations/ Measurements Movement from back to front /l/ distorted. /k/ fricated but some stop sounds managed. Variable differentiation of velar plosive, less clarity on liquids. 9 segments in 11 seconds No Comment 9 segments in 10.6 s., none recognisable as /ka/ /la/ Sounds normal 10 segments in 4 seconds No Comment 10 segments in 5.1 secs, all recognisable as /ka/ and /la/ Normal Voiced velar plosive not voiceless, speed ok No Comment 10 segments in 4.8 seconds, 3 /ka/ segments recognised as /ga/ Accurate /k/, distorted /l/ Voiceless velar ok, irregular and late detection of second segment No comment 7 segments in 12.3 secs, /la/ recognised as /ra/ Good /k/, /l/ slightly distorted, slow and deteriorating in regularity between spacing (getting slower) Slow alteration between placements, change in second vowel. Velar segment variable. No Comment 9 segments in 6.9 seconds, 3 /ka/ segments recognised as /ga/

123

5.3.1 Quantifying DDK incompetence As made clear from this test’s evaluation instructions, speed and correctness of articulatory placement are the two principal assessment criteria; explicit task execution times are specified that designate various levels of competence. It has also been discovered, however, that there is another criterion, not explicitly stated in the above cited FDA instructions, which expert assessors consider important when judging a speaker’s competence in this task: this indicator of competence is the rhythm and spacing of the /ka/ /la/ segments. A review of the SUFDAR DDK utterances rated as “A” grade performances compared with others scoring “C” or lower reveals that the former exhibit an evenly balanced intra-segmental duration and intensity ratio, i.e. any given /ka/ and /la/ pair will be approximately the same length and volume. Failure to maintain this intrasegmental duration-power parity almost invariably resulted in varying degrees of penalisation by the expert assessors. Based on the grades and comments of the expert panel, /ka la/ segments exhibiting an inter-syllable durational disparity exceeding 0.5 are considered unbalanced 67 . The contribution of this disparity metric to the automated DDK assessment is detailed in the section that follows.

5.3.2 Task 24 Criteria Weighting and Correlation with Expert Panel The quantification of the DDK task criteria has proven to be non-trivial since there are three equally important aspects pertaining to the execution of this particular test, these factors are the execution of the articulatory movements necessary to produce /k/ and /l/ along with the completion of said movements within a certain time limit. Unfortunately, the evaluation rubric for this task does not consistently specify quantifiable output for the award of the various letter grades; the instructions for the award of the “D” and “E” grades make no mention of any time limit or minimum number of correct pronunciations. In light of this absence of objective criteria and the resultant necessity to experiment with various weighting co-efficients to simulate human expert assessment, it emerged that the protocol best approximating expert subjective evaluation is one which – after the initial award of the default maximum of eight points – deducts a half-point for every instance of a malformed /ka/ or /la/. A half-point penalty is also incurred for every observation of an 67

This inter-syllable duration disparity metric signifies the difference in syllable length ratio between the /ka/ and /la/ sub-segments, a 0.5 disparity, for example, would indicate that the /ka/ syllable is more than one and a half times longer than the /la/ syllable or vice versa.

124

inter-syllable duration disparity exceeding 0.5. Furthermore, in the event that the test subject takes more than the stipulated five seconds to complete the ten /ka la/ segments, a full point is deducted for every extra second of delay. With a rating of 0.88, inter-judge correlation for DDK assessment is superior to that of the lip seal test. Unfortunately, the CFDA’s concordance with said expert panel is not as high, with only 0.67 for the first judge, 0.70 for the second and 0.68 for the third. As mentioned in section 5.3.1, this disparity is due, in part, to the recogniser’s higher error rate regarding classification of /la/ utterances. Table 5.2 presents a selection of expert judge and CFDA assessments for the five dysarthric speakers whose data appear in the corpus. The following section considers the introduction of objective measures for Task 15 (Maintenance of Palate), the implementation of which – in terms of achieving high correlation with the expert assessors’ judgements – has proven the most challenging of the three testing procedures discussed in this chapter.

5.4 Evaluation of Velarpalatal competence This test, referred to as the Maintenance of Palate task, has proven to be the most challenging in terms of metric system implementation since it seeks to evaluate an articulatory dysfunction – hypernasality 68 – which is difficult to unambiguously identify using only acoustic information. The patient is required to repeat the word pairs “may pay” (/me pe/) and “nay-bay” (/ne be/) in a rhythmical fashion. Any tendencies towards hypernasality would become apparent when the speaker attempts to rapidly raise and lower the palate to first produce the nasalized /m/ and /n/ phones followed by an immediate transition to the /e/ vowel. A definitive diagnosis of hypernasality usually requires the use of specialist instruments or – as Enderby (1983) and Brookshire (1992) recommend – that the test administrator should palpate the test subject’s nose bridge during execution of the task to detect nasal resonance, defined as the vibration caused by nasal air flow during vowel production (Kent, 1986). Such direct detection methods are, of course, not possible via auditory analysis; furthermore, there is the distinct possibility that any air leakage detected could be the result of improper lip seal rather than nasal emission. It is important, therefore, to devise some method to distinguish between 68

As discussed in chapter 2, hypernasality is a speech disorder occurring when the palate and pharynx tissues do not close properly. This inadequate closure causes air to escape through the nose during speech instead of coming out of the sides and back of the throat, particularly with certain sounds such as “p,” “b,” “s,” and "k.” [Cited from the American Academy of Otolaryngology internet web site at http://www.entnet.org/healthinfo/throat/hypernasality.cfm; last visited on June 15, 2007.

125

velopharyngeal and velarpalatal insufficiency; from a close examination of the SUFDAR recordings featuring dysarthric utterances made in response to this Maintenance of Palate task, it would appear that such a distinguishing feature is the time lag between the onset of nasal emission and that of the /b/ or /p/ plosion. The importance of this time lag will be explained in greater detail in section 5.4.2. At this juncture, however, it is important to detail the word or pseudo-word items included in the recogniser’s vocabulary for this task, which are composed of the target words (first 4 items) along with the most likely errors, to wit: “may”, “pay”, “nay”, “bay”, “fay”(/fe/), “day”, “vay” (/ve/), “tay” (/te/) and “say”. The recogniser’s classification concurred with that of the expert panel members 78.4% of the time. Similar to /p/ – /f/ confusion noted for the severely flaccid speaker for the lip seal task, it has been observed that, for the SUFDAR recordings showing evidence of moderate to severe hypernasality, the recogniser tends to interpret – in 7 out of 10 instances – what the speaker intended to be “pay” or “bay” as “fay” or “say”, a classification that appears to be diagnostically significant but which can only be confirmed through testing on a more extensive data set (of flaccid lip seal recordings) than is currently available to this researcher. Indeed, it is at this juncture that attention shall be focused on the design and implementation of the CFDA’s spectral analysis measurement systems meant specifically for hypernasality detection.

5.4.1 Objective Measurement of Hypernasality In terms of its acoustic properties, excessive nasal emission has a similar broad-band frequency noise profile to that resulting from improper lip seal; despite this similarity, there is a substantial difference between the two types of involuntary air leakage: it has been observed that, for the dysarthric speech data in the SUFDAR corpus, the start of hypernasal emission is – on average – further away from plosive onset than would be the case for an incompetent attempt at bilabial seal. As discussed in section 5.2.1, inadvertent bilabial air leakage tends to occur immediately prior to the /p/ plosive burst and often obscures it (see Figures 5.2 and 5.3); inadvertent nasal emission, on the other hand, typically starts twenty to forty milliseconds before the amplitude spike marking the onset of /p/ or /b/, as is the case with the examples shown in Figure 5.4. There is always the possibility, of course, that the test subject may suffer from both velopharyngeal and palatal incompetence, a situation which would be non-trivial to disambiguate based solely on acoustic evidence. Such a dilemma may be avoided, however, if the assumption is made that any evidence of incompetent lip seal would have

126

been previously detected during the test specifically designed for evaluating that articulatory abnormality. Based on this assumption that the hypernasality evaluation task was not meant to be performed in isolation but rather as part of a complete FDA test battery, the CFDA application will treat all evidence of post-vowel/pre-plosive broadband noise lasting longer than thirty milliseconds as evidence of hypernasality. The CFDA’s measurement of the severity of hypernasal symptoms employs the same HLF metric as employed used to detect bilabial lip seal air leakage; readings derived from this HLF metric can be translated directly into major criterion points as defined in Table 5.4. Figure 5.4: Examples of Hypernasalised “may-pay” Renditions

“may” “pay”

“may” “pay”

“may” “pay”

/p/ plosive bursts are not well defined (unlike those in Figures 5.1 & 5.2)

Table 5.4: Positive Point Parameterisation for Task 15 Major Criteria Positive Points Major Criteria parameters for Hypernasality (Letter Grade) Test 8 (A) 6 instances of broad band noise of 30 ms duration or more. (HLF ratio > 1.4:1) 6 (B) 5 instances of broad band noise of 30 ms duration or more. (HLF ratio > 1.4:1) 4 (C) 4 instances of broad band noise of 30 ms duration or more. (HLF ratio > 1.4:1) 2 (D) 3 instances of broad band noise of 30 ms duration or more. (HLF ratio > 1.4:1) 0 (E) 1.4:1)

127

Table 5.5: Expert Listener and Automatic Evaluations for Selected Speech Data Samples for Test 15: Palate Movement in Speech

Test Category Task 15: Palate Movement in Speech (“NayPay/May-Bay”)

Sample No. Grade

1

A

Assessor ID

1

1 1

C+ C

2 3

1

A

CFDA

2

C

1

2 2

C D

2 3

2

D

CFDA

3

B

1

3 3

E+ C+

2 3

3 4

D A

CFDA 1

4 4

C B+

2 3

4 5

A A

CFDA 1

5 5

B A

2 3

5

A

CFDA

Observations/Measurements Slight hypernasality on /p/ but within normal limits No regular alteration between +/voice on plosives -- slight hypernasality. No Comment 6 correct /nay pay/segments detected. No excessive breathiness Some hypernasality on /p/, variable, sometimes totally nasal /p/ sound. Sometimes milder hypernasality. No differentiation of +/- voice and nasal emissions No Comment 4 segments “Nay Pay” or “May Bay”. Breathiness (i.e.HLF ratio < 1.4:1) detected in 5 post /p/ vowels Some imbalance in resonance, slightly so. Unbalanced resonance, lack of clear differentiation, fatigue and poor vocal quality No Comment 3 full segments “Nay Pay” detected, post /p/ vowel breathiness (i.e.HLF ratios < 1.4:1) in 2nd and 3rd attempt Normal resonance Lack of differentiation between /m/ and /n/, clear definition between nasal and labial No Comment 4 segments, no post /p/ vowel breathiness (i.e.HLF ratios > 1.4:1). Normal resonance Slow output; clear differentiation: nasal and labial plosive. No Comment 4 segments, no post /p/ vowel breathiness.

A selection of subjective expert judgement and CFDA evaluations for SUFDAR recordings featuring hypernasality evaluations appears in Table 5.5. Both the inter-judge

128

and machine-human correlation for hypernasality evaluation proved to be the lower than for any of the other assessment procedures described in this chapter, with the overall inter-judge correlation is 0.61. The correlation between the CFDA application’s assessment and the respective expert panel judges are as follows: (i) 0.70 for the judge 1, (ii) 0.55 for the second judge and (iii) 0.59 for the third. These relatively low levels of machine-human agreement underscore the fact that reliance solely on acoustic means to detect and quantify hypernasality is not the most effective diagnostic method and cannot substitute more direct means of testing such as palpating the patient’s nose bridge (as discussed in section 5.4). Unfortunately, it is outside the scope of this study to investigate the use of the type of specialist instrumentation which could produce a more reliable diagnostic result.

5.5 Chapter Summary For the FDA labial, lingual and pharyngeal tasks discussed in this chapter, a two-pronged evaluation technique – using a combination of ASR and low-level signal processing techniques – has been adopted for the assessment of the targeted articulatory behaviour and to quantify certain expected pathologies such as breathiness and hypernasality. The ASR analysis is used to determine whether or not the test subject executed the specified articulatory movements. It is assumed – save in certain exceptional cases not applicable in this context – that there is a direct relationship between the correct pronunciation of certain speech sounds and articulatory actions, e.g. the pronunciation of /p/ implies a voiceless bilabial seal immediately followed by a plosive burst of air from the lips. The ASR approach has worked well as an articulatory place-and-manner classifier, concurring with expert subjective evaluation for 92.4% of all speech samples examined for Task 8 (Lip Seal), 85.7% for the Task 24 (DDK: Alternate Movements) samples and 78.4% for the Task 15 (Palate Movement / Hypernasality). Such relatively good phone recognition rates notwithstanding, the ASR classifier’s capacity to detect pathological phenomena was – as expected – neither sufficiently consistent nor precise enough for the purposes of pathological measurement. It was necessary, therefore, to implement a customised spectral analysis objective metric system specifically designed to detect and quantify the anticipated abnormalities via their acoustic signature. All of these systems classify pathology in terms of energy distribution between the high (4000Hz -8000Hz) and low (

Suggest Documents