An Investigation of Acoustic, Linguistic, and Affect

0 downloads 0 Views 29MB Size Report
Chapter 13 summarizes the thesis and provides future prospective directions ..... m o to r Reta rd a tio n. Min im a l D u ra tio n. (w eeks). Major Depression Disorder ...... Developed by Dr. Robert L. Spitzer, Dr. Janet B.W. Williams, and Dr. Kurt ...... Kotsiantis, S.B., 2013. Decision trees: a recent overview, Artif. Intell. Rev., vol.
An Investigation of Acoustic, Linguistic, and Affect Based Methods for Speech Depression Assessment Brian Stasak

A thesis submitted for the degree of Doctor of Philosophy

School of Electrical Engineering & Telecommunications Faculty of Engineering

December 2018

Originality Statement ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed: Date: December 17th, 2018

Copyright Statement ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorize University Microfilms to use the 350-word abstract of my thesis in Dissertation Abstract International. I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' Signed:

Authenticity Statement ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Signed:

Abstract Given that there is no definitive, widely accepted laboratory test for depression, there is a need for effective new methods to help with patient diagnosis and monitoring. While abnormal speech behaviours are identified as useful indicators for recognizing depression, the effects of depression on speech production warrant further investigation. Automatic approaches for speech-based depression assessment show promise, however they are still in their early developmental phase, and many questions remain unanswered. This thesis investigates which acoustic, linguistic, and affective aspects of speech are discriminative of depression, towards informing speech elicitation methods and providing more effective knowledge-driven automatic speech-based depression disorder classification system designs. Through experimentation herein, it is discovered that as articulatory-gestural effort, linguistic complexity, and affect dimensions increase, more discriminative speech-language differences are exhibited between depressed and non-depressed speakers. By examining vowel articulatory parameters, statistically significant differences in acoustic characteristics are found at a paraphonetic level in depressed speakers, such as shorter vowel durations and less variance for ‘low’, ‘back’, and ‘rounded’ vowel positions. Furthermore, when compared with non-depressed speakers, experiments reveal that depressed speakers exhibit reduced linguistic stress attributes. To help guide speech data selection and exploit articulatory differences between depressed/nondepressed speakers, two new measures based on articulatory acquisition mastery and phoneme transition gestural effort are devised. From a linguistic perspective, experiments herein show that statistically significantly greater numbers of speech disfluencies occur in depressed speakers than non-depressed speakers, especially in terms of hesitations and malapropisms. It is discovered that linguistic placement of affect token words at the beginning of elicited read sentences rather than the middle/end of sentences provides better acoustic discriminative cues for identifying depressed speakers. From an affective perspective, experiments conducted show that individually computing features on a per-valence affect category (negative, neutral, positive) basis and then fusing them into a multi-affect feature results in significant improvements, up to 100% accuracy, in depression classification relative to separate per-affect-range systems or affect-agnostic approaches.

Acknowledgements Firstly, I thank my supervisor Associate Professor Julien Epps. Without reservation, Prof. Epps’ keen insight, sincerity, guidance, and clever prowess proved to be invaluable during my studies. His zeal for shaping raw complicated ideas into elegantly simple solutions serves as motivation for any auspicious PhD student. I am grateful to my co-supervisor Dr. Eliathamby Ambikairajah. His enthusiasm towards applied research and helpful presentation remarks did not go unnoticed. I have additional thanks to Dr. Vidhyasaharan Sethu, Dr. Phu Ngoc Le, and Dr. Nicholas Cummins for their coursework leadership and positive feedback over the years. I also thank Dr. Roland Goecke for his research contributions. I would like to recognize my exceptional research colleagues, and in particular, give a special thanks to Dr. Zhaocheng Huang, Dr. Ting Dang, and Dr. Jianbo Ma. I acknowledge that my prior experience entrenched with the United States Air Force Research Lab (AFRL-NY) was helpful while pursuing my PhD. My previous work with the Forensic Audio Group provided me with a practical background on applied research techniques across a wide range of speech processing projects. I wish salute my former AFRL colleagues for their positive encouragement. I especially thank Dr. Aaron Lawson, Ben Pokines, and Bernie Zimmermann for their additional research advice over the last few years. I also wish to thank my former Trinity College Dublin professors and classmate, Dr. Carl Vogel, Dr. Christer Gobl, and Capt. Joe Banner for their kind academic post-graduate endorsements. I applaud the Black Dog Institute (Sydney) and Sonde Health (Boston) for their benevolent efforts in the field of healthcare. I am very thankful that they were willing to extend the courtesy of sharing their privately collected speech data. My doctoral study was made possible by generous financial support from UNSW, including the TFS Award and the Faculty of Engineering Research Award. Additionally, the National ICT Australia and Data61 scholarship provided extra much appreciated financial support during my studies. I also thank the UNSW School of Electrical Engineering and Telecommunications for providing computer and lab resources. Lastly, I thank Glenys and her family, and also my family and friends for their support, and advocating my decision to, once again, pursue academic studies overseas. It has been an honest privilege and rare opportunity for me to study abroad in Australia, where there is an abundance of natural beauty.

Abbreviations and Symbols ANEW

Affective Norms for Emotional Words

AVEC

Audio/Visual Emotion Challenge

AViD

Audio-Visual Depressive Language Corpus

BDAS

Black Dog Institute Affective Sentences

BDI

Beck Depression Index

COVAREP

Cooperative Voice Analysis Repository for Speech Technologies

DAIC-WOZ

Distress Analysis Interview Corpus – Wizard of Oz

DCT

Discrete Cosine Transform

DSM

Diagnostic and Statistical Manual of Mental Disorders

eGeMAPS

Extended Geneva Minimalistic Acoustic Parameter Set

HAMD

Hamilton Rating Scale for Depression

LDA

Linear Discriminant Analysis

LLD

Low-Level Descriptors

LMLR

Log Mean Likelihood Ratio

MAE

Mean Absolute Error

MFCC

Mel-Frequency Cepstral Coefficients

NLP

Natural Language Processing

PMR

Psychomotor Retardation

PSQ

Patient Health Questionnaire

RMSE

Root Mean Squared Error

RVM

Relevance Vector Machine

SALAT

Suite of Linguistic Analysis Tools

SEANCE

Sentiment Analysis and Cognition Engine

SH1

Sonde Health Database

siNLP

Simple Natural Language Processing Tool

SVM

Support Vector Machine

TAALES

Tool For the Automatic Analysis of Lexical Sophistication

TAASSC

Tool for the Automatic Analysis of Syntactic Sophistication and Complexity

VAD

Voice Activity Detection

VTC

Vocal Tract Coordination

WHO

World Health Organization

QIDS

Quick Inventory of Depressive Symptomatology

Table of Contents Abstract .............................................................................................................................. v Acknowledgements ........................................................................................................... vi Chapter 1 INTRODUCTION ........................................................................................... 1 1.1 Motivation ..................................................................................................... 1 1.2 Research Questions .................................................................................... 10 1.3 Organization ............................................................................................... 10 1.4 Major Contributions .................................................................................. 12 1.5 Publication List ........................................................................................... 14 Chapter 2 CLINICAL DEPRESSION BACKGROUND ............................................ 16 2.1 Depression ................................................................................................... 16 2.2 Depressive Disorder Subtypes ................................................................... 20 2.2.1 Non-Melancholic.......................................................................... 21 2.2.2 Melancholic .................................................................................. 21 2.2.3 Unipolar (Major Depressive Disorder) ..................................... 22 2.2.4 Bipolar Depression ...................................................................... 22 2.3 Standard Clinical Assessment Techniques .............................................. 23 2.3.1 Beck Depression Inventory ......................................................... 25 2.3.2 Patient Health Questionnaire ..................................................... 26 2.3.3 Quick Inventory of Depressive Symptomology ........................ 27 2.4 Emerging Assessment Techniques ............................................................ 28 2.4.1 Biological ...................................................................................... 28 2.4.2 Neurobiological ............................................................................ 29

2.4.3 Physiological ................................................................................. 30 2.4.4 Statistical ...................................................................................... 31 2.5 Summary ..................................................................................................... 31 Chapter 3 SPEECH-LANGUAGE BACKGROUND .................................................. 32 3.1 Acoustic Theory of Speech Production ..................................................... 33 3.2 Measurements of Speech Processing ......................................................... 39 3.2.1 History of Acoustic Speech Measurement Tools....................... 39 3.2.2 Discrete Time Speech Acoustic Analysis ................................... 42 3.3 Speech-Language Affect Measurements - Continuous Ratings ............. 45 3.3.1 Arousal .......................................................................................... 47 3.3.2 Valence .......................................................................................... 48 3.3.3 Other Affective Dimensions ........................................................ 48 3.4 Effects of Depression on Speech ................................................................ 48 3.4.1 Acoustic Behaviours and Characteristics .................................. 49 3.4.2 Linguistic Behaviours and Characteristics ............................... 57 3.4.3 Affect Behaviours and Characteristics ...................................... 59 3.5 Summary ..................................................................................................... 63 Chapter 4 AUTOMATIC SPEECH-BASED DEPRESSION ANALYSIS BACKGROUND .............................................................................................................. 64 4.1 Data Selection .............................................................................................. 66 4.2 Speech Feature Types for Depression Analysis ....................................... 68 4.2.1 Acoustic ......................................................................................... 68 4.2.2 Linguistic ...................................................................................... 77

4.2.3 Affective........................................................................................ 79 4.3 Statistical Classification Techniques ........................................................ 81 4.3.1 Decision Trees .............................................................................. 82 4.3.2 Linear Discriminant Analysis .................................................... 85 4.3.3 Support Vector Machine ............................................................ 88 4.3.4 Support Vector Regression ......................................................... 90 4.3.5 Relevance Vector Machine ......................................................... 92 4.3.6 Neural Network Based Methods ................................................ 94 4.3.7 Evaluation Methods .................................................................... 94 4.4 Summary ..................................................................................................... 96 Chapter 5 DEPRESSION SPEECH CORPORA ......................................................... 97 5.1 General Speech Depression Database Criteria ........................................ 97 5.2 Speech Elicitation Considerations and Modes......................................... 98 5.2.1 Diadochokinetic ........................................................................... 99 5.2.2 Held-Vowel ................................................................................... 99 5.2.3 Automatic ..................................................................................... 99 5.2.4 Read ............................................................................................ 100 5.2.5 Spontaneous ............................................................................... 101 5.3 Speech Nuisance Factors ......................................................................... 102 5.4 Speech-Based Depression Databases ...................................................... 103 5.4.1 Audio Visual Depressive Language Corpus............................ 103 5.4.2 Audio Visual Emotion Challenge 2014 .................................... 104 5.4.3 Black Dog Institute Affective Sentences .................................. 106

5.4.4 Distress Analysis Interview Corpus – Wizard of Oz .............. 108 5.4.5 Sonde Health .............................................................................. 109 5.5 Summary ................................................................................................... 111 Chapter 6 ACOUSTIC PHONEME ATTRIBUTES.................................................. 112 6.1 Overview .................................................................................................... 112 6.2 Methods ..................................................................................................... 113 6.3 Experimental Settings .............................................................................. 117 6.4 Results and Discussion ............................................................................. 118 6.5 Summary ................................................................................................... 121 Chapter 7 ARTICULATORY STRESS ...................................................................... 123 7.1 Overview .................................................................................................... 123 7.2 Methods ..................................................................................................... 127 7.3 Experimental Settings .............................................................................. 129 7.4 Results and Discussion ............................................................................. 131 7.5 Summary ................................................................................................... 138 Chapter 8 LINGUISTIC TOKEN WORDS................................................................ 140 8.1 Overview .................................................................................................... 140 8.2 Methods ..................................................................................................... 142 8.3 Experimental Settings .............................................................................. 144 8.4 Results and Discussion ............................................................................. 145 8.4.1 Token Words Versus Entire Utterances .................................. 145 8.4.2 Linguistic Baseline System ........................................................ 147 8.4.3 Acoustic and Linguistic Features Fusion ................................. 147

8.4.4 n-Best Approach ........................................................................ 148 8.5 Summary ................................................................................................... 149 Chapter 9 ARTICULATION EFFORT, LINGUISTIC COMPLEXITY, AND AFFECTIVE INTENSITY ........................................................................................... 150 9.1 Overview ................................................................................................... 150 9.2 Methods ..................................................................................................... 151 9.3 Experimental Settings .............................................................................. 153 9.4 Results and Discussion ............................................................................. 155 9.5 Extension of Results and Discussion Using Affective Intensity............ 157 9.6 Summary ................................................................................................... 160 Chapter 10 SPEECH AFFECT RATINGS ................................................................ 161 10.1 Overview ................................................................................................. 161 10.2 Methods ................................................................................................... 162 10.3 Experimental Settings ............................................................................ 164 10.4 Results and Discussion ........................................................................... 165 10.4.1 Manual Affect Ratings ............................................................ 165 10.4.2 Acoustic Frame Selection Based on Manual Affect Ratings 166 10.4.3 Acoustic Frame Selection Base on Automatic Affect Ratings167 10.5 Summary ................................................................................................. 168 Chapter 11 ACOUSTIC, LINGUISTIC, AND AFFECTIVE CONSIDERATION FOR SPOKEN READ SENTENCES ................................................................................... 169 11.1 Overview ................................................................................................. 169 11.2 Methods ................................................................................................... 173

11.2.1 Acoustic Feature Extraction ................................................... 173 11.2.2 Speech Voicing Extraction ...................................................... 173 11.2.3 Manual Speech Disfluency Extraction ................................... 174 11.2.4 Automatic Speech Recognition Disfluency Extraction ......... 175 11.2.5 Linguistic Measures ................................................................. 176 11.2.6 Affective Measures................................................................... 176 11.3 Experimental Settings ............................................................................ 179 11.4 Results and Discussion ........................................................................... 179 11.4.1 Acoustic Analysis ..................................................................... 179 11.4.2 Speech Voicing Analysis .......................................................... 182 11.4.3 Verbal Disfluency Analysis ..................................................... 185 11.4.4 Affect-Based Feature Fusion .................................................. 190 11.4.5 Limitations................................................................................ 192 11.5 Summary ................................................................................................. 193 Chapter 12 INVESTIGATION OF PRACTICAL CONSIDERATIONS FOR SMARTPHONE APPLICATIONS ............................................................................. 195 12.1 Overview .................................................................................................. 195 12.2 Methods ................................................................................................... 196 12.3 Experimental Settings ............................................................................ 198 12.4 Results and Discussion ........................................................................... 199 12.4.1 Manufacturer Comparison ..................................................... 199 12.4.2 Manufacturer Version Comparison ....................................... 200 12.4.3 Manufacturer Feature Comparison ....................................... 201

12.4.4 Manufacturer Normalization Comparison ........................... 202 12.5 Summary ................................................................................................. 203 Chapter 13 CONCLUSION.......................................................................................... 205 13.1 Summary ................................................................................................. 205 13.2 Future Work ........................................................................................... 209 Appendix A: Beck Depression Inventory II ................................................................ 212 Appendix B: Patient Health Questionnaire-9 ............................................................. 215 Appendix C: Quick Inventory of Depressive Symptomology Self-Report............... 216 Appendix D: List of Acoustic Analysis Depression Speech Studies.......................... 219 Appendix E: List of Linguistic Analysis Depression Speech Studies ....................... 221 Appendix F: List of Affect Analysis Depression Speech Studies .............................. 222 References ...................................................................................................................... 223

Chapter 1 INTRODUCTION

1.1

Motivation

Depression has long been prevalent, and continues to take its toll on the global health agenda as a leading worldwide contributor of disabilities, suicides, and economic costs (WHO, 2004, 2012). With an already conservative estimate of 400 million individuals afflicted by depression, this disorder is forecast to become the world’s second-most disabling illness by 2020 (WHO, 2001). Even more troubling for both developed and undeveloped nations, by the year 2030, depression is projected to become the leading health concern worldwide (WHO, 2018). Depression is a significant financial burden on the global economy. For instance, in 2000, it was conservatively estimated that the United States alone amassed over $83 billion (USD) in depression-related expenses (Greenberg et al., 2015). Similar losses attributed to depression were recorded in many other nations, including a combined estimate of $92 billion (EURO) within European Union countries (Olesen et al., 2012). The current total estimates of global losses stemming from depression are over a staggering $1 trillion (USD) (WHO, 2016) per year. Continual increases in global financial burden and expenditures related to depression disorders have added to the mountain of statistical evidence that depression is a global epidemic. It is estimated that 20% of all individuals will have a bout of depression during their lifetime (Burcusa & Iacono, 2007). Moreover, nearly half of all people clinically diagnosed with depression will have recurring depressive episodes that span muliple years (Burcusa & Iacono, 2007; WHO, 2017). Statistically, women are at much higher risk for depression than men, having up to three times the rate of incidence (APA, 2013; Praat & Brody, 2014). Moreover, depression and suicide are intrinsically linked. In nearly half of all suicides, its victims meet the diagnosis criteria for clinical depression (McGirr et al., 2007; Minkoff et al., 1973). Depression has biological and/or psychological origins, and most often it is suddenly triggered by a traumatic event or a series of negative events (Brown & Harris, 1978). However, not all people are impacted by traumatic or negative events in the same manner. The heterogeneous characteristic of

depression makes it more elusive to correctly identify because it requires a person’s full cooperation during diagnosis and clinically advised therapy. Remarkably, some people are completely unaware that they have depression because it is often masked by or attributed to symptoms found in other illnesses and disease (Aljarad et al., 2008). Depression disorder awareness is also reduced when it is mistakenly perceived as merely an “extreme” form of sadness rather than a multi-form mood disorder with a range of severities and symptoms (Epstein et al., 2010; Lemelin et al., 1994; Williams et al., 2017). Depression, regardless of the severity, is a frequently studied and extremely treatable disorder in the hands of capable clinicians. Moreover, if depression is flagged, correctly diagnosed, and addressed properly by a clinician early in its onset, the duration of prescribed patient therapy is substantially shortened, and further, the rate of remedial success is at its highest (PT, 2018). However, patient interviewing skills differ amongst clinicians, and medical literature concurs that ineffective interviewing techniques result in less complete knowledge of patients’ backgrounds and their major concerns (Ani et al., 2008; Carter et al., 2010; Enelow & Swisher, 1979; McCabe et al., 2013a, 2013b; Ong et al., 1995). Research asserts that the quantity and quality of information gathered is greatly dependent on clincian’s unique elicitation style, range of pertinent enquiries, and individual protocol guidelines (Howe et al., 2014; Memon & Bull, 1991; Roter et al, 1987). Furthermore, Aljarad et al. (2008) demonstrated that specialized psychiatrists, due to their extensive mental health expertise, revise approximately half of all initial referral mental health diagnoses from general practitioners. As a mood disorder, depression negatively impacts the normal healthy behaviour of an individual. Unlike many other illnesses that can only be identified through invasive biological or sophisticated imagery evaluation techniques, many effects caused by depression can be directly observed by evaluating basic cognitive-motor behaviours. For instance, an increased physical inactivity, tiredness, and poor posture are commonly observed as gross motor-subsymptoms in individuals with depression (Parker et al., 1990; Segrin, 2000; Sobin and Sackeim, 1997). Furthermore, finemotor behaviours are also impacted by depression, causing depressed populations to exhibit reduced facial expressivity, hand-eye coordination, and speech-language abilities (Ellgring & Scherer, 1996; Girard et al., 2013, 2014; Segrin, 2000). Again, a telltale social-behavioural sign of depression is the manner in which depressives verbally communicate. Early subjective observational studies (Ferenczi, 1915; Mignot, 1907, Newman & Mather, 1938; Stinchfield, 1933) of depressed populations by clinicians described them as having

2

abnormal speech patterns, such as weaker loudness, slower rate, flattened pitch, uniform rhythm, less verbosity, and an unusual lifeless or hollow sounding timbre. Following guidance on speech characteristics from the aforementioned early studies, later, objective studies using recordings and speech analysis from depressed speakers discovered and measured a variety of abnormal changes in acoustic, linguistic, and affective attributes. For instance, acoustic studies (Alpert et al., 2001; Hollien, 1980; Szabadi et al., 1976) have indicated decreases in depressed speakers’ loudness, stress-prosodic characteristics, verbal fluency, and overall rate-of-speech. When compared with healthy populations, recorded linguistic speech attributes of depressed populations include fewer words per phrase with less lexical diversity, illogical pragmatics, and weak semantic clarity (Cannizzaro et al., 2004a; Lott et al., 2002; Oxman et al., 1988; Rosenberg et al., 1990; Rubino et al., 2011). Linguistic analysis of individuals with depression has also shown that depressed speakers use more first-person pronouns (e.g. I, me, my) and fewer collective pronouns (e.g. we, us, our) (Stirman & Pennebaker, 2001). Additionally, Rude et al. (2004) demonstrated that depressed speakers verbally express more negatively valenced words than healthy populations. While analysis of depressed speakers’ abnormal speech-language attributes have helped to increase knowledge about how depression adversely effects speech processing, still additional research is necessary to better understand how technology-based applications can improve clincical mental health diagnosis and therapy (Cohen & Elvevåg, 2014). For automatic speech-based depression analysis, developments in this healthcare application domain began less than two decades ago, and only gained momentum in the past few years. Thus, there is still a great deal to learn with regards to automatic data selection, and discovering what vital parts of speech are needed to generate reliable depression severity measures and how best to elicit them. In recent years, speech as a biosignal has received more investigative attention as an objective measure for discovering potential biomarkers that can aid mental health diagnoses and monitoring (e.g. Cohen & Elvevåg, 2014; Girard et al., 2014; Harel et al., 2004; Kiss & Vicsi, 2014; Mundt et al., 2012; Nezu et al., 2009; Töger et al., 2017; Trevino et al., 2011; Quatieri & Malyska, 2012; Williamson et al., 2013, 2014). Automated speech processing, although not currently implemented as a standard biomarker protocol tool in assessing and monitoring depressed individuals, is perhaps not far into the future as a potential clinical healthcare application. Examples of speech elicitation in automated assessment and monitoring systems have included smartphone voice participation prompts or a virtual human interviewer (Bhatia et al., 2017; Girard et al., 2013; Gonzales et al.,

3

2000; Joshi et al., 2013b; McIntyre et al., 2009; Ogles et al., 1998; Reilly et al., 2004; Roberts et al., 2018; Scherer et al., 2014). For example, over the last decade, automated clinical screening and virtual-avatar interview elicitation approaches have been explored more. Many studies (Bhatia et al., 2017; Girard et al., 2013; Joshi et al., 2013b; McIntyre et al., 2009; Scherer et al., 2014) have proposed designs for multimodal behavioural digital biosignal tracking systems specifically for identifying illnesses. These aforementioned studies, which include multimodal biosignals, generally have relied heavily on monitoring and processing a patient’s voice to generate the most accurate diagnostic readings. While implementation of a multimodal system in a clinical setting is sensible, outside of this environment, its practicality is more limited due to costs, sensory/hardware device requirements (e.g. electricity, Internet), and security (e.g. medical privacy, device security). For medical analysis, speech is unique in that its acquisition does not require elaborate setup. Speech biosignal analysis also allows for broader, more practical patient access via popular smart devices (e.g. smartphone, tablet, computer). Roughly half of all cellphone owners in developed nations and a third in developing nations possess a smartphone; these figures are forecasted to continue growing rapidly as device affordability and network access increases (Rainie & Perrin, 2017). This estimate does not include also people who have access to a smartphone and/or other smart device. Audio speech processing, designed to automatically extract and analyze discrete temporal information using multi-tiered speech-language systems (e.g. speech recognition, paralinguistics, transcript analysis), holds significant promise for clinical depression analysis. These systems can identify depressive patients’ social-behavioural patterns described earlier in this chapter that may be easily overlooked or deemed incalculable by a clinician. For articulatory, linguistic, and affect related approaches toward evaluating depression there is still good opportunity for the exploration and exploitation of new discriminative information for automatic depression classification. Miller (1963), Chomsky (1959), and Hauser et al. (2014), believe that spoken language has innate biological origins in a physiological and cognitive sense. Miller (1965) noted that any general theory of psychology is inadequate if it fails to take language into account. As shown in Fig. 1.1, the speech biosignal normally contains three main tiers of encoded information, noting that no model of human speech is regarded as complete (Garman, 1990). Hence, Fig. 1.1 illustrates the most basic innate speech production fundamentals and vital component processes for naturally effective verbal communication. Whether these tiers have or do not have a biological origin, it is not difficult to comprehend how depression can impact the functions within each of these tiers.

4

Depression causes relatively predictable abnormal speech-language behaviours that are identified as typical clinical subsymptoms. • Arousal • Dominance • Valence • Confidence • Empathy • Politeness • Sincerity

AFFECT

• Pragmatics • Semantics • Lexicon • Syntax • Stress • Phonemes

LINGUISTIC

• Duration • Voicing • Intensity • Pitch • Timbre

ACOUSTIC

Figure 1.1: Three major analysis tiers contained within the speech biosignal. Examples of information contained in each tier are shown on the right. Speech-language deficiencies can be found across all three tiers and/or selective tiers depending on the individual and the severity of the disorder/disease. Notably, a great deal of speech-based depression research to date has focused primarily on the acoustic tier.

In Fig. 1.1, the acoustic tier contains crucial elements related to neurophysiological-motor function; such as vocal fold activation, articulators, and oral/nasal resonator modifications. Considered independently, the acoustic tier does not represent any communication-related information a cognitively developed person would understand, as this tier contains no organized language system. The linguistic tier shown in Fig.1.1 contains language-specific information defined by a set of language-dependent phoneme prototypes, a great number of meaningful lexical units, and grammar rule expectations. The affect tier contained in Fig. 1.1 includes purposeful emotional content and expression. It is often represented as one of the most complex aspects of language because it usually incorporates acoustic and linguistic tier elements. For instance, in the delivery of an ironic joke, vocal pitch can be spoken in a very sad acoustic tone but include the linguistic-stress of a very affectively positive word selection. It is fair to say that not all depressed speakers may exhibit the same level of abnormal speechlanguage characteristics across all tiers (e.g. psychomotor abnormalities, cognitive deficits, negative affective bias). Therefore, it is believed that a multi-tier assessment can provide more comprehensive evidence regarding the depression severity than information from only a single tier.

5

Moreover, the comprehensive analysis of these three different tiers may provide further diagnosis reliability, and perhaps even allow for specific depressive disorder subtype recognition. Recent research (Morales, 2018; Morales et al., 2017; Paulus et al., 2017) has indicated that there is a disconnection between studies related to clinical depression and automatic diagnostic detection techniques. For example, many speech-based depression investigations fail to interlink multiple components of language (e.g. articulation, linguistic, affect) to abnormal behaviours exhibited by speakers with depression. One reason for this is that the majority of speech-based depression studies have focused on primarily the acoustic tier for automatic analysis. Thus, there are gaps in the understanding of how linguistic and affect information could be used to potentially improve upon speech-based depression techniques. A multi-tiered analysis, as illustrated previously in Fig. 1.1, is more akin to probed methodology found specifically during speech language pathology evaluations, wherein specific verbal language tasks (e.g. articulation, syntax, semantics, context appropriateness) are each isolated and evaluated to clearly establish speaker deficiencies or abnormalities in communication. Clinicians also use a similar patient diagnostic approach to speech language pathologists in that, in either occupation, no good clinician relies on a single test to guide his/her diagnosis - but rather, he/she uses a range of individual test batteries to aid during diagnosis. It is proposed that by using a similar multi-tiered speech-language analysis approach for automatic speech-based depression approaches, it could result in more effective depression diagnostic results. However, before a multi-tiered approach can be implemented in speech-based depression methods, a better understanding of speech analysis within each tier is necessary. For automatic speech-based depression classification or prediction analysis, more emphasis should be placed on evaluating all major tier components of spoken language, as shown previously in Fig. 1.1. Further, there is a need to better understand exactly how different tier information is independently and dependently affected by depression disorders. However, this investigative approach across components of speech requires expertise not just in the field of speech acoustics, but also a good understanding of linguistic and paralinguistic foundations. Certainly, for automatic depression analysis, there is a need for more fine-grained speech-based investigation by specifically using data selection to isolate acoustic, linguistic, and affect tier information. The concept of data selection has proven effective in many other areas of speech processing (e.g. speech recognition, speaker identification, emotion classification). For example, the

6

most common forms of data selection for automatic speech-based depression analysis include voice activity detection, gender, and phonologically driven methods. Voice activity detection, also referred to as VAD, is a standard practice in virtually all automatic speech processing applications because it aids in isolation of speech and the removal of undesirable non-speech noise (Kinnunen & Rajan, 2013). Gender-specific speech-based depression studies (Alghowinem et al., 2012; Cummins et al., 2017; Low et al., 2010; Vlasenko et al., 2017) have also been explored indicating speech characteristic differences depending on gender. Speech-based depression experiments have also been conducted using phoneme-based data selection methods, such as isolation of individual phonemes (e.g. vowels, consonants) (Scherer et al., 2015; Trevino et al., 2011; Vlasenko et al., 2017), basic vowel groups (e.g. diphthongs/monophthongs) (Scherer et al., 2016; Trevino et al., 2011), and broad articulatory manner groups (e.g. voiced/unvoiced, vowel/fricative/plosive/nasal) (Trevino et al., 2011). However, for automatic depression classification, still very little is known about how data selection of linguistic or affective information alone (or as a reference for selecting acoustic information) might reveal further properties discriminative of depression. The aforementioned studies evaluating articulatory aspects using data selection on depressed speakers (Scherer et al., 2015, 2016; Trevino et al., 2011; Vlasenko et al., 2017) have yet to fully explore phoneme-related parameters to their full extent. For instance, only broad vowel groups for data selection have been explored, and not multiple vowel parameter analysis. It is believed that vowel appraisals related to articulatory parameters could be useful in measuring disturbances exhibited by depressed speakers. Furthermore, none of the prior studies have co-examined a language outside of English; thus, they have yet to demonstrate depressive behavioural vowel trends across multiple languages. At the acoustic level, it is still unknown whether the speech biosignal has universal biomarkers that can be generalized and extended across more than one language. Gábor and Klára (2014) suggest that acoustic-based depression classification research should place higher priority on discovering a correlation between depression severity and changes in articulatory acoustic phoneme parameters. Recently, phono-articulatory complexity score-based systems have been investigated as a measure of articulatory kinematic effort for detection of pathological speech disease/disorder. Yet, in this literature, such as (Stoel-Gammon, 2010; Jakielski et al., 2006; Maccio, 2002; Shriberg et al., 1997; Shuster & Cottrill, 1997), only a small set of articulatory

7

characteristics, speakers, and languages were manually examined. Furthermore, these investigations were primarily designed for speech pathology evaluation applications. Another promising, practical method to evaluate articulatory-physiological speech parameters is analysis of phoneme occurrences/transitions based on spoken transcript analysis and acoustic-based features. Early key influential physiological articulation parameter investigations concentrated on articulatory descriptors, such as English vowel attributes (Ladefoged, 1993), universal phonetic markedness (Chomsky & Halle, 1968; Henning, 1989) and spoken articulatory gestures (Browman & Goldstein, 1986, 1992). These linguistically-motivated investigations provided further insight into natural speech production, including static phoneme analysis and non-static articulatory transitions called articulatory gestures, which are essentially parameterized dynamic systems based on overlapping coordinated muscular movements. In (Browman & Goldstein, 1989, 1995), a gestural score was proposed to help measure articulatory movements based on the linguistic order of articulatory change. Unlike a discrete sequence of phonemes, a gestural score has an advantage because it encodes a set of hidden states present in natural interconnected speech. It is conceivable that for depression, the loss of articulatory control and kinematics can also be measured using similar descriptive linguistic phonologically driven metrics. Yet another important aspect of spoken language is affect. During speech, a great deal of information regarding an individual’s state-of-mind is revealed by his/her affective lexical choices and range of affect. In a study by Ojamaa et al. (2015), it was shown that sentiment analysis of spoken transcripts is a good indicator for estimating the type of affective content expressed in natural speech. However, it was only recently that sentiment and affect-based ratings haven been explored to identify and/or predict depression severity (de Choudhury et al., 2013; Huang et al., 2015; Resnik et al., 2015; Wang et al., 2013b). While it is well established that individuals with depression exhibit greater amounts of negative affect (Joorman et al., 2005), the integration of spoken transcripts coinciding with its acoustic information is rare in the depression literature. The absence of transcript-acoustic analysis is particularly surprising as there is likely rich information in examination of the relationship and degree of reciprocality between both of these types of information. With advances in speech-to-text processing (e.g. automatic speech recognition), the capacity to automatically transcribe (and even translate) speech allows for greater exploration of natural verbal affective analysis.

8

Previous speech-based depression studies have focused their attention heavily on glottal (Ellgring & Scherer, 1996; Moore et al., 2003), pitch (Alpert et al., 2001; Mundt et al., 2007; Quatieri & Malyska, 2012), formant (Cummins et al., 2015a, 2017; Flint et al., 1993; Helfner et al., 2013; Williamson et al., 2014), intensity (France et al., 2000; Kiss & Vicsi, 2015) and spectral (Cummins et al., 2013a; Ozdas et al., 2004a; Sturim et al., 2011; Yingthawornsuk et al., 2007) voice characteristics because depression has been linked with changes in speech motor control (Cummins et al., 2013c, 2015b, 2015c; Sobin & Sackeim, 1997). In addition, suprasegmental (i.e. syllable/word) speech characteristics, such as rate of speech (Ellgring & Scherer, 1996; Flint et al., 1993; Gredin & Carroll, 1980; Trevino et al., 2011) and pauses (Alpert et al., 2001; Esposito et al., 2016; Mundt et al., 2007; Nilsonne, 1987; Trevino et al., 2011), have also been previously investigated as depression disorder indicators. For speech-based depression severity assessments, none of these individual speech characteristics have been proven consistently discriminatively dominant over all others. Furthermore, many of these speech characteristics can vary in usefulness depending on the elicitation mode (Mitra & Shriberg, 2015). In general, speech-based depression studies (Alghowinem et al., 2013a; Hashim et al., 2017; Low et al., 2010, 2011; Moore et al., 2004; Williamson et al., 2013) have indicated that a combination of several of these speech characteristics can provide more discriminative depressed speaker analysis. Many different machine-learning methods have been used to help classify depressed speakers or predict their severity based on speech characteristics. For instance, Alghowinem et al. (2013a) and Long et al. (2017a) compared several individual and combined machine-learning techniques across various speech characteristics. However, despite the relatively many options for machine-learning methods, whether straightforward or complex in design, so far none has proven to be the most advantageous for depression classification/prediction tasks. As shown previously in Fig. 1.1, speech conveys three unique but intrinsically linked encoded tiers of information (i.e. acoustic, linguistic, affect). It is logical that these key speech information tiers hold vital clues regarding abnormal behaviours exhibited by individuals with depression. In addition, emerging speech-based depression literature briefly presented herein has demonstrated that each of these tiers contains useful information for automatic depression analysis. This presented thesis explores each of the individual speech biosignal tiers while also studying the relationship between their interdependencies. By individually examining the processes within these interconnected tiers, new knowledge on depression related verbal behavioural changes are

9

discovered. Furthermore, a combined multi-tiered speech-based depression investigation is conducted with promising results for automatic applications.

1.2

Research Questions

There is a need to discover more effective and/or alternative methods for identifying severity of clinical depression. It is currently unknown which kinds of acoustic, linguistic, and affective measures can be applied to automatically guide speech data selection for depression classification applications. •

What different speech-related measures are most useful for data selection for automatic depression assessment analysis from the perspective of acoustic, linguistic, and affect information?

Features for depression classification can be directly derived from articulatory, linguistic, and affective information; however, many of these new potential features are currently unexplored. •

How can new knowledge-based features stemming from related fields of study (e.g. affective computing, linguistics, speech language pathology) and articulatory, linguistic, and affective information be developed for automatic depression analysis?

It is generally known that elicitation of affect and linguistic content can influence and guide automatic assessment protocols for speech-based depression assessments; yet, little investigation has gone into understanding how these can impact speech-based depression assessment performance. •

How can affective content and linguistic structure be exploited for speech-based depression classification performance and elicitation design?

1.3

Organization

This thesis is organized as follows: Chapter 2 defines core patient depression symptoms that must be recorded by a clinician for diagnosing a depression disorder. Literature review information encompassing several different depression disorder subtypes and current diagnostic approaches are also discussed.

10

Chapter 3 briefly reviews components of speech-language production components with a focus on articulatory and linguistic aspects. It also highlights abnormal acoustic, linguistic, and affective characteristics exhibited by individuals with depression. Chapter 4 contains a short review of basic speech processing terminology and methods for digitally characterising the acoustic biosignal. Chapter 5 briefly reviews speech-depression databases available for analysis and provides descriptions and details for all databases used in experiments herein. Chapter 6 presents experiments using different articulatory parameters as measures for acoustic speech data selection. Moreover, these experiments investigate the hypothesis that increased articulatory complexity results in enhanced automatic depression classification performance. Chapter 7 further examines specific vowel articulatory parameters including the first analysis of linguistic-stress based vocalic features in both English and German languages. Chapter 8 investigates data selection to reduce acoustic variability via token word analysis. Experiments explore commonly spoken formulaic language, in particular filler-words, to determine if these words hold more depression discriminative properties and are useful in predicting depression severity. Chapter 9 investigates linguistic complexity measures as proxies for more discriminative data selection for depression. A novel measure for quantifying articulation effort is introduced and experiments are conducted by selecting spoken phrases with higher linguistic complexity. Chapter 10 presents experiments using manual and automatic affect emotion ratings, wherein an investigation of emotion-based data selection based on novel rating thresholds. Chapter 11 explores the main three tiers of spoken language (e.g. acoustic, linguistic, affect) in the form of speech disfluencies, voicing, affect target word location, point-of-view, and valence-based feature sets. Moreover, it examines how the careful deliberate construction and/or selection of assessment protocol sentences can influence automatic depression classification performance. Chapter 12 presents a practical systematic study of smartphone device considerations for automatic speech-based depression classification. Smartphone device manufacturer and specific smartphone series are examined via normalized modeling. Chapter 13 summarizes the thesis and provides future prospective directions in field of automated speech mental health research and applications.

11

1.4

Major Contributions

Research presented in this thesis delivers original contributions in the field of speech processing with a focus on data selection methods for measuring depression severity. The major novel observational and technical contributions are summarized as follows: Novel Observations Depressed speakers are found to exhibit shorter vowel durations and less variance for ‘low’, ‘back’, and ‘rounded’ vowel positions in both English and German. Results using a small set of linguistic stress-based features derived from multiple vowel articulatory parameter sets showed statistically significant gains in depression classification performance over baseline approaches. Filler words can be equally or more effective for depression prediction than using entire utterances. The first investigation of how emotional information expressed in speech (e.g. arousal, valence, dominance) contributes to depression classification. A novel examination of elicitation based on linguistic positioning of affective token words demonstrated that for read sentences, affective keywords placed at the beginning of a sentence rather than middle or end of a sentence resulted in increased depression classification performance for acoustic-based speech features. An evaluation of how different types of emotionally charged keywords (e.g. positive, neutral, negative) impact verbal behaviour and automatic depression classification performance in a series of reading tasks. Experimental non-depressed and depressed speaker results indicate that for acoustic and voicing features, neutral and positive sentences provided more discriminatory feature information than negative sentences.

12

Novel Techniques A novel articulatory age-of-mastery norms measure for quantifying articulation effort is introduced, and when applied experimentally to spontaneous speech produces gains in depression classification accuracy. In the first acoustic-transcript insights of their kind for speech in general, experimental results demonstrate that by selecting speech with higher articulation effort, linguistic complexity, or affective arousal/valence, improvements in acoustic speech-based feature depression classification performance can be achieved. Disfluency tracking is proposed, including the first application of speech error analysis for depression classification systems using read speech together with affective stimuli. Experimental disfluency results indicate that a statistically significantly larger number of disfluencies occur with depressed speakers during negative-valence read sentences than neutral or positive ones. A new acoustic gestural effort measure is proposed that calculates articulation complexity based on phoneme-to-phoneme transitions and Chomsky-Halle phonetic markedness. By utilising acoustic speech data with higher densities of specific phonetic markedness and gestural effort, improvements in depressed/non-depressed classification accuracy and F1 scores were recorded. A new multi-affect (e.g. positive, neutral, negative) speech-based feature fusion technique resulted in considerable depression classification performance gains for both manual and automatic approaches, in some instances reaching 100% accuracy. This was the first time a multi-affect fusion technique has been proposed for speech-based depression analysis. The first systematic study of how smartphone device variability affects performance of speech-based classification of depression, and how normalisation approaches can mitigate this variability. Smartphone devices were found to more negatively impact systems based on spectral features than those based on prosodic/glottal features with regards to depression classification performance.

13

1.5

Publication List

Journal Papers Stasak, B. & Epps, J., 2018. Automatic Depression Classification Based on Affective Read Sentences: Opportunities for Text-Dependent Analysis. Speech Communication, submitted for review. This publication forms the majority of Chapter 11. Stasak, B., Epps, J., Goecke, R., 2018. An Investigation of Linguistic Stress and Articulatory Vowel Characteristics for Automatic Depression Classification. Computer Speech & Language, vol. 53, pp. 140-155. This publication forms the majority of Chapter 7.

Conference Papers Stasak, B., Epps, J., Lawson, A., 2018. Pathologic Speech and Automatic Analysis for Healthcare Applications (Batteries Not Included?). In: Proceedings of the 17th Australasian Speech Science & Technology (ASST) Conference, Sydney, Australia, 2018. This publication forms the majority of Chapter 5. Stasak, B., Epps, J., 2017. Differential Performance of Automatic Speech-Based Depression Classification Across Smartphones. In: Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), San Antonio – TX, United States of America, pp. 171-175. This publication forms the majority of Chapter 12. Stasak, B., Epps, J., Lawson, A., 2017. Analysis of Phonetic Markedness and Gestural Effort Measures for Acoustic Speech-Based Depression Classification. In: Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), San Antonio, United States of America, pp. 165-170. This publication forms the majority of Chapter 6. Stasak, B., Epps, J., Goecke, R., 2017. Elicitation Design for Acoustic Depression Classification: An Investigation of Articulation Effort, Linguistic Complexity, and Word Affect. In: Proceedings of INTERSPEECH, Stockholm, Sweden, pp. 834-838. This publication forms the majority of Chapter 9.

14

Stasak, B., Epps, J., Cummins, N., 2016. Depression Prediction Via Acoustic Analysis of Formulaic Word Fillers. In: Proceedings of the 16th Australasian Speech Science & Technology (ASST) Conference, Parramatta, Australia, pp. 277-280. This publication forms the majority of Chapter 8. Stasak, B., Epps, J., Cummins, N., Goecke, R., 2016. An Investigation of Emotional Speech in Depression. In: Proceedings of INTERSPEECH, San Francisco, United States of America, pp. 485489. This publication forms majority of Chapter 10. Dang, T., Stasak, B., Huang, Z., Jayawardena, S., Atcheson, M., Hayat, M., Le, P., Sethu, V., Goecke, R., Epps, J., 2017. Investigating Word Affect Features and Fusion of Probabilistic Predictions Incorporating Uncertainty in AVEC 2017. In: Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge (AVEC ’17), San Francisco, United States of America, pp. 27-35. Huang, Z., Stasak, B., Dang, T., Gamage, K., Le, P., Sethu, V., Epps, J., 2016. Staircase Regression in OA RVM, Data Selection and Gender Dependency in AVEC 2016. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge (AVEC ’16), Amsterdam, Netherlands, pp. 19-26. Huang, Z., Dang, T., Cummins, N., Stasak, B., Le, P., Sethu, V., Epps, J., 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In: Proceedings of the 5th ACM International Workshop on Audio/Visual Emotion Challenge (AVEC ’15), Brisbane, Australia, pp. 41-48.

15

Chapter 2 CLINICAL DEPRESSION BACKGROUND

This chapter provides background on depression and its requirements for clinical medical diagnosis. Depression subtypes are identified, noting many differences in their diagnostic prevalence in populations and symptomatic behaviour characteristics. Widely used standard clinical depression assessment techniques are discussed, especially subjective self-assessment tools and their potential weaknesses as a diagnostic tool. In addition, emerging objective technology-based diagnostic techniques for depression are also recognized and reviewed. This chapter provides readers with important insights into depression including prevalent behavioural changes, subtypes, and diagnostic tools currently available.

2.1

Depression

After a negative experience (e.g. loss of loved one, bad grade, financial setback), it is normal for people to incur an emotional state of temporary sadness. However, clinicians make a careful distinction between temporary kinds of grief versus prolonged debilitating sadness in the form of a mood disorder (WHO, 2009). As a depression severity increases in a person, so does the cognitive and physical burden. For individuals suffering from clinical depression, the disorder adversely affects their mood, behaviour, cognition, and body’s wellbeing. Depression can provoke a person to fixate on a continuous stream of negative thoughts, including those of helplessness and hopelessness (PT, 2018; Peeters et al., 2003; Bylsma et al., 2011). Further, depression often prompts chronic physical discomfort, such as body fatigue, back/limb/joint pain, gastrointestinal issues, and shifts in libido (Mathew & Weinman, 1982; Ohayon & Schatzberg, 2003; Trivedi, 2004). Associated with this, depression is also reported to drastically increase risks for other diseases and disorders. For example, depressed populations have significantly higher risks for osteoporosis (Cizza et al., 2009), coronary heart disease (Barth et al., 2004), lupus (Zhang et al., 2017a), diabetes (Das et al., 2013), multiple sclerosis (Feinstein et al., 2014), and particular kinds of cancer (e.g. oropharyngeal,

pancreatic, breast, lung) (Massie, 2004). Generally, a large number of people have more than one medical condition present. For instance, 25% of adults have multiple chronic conditions, but after the age of 65 years, this rate rapidly increases to 75% (CDC, 2018). Hence, clinicians must be very thorough when initially assessing patients for depression because there are many different types and/or illnesses that have overlapping symptoms (Aljarad et al., 2008). During clinical diagnosis clinicians initially look for telltale symptoms of depression that are clearly defined by the Diagnostic and Statistical Manual (DSM-IV-TR) (ADA, 2000). The DSM-IV-TR, compiled and published by the American Psychiatric Association (APA), is an extensively used descriptive resource diagnostic guide that outlines standardized medical criteria for the classification of mental disorders (APA, 2000, 2013). Within the DSM-IV-TR, there are nine different major symptoms that denote a depression disorder. If five or more symptoms are present in a patient for longer than a two-week period, it indicates he/she is considered to have a depression disorder. For clinical depression diagnosis, at least one of the core symptoms must include a depressed mood or decrease of interest/pleasure. As a DSM-IV-TR prerequisite, the overall symptoms must cause significant concern or impaired function. Furthermore, these depressive symptoms cannot be directly attributed to another medical disorder, bereavement, or the influence of substances. According to the DSM-IV-TR (ADA, 2000), the fundamental diagnostic symptoms for depression are the following: •

Depressed mood most of the day or nearly every day.



Markedly diminished interest or pleasure in all or almost all activities most of the day or nearly every day.



Significant unintentional weight loss or gain or increase/decrease in appetite.



Insomnia or hypersomnia nearly every day.



Noticeable psychomotor agitation or retardation nearly every day.



Fatigue or loss of energy nearly every day.



Feelings of worthlessness or either excessive or inappropriate guilt nearly every day.



Diminished ability to think, concentrate, or make decisions nearly every day.



Recurrent thoughts of death, recurrent suicidal ideation without a specific plan, or a suicide attempt and specific plan.

In addition to the DSM-IV-TR depression criteria, usually during an interview with a patient, a clinician will also investigate or consider previous changes in his/her patient’s life. This includes

17

asking questions about home life, work, and other potential stress environments. Per standard practice, a clinician will also take into account a patient’s demographic background and family health-related history. Experts on depression have noted that an individual’s age, attitude, culture, coping strategies, environment, ethnicity, gender, genetics, height, occupation, lifestyle, personality, and socioeconomic status have varying impacts on the incidence of depression among different populations (Brown, 2013; Cusin et al., 2009; Garnefski et al., 2002; Holahan & Moos, 1987; Karasz, 2005; Krupnick & Cherkasova, 2014; Nolen-Hoeksema, 1991; Yorkbik et al., 2014). These predispositions also referred to as diathesis, play a powerful contribution in a person’s vulnerability towards depression. As a negative life event occurs, each diathesis that is present increases the likelihood of a person manifesting depression. There are many different internal and external catalysts that spur on symptoms related to depression. The foremost catalysts to depression are shown in Fig. 2.1 below. Depression is often caused by multiple reasons and a series of stressful or negative events over variable time periods. Thus, characterizing the origins of depression in a uniform logic remains difficult.

Figure 2.1. Word-cloud list of common, troubling life-events along with potential diathesis factors known to lead to depression onset according to studies (Patel et al., 2002; Shapero et al., 2014). Common stressful lifeevents are indicated by large text size, whereas diathesis factors are a small text size.

Biological indicators for depression are complex because they often comprise genetic predisposition, prior/current health status (i.e. additional disorder/s, disease), and environment factors. The biological origins of depression have been investigated using neurological imagery data. There is evidence supporting familial genetic ties to depression, and that depression is directly associated with dysfunctions in the cortical-limbic systems (Beck, 2008; Beck & Alford, 2009). One particular brain area of interest in depression is the amygdala, which is largely in command of processing memory, decision-making, and affective responses (Yilmazer-Hanke, 2012). Beck

18

(2008) discovered that a hypersensitive amygdala is associated with patterns of negative cognitive biases that can constitute a higher-risk factor for depression. Moreover, they ascertained that a combination of a hyperactive amygdala and hypoactive prefrontal region of the brain resulted in patients with a higher incidence of depression and cognitive deficits, such as spoken language difficulties. For some depressed individuals, biological susceptibility to depression is concurrent with deviations in presynaptic/postsynaptic neurotransmitter transition and function (Syvälahti, 1994). Studies (Audenaert et al., 2002; Elliott et al., 1997; Okada et al., 2003) have shown that people with depression exhibit reasoning skill deficits. These depression-related deficits coincide with diminished neural activation in brain regions vital for cognitive and language processing; and further, an absence of activity in neural regions responsible for emotional processing. Over the decades, researchers have studied subsymptom psychomotor disturbances exhibited by individuals with depression. The Motor Agitation and Retardation Scale (MARS) and Salpetriere Retardation Rating Scale (SRRS) (Buyukdura et al., 2011; Sobin et al., 1998) were manual tests designed to specifically measure the degree of psychomotor disturbance severity in depressed patients while also delineating between retardation and agitation subsymptoms. It is important to note that both the MARS and SRRS involve an assessment of a patient’s speech abilities. The effects of psychomotor retardation cause slower cognitive response, motor control, and poorer verbal abilities (Buyukdura et al., 2011; Flint et al., 1993; Kuny & Stassen, 1993; Nilssone, 1988; Sabbe et al., 1996; Szabadi et al., 1976). Furthermore, psychomotor retardation directly influences the proper function of the vocal folds and articulators, leading to a decrease in their controlled agility and increase in their incoordination. According to Caligiuri and Ellwanger (2000), psychomotor retardation is exhibited by approximately 60% of people with mild to severe clinical depression diagnosis. Clinicians also have observed psychomotor agitation within depressed populations, which is the opposite on the spectrum of psychomotor retardation. Psychomotor agitation is often broadly described as excessive rapid gesturing, accelerated motor activity, and verbose activity (Day, 1999; Ulrich & Harms, 1985). According to the study by Day (1999), which included a broad analysis of previous literature on psychomotor agitation, its exhibited prevalence in clinically mild to severely depressed patients is in a range of 17% to 72%, depending on the depression subtype diagnosis. Further, Day (1999) noted that in many previous depressive case studies only psychomotor disturbance was recorded and excluded details as to whether it was retardation or agitation.

19

The DSM-IV-TR (ADA, 2000) surprisingly does not specifically address certain types of speech and/or language behaviours as diagnostic cues for depression; but rather, it uses a broad retardationagitation descriptor for all physiological behaviours. Therefore, to procure more precise information regarding a patient’s speech and other specific behaviours, clinicians rely on the criteria set by the mental status examination. The mental status examination is a structured guide for observing a patient’s behaviours, appearance, attitude, appropriate affect, concentration, cognition, and judgment (Trzepacz & Baker, 1993). As shown in Table 2.1, there is a specific category on speechrelated criteria. Clinicians use the MSE to help flag abnormal speaking behaviours for further medical investigation. Table 2.1 Summary of Mental Status Examination (MSE) speech descriptors commonly used during patient assessments. Category Pattern Rate of Speech Flow of Speech Intensity of Speech Clarity Liveliness Quantity

2.2

Slow | Rapid Hesitant | Long Pauses | Stuttering Loud | Soft Clear | Slurred Pressured | Monotonous | Explosive Verbose | Scant

Depressive Disorder Subtypes

Various subtypes of depression disorder are generally categorized into two contrarily paired subsets: non-melancholic/melancholic and unipolar/bipolar. The DSM-IV-TR guidebook makes distinction between these depression disorder subsets to help group, differentiate, and identify subtypes based on their symptomatology. Depression subtype symptomatology disparities, as shown in Table 2.2, are crucial in choosing the proper identification, diagnosis, and course of clinical treatment (APA, 2000; Benazzi, 2006) because depending on the subtype specific treatments have greater effectiveness than others.

20

Major Depression Disorder Dysthymic Disorder Psychotic Depression Cyclothymic Disorder Bipolar Disorder Seasonal Affective Disorder Perinatal Depression Atypical Depression Catatonic Depression

Minimal Duration (weeks)

Psychomotor Retardation

Psychomotor Agitation

Multiple Physical Symptoms

Paranoia and/or Mood Swings

High Degree of Severity

Loss of Interest

Low Mood

Table 2.2 Summary of clinical depression diagnosis subtypes and their characteristics indicated by different filled-in shades; present (dark blue); present sometimes (light blue); and not present (white). Minimal symptom durations are based on information taken from APA (2000), PT (2018) and YBB (2016).

104 > 26 > 104 > 13 > 104 >2 >2 >2

2.2.1 Non-Melancholic A non-melancholic subset of depression is primarily psychological, rather than biological. Roughly 90% of people diagnosed with depression have a non-melancholic form. This form of depression is event-driven and occurs during or after stressful times. The key attributes of the non-melancholic form of depression are moody social-behavioural impairments, such as anti-social behaviour, decreases in verbal language and positive outlook, which last longer than a two-week period. In regard to treatment for non-melancholic forms of depression, the most commonly recommended clinical treatments are cognitive behavioural therapy, interpersonal therapy, psychotherapy, mindfulness meditation, and counseling.

2.2.2 Melancholic Melancholic depression is a subset and subtype of depression found in only approximately 2% to 10% of diagnosed cases of depression (APA, 2000; BDI, 2018). However, within the affected population, it claims a higher depression severity than non-melancholic. The root cause of

21

melancholic depression is biological, which stems from an imbalance of neurotransmitters (i.e. dopamine, serotonin, noradrenaline). The melancholic form of depression affects both mood and physical bodily function. Its general attributes include extremely depressed mood, anhedonia (e.g. the inability to find pleasure), lethargy, low energy, poor concentration, and agitated movements – including psychomotor disturbance. Unlike the non-melancholic form, the melancholic form is generally unresponsive to psychotherapy/counseling (BDI, 2018). Consequently, for melancholicdiagnosed patients, medications (e.g. tranquillizers, antidepressants, mood stabilizers) are often also prescribed as the treatment. In some extreme cases, if anti-depressants do not work for a patient, Electroconvulsive Therapy (ECT) may be prescribed as a treatment option.

2.2.3 Unipolar (Major Depressive Disorder) Unipolar disorder, also known as Major Depressive Disorder (MDD), encompasses individuals who have prolonged periods of sadness, low self-esteem, fatigue, cognitive interference, and decrease of interest in once enjoyable activities. Unipolar is used to describe a depressive episode that remains emotionally flat over a course of time. Dysthymic disorder is similar to MDD; however, it is typically less severe and over a much longer time period. A person must have this milder depression documented for over two years to be diagnosed with dysthymia (PT, 2018). The DSMIV-TR recognizes five further subtypes of unipolar depression: melancholic depression, season affective disorder, perinatal depression, atypical depression, and catatonic depression.

2.2.4 Bipolar Depression Bipolar is a subset and subtype of depression, which affects approximately 2% of the total world population (PT, 2018). Bipolar disorder was once commonly referred to as ‘manic depression’ because it involves intermittent periods of depression, mania, and normal moods (PT, 2018). Mania is described as intensity extremes in behavioural agitation. Unlike unipolar depression, bipolar has behavioural symptoms that include intense elation, bountiful energy, excessive racing thoughts, insomnia, inability to focus on task, and increased rate of speech. Patients from the bipolar depression subset are more likely to involve hypersomnia and psychomotor retardation, whereas by contrast patients from the aforementioned unipolar subset are more likely to experience insomnia and psychomotor agitation (Benazzi, 2006). In extreme cases, an individual with bipolar disorder can have psychosis.

22

2.3

Standard Clinical Assessment Techniques

An analysis by Parslow & Jorm (2000) found that general practitioners (e.g. family doctors, medical interns) diagnose approximately 76% of depression disorders. However, general practitioners’ overall reliability in properly identifying depression disorders in patients is marginal according to diagnostic studies completed by Mitchell et al. (2009), Parslow et al. (2011), and Young et al. (2001). For instance, in a primary care setting, depression disorders are identified with low sensitivity by general practitioners, having only 33% to 50% correct diagnosis accuracy (Aljarad et al., 2008; Freeling et al., 1985; Mitchell et al., 2009; Perez-Stable et al., 1990; Young et al., 2001). Thus, it appears that the usefulness of any human interview method and/or assessment tool for depression evaluation is only as good as the clinician implementing it and his/her patient’s symptomology input. Segrin et al. (1998) found that several different clinical-related aspects contribute to the poor diagnosis of depression disorders, such as limitations in specialized training (e.g. interview methodology, mental health awareness, cultural sensitivity), costs (e.g. multiple assessments, additional referrals), clinician-patient session time constraints (e.g. length or number of sessions), clinician-patient relationship (e.g. family physician, new patient) and patient cooperation (e.g. veracity, social stigma, occupational impact). The National Institute of Mental Health has suggested that patient assessment diagnostic protocols be re-investigated (NIMH, 2013). Furthermore, noninvasive automatic technologies were recently proposed by Philip et al. (2017) to further ameliorate clinicians’ ability to accurately diagnose and monitor depression disorders. A major difficulty is that similar abnormal symptoms are shared and observed across many different disorders/diseases. For example, both depression and anxiety disorders can co-exist, and be difficult to clearly distinguish during diagnosis (Clark & Watson, 1991). Anxiety and depression are believed to co-exist along a continuum, from normal to severe, allowing potential simultaneous overlap of these two disorders (Endler & Kocovski, 2001). Again, in general populations, many individuals have more than one medical condition present (CDC, 2018). Unlike many other illnesses, depression is particularly difficult to diagnose because clinicians rely heavily on subjective psychological phenomena rather than more objective physiological phenomena. In fact, research by Rogers et al. (1993) and Monaro et al. (2018) indicated that psychological measures could be easily feigned in a clinical setting. Further, Saberi et al. (2013) noted that in legal cases, behavioural symptoms including motor behaviours were the most

23

frequently feigned (78%). The malingering or concealment of depression symptoms is a relatively new concern in healthcare; wherein due to an increase in falsified insurance claims and depressionrelated negligence incidents, the number of forensic psychiatry feigned detection method studies has also increased (Monaro et al., 2018). An issue with most classical clinical evaluation methods is their low reliability for providing information to help clinicians discern real versus coached symptoms. A study by Nezu et al. (2009) indicated that worldwide, there are nearly 300 different clinically validated measures for calculating depression severity. Among the most common method used by clinicians to help identify depression in individuals are verbal structured interview method or selfreport evaluations (Alexopoulos et al., 1988; Beck et al., 1996; Kroenke et al., 2001, 2008; Rush et al., 2005). In Rettew et al. (2009, 2010), a comparative analysis of nearly 40 depression studies on interview versus self-report evaluations demonstrated weak correlations with each other. Many of these methods used by clinicians for measuring depression severity rely on subjective evaluative discourse observations and non-linear depression rating scales, several of which were designed decades ago (Bachman & O’Malley, 1977; Beck et al., 1996; Hamilton, 1960; Kroenke et al., 2001; Montgomery & Asberg, 1979; Rush et al., 2005). The non-linearity of depression ratings provides smaller differentiation between the clinical non-depressed and depressed categories along with depression subtypes. For instance, an assessment score that is twice the value of another score does not mean that the individual is twice as severely depressed. Furthermore, as noted many years ago in Prusoff et al. (1972), there are inherent discrepancy concerns in terms of self-assessment sensitivity per different individuals. For instance, a depressed patient may underestimate or overestimate his/her degree of disturbance, which can significantly impact the self-assessment severity results, particularly in the case where there are only a few questions asked (see Table 2.3 below). This is all the more reason that in many practical circumstances multiple depression assessment protocols are given, especially for borderline patients. Examples of common depression rating scale criteria structured interview and self-rated evaluations are the following: Rosenberg Self-Esteem Scale (RSE); Hamilton Rating Scale for Depression (HAMD); Beck Depression Inventory II (BDI-II); Montgomery-Asberg Depression Rating Scale (MARSD); Quick Inventory of Depressive Symptomology Self-Report (QIDS-SR); and Patient Health Questionnaire (PHQ-8/PHQ-9). The most popular self-report assessments are summarized below in Table 2.3. Most self-assessment approaches were originally calibrated on adults vetted by

24

psychiatrists with extensive mental health backgrounds, and not calibrated on non-vetted populations using general practitioners. Table 2.3 Comparison of clinical diagnostic depression self-assessments: Beck Depression Index (BDI-II) (Beck et al., 1961, 1996), Patient Health Questionnaire-8 (PHQ-8) (Kroenke et al., 2009), Patient Health Questionnaire-9 (PHQ-9) (Kroenke et al., 2001), and Quick Inventory of Depressive Symptomology SelfReport (QIDS-SR) (Rush et al., 2005). Test

Updated

Criteria

BDI-II PHQ-8 PHQ-9 QIDS-SR

1996 2008 1999 2003

DSM-IV DSM-IV DSM-IV DSM-IV

Severity Score Range 0 – 63 0 – 24 0 – 27 0 – 27

Number of Questions 21 8 9 16

Time 8 min. 1 min. 1 min. 3 min.

Measure Symptoms Directly Yes No No Yes

In an effort to collect psychological phenomena, self-assessments have become a common tool for helping diagnose the severity of depression (see Table 2.3). Depression self-assessments are popular among healthcare professions due to their low cost, simplicity, quick assessment, and minimal training requirements. While the reliability of depression self-assessments is relatively high (Nuevo et al., 2009), these types of subjective assessment have come under more mental health and medical expert scrutiny over the last two decades (Doward, 2013). Some experts have questioned the reliability of self-assessments due to patient over-familiarity and his/her reading ability (Cusin et al., 2009). Also, considering that depression is such a prevalent and serious illness (i.e. one that can lead to self-harm), it is inherently flawed that during a self-assessment a clinician presumes the patient’s honesty. There are many obvious reasons (e.g. unemployment, certification denial, loss of child custody, social stigma) why an individual might potentially hide or manipulate his/her selfevaluation scores during routine mental health screenings found in occupations related to defense, law-enforcement, medical, and transportation industries (Corrigan & Wassel, 2008; Lai et al., 2000). It has been shown in patients with some psychological disorders (e.g. psychotic depression, obsessional personalities) that systematic discrepancies were reflected during patient self-reporting, including exaggerated or minimized psychiatric symptoms (Boyd et al., 1982; Carter et al., 2010; Paykel et al., 1973; Prusoff et al., 1972).

2.3.1 Beck Depression Inventory In 1961, the Beck Depression Inventory (BDI-I) was originally developed as a self-assessment for measuring depression severity (Beck et al., 1961). Decades later, in 1996, this self-assessment was revised to the BDI-II (Beck et al., 1996) to reflect psychological/somatic manifestations over a two25

week period and with reduced linkage to any particular theory of depression. The BDI-II is currently one of the most commonly used worldwide measures for depression severity (Wang et al., 2013a). The modernized BDI-II contains 21 depressive symptom-related items (see Appendix A), wherein a patient provides numerical ratings per item congruent with his/her personal subjective judgment. Each BDI-II item has a possible score ranges of 0, 1, 2, and 3. These 21 items are also categorized by depressive sub-symptoms, which include cognitive, affective, and somatic behavioural observations. In particular, the BDI-II focuses on core negative self-perception (e.g. self-dislikes, self-criticisms). It investigates negative self-evaluation symptoms and weights these sub-symptom depressive categories accordingly when calculating the final depression severity score measure. Depression diagnostic studies on the BDI-II have found a relatively high degree of internal consistency and statistical validity/reliability (Kühner et al., 2007; Lee et al., 2017; Wang et al., 2013a) The total score range for the BDI-II is 0 to 63, wherein four different depression severity classes are defined: minimal (0-9), mild (10-18), moderate (19-29), and severe (30-63). Thus, the higher the BDI-II final score, the greater the depression severity. Generally, any score above 10 indicates a severity level of depression high enough for consideration of MDD diagnosis and referral. The BDI-II is available in many languages for a minimal form-fee cost (e.g. ~$10 USD) and can be completed with little instructional training in a matter of a few minutes.

2.3.2 Patient Health Questionnaire The Patient Health Questionnaire (PHQ) was originally part of the much more comprehensive Primary Care Evaluation of Mental Disorders (PRIME-MD) (Spitzer et al., 1999). However, in the mid-1990s, it was devised specifically as the PHQ-8; a self-assessment tool for helping measure depression severity consisting of 8 items (see Appendix B). A few years later, the PHQ-9 was also developed with an addition item (Kroenke et al., 2001). Unlike the BDI-II, the PHQ-8 and PHQ-9 do not investigate depression sub-symptoms. Instead, they are essentially items based on the DSMIV sub-symptoms (ADA, 2000; Kroenke et al., 2001; Spitzer et al., 1999). Depression diagnostic studies on both the PHQ-8 and PHQ-9 have found a relatively a high degree of internal consistency and statistical validity/reliability (Kroenke et al., 2009; Ory et al., 2013; Razykov et al., 2012; Zhang et al., 2013).

26

The PHQ-8 and PHQ-9 only differ in that the latter has an additional item on feelings of suicide, and thus the score ranges are slightly different. Each PHQ item has a possible score ranges of 0, 1, 2, and 3, so the PHQ-8 has a score range of 0 to 24, whereas the PHQ-9 has a score range of 0 to 27. The PHQ-8 and PHQ-9 define four different depression severity classes: none – minimal (0-4), mild (5-9), moderate (10-14), moderately severe (15-19), and severe (20-24|27). Again, the higher the PHQ score value, the greater the depression severity. Generally, any score above 10 indicates a severity level of depression high enough for consideration of MDD diagnosis and referral. The PHQ-8 and PHQ-9 are freely available online in many languages and can be completed with little instructional training in a matter of a few minutes.

2.3.3 Quick Inventory of Depressive Symptomology The Quick Inventory of Depressive Symptomology Self-Report (QIDS-SR) was developed as another fast, clinical self-assessment for measuring depression severity (Rush et al., 2005). The QIDS-SR contains 16 depressed symptom-related items (see Appendix C), wherein a patient provides numerical ratings per item based on his/her personal subjective judgment. The QIDS-SR items have possible score ranges of 0, 1, 2, and 3. These 16 items are also categorized by depressive sub-symptoms, which include insomnia/hypersomnia, affective, appetite, somatic behaviours, and suicidal ideation observations. Similarly to the BDI-II, the QIDS-SR also focuses on negative selfperception (e.g. self-dislikes, self-criticisms). It investigates negative self-evaluation symptoms and weights these sub-symptom depressive categories accordingly when calculating the final depression severity score measure. Depression diagnostic studies on the QID-SR have found a relatively high degree of internal consistency and statistical validity/reliability (Mergen et al., 2011). The total score range for the QIDS-SR is 0 to 63, wherein five different depression severity classes are defined: normal (0-5), mild (6-10), moderate (11-15), severe (16-20), and very severe (21-48). Thus, the higher the QIDS-SR final score, the greater the depression severity. Generally, any score above 10 indicates a severity level of depression high enough for consideration of MDD diagnosis and referral. The QIDS-SR is freely available online in many languages and can be completed with little instructional training in a matter of a few minutes.

27

2.4

Emerging Assessment Techniques

As briefly mentioned in Chapter 1, there are several potential non-traditional techniques for depression analysis. These alternative diagnostic approaches, which include biological, neurobiological, physiological, and statistical techniques are more objective in nature when compared with standard clinical depression self-assessment techniques, and generally require a laboratory environment for analysis. It is important to note that none of these non-standard assessment techniques are currently used solely for depression diagnostic purposes, and the majority of these are still regarded as experimental techniques. There are still many unanswered questions as to the effectiveness, reliability, and repeatability of the following techniques mentioned herein. Moreover, due to the costs and high-level of multiple expertise many of these techniques are presently considered unpractical for large-scale field implementation. Also, a general major issue with many alternative depression diagnostic techniques is that for severely depressed patients with suicidal tendencies, time is of the essence and people sometimes cannot wait weeks before they are diagnosed. Many people with severe depression require immediate diagnosis and clinical intervention rather than pre-scheduled intervention and/or multiphase laboratory evaluator process. Also, many laboratory-based in-office approaches often only capture the in-the-moment representation, and thus do not capture morning-to-night, day-to-day, or week-to-week differences unless costly multi-session evaluations are completed. Consequently, in particular, biological and neurological types of techniques for depression analysis are highly contingent on when the test assessment is completed.

2.4.1 Biological Unlike many disease/disorders (e.g. diabetes, hepatitis, heart disease, Lyme disease, HIV-AIDS) that have biological tests with significantly high-degrees of diagnostic accuracy, currently there are no known reliable objective biological diagnostic tests for depression (Belmaker & Agam, 2008). However, researchers have not yet abandoned the possibility of a biological test for depression. Most biological procedures are invasive, require multi-disciplinary expertise, carry high costs, a longer analysis-diagnosis time frame, and currently cannot be analyzed without human assistance. For depression disorder recognition, laboratory biological analysis techniques have been explored with varying degrees of success (Bilello et al., 2014, Smith et al., 2013). One biological approach for identifying depression is hormonal-based suppression testing. An example is the dexamethasone

28

hormonal test, which includes invasively administering a low dose injection of dexamethasone. This harmless injection frequently reduces corticotropin and cortisol hormone levels in healthy populations, but not depressed populations (APATFTP, 1987; Arana et al., 1985). However, the dexamethasone test does not always produce the same discriminative effect for nondepressed/depressed individuals; and thus, has been deemed too inconsistent to provide an accurate reading for depression severity (Fountoulakis et al., 2008; Stewart et al., 2008). More comprehensive biological depression diagnostic multi-biomarker approaches include neurotrophic, metabolic, inflammatory, and hypothalamic pituitary adrenal axis pathway analysis (McDermitt et al., 2001; Redei et al., 2014; Renshaw et al., 2009; Sung et al., 2005). These multicomponent biomarker evaluations have also demonstrated promise for depression disorder recognition (Smith et al., 2013). However, the multi-biomarker approach is generally limited to the diagnosis of biologically caused subtypes of depression, and not psychologically derived subtypes. Biological genomic techniques have been examined as well for depression severity analysis. For genomic analysis, an individual’s deoxyribonucleic acid (DNA) sample can be obtained by invasive blood collection or pulled hair follicle, or via less invasive swab sample form the inner cheek of the mouth. This approach relies on genetic code anomalies found in the DNA double helix (LeNiculescu et al., 2009; Spijker et al., 2010; Sullivan et al., 2000), wherein non-depressed versus depressed genomic population profiles are compared. In Spijker et al. (2010) it was shown that for depression diagnosis genomic methods had good sensitivity, but poor specificity. Further, the genomic approach is it is only helpful for biologically caused subtypes of depression.

2.4.2 Neurobiological Neurobiological techniques have been extensively used to analyze depression disorders. Generally, neurobiological approaches require sophisticated scanning machines, neurological and imagecorrection proficiency for analysis. Clinicians using this approach look for abnormal neuroelectrical activity throughout different neural network regions of the brain. Clinicians use Functional Magnetic Resonance Imaging (fMRI) technology to help identify how individuals with depression process information differently than healthy patients. fMRI scanning techniques have aided in the ability to identify depression, and also exhibited distinctions between subtypes of depression (Landhuis, 2017).

29

However, in routine clinical practice today, Prata et al. (2014) notes that there are no tools for diagnosis and/or treatment supervising fostered solely on neuroscience methods. Furthermore, Insel and Charney (2012) suggest that novel neurological developments with a specific focus towards clinical application have stalled. The main reasons why neurological imaging techniques for depression have idled is due to the narrow population with related expertise, machine operational knowledge, patient accessibility, assessment time-line, number of analysts/techs, and the overall costs are more expensive than the streamlined verbal interview or self-assessment approach.

2.4.3 Physiological Physiological techniques have also been examined for depression analysis, and usually involve less invasive or ordinarily worn body sensors (e.g. sEMG, smart device band/watch). Many publicly available devices, such as a smart-watch, smartphone, or smart-band have multiple sensors built into them. Examples of commonly integrated sensors include an accelerometer, barometer, gyroscope, global positioning system (GPS), microphone, camera, and magnetometer information capabilities. Examples of physiological techniques for depression analysis include galvanic skin response (Schneider et al., 2012; Vahey & Becerra, 2013), cardiovascular dysregulation (Carney et al., 2005a, 2005b), saccadic eye movement (Carvalho et al., 2014, 2015; Steiger & Kimura, 2010), sleep pattern tracking (Hasler et al., 2004; Paunio et al., 2015), and electrodermal activity (Sarchiapone et al., 2018). Automatic video processing techniques have also received considerable attention for depression behavioural analysis. Some examples of video-based evaluation methods explored include monitoring gross-motor movements (e.g. arms, legs, feet, head) (Balsters et al., 2012; Girard et al., 2013; Parker et al., 1990; Scherer et al., 2013; Sobin & Sackeim, 1997) and fine-motor movements (e.g. handwriting, hand-eye coordination, verbal abilities) (Scherer et al., 2013). Facial emotion analysis has also been studied to evaluate depression severity (Balsters et al., 2012; Cohn et al., 2009, 2017; Joshi et al., 2013a, 2013c; Schelde, 1998; Scherer et al., 2013). Speech-language related behaviour analysis is included as a physiological assessment technique for depression disorders; however, this is discussed in detail later in Chapter 3.4. Generally, physiological techniques demonstrate significant promise for use in depression analysis due to their relatively inexpensive collection devices, built-in consumer product ranges (e.g.

30

computer, glasses, tablet, phone, watch, band), and non-invasive data recording methods. Moreover, unlike many previously mentioned emerging techniques that only procure a single sample of information, physiological devices are usually built to record and collect behavioural information throughout the day/night. The ability to monitor and/or track behavioural information over multiple time periods is a major advantage of physiological techniques because it improves reliability while also allowing more understanding of how depression behaviours change over time. The measuring of depressive speech behaviours over time has strong implications for potentially allowing careful monitoring of patients, especially those at-risk for suicide.

2.4.4 Statistical Computational psychiatry is an emerging field that exploits mathematics and statistical analysis to investigate psychiatric disorders, generate quantitative predictions, and fuse multiple modes of patient data from a wide variety of collected diagnostic assessment information (i.e. both standard and emerging technologies) (Huys et al., 2016). In Paulus et al. (2017), it is suggested that mental health analysis and diagnosis include input from a diverse group experts with practical experience from several fields of science (e.g. medical, psychological, engineering) to further improve clinical setting evaluation protocols. This process of large amounts of comingled data is mathematically referred to as ‘big data’ analysis. Notably, Grove et al. (2000) demonstrated evidence using metadata analysis that mechanical prediction is typically as accurate, or more accurate than a clinician’s. Due to the complex nature and multivariate phenomena used to diagnose depression and the growing digitalization of hospital records, it is believed that statistical machine learning techniques are more consistent than subjective human observations for the analysis of high-dimensional data.

2.5

Summary

This Chapter 2 reviewed the clinical definition of depression, subsymptoms, a few subtypes, and depression diagnostic approaches, including current assessment techniques and evolving technology-based techniques. Unlike many other specialty health sciences (e.g. neurology, dentistry, oncology, bioengineering) where technology-based diagnostic assessment approaches are commonplace, in the field of psychiatrics, use of applicable automated technology has yet to reach its full potential. 31

Chapter 3 SPEECH-LANGUAGE BACKGROUND

Although spoken language sounds effortless, it is a well-organized cerebral-physiological action involving a cognitively demanding series of complex movements. At a pre-audible stage, multiple areas of the brain simultaneously activate (e.g. Wernicke’s, Broca’s, prefrontal cortex, supramarginal gyrus), wherein each quickly labor to access meaningful streams of selectively arranged words (Miller, 1963; Edwards, 2010). The articulation of a spoken phrase requires a sophisticated degree of simultaneously memorized physiological musculature coordination involving the respiratory system, laryngeal muscles, supra-laryngeal muscles, and fine articulatory positions. Over 100 independently innervated muscles are utilised to produce naturally connective speech sounds (Lenneberg, 1967). From the originating neurological stage to final articulation stage, the overall propagated time interval is extraordinarily brief; for instance, in conversational speech, a person can intelligibly generate up to nine syllables per second (Kent, 2000). What is perhaps more astonishing is the overall degree of targeted articulatory precision exhibited by both adolescent and adult speakers. During spontaneous discourse, speakers produce approximately one unnatural speech or language error per 900 words (Garnham, 1981). Indeed, there are very few other deliberate cerebralphysiological actions people undertake on a daily basis with such keen accuracy. These many hidden cognitive-motor intricacies related to speech processing and production are ordinarily only revealed to others when an individual bears a disorder and/or neurological disease; i.e. wherein some disturbance impedes the body’s ability to properly verbally communicate. Observed social abnormal behaviours and communicative defects are frequently indicators of common health concerns (Hirschberg et al., 2010). This chapter presents a review of information on speech production, including the basic anatomy and physiology involved. Common speech-related vocabulary and measurements are also defined. In addition, a background that encompasses acoustic, linguistic, and affective characteristics found in individuals with depression is discussed, with references to evidence-based diagnostic criteria for speech-language abnormalities.

32

3.1

Acoustic Theory of Speech Production

To understand the process of speech production, at minimum, a rudimentary knowledge of the vocal anatomy and physiological speech fundamentals is required. Readers with further interest in the anatomy and physiology of speech production are referred to Hixon et al. (2014), Honda et al. (2004), Deller et al. (2000), and Zemlin (1968). Fig. 3.1 shows the basic anatomical contributions to the production of speech sounds, also referred to as phonemes. As shown by Fig. 3.1, there are several body parts that contribute during speech, many of which the reader is probably familiar with (e.g. diaphragm, lungs, trachea, esophagus, jaw, teeth tongue, lips, nostrils). However, a discussion of some of the lesser known anatomical speech mechanisms is given herein.

Figure 3.1. Labeled illustration of important anatomical and physiological mechanisms involved during speech production. As shown by the illustration, speech requires independent motor control and coordination of many parts of the body. Note that this chart precludes neurological planning mechanisms involved during speech production (see: Grodzinsky et al., 2000). This image was taken from Deller et al. (2000).

The larynx, which is often referred to as the vocal folds, vocal cords, or voice box, is an upper extremity of the trachea (e.g. windpipe). It is located where the Adam’s apple rests on a male or female anterior neck. The larynx is made up of three symmetrical pairs of bone cartilage, the thyroid, cricoid, and arytenoid, in addition to a network of ligaments and membranes (Deller et al., 2000). These cartilages house and protect the vocal folds, whose membranes stretch from front to back across the larynx. During a resting stage, the vocal folds remain apart. However, when activated the vocal folds tighten closely together, yet with flexibility (i.e. similar to attributes of a rubber band). The space in between the vocal fold membranes is called the glottis.

33

Effectively, the vocal folds form a membranous reed-like instrument within the larynx structure. The vocal folds consist of two elastic plates, which with the help of small muscles can be controlled to stretch to produce a narrow fissure between them. This gap between the folds, the glottis, is where the air stream from the lungs passes through and causes the vocal fold fissures to vibrate (Ward, 1958). During voiced speech, the vibration of the vocal folds resembles rhythmical puffs. As the vocal folds constrict to make a narrower glottis, the frequency of the vibrations increases, whereas if the vocal folds are widened the rate of vibrations decreases. This tightening constriction of the vocal folds is auditorily perceived as changes in a vocal pitch. It is important to remember that the glottal source vibration is not actually physically produced by muscular activation, but rather by the kinetic energetic generated from the lungs, which push a stream of air across the vocal folds resulting in rapid modulated motion of the folds (e.g. Bernoulli effect). The movements of the active organs of speech are independent of each other (Ward, 1958). Consequently, the independent nature of these speech organs allows for a large variety of different speech sounds. One of the most common speech physiological-articulatory combinations involves activation of the vocal folds with movements of the tongue, soft palate, and lips. The activation of the vocal folds usually involves voiced production (i.e. accompanied by vibration of vocal folds), or unvoiced/voiceless production (i.e. no vibrations, as the folds are apart). All English vowels are voiced, whereas English consonants can be produced with or without voicing. Voiced English examples are the phonemes /a/, /i/, /l/, and /r/. On the other hand, English unvoiced sound examples are the phonemes /s/, /sh/, /f/, and /th/. Generally, a person can distinguish whether a phoneme is voiced or unvoiced by placing his/her hand on the throat and repeatedly enunciating the same sound; if a vibration is felt, then there is good indication that this phoneme is voiced. Another trait that makes vowels different from consonants is that vowels have no significant airflow disturbance during production, whereas consonants usually have some point of restriction. This effectively means that vowels generally have greater amplitude than consonants, and that consonants are generally noiser than vowels. The idea of a noise type phoneme relates to the distribution of energy and periodicity of a phoneme. For vowels, the bulk of the acoustic energy is contained below 2 kHz with quasi-periodicity attributes, whereas for consonants the majority of energy is above 2 kHz and their signal is aperiodic. Whether a vowel or consonant, the speech organs involved during speech production are limited to an effective bandwidth of approximately 7 to 8 kHz.

34

In addition to voiced and unvoiced positions, there are additional manners of excitation. Examples of other types of excitation include mixed, plosive, and whisper (Deller et al., 2000). One or more of these excitation types may be blended in the excitation of a particular speech sound or class of sounds. For instance, the English phoneme /z/ between words may include a mix of voiced and unvoiced characteristics. Plosive characteristics involve a short region of silence, followed by a region of voiced, unvoiced, or mixed speech. In English, plosives are produced by complete closure of the vocal tract (i.e. usually the lips or tongue), the building of air pressure, and quick release of an air burst. Examples of English plosive phonemes are /b/, /p/, /t/, and /d/. As expected from its title, whispered speech does not involve excitation of the vocal folds. Thus, even normally voiced sounds will become unvoiced – which can make distinctions between sounds perceptually more difficult (e.g. /t/ and /d/, /th/ and /dh/). As shown previously in Fig. 3.1, there are three major cavities that affect vocal resonance and quality (Hixon et al., 2014). First is the pharyngeal cavity that rests directly above the larynx, and connects to the oral and nasal cavities. Second is the oral cavity, which surrounds the main articulators (e.g. tongue, teeth, lips, hard palate). Third, unlike the pharyngeal and oral cavities, which are involved in most speech sounds (e.g. precluding dental clicks or lip smacks), the nasal cavity is uniquely activated by the soft palate. The soft palate is a membranous structure (velum) towards the back roof of the mouth, and when elevated, it opens the nasal passage and elongates the vocal tract air passage generating a nasal-like quality to a speech sound (e.g. /n/, /m/, /ng/). The combination of the length of all three cavities is referred to as the vocal tract length. The average vocal tract length of a child is 10 centimeters, whereas for adult females and males, it is 14 and 17 centimeters, respectively (Deller et al., 2000). The source-filter theory is a basic explanation of the major acoustic components of the process of speech production (Fant, 1960), which also has important pathological and signal processing implications. For instance, an individual with severe depression can incur behavioural-cognitive changes that can be directly observed in his/her speech. Moreover, depression-related psychomotor disturbances adversely affect the proper function of many coordinated components involved during the articulatory speech production process. The physiological origin of normal speech begins as the lungs generate potential energy and changes in air pressure within the vocal tract. For voiced speech sounds, the lungs expel air and the velocity of this air impacts vocal fold periodicity. As the harmonically rich glottal sound energy is travels along the vocal tract, along its laryngeal flow, harmonic amplitudes are modulated by the

35

pharyngeal, oral, and nasal cavities along with various positions of the articulators (e.g. tongue, teeth, lips, jaw, velum) acting as filter. As shown in Fig. 3.2, the source-filter theory represents discrete-time speech model that includes a series of differently sized tubes with highlighted stages during the speech production process. For unvoiced sounds, a random noise generator is represented by N(z), a random noise-like signal component whose maximum amplitude of noise is controlled by the parameter Au. For voiced phonemes, the impulse train generator is represented by E(z), the acoustic source vocal fold vibration activation or ‘pitch period’. The glottal pulse model, represented by G(z), generates its own spectrum output that includes fine spectral components (e.g. harmonics, noise) and a spectral slope that descends roughly -12dB per octave in a low-pass fashion. Importantly, within the glottal pulse model, there are two main glottal phases: (1) opened glottal phase, wherein the vocal folds open and the glottal flow increases from the baseline to the maximum value; and (2) closed glottal phase, wherein the vocal folds are closed and the glottal flow is at the baseline, which is referred to as the glottal closure instant. Glottal pulse parameters include: fundamental period (T0) maximum amplitude of glottal flow (Av) and time at which it is reached (Tp); maximum excitation (Em); and glottal closure instant (Tc). The vocal tract model V(z) acts as a filter, which includes the oral and/or nasal cavity, depending on whether the velum is opened or closed and on articulator parameters (e.g. tongue, hard palate, teeth). Additionally, the lip radiation model has constant magnitude-spectra and functions using a high-pass filter with roughly 6dB per octave rise, and closed/open parameters (i.e. used during labial phoneme production). When these separate models are combined during voiced speech they can be represented as S(z) = E(z)G(z)V(z)R(z) and for unvoiced speech S(z) = N(z)V(z)R(z). Impulse Train Generator E (z)

Glottal Pulse Model G (z)

Fundamental Period (T0)

GP Parameters

Voiced Av

+ +

Random Noise Generator N (z)

Vocal Tract Model V (z)

Lip Radiation Model R (z)

VT Parameters

LR Parameters

Speech S (z)

Unvoiced Au

Figure 3.2. Discrete-time model for speech production based on Fant’s (1960) ‘Source-Filter Theory of Phonation’. This model demonstrates that the vocal tract can be acoustically modeled as a series of interconnected components (i.e. differently sized tubes). As illustrated, the excitation model has two parallel stages, from which the input to the vocal tract model is a linear combination: (1) for voiced speech, there is a train of glottal pulses spaced at intervals of the pitch period; and (2) for unvoiced speech, excitation is random

36

and noise-like. As shown by this figure, speech system requires independent coordinated sequential actions to efficiently produce spoken language.

The rate of the base glottal vocal fold vibration is referred to as the fundamental frequency. The fundamental frequency is essentially the lowest frequency component of the quasi-periodic waveform produced by the vocal folds. The fundamental frequency is represented as F0 in the literature, and is defined as the number of glottal cycles per second (Hz). A child’s fundamental frequency is much higher than an adult’s because their body and vocal tract is proportionally much smaller. Furthermore, the approximate F0 range for an adult female is from 165 to 255 Hz, whereas a male’s is from 85 to 180 Hz (Baken & Orlikoff, 2000; Titze, 1994). Abnormalities in the F0 function, such as limited flexibility range and control, can often be signs of a vocal pathology, which may have a physiological or psychological origin. Typically, in healthy speakers, the F0 has audible varied intonation contours. During normal speech, people’s F0 range tends to fall within a habitual range, as opposed to more deliberate and effortful speech related tasks, such as singing. It should be mentioned that in speech science literature the F0 and the term pitch are often used interchangeably. However, pitch specifically refers to human perceptual auditory phenomena (Deller et al., 2000). In addition to the F0, the acoustic resonance causes additional harmonic formants (e.g. F1, F2, F3) to be generated due to the speech cavities and articulators. Another basic concept of the speech production process is often referred to as the speech chain, which is illustrated in Fig. 3.3. The speech chain begins with an idea, or a process called ideation, wherein a rational notion is encoded using language into neurological signals in the brain. Following this, in a very brief moment of time, signals sent from the brain activate a network of motor nerves that control speech mechanisms and their muscular movements. These movements produce speech in the form of linguistic phoneme representations in a sound wave form, which are aurally self-monitored, whilst the speech is also received by another listener’s auditory system and decoded in his/her brain. The self-monitoring or feedback loop allows for a speaker to quickly correct his/her articulatory or grammatical errors during speech production, thus helping to increase the effectiveness of communication exchange. This process is referred to as a chain because to be successful the entire process requires a degree of language understanding and an event-linked series of processes (Denes & Pinson, 1993). During this process a speaker manipulates the speech further by including specific linguistic content, emphasis (e.g. prosody, intensity dynamics, quality dynamics, rate-of-speech), while also providing a listener with non-verbal cues, such as visual facial expressions, eye, and/or hand movements.

37

While the speech chain is considered a serial high-level of human speech processing (i.e. perhaps over simplified), it still provides the key steps involved.

Figure 3.3. Illustration of the speech chain process, taken from Denes & Pinson (1993). Although a simplified representation of human speech processing, major components involved during speech production and language exchange are included.

While the speech chain process model of speech processing in Fig. 3.3 is shown to contain serial levels (e.g. linguistic, physiological, acoustic), it is believed that many of these processes run in a parallel or hierarchical fashion (Peelle et al., 2010; Miller, 1988). Surprisingly, due to experimental limitations, there is still disagreement concerning how speech-language information is stored, organized, and processed in the brain during speech production (Garman, 1990). In patients exhibiting depression, abnormal speech chain disturbances often are observed at more than one level process, as previously discussed in Chapter 2.1 and to a greater extent, later in Chapter 3.4. The source-filter and speech chain concepts are helpful in understanding what mechanisms are required for normal speech production. Moreover, due to the many independent processes during speech, any impeding issue, for example caused by a disorder or disease, can cause abnormalities in motor control and/or the physiology of a speech mechanism. Therefore, abnormal speech behaviours and articulatory defects can be good indicators of many prevalent illnesses and neurological disorders (Hirschberg et al., 2010). As a result, diagnoses of many prevalent diseases/disorders encompass some degree of subjective and/or objective speech-language behavioural evaluation analyses (Bennabi et al., 2013; Chevrie-Muller et al., 1985; Damico et al., 2010; Duffy, 2012).

38

3.2

Measurements of Speech Processing

3.2.1 History of Acoustic Speech Measurement Tools While the theoretical analysis of speech-language dates back to ancient times (Benesty et al., 2008; Gera, 2003; Stemple et al., 2014), the analysis using speech-based measuring devices dates back approximately a century. As briefly mentioned in Chapter 3.1, the spectral characteristics of the various phonemes and their rapid change over time produce a continuous speech signal that is nonstationary. Consequently, the speech can be evaluated by observing short time segments of the signal that can be considered locally stationary and that contain specific acoustic properties related to a phoneme and/or its transition to another phoneme. To measure different aspects of speech, over the years, a few key devices have been implemented for analysis. The kymograph was one of the first semi-objective real-time devices used to measure acousticphonological events (Tillmann, 1995) and emotional state (Wundt, 1902), as illustrated in Fig. 3.4. Basically, a person transmitted voice sound waves via a tube to a cylindrical rotating drum, which in turn, had a connected stylus scribing vocal vibrations onto paper (Leon & Martin, 1970; Rousselot, 1924). Interestingly, the drum’s size acted as a low-pass filter; thus, the broad harmonicvibrations recorded consisted mostly of F1 and F2 formant energy values. A similar example to the image this produced can be seen in a modern-day digitally produced waveform, which has limited spectral visual information.

Figure 3.4. Illustration of an analogue kymograph speech acoustic measuring device (image taken from Leon & Martin (1970) and Bolinger (1972)). The kymograph had a sampling rate measured in hundredths of a second.

39

However, the kymograph did allow for a broad analysis of fundamental prosodic speech sound characteristics, such as duration, intensity, and pitch. The duration could be analyzed through the kymograph so long as the speed of the cylindrical rotating drum remained a constant speed. The intensity could be measured based on the calibration and reaction of the stylus apparatus to the propagated acoustic energy. Pitch could also be measured crudely by hand by measuring the number of peaks or valleys (i.e. indicates the number of cycles) per constant analysis time period. The higher the pitch, the greater the number of peaks/valleys per time period, whereas the lower the pitch the fewer the peaks/valleys. The kymographic tool and process required a great deal of manual calibration and tedious careful human analysis, often requiring a magnify glass and diligent patience. Surprisingly, many kymographic studies were implemented until the 1950’s (Maak, 1957; Magdits, 1959; Bolinger, 1972), and are now considered the basis for phonological language and physiological voice studies. Around World War II, new efforts were made to analyze the speech with greater resolution using an analog oscilloscope (i.e. oscillograph). Again, the purpose of the oscilloscope was to allow analysis of duration, intensity, and frequency aspects found in speech. Using a transducer (e.g. microphone), the acoustic waveform was converted into an electric signal, wherein the oscilloscope further converted the signal by using an electron-based curve produced on the screen of a cathode ray tube. The oscilloscope acoustic-signal representation was measured by small movements of a luminous spot created by the impact of electrons on a fluorescent coating (Bolinger, 1972). This process was either filmed or an oscillograph was used to record changes in the acoustic signal. Wartime defense research also lead to the development of the spectrogram, which is a spectral visual speech representation that includes acoustic information regarding time, intensity, and frequency with a far more resolution clarity in comparison to the previously mentioned devices (Joos, 1948; Leon & Martin, 1970; Potter et al, 1947; Shankweilar & Fowler, 2015). As shown in Fig. 3.5, the spectrogram allowed for the examination of individual phoneme differentiation during natural spontaneous speech, and even allowed an experienced expert to visual identify a particular phoneme or phoneme class based solely on the spectrogram.

40

Figure 3.5. Spectrogram of healthy male speaker saying the sentence “There was a spider in the shower” taken from the Black Dog Institute Affective Sentences (BDAS) database (Alghowinem, 2015). This analysis shows time, frequency, and energy. Only 0 to 8kHz frequencies are shown, as most speech energy can be captured below 4kHz. The loudness is shown in dB-SPL, wherein darker red indicates greater density of acoustic energy. These resonance energy bands are known as formants, as previously discussed in Chapter 3.1.

Initially, the analog spectrogram machines were large in size (i.e. roughly the size of a desk), not very mobile, had limited recording analysis times (i.e. less than 4 second analysis windows) and frequency ranges (i.e. up to 8kHz) (Leon & Martin, 1970). However, since the mid-1980s, the spectrogram measurements of the speech have relied on digital devices along with audio analysis software that offers higher precision and additional signal processing tools (e.g. filters, zoom capability, automatic annotations). These types of digital speech measurement applications are further discussed later in Chapter 3.2.2. Currently, the spectrogram is still a staple visual aid in speech processing analysis and voice forensics. The brief history of speech measurement devices presented in this chapter shows that the accurate measurable analysis of the acoustic speech did not mature until after the 1950’s with the advent of

41

the spectrogram1. While modern methods for measuring the acoustic speech have moved fully towards digital speech processing techniques, there is still more prospect for development of new methods to automatically measure speech.

3.2.2 Discrete Time Speech Acoustic Analysis Many acoustic speech-processing techniques are derived from Fourier theory (Cochran et al., 1967). In brief, the Fourier transform operates on the theorem that any periodic signal can also be represented by a set of sinusoidal basis functions (see Fig. 3.6) each with their own amplitude and

O ri gin

al S

Tim

e

igna l

Amplitude

phase.

Sin

id u so

om al C

e pon

n que Fre

nts

cy

Figure 3.6. Illustration of a complex signal approximated using six different sinusoidal signals. The red line represents the original signal, whereas the blue lines represent each of the six sinusoidals that together comprise of the original signal. The original signal input is shown in terms of amplitude versus time, while the frequency domain signal is shown as amplitude versus frequency (still image taken from Barbosa (2018) animation).

Typically, 16 kHz is chosen for the speech audio recording sampling rate because the distributed acoustic waveform energy in speech is mostly below 8 kHz. In many speech-processing applications, this sampling rate is further reduced to 8 kHz. This sampling rate reduction typically has only a minor impact on application performance, as the majority of speech information is located below 4 kHz, and historically, speech processing has focused a great deal on telephonebandwidth speech. Most speech processing algorithms implement fast Fourier transforms (FFT) based on the discrete Fourier transform (DFT) to quickly and efficiently represent the spectral information in a discrete short-term speech signal.

1

X-ray, sonography, palatography, and nuclear magnetic resonance have also been applied as measurements to a great extent in speech processing studies (Honda, 2008). However these technologies are typically meant for understanding anatomical and physiological mechanisms of speech, rather than the acoustic waveform.

42

The short-time Fourier transform (STFT) uses a fast Fourier transform (Deller et al., 2000) and time-frequency representation of a windowed segment of the speech. The STFT expresses a set of time-frequency distributions, which specify complex amplitude versus time and frequency signal properties (Cohen, 1995). STFT is a long-standing method commonly applied in the field of audio signal processing. A window contains only signal values within to the given time period, and excludes surrounding values outside of that chosen window. A window can be represented by the following equation: 𝑠[𝑛] = 𝑥 𝑛 𝑤[𝑛 − 𝑏]

(3.1)

where s[n] is the input signal, x[n] is the length of b window function, b is the time shift value (i.e. normally constant), and w[n-b] is the window function. There are two explicit parameters that greatly impact the windowing process. First, the window must have a size that is long enough for the proper analysis of the speech, but not so long that the assumption of local stationarity regularly breaks down. Second, the window function should include a degree of overlap. If the window size is too brief, it may fail to capture important characteristics of the speech. As a result, a trade-off exists between time and frequency resolution. This phenomenon is known as the Heisenberg-Gabor property (Quatieri, 2002; Williamson, 2000), wherein a function cannot be both time- and band-limited to an arbitrarily small precision, as represented by the following equation: ∆𝑇 ∙ ∆𝑓 ≥

1 4𝜋

(3.2)

where ∆𝑇 and ∆𝑓 are the uncertainty in time and frequency. Hence, during speech analysis, it is imperative to maximize both time and frequency resolution components; thus, to minimize the product of ∆𝑇 ∙ ∆𝑓. Fig. 3.7 below is an example of windowing of a voiced speech sound over the time period of a tenth-of-a-second. As indicated by Fig. 3.7, speech consists of both frequency and amplitude modulated signal components. The term ‘tiling’ is used to describe how time and frequency resolution is traded off. In most speech processing applications, the time-frequency resolution is immutably fixed.

43

425_CornerVowels 0.1015625

Amplitude

0.02933

0

-0.02707

0

0.1016 Time (s)

Figure 3.7. Example of the recorded speech sound /aa/ taken from the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) database that illustrates rectangular windowing of this speech with 50% windowing overlap.

The most common fixed window size for STFT is between 10 milliseconds to 40 milliseconds. However, in some suprasegmental speech analysis (i.e. syllable level or multi-phoneme analysis), the window size can be set as large as 120 milliseconds. The windows for speech processing typically have a degree of overlap (i.e. usually 50% overlap) and Hamming or von Hann window function. Using automatic speech processing tools and the STFT, Fig. 3.8 (see next page) shows a spectrogram for the phrase “She gave her daughter a doll”. Currently, the majority of acoustic speech properties are measured using digital methods.

(Space purposely allotted for next figure.)

44

F4 F3 F2

F1

sh-e

ga

e ---v

er h ---

dau

-ter gh-

ll ---o a d

Figure 3.8. 3-dimensional spectrogram using a window length of 256 samples (20ms) with 50% overlap and 512 discrete Fourier transform points. The short recording is of a healthy Australian male speaker reading aloud the sentence, “She gave her daughter a doll” extracted from the Black Dog Institute Affective Sentence (BDAS) database (for more information about this database see Chapter 5). The three main components of the speech are shown: time in seconds in the x-axis; frequency in kHz on the y-axis; and intensity in the z-axis. Also, four separate formants are observed in approximately each 1kHz region as indicated by dotted-line markers. The transcript is provided as a reference point to help demonstrate durational, spectral, and intensity differences between phonemes.

3.3

Speech-Language Affect Measurements - Continuous

Ratings Continuous affect ratings examine a single dimension of emotion based on dynamic changes in the voice over time. Continuous affect annotation is mainly concerned with providing moment-tomoment measurements; thus, it covers a wide spectrum of prototypical emotions without directly

45

having to label each with a finite discrete emotion (e.g. happy, sad, bored). The continuous affective rating approach provides indispensable information into how affect metamorphoses (i.e. the transitioning phases over time). Many affective research scientists have focused their attention on scaled affective subjective characteristics, such as valence, arousal, dominance, direction, and intensity (Barrett, 1998; Bradley & Lang, 1994; Nowlis & Nowlis, 1956; Lang, 1984; Lang et al., 1993; Russell, 1989; Schlosberg, 1954; Tellegen, 1985; Uldall, 1972; Wundt, 1902). In speech, affect is expressed through acoustic and linguistic means. For instance, acoustically, an increase in arousal is typically observed through an increase in loudness in the speech. Further, each lexical word representation is associated with different levels of affect. For example, in Fig. 3.9 below, human-rater mean scores based on linguistic-lexical affective perception of eight different English words are shown. These words are plotted based on their approximate mean arousal and valence ratings. While words ‘alive/dead’ and ‘love/hate’ are opposites in meaning, depending on the affect measure, they may be more different (i.e. in the case of valence) or similar (i.e. arousal).

Figure 3.9. A two-dimensional comparison of four positive (blue) and four negative (red) text-based word affect scores based on Affective Norms for English Words (ANEW) (Bradley & Lang, 1994). The words are place in their approximate arousal-valence coordinate values. This shows the non-acoustic linguistic affective dimensional aspect of individual words. It should be noted that the affect of an individual word is represented by a single finite value, but by using a series of word strings (e.g. sentences, paragraphs).

Another example of continuous affect ratings is shown in Fig. 3.10, wherein an excerpt of a spoken transcript was evaluated using the Affective Norms for English Words (ANEW). In this example, a string of words (e.g. phrases) can be represented by a finite number of semi-sequential (i.e. some words do not have affective scores) affective token word scores. These affective token words scores can provide information regarding how affect changes over time during speech. Additionally,

46

affective token word score at word or phrase level can allow the examination of acoustic or

Valence Score

linguistic information within selective affect regions (e.g. negative, neutral, positive). 9 8 7 6 5 4 3 2 1 0

25

50

75

100

125

150

175

200

225

Spoken Words (in sequential order)

Figure 3.10. Continuous text-based word valence scores based on the Affective Norms for English Words (ANEW) and a partial speaker transcript taken from a Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) database. By using a series of words (207) transcribed from spoken spontaneous sentences, the progression of individual word-affect information can be shown over time as indicated by each dot; positive (blue) range greater than 6, neutral (green) range between 4 to 6, and negative (red) range less than 4. In the example shown, the second half of the spoken excerpt contains greater variability and a lower average valence than the first half.

3.3.1 Arousal Arousal is described as a conscious affective experience based on a varied degree of subjective mental activation or interest. Consequently, low levels of arousal will produce a deactivated mental state, which is often perceived simply as sleepy or disinterested. For speech, arousal has shown to be related to how an utterance is spoken, and less about word selection (Karadogan & Larsen, 2012). Increased speech arousal is represented by stronger fundamental frequency vocal tension, resulting in a perceivable rise in pitch. It has been shown that fundamental frequency, amplitude, and duration can indicate various degrees of arousal (Banse & Scherer, 1996; Bänziger & Scherer, 2005; Bone et al., 2014). Further, it has been recorded that individuals with depression experience and exhibit low levels of arousal, which results in monotonous vocalizations that are characterized by a decrease in formant dynamics (Cummins et al., 2015a; Guidi et al., 2015; Hall et al., 1995; Porritt et al., 2015).

47

3.3.2 Valence Valence entails an individual’s interpretation of pleasantness or unpleasantness during an affective experience. Generally, valence is measured using a positive and negative measure axis. For example, emotions like anger or fear have a negative valence, whereas happy and satisfied have positive valence. The dimension pertaining to valence has been shown to have a strong correlation to the semantics of what is spoken (Karadogan & Larsen, 2012). Often, individuals with depression are verbally perceived by others to use vocabulary and prosodic cues that have lower valence and more negative bias than healthy populations (Joormann et al., 2005; Joorman & Gotlib, 2008, 2010). Furthermore, people with depression tend to focus on negatively valenced stimuli longer than healthy populations (Bylsma, et al., 2011).

3.3.3 Other Affective Dimensions Dominance relates to an individual’s perceived assertiveness, authority, and/or aggressive vocal characteristics. High degrees of dominance are generally useful in emergency or dangerous situations. Research has shown that depressed individuals use more perceptually submissive speech and consequently often exhibit less dominance (Osatuke et al., 2007). Additional speech affect dimensions have been explored; however, these dimensions have not been looked at as extensively as other sensory affect dimensions found in visual context. For instance, besides the aforementioned arousal, valence and dominance, dimensional attributes found in everyday speech also include: potency (coping), empathy (understanding), intensity (degree of emotion), persuasiveness, politeness, agreeability, sincerity, patience, and confidence (Uldall, 1972; Hirschberg et al., 2003; Laukka et al., 2005). Less is known about these dimensions of speech and their relation to affect.

3.4

Effects of Depression on Speech

Long before automatic methods were explored as tools for depression diagnosis, early subjective studies (Eldred & Price, 1958; Kraepelin, 1921; Moses, 1954; Newman & Mather, 1938; Stinchfield, 1933) examined the speech patterns found in clinically depressed speakers and documented atypical characteristics when compared to healthy populations. For instance, the

48

aforementioned studies observed that the speech of individuals with depression was lifeless, flat, hollow, and muffled. Further, the ‘tone’ or subject matter of the depressed patients’ speech was often noted as having a negative, hopelessness, or distressed perspective. The primary speech indicators exhibited by depressed patients in the aforementioned studies included reductions in prosodic vocal intensity, fundamental frequency (F0), rate-of-speech, and vocal quality (i.e. linguistic stress). In Newman & Mather (1938), although experiments were based on subjective human-listening analysis of F0 contours; surprisingly, non-expert listener participants were able to successfully identify clinically depressed and healthy speakers with 80% classification accuracy. A similar study by Nilsonne & Sundberg (1985), which included F0, speech rate, and pause time, also reported non-expert human-listening accuracy of 80% accuracy for depression classification. Both of these subjective speech-based listening studies (Nilsonne & Sundberg, 1985; Newman & Mather, 1938) helped to establish that unusual verbal behaviours are indicative of changes in a person’s mental health, especially concerning depression. Moreover, these two studies provided evidence that general prosodic-related speech cues hold substantial clues regarding depression severity. The following sub-chapters describe typical observable changes in acoustic, linguistic, and affect speech-language behaviours caused by depression. In particular, these studies are important because they help to identify measurable diagnostic indicators of depression based on speech; and furthermore, they hint at which parts (e.g. pitch, intensity, duration, timbre) of speech-language could be most useful for automatic speech-based analysis algorithms.

3.4.1 Acoustic Behaviours and Characteristics In the decades following some of the earliest studies on speech characteristics of depressed patients, studies by Otswald (1965), Szabadi et al. (1976), Darby and Hollien (1977), Hollien (1980); Greden et al. (1981), Greden and Carroll (1980), Darby et al. (1984), and Chevrie-Muller et al. (1985) examined speech from individuals with depression disorders using audio recordings and more objectively measured speech analysis methods (e.g. spectrogram). Researchers in these studies confirmed early subjective speech behavioural accounts that critical cues for depression and its severity can be derived from major prosodic acoustic elements, such as fundamental frequency (e.g. pitch), intensity, voice quality and duration.

49

Changes in F0 and pitch are generally regarded as a major indicator of depression. Many studies (Breznitz et al., 1992; Darby et al., 1984; Hönig et al., 2014; Hussenbocus et al., 2015; Kuny & Stassen, 1993; Mundt et al., 2007; Nilsonne, 1987; Nilsonne et al., 1988; Porritt et al., 2015; Simantiraki et al., 2017; Stassen et al. 1998; Tolkmitt et al., 1982) have demonstrated that depressed populations have lower pitch with less variation than healthy speakers. The lowering and/or restricted range in depressed speakers’ pitch contributes to the ‘flat’ and ‘monotone’ perceptual auditory descriptions in the early depression literature. Furthermore, it has been shown in Nilsonne (1987) and Kuny and Stassen (1993) that after depression therapy, patients’ normal pitch register and variation increased. The recorded F0/pitch change due to a depression-related disorder and its post-therapeutic return to normalcy in a speaker has particular importance for monitoring depressed patients’ progress. Recorded F0/pitch transformations over time can be utilised to help indicate improvements in the patient’s health and/or help determine the effectiveness of his/her prescribed clinical depressive disorder therapy (Karam et al., 2016). Although generally, it is widely accepted by clinicians that depression disorders impact individuals’ F0/pitch behaviour (i.e. refer to Chapter 2.1 for depressed speaker characteristics), in some studies (Alpert et al., 2001; Cannizzaro et al., 2014a; Mundt et al., 2012, Quatieri & Malyska, 2012), no significant difference in F0/pitch values between ‘depressed’ and ‘non-depressed’ speakers were found. However, direct comparisons in many of these studies are difficult to make due to difference in speaker gender, depression subtypes, severity levels, comorbidity, and speech elicitation mode tasks. Also, many of these studies used different F0/pitch analysis algorithms and used data recorded under different circumstances (e.g. environment, recording device, elicitation tasks). In addition to the pitch register of depressed populations, studies (Darby et al., 1984; Ellgring & Scherer, 1996; Fossati et al, 2003; Nilsonne, 1987, 1988; Sobin & Sackeim, 1997) have also indicated F0 laryngeal-glottal incoordination, especially in instances of high severity. In these aforementioned studies, when compared with healthy speakers, depressed speakers exhibited less motor control and they had more difficulty maintaining normal speech vocal fold function. It is known that psychomotor retardation can cause interruptions in fine motor control (see previous Chapter 2.1); thus, during speech, vocalization and articulatory process are highly sensitive to this any degree of motor dysfunction. The severity of depression and degree of psychomotor retardation has been correlated with increased vocal fold dysfunction (Hicks et al., 2008) along with weaker harmonic formant

50

amplitudes (Simantiraki et al., 2017). According to Sahu & Espy-Wilson (2016) and Flint et al. (1993), in many depressed speakers, the lack of laryngeal control caused by psychomotor retardation adds to a perceived ‘breathiness’ quality in their speech when compared with healthy populations. Moreover, Ozdas et al. (2004) and Quatieri & Malyska (2012) found that abnormal fluctuations in peak-to-peak F0 frequency/amplitude (e.g. shimmer, jitter) were indicative of depressed speakers, whereas healthy speakers did not exhibit these voice qualities. Intensity has also been shown to be a strong indicator of depression severity. Researchers have indicated for some time that patients with depression often have weak vocal intensity and perceptually ‘monoloud’ articulatory attributes when speaking (Darby & Hollien, 1977; Newman & Mather, 1938; Ostwald, 1965; Scherer et al., 2015). Kuny and Stassen (1993) showed that as depressed patients underwent therapy for their disorder, later after recovery, their speech energy dynamics increased. Again, much like the analysis on pitch by Nelsonne (1987) and Kuny and Stassen (1993), it is believed that changes in speech intensity can provide severity tracking information for depressed patient monitoring. It is believed that low vocal intensity is common in depressed speakers because often, as a subsymptom, they exhibit diminished physical energy (i.e. tiredness, sleepiness) and sleep disorder symptoms (Nutt et al., 2008). Furthermore, depressives often display deficits in social dominance (i.e. increased passiveness during conversation, less likely to raise volume to interject to demonstrate verbal stance dominance) (Allan & Gilbert, 2011; Osatuke et al., 2007). Conversely, however, it should be noted that for depressed speakers exhibiting a heightened psychomotor agitation subsymptoms, their speech could have greater dynamics than usual, especially during a psychotic episode. In Fig. 3.11, vowel comparisons between a ‘non-depressed’ and ‘depressed’ speaker are shown using spectrograms. In examination of these spectrograms for both speakers and their vowels (/i/, /a/), it is observed that the ‘non-depressed’ speaker demonstrates more power-spectrum dynamics over time than the ‘depressed’ speaker. The flatness is evident by the visible temporal smoothness in the ‘depressed’ speaker’s spectrogram when compared with the ‘non-depressed’ speaker. Fig. 3.11 demonstrates that even within a small segment of speech, such as a spoken vowel, acoustic spectral-power differences can easily be visually observed between ‘depressed’ and ‘nondepressed’ speakers using spectrograms. With regards to the ‘monoloudness’ or ‘hollowness’ characteristics observed in depressed speakers, the speech power shown in Fig. 3.11 also

51

contributes to the ‘monopitch’ aspects, as an increase in loudness generally correlates with an increase in pitch due to increased glottal flow (i.e. increased lung airflow velocity). (a)

(b)

/i/

/i/

(c)

(d)

/a/

/a/

Figure 3.11: 3-dimensional spectrogram using a window length of 256 samples (20ms) with 220 samples of overlap and 512 discrete Fourier transform points. Comparison of two female American-English speakers uttering an /i/ (top) and /a/ (bottom) phonemes; non-depressed (a, c) and depressed (b, d). These vowel segments were extracted from the Sonde Health (SH1) database (for more information, refer to Chapter 5.3.5). The depressed speaker has less spectral dynamics and energy above 5kHz than the healthy speaker.

The overall prosodic ‘flatness’ (Darby & Hollien, 1977; Newman & Mather, 1938; Ostwald, 1965) can further be explained in terms of linguistic stress. In linguistic terms, speech modulation can be understood as being composed of a mixture of stressed and non-stressed sounds. In natural speech, linguistic stress functions at a phoneme unit level to permit greater segmental distinction between streams of interlinked phonemes (e.g. syllables, words, phrases). Linguistic stress improves speech intelligibility by emphasising what phoneme/syllable units differ from each other. Linguistic stress also provides audible cues as to which sound units (e.g. words, phrases) carry the most important informational content (Hockett, 1958; Ladefoged, 1967; Miller, 1963). In Hitchcock & Greenberg (2001) and Greenberg et al. (2003), it was shown that syllable stress perceptually influences an

52

individual’s ability to identify phonetic segments in spontaneous speech, especially temporal aspects of the vocalic nucleus (e.g. most central part of a syllable unit). Trevino et al. (2011) discovered that depressed speakers had statistically significant correlations with the psychomotor subsymptom and duration/signal power for most vowels. Further, on the contrary, Trevino et al. (2011) found minimal statistical correlation between the agitation subsymptom and duration/signal power in depressed speakers’ vowel production. The decrease in power-spectral dynamics found in depressed speakers contributes to weak linguistic stress, which consequently results in reduced speech clarity and a greater perceptual blur between speech segments (e.g. phonemes, words, phrases) (Cannizzaro et al., 2004a; Helfer et al., 2013; Mundt et al., 2012). These intensity linguistic-stress factors weaken the intelligibility of depressed speakers’ speech when compared with healthy populations. For speech-based depression analysis, intelligibility is an area that has been broadly examined using subjective means rather than objective measured means. Formant vowel studies (Scherer et al., 2016; Vlasenko et al., 2017) have also discovered measurable differences between depressed speakers and healthy speakers in regard to their vowelspace variance ranges. Similarly to Fig. 3.12 below, these studies (Scherer et al., 2016; Vlasenko et al., 2017) have found that the formant ranges of corner vowels /a/, /u/, and /i/ are narrower and comprise less total area for depressed speakers than healthy speakers. Therefore, depressed speakers’ speech has a reduction in vowel F1 and F2 ranges along with decrease in formant intensities (i.e. higher intensities usually result in wider formant energy peaks and increased formant frequency values). Scherer et al. (2016) found that depression affects vowel frequencies F1 and F2 and the Vowel Space Area (VSA), which is a formant-based formula established by Bradlow et al. (1996) originally intended for measuring speech clarity. Scherer et al.’s VSA approach uses k-means clustering optimization, Euclidian distance based on the approximated /a, u, i/ clusters, and Heron’s formula to compute a vowel space ratio (i.e speakerVSA/referenceVSA). Using a similar F1 and F2 VSA-based approach to that of Scherer et al. (2016), Vlasenko et al. (2017) used a phoneme recognizer to compare recorded vowels of clinically depressed and non-depressed speakers on a gender-specific basis. Therein, Vlasenko et al. (2017) found again that F1 and F2 values for corner along with additional vowel distributions were different for ‘non-depressed’ and ‘depressed’ speakers.

53

/u/

/u/

/a/

/a/

/i/

/i/

(a)

(b)

Figure 3.12: Similarly to Scherer et al. (2016) and Vlasenko et al. (2017), a comparison of the Vowel Space Area (VSA) using F1 and F2 values based on /a, i, u/ corner vowels for two Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) database female speakers: (a) ‘non-depressed’ and (b) ‘depressed’. The Black triangle is the approximated range VSA based on Fant (1960), whereas the red triangle is the approximated speaker’s VSA. The small gray dots are sample data for each phoneme produced (~11k individual values). The ‘depressed’ speaker has relatively lower F1 and F2 ranges and a smaller VSA when compared with the ‘non-depressed’ speaker. The ‘healthy’ speaker has a greater dynamic formant shift based on Fant’s corner vowel estimates than the ‘depressed’ speaker. The VSA area of the non-depressed speaker is 0.78, whereas the depressed speaker is 0.68.

Accompanying the speech are auxiliary speech behaviours that provide additional information regarding a speaker’s state of mind and health. Auxiliary speech habits include sighs, breaths, laughs, pauses, and repeats. Auxiliary speech behaviours can also include habitual filler words (e.g. “umm”, “hmm”, “uh”). However, filler words have not been closely examined in the speech-based depression literature. Many studies (Alpert et al., 2001; Bucci & Freedman, 1981; Cannizzaro et al., 2004a; Darby et al., 1984; Ellgring & Scherer, 1996; Fossati et al., 2003; Hartlage et al., 1993; Nilssone et al., 1987, 1988; Szabadi et al., 1976) have shown that depressed people exhibit a greater number of speech hesitations (i.e. pauses, repeats, false-starts) than non-depressed populations during communication due to psychomotor agitation/retardation and cognitive deficits. One of the earliest studies evaluating depressed speaker pause analysis was by Szabadi et al. (1976). Their research indicated that for automatic speech (e.g. counting) speakers have longer pause durations when depressed than after-going a two-month treatment. This improvement in speaker fluency and reduction in pauses also holds promise as a potential method to help monitor depressed patient progress over time. Since Szabadi’s (1976) investigation, additional speech-based depression studies (Alghowinem et al., 2012; Alpert et al., 2001; Esposito et al., 2016; Mundt, 2012; Nisonne et al., 1988; Stassen et al., 1998) have also evaluated speech pause durations with further indication that speech production, and more specifically the language retrieval process, is interrupted by cognitive and

54

concentration deficits caused by depression. In Esposito et al. (2016), pauses were examined in mildly to severely depressed speakers. For spontaneous speech, this study found significant pause duration lengthening exhibited by depressed speakers relative to non-depressed. It has also been recorded that depressed patients take a significantly longer time to respond to verbal questions (Scherer et al., 2015), again hinting at abnormal speech processing latencies. Fig. 3.13 again illustrates phrase-level comparison between non-depressed/depressed speakers responding to the same query. Overall, as previously discussed herein, depressed speakers’ speech can often be described as undynamic, quiet, prosodically flatter, and non-continuous than healthy population’s speech (Cummins et al., 2015a). Further, it is believed the speech uniformity exhibited in the depressed speaker’s speech in 3.13b contributes to overall weaker intelligibility because it provides less distinction between phonemes, linguistic-stress components between words, and prosodic paralinguistic clues that help guide the meaning of what is being said. (b)

(a)

Figure 3.13: Spectrograms of two female speakers, (a) non-depressed and (b) severely depressed, spontaneously responding to the same interview question taken from the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ). These spectrograms applied a sample length of 256 (20ms) with 50% overlap and 512 discrete Fourier transform points. Noticeably the severely depressed speaker exhibits flat formant characteristics, whereas the non-depressed speaker shows a much higher degree of formant dynamics, especially above 1kHz.

A collection of the major acoustic characteristics that describe depressed speakers is shown in Table 3.1. Noticeably over the decades, similar and consistent depressive speech characteristics have been recorded across the majority of studies. Remarkably, the auxiliary speech behaviours including pause and pause durations has been investigated a great deal. Although pauses are not technically an acoustic phenomenon, pause behaviours exhibited by depressed speakers are a good indicator of a disruption in the speech processes described earlier in Chapter 3.1.

55

Table 3.1. List of acoustic-driven speech analysis of depressed speakers. The acoustic characteristic sets are divided into three categories: prosodic, spectral, and suprasegmental. The individual depressive speech attributes are listed below each characteristic category, wherein ↓ indicates a decrease and ↑ indicates an increase in mean or variance when compared with non-depressed populations.

Greden et al. (1981)

↓ ↓ ↓

Nilsonne (1987) Nilsonne (1988)

↓ ↓

Alpert et al. (2001)

↓ ↓ ↑ ↓

↑ ↓

Mundt et al. (2007)

↓ ↑ ↓

Low et al. (2010)



↓ ↓

Trevino et al. (2011)

↓ ↑ ↑ ↑

Hashim et al. (2012)

↑ ↓ ↓ ↓

Quatieri & Malyska (2012) Horowitz et al. (2013) Bozkurt et al. (2014)

↓ ↑

Scherer et al. (2014)

↓ ↓ ↓ ↓ ↑ ↓

Williamson et al. (2014) Cummins et al. (2015a)



Kiss & Vicsi (2015) Esposito et al. (2016)





Mendiratta et al. (2017)



Vlasenko et al. (2017)

56

↓ ↓ ↓ ↑ ↑ ↑ ↓

Intelligibility

Articulatory Errors

↓ ↑ ↓

Cannizzaro et al. (2004a)

Yang et al. (2016)

↑ ↑



France et al. (2000)

Sanchez et al. (2011)

Phrase Duration

↓ ↓ ↑

Ellgring & Scherer (1996)

Ozdas et al. (2004a)





Flint et al. (1993)

Moore et al. (2003)

Pauses

↓ ↑ ↓ ↑ ↓ ↓ ↑

Szabadi et al. (1976)

Darby et al. (1984)

Suprasegmental

Rate of Speech

Breathy

Shimmer

Jitter

Spectral Formant Bandwidth

Phone Duration

Pitch

Intensity

F0 Coordination

Prosodic



3.4.2 Linguistic Behaviours and Characteristics Speech is inherently comprised of many linguistic attributes, such as the number of phonemes, lexical choice, syntactic variables (e.g. adjectives, adverbs, negation, quantifiers), grammatical complexity, phrase type (e.g. statement, question), context, subject matter, and previous statements (Collier, et al., 2014). Linguistic behaviours in depressed speakers have been examined using both spoken transcripts and written analysis. Individuals with depression often exhibit linguistic deficits in their verbal language skills, such as the inappropriate/vague use of words, frequent abandonment of phrases, and use of overly redundant phrases (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman, et al., 1985). For instance, speech and written language studies have demonstrated pragmatic pronoun differences between healthy and depressed individuals based on frequency of pronoun types. Many studies (Bucci & Freedman, 1981; Ramirez-Esparaza et al., 2008; Rude et al., 2004; Smirnova et al., 2018; Stirman & Pennebaker, 2001; Vedula & Parthasarathy, 2017) have indicated that depressed speakers use more first-person singular pronouns (e.g. I, me, my) than collective pronouns (e.g. we, our, us). The excessive use of first-person pronouns and reflexive thoughts (e.g. self-centered topic) are believed to be a consequence of depressives’ symptomatic self-focused attention (Pyszczynski & Greenberg, 1987). Often, individuals diagnosed with depression have an inclination towards being preoccupied with their own concerns to an excessive degree (Feldman & Gotlib, 1993). It is believed that this needless extraneous processing is caused by excessive rumination of focus on themselves and the nature or implications of their negative feelings (Nolen-Hoeksema, 1991; Watkins & Brown, 2002). Lott et al. (2002) and Nguyen et al. (2014) showed that depressed populations exhibited less lexical diversity and shorter average utterance lengths (i.e. number of words per phrase) than healthy populations. Further, when compared with healthy populations, many studies (Nguyen et al., 2014, 2015; Rude et al., 2004; Smirnova et al., 2018; Vedula & Parthasarathy, 2017) have shown that depressed individuals have increased negation use (e.g. not, never, cannot). Recently, Smirnova et al. (2018) explored the language use of depressed patients using linguistic analysis based on their personal written text. Their study, which included over 100 depressed patients and multiple written self-reports, concluded that depressed patients use significantly more colloquialisms (e.g. informal language), unusual syntax structure (e.g. atypical word order), increased tautologies (e.g. lexical redundancy) and word ellipses (e.g. word omissions) when compared to non-depressed populations. While written language is not identical to how individuals

57

speak, it does give insight into how depressed individuals construct their language and idiolect, such as lexical variety, grammatical structure, and topic content. Only a handful of linguistic speech-based studies have concentrated on linguistic errors made by depressed speakers. For example, Rubino et al. (2011) found when comparing depressed and healthy speakers, depressed speakers had significantly higher frequencies of referential failures. A referential failure is when a reference word or idea is made unclear, ambiguous, or interjected without proper introduction. In particular, some of the key referential failures exhibited by depressed speakers in Rubino et al. (2011) were vague references, absent content information, ambiguous word meanings, wrong word usage, and structural imprecisions. The increase in malformed linguistic errors contributes to a more vague or unclear speech-language content. Rubino et al. (2011) and Bucci and Freedman (1981) have suggested that linguistic referential communication

disturbances

are

indicators

of

neurocognitive

correlates

to

psychiatric

abnormalities. Discourse analysis information and conversational exchange habits between speakers over time has also been shown to be useful for depression identification. In Williamson et al. (2016), semantic context analysis between depressed/non-depressed speakers and the interview revealed differences in the number of types of responses generated by the interviewer. In an earlier acoustic study by Scherer et al. (2014), it was also demonstrated that the interaction between the patient and clinician or interviewer is inversely related to depression severity. Therefore, according to Williamson et al. (2016), linguistic discourse analysis cues regarding depression are not only contained within the patient’s speech, but also the interviewer’s speech. Table 3.2. List of linguistic-driven, both speech and written, analysis of depressed speakers. The linguistic characteristic sets are divided into three categories: lexical, grammatical, and suprasegmental. The individual depressive speech characteristics are listed below each attribute category, wherein ↓ indicates a decrease and ↑ indicates an increase in mean or variance when compared to non-depressed populations. It is observed that over the years, similar depressive linguistic characteristics have been found across the majority of studies presented here.

Andreasen & Pfohl (1976)

↑ ↓ ↓

↑ 58



↑ ↓

Number of Responses

Intelligibility

Self-Focused

Rumination

Colloquiallisms

Overview Mean Utterance Length

Referential Failures

Unusual Syntax

Achievement Related

Grammar Errors

Negation

Sentence

Word Discrepancy

Reflexive

Distress Related Words

Lexical Diversity

Collective Pronouns

First-Person Pronouns

Word

Bucci & Freedman (1981)

↑ ↓

Rosenberg et al. (1990) Stirman & Pennebaker (2001)

↑ ↑ ↑

Oxman et al. (1988)

↑ ↑ ↑

↑ ↓ ↑ ↓

Lott et al. (2002) Rude et al. (2004) Ramirez-Esparza et al. (2008)

↑ ↓ ↑ ↓





↑ ↑

↑ ↑

Rubino et al. (2011)









Park et al. (2012)

↑ ↑

De Choudhury et al. (2013) Nguyen et al. (2014)

↑ ↓



↓ ↓

Scherer et al. (2014)



Resnik et al. (2015) Vedula et al. (2017) Smirnova et al. (2018)



↑ ↓ ↑ ↓

↑ ↑ ↑ ↑ ↑ ↑

↑ ↑ ↑



↑ ↑

3.4.3 Affect Behaviours and Characteristics In Chevrie-Muller et al. (1985), a review of speech and psychopathology indicated that abnormalities of acoustic prosodic elements important for recognizing depression disorders, in addition to how these function with linguistic information (i.e. words, phrase meaning) as an expression of an individual’s affective state. Early subjective analysis of depressed populations that described patients’ speech and associated behaviours relied heavily on affective expectations. For instance, if a patient says he/she is “doing great”, but utters this statement with a negative acoustic affect inflection, it should raise an affective deficit warning flag with a clinician. Moreover, speech affect also can be measured by the acoustic speech along with the linguistic word representations. If a patient characterizes his/her mood as “manic”, “distressed”, “sad”, and “unstable” – these terms all encompass a negative valence. Thus, it is not just ‘how’ the patient says something, but also ‘what’ he/she says that gives weight to the clinician’s end diagnosis. The ‘how’ is generally represented as the acoustic information (e.g. paralinguistic attributes), whereas the ‘what’ is represented as the linguistic semantic information (e.g. word attributes). Most studies in speech-based depression have yet to tie these two aspects together and clearly establish a systematic affect approach to assess potentially depressed patients.

59

In Killgore (1999) and Witt et al. (2008), it was demonstrated that depressed patients exhibited low pleasure and arousal (e.g. degree of excitement), when compared with healthy populations. Further, low energy, low motivation, and listlessness are often attributed to individuals with depression, which can be perceived by others as having low arousal (Abramson et al., 1978; Grayson et al., 1987). In general, low arousal in speech is indicated by acoustic cues, such as decreased F0 contour, lower intensity, and flat spectral dynamics (Scherer, 1986, 1989; Johnstone & Scherer, 1999). Further, as expected, a state of depression is often attributed to a sad emotional state. Thus, a slower speech rate, decreased pitch variability, and weak vocal intensity are often synonymous with sadness (i.e. low arousal), as well (Juslin & Laukka, 2003). An example of arousal ratings for a ‘depressed’ and ‘non-depressed’ speaker is given in Fig. 3.14 below. In this figure, it is apparent that generally over the course of the entire file lengths, the nondepressed speaker has a higher overall arousal rate than the ‘depressed’ speaker. It can also be observed that the arousal range of the ‘non-depressed’ speaker is larger than the ‘depressed’ speaker. 0.2

Normalized Arousal

0.15

0.1

0.05

0 1

202

403

604

805

1006

1207

-0.05

Time (frame number)

Figure 3.14. Example of normalized human-annotator arousal-rating evaluations of spoken responses to the same interviewer question from a non-depressed speaker (green) and depressed speaker (red) taken from the Audio-Visual Emotion Challenge 2014 (AVEC 2014) database. The overall arousal range for the nondepressed speaker is dynamically wider in range than the depressed speaker. Also, over the course of the recording, the non-depressed speaker maintained a higher 'arousal' value average than the depressed speaker. The non-depressed speaker file length is slightly shorter than the depressed speaker.

With regards to valence (i.e. degree of positive to negative emotion), depressed people have difficulty inhibiting irrelevant negative material from their working memory. Studies by Joormann et al. (2005) and Joormann & Gotlib (2008) showed that depressives have an abnormal affinity towards excessive negative bias coinciding with a deficiency in positive affect norms. Negative

60

affect persists longer in individuals with depression than in non-depressed populations (Bylsma et al., 2011; Peeters et al., 2003). For instance, in Murphy et al. (1999), it was shown that depressed patients are slower to respond to positive (e.g. happy) stimuli than negative (e.g. sad) stimuli during affective-shift task experiments. Depressed people also disengage from positive stimuli more quickly than negative stimuli (Levens & Gotlib, 2015). For example, during memorization tasks depressed populations have also shown increased recall of negative stimuli over positive stimuli in when compared with healthy populations (McDowall, 1984). Cognitive neurological studies (Audenaert et al., 2002; Elliott et al., 1997; Okada et al., 2003) have shown that depressed people’s attention performance deficits coincide with decreased neural activation in brain regions critical for cognitive control. Moreover, depressed individuals also show an absence of activity in crucial neural regions responsible for affective processing. During cognitive tasks in Jones et al. (2010) and Ravnkilde et al. (2002), unlike non-depressed populations, depressed individuals engaged in increased intrinsic processing that exhausted their ability to control the processing of information. Sentiment analysis has primarily gained popularity in text-processing applications (i.e. social media, product reviews, financials) as a form of data mining involving automatic analysis of affect, opinions, and attitudes. While sentiment analysis of text-transcripts has demonstrated good performance in predicting affect from natural speech (Ojamaa et al., 2015), only in the last few years have affect-based ratings also been explored for depression severity prediction (de Choudhury et al., 2013; Devan et al., 2018; Huang et al., 2015). Kanske and Kotz (2012) used 120 words from the Leipzig Affective Norms for German (LANG) to look at how arousal/valence related to patient Depression Anxiety Stress Scores (DASS). Therefore, each word in this study had visual and auditory affect ratings (e.g. positive, neutral, negative) judged by non-depressed and depressed patients. Results demonstrated greater negative valence across all word types correlated with higher DASS scores. Moreover, the results also showed higher anxiety scores related to an increase in arousal rankings and that recordings of positive words were longer than neutral and negative words. Other affective dimensions have been explored for depression to a lesser extent than arousal/valence. It has been suggested by Wolters et al. (2015) that analysis should include anger, as psychomotor symptoms of depression are complex, ranging from retardation to agitation. Moreover, in a recent affective text-processing study on depressed populations, Nguyen et al.

61

(2014) found that depressed patients expressed more angry affective terms than healthy populations. A collection of the typical affect characteristics that describe depressed speakers is shown in Table 3.3 below. There have been surprisingly fewer investigations on how depression and affect are linked to speech behaviours (i.e. most of these are text-based (e.g. written) behavioural studies), when compared with individual acoustic and linguistic studies. As Pampouchidou et al. (2017) concluded from an evaluation of affect literature, due to advances in technology, such as speech recognition, social media, and video processing, affective behavioural research has significantly increased within only the last few years. Consistent depressive affective characteristics have been recorded across the majority of studies listed in Table 3.3. Table 3.3. List of affect-driven, both spoken and written, analysis of depressed speakers. The affect characteristic sets are divided into three categories: arousal, valence, and other. The individual depressive speech attributes are listed below each feature category, wherein ↓ indicates a decrease and ↑ indicates an increase in mean or variance when compared to non-depressed populations. It is observed that over the years, similar depressive speech characteristics have been found across the majority of studies presented here.

Anger

Well-being

Sadness

Anxiety/Stress

Other Dominance

Positive

Negative

Low

High

Arousal Valence



McDowall (1984)



Hall et al. (1995)

↑ ↓

Joorman et al. (2005)



Osatuke et al. (2007)

↑ ↓ ↑ ↓

Joorman & Gotlib (2008) Capecelatro et al. (2013) Wang et al. (2013b)



Nguyen et al. (2014) Morales et al. (2016)



Fatima et al. (2017) Seabrook et al. (2018)

↑ ↓

Stankevich et al. (2018)

62

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

3.5

Summary

This chapter has given an overview on mechanisms involved during speech production including the main anatomical-physiological components and voiced/unvoiced articulatory requirements. The source-filter theory and speech chain concepts were shared, providing better insight as to how the process of speech involves separate but coordinated actions, any of which can be interrupted by psychogenic illnesses, such as depression. A brief background on speech measurements was also given in this chapter, highlighting the speech measurement transition from analog to digital processing, especially in terms of the spectrogram, discrete speech signal processing methods (e.g. FFT, STFT), and automatic speech analysis toolkits, discussed in more detail in sub-chapters 4.1.14.1.3. This chapter also identified acoustic and linguistic abnormalities commonly exhibited by depressed populations when compared with healthy populations. Further, human affect was recognized as a high-level speech trait revealing different dimensions of inferred sentiment, such as arousal, valence, and dominance.

63

Chapter 4 AUTOMATIC SPEECH-BASED DEPRESSION ANALYSIS BACKGROUND

An interactive, objective virtual human interview or interactive voice response depression assessment tool (Cummins et al., 2015b; DeVault et al., 2013, 2014, 2015; Mundt et al., 2007) has several practical advantages in the diagnosis of depression disorders. As one example, a virtual human interviewer could in future provide a non-biased, uniform, automated prescreening assessment before a patient arrives to meet with his/her clinician. This automation can help reduce medical costs, prioritize at-risk patients, and facilitate expedited diagnosis of clinical depression potentially resulting in a decrease in the number of annual suicides. There are several other dialogue advantages to such an automated approach: effective open-ended enquires without typical clinical session time constraints; questions using consistent verbal and non-verbal behaviours, thus imparting less influence on a patient (Scherer et al., 2014) and attuning enquiry stylization to better associate with different patients’ backgrounds (Gonzalez et al., 2000; Howes et al., 2014; DeVault et al., 2013, 2014; Yu et al., 2013; Yang et al., 2013). However, any such approach will require automatic measurement for depression assessment. The measurement and examination of vocal characteristics is advantageous over other biosignals because of its naturalistic communicative form and non-invasive collection without complex expensive machines that require extensive specialized training (e.g. MRI, PET, SQUID). For example, a state-of-the-art Magnetic Resonance Imaging (MRI) machine costs approximately $1.5m USD (Sarracanie et al., 2015). This stationary, noisy, large device consumes an enormous amount of power/helium and requires a specially designed location. An MRI average annual operational cost is between $500k and $1m USD (PIS, 2014). By contrast, a state-of-the-art smartphone with speech recording capabilities can be purchased and operated yearly for less than $1k USD. Although both an MRI and smartphone can be used to potentially evaluate depression, by comparison, a smartphone is far more practical as a data collection device due to its overall cost and free mobility. Speech has considerable promise in

64

mobile or smart device healthcare integration, as it can be measured cheaply, conveniently, nonintrusively, and regularly – and its data can be easily recorded, stored, and processed on the device or via a cloud-based server (Ben-Zeev et al., 2015). As smart devices have increased in demand, so has the pervasiveness of routinely using smart device applications or apps. Many smart devices already have integrated health apps that automatically track physical activity (e.g. walking, running), sleep patterns, and calculate caloric nutritional intake. Companies such as AppleTM, GoogleTM, and MicrosoftTM have provided software kits designed exclusively for medical researchers to collect data and develop new health apps (Ventola, 2014). Recently, the Audio Visual Emotion Challenge (AVEC) (Ringeval et al., 2015, 2017) has motivated research in the area of speech and depression, including new linguistic text-based methods, such as topic modeling (Gong & Poellabauer, 2017) and natural language processing (Dang et al., 2017). However, these types of approaches mainly treat acoustic, linguistic, and affective information separately (e.g. fusing the outputs of two independent subsystems) rather than examining interdependencies between the acoustic-linguistic-affect transcript content. Acoustic features related to prosodic, spectral, and suprasegmental characteristics have been examined in many speech-based depression studies, as mentioned previously in sub-chapter 3.4.1. Moreover, studies (Low et al, 2010; Sturim et al., 2011; Mantri et al., 2013; Cummins et al., 2013b) have looked at the relationship between depressed speech acoustic features and clinical HAMD depression scores of patients. Other features such as jitter and glottal flow have also demonstrated higher values for individuals with depression and/or suicidal tendencies (Ozdas et al., 2004). Research evaluating phonological speaking rate has also indicated usefulness in classifying varying degrees of depression (Trevino et al., 2011). In addition to acoustic-based analysis, spoken language also contains explicit linguistic-based information, which can be used to quantitatively analyze an individual’s language structure. By applying Natural Language Processing (NLP) approaches to recorded speech transcripts, differences in healthy versus depressed speakers have been discovered to a minor extent. In several studies (Andreasen & Pfohl, 1976; Cannizzaro et al., 2004a; Lott et al., 2002; Oxman et al., 1988; Rosenberg et al., 1990; Rubino et al., 2011; Rude et al., 2004; Williamson et al., 2016), linguistic information has been used to identify depression in speakers via their word-frequencies, topic coverage range (e.g. word frequency within specific subject categories), n-grams (e.g. bigrams, trigrams), syntax/semantic clarity, and psycholinguistic properties (e.g. age of acquisition,

65

concreteness, imaginability). Surprisingly, however, despite robust and effective depression classification performance from text-based speech analysis, applications using these approaches still have much to be explored in terms of automatic analysis and elicitation design. Chapter 4 contains a brief background on data selection in speech applications and commonly used acoustic, linguistic, and affect speech features for speech-based depression analysis. Included is a breakdown of open-source voice feature extraction toolkits and machine learning techniques, which are later applied in experiments described in Chapters 6-12.

4.1

Data Selection

Data selection is a critical process of determining the most suitable data for analysis amongst different data types and sources while also using appropriate instruments during data collection (ORI, 2018). Data selection has been used in most applications of speech processing, such as speech recognition, speaker identification, and emotion prediction/classification (Boakye & Peskin, 2004; Ishihara et al., 2010; Reynolds et al., 1995; Shriberg et al., 2008). In Zhang et al. (2017b) it was suggested that for the extraction of information from the speech biosignal for use with automatic speech applications, there is still a high demand for better quality, more diverse data types, and larger amounts of data. While often, having more data is preferable to less data, it is commonly observed that at some point as more data is added to any speech processing application, its impact in terms of performance improvement usually becomes smaller. Consequently, a major dilemma for any speech processing application is how much data is enough, or perhaps more importantly, which parts of the data have the least informational redundancy. Moreover, a major obstacle for automatic speech applications is the lack of sufficient labeled data, again in terms of quantity and quality. As discussed later in Chapter 5, many current speech depression databases have data sparsity (e.g. low resource – especially across depression subtypes and languages), mismatched data conditions (e.g. microphone/smartphone, office/home), minimal metadata (i.e. usually only gender/age), and/or unreliable markings (e.g. non-clinical assessment scores, improper transcript entries). Conventionally, data selection and its annotation have been performed manually, which is often expensive, time-consuming, and further, has been shown to yield inconsistency in agreement across different annotators (Hayden, 1950; Mines, 1978).

66

Although human data selection for speech is still popular today, there has been a push towards automatic methods, especially for common speech processing tasks, for example voice activity detection (VAD), as shown in Fig. 4.1 below. VAD is a standard practice in most automatic speech processing applications because it aids in the reduction of undesirable non-speech background noise. If extraneous noise, unrelated to speech is included during speech analysis, it can lead to inaccurate measures of the speech biosignal and result in poor performance.

Figure 4.1. Example of speech data selection using voice activity detection (VAD); the waveform (top) and spectrogram (bottom). The dotted circles indicate areas where speech is located. During data selection using VAD, only the circled speech segments will be analyzed. This healthy female speech sample was taken from the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) database (for more information see Chapter 5.4.4).

It is known that depending on the speech analysis approach (e.g. prosodic, spectral, suprasegmental) and application, some types of data are more discriminative for over others. For instance, data selection can also be applied to compare specific phoneme types or specific words. Phoneme-specific data selection approaches help to reduce phonetic variability found throughout typical speech samples, where all types of phonemes occur. It has been demonstrated that even a single phoneme type (i.e. vowel, diphthong) can provide discriminative information for depression classification/prediction tasks (Boakye & Peskin, 2004; Scherer et al., 2016; Sethu et al., 2008). Studies (Jiang et al., 2017; Long et al., 2017a; Scherer et al., 2015; Trevino et al., 2011; Vlasenko et al., 2017) on speech-based depression analysis have only recently started to explore new data selection techniques beyond traditional VAD and phoneme-dependent methods. There is still much unknown about how acoustic, linguistic, and affective measures can be applied as a data selection technique for depression assessments. These new types of speech-based measures have the potential to further automatically extract and exploit important discriminative information from the speech biosignal for depression analysis while also contributing to the general understanding of speechbased data selection methodology.

67

4.2

Speech Feature Types for Depression Analysis

4.2.1 Acoustic The most common acoustic features in the depression literature are based on speech-based characteristics discussed earlier in sub-chapter 3.4.1 and shown previously in Table 3.1. Many types acoustic speech parameters have been explored for speech-based depression classification and prediction tasks. While glottal parameters have been used as features in several studies (Low et al., 2010; Moore et al., 2008; Ooi et al., 2013, 2014) to measure vocal source energy (F0), excitation cycle, and airflow in the vocal tract of depressed/non-depressed speakers, still prosodic parameters are among some of the most popular features investigated for speech-based depression analysis (Cummins et al., 2015b; Hönig et al., 2014; Moore et al., 2003, 2008; Trevino et al., 2011). Prosodic parameters include the extraction of pitch and formant frequencies, loudness, duration, and voice quality features (Chevrie-Muller et al., 1985). Spectral parameters (e.g. cepstral) determining the short-term power spectrum of the speech biosignal have also been investigated as features for speech-based depression studies (Alghowinem et al., 2013b; Cummins et al., 2011, 2015b; Low et al., 2009; Stolar et al., 2018; Williamson et al., 2014). For a summary of speechbased depression studies using acoustic features, refer to Appendix D. Pitch The speech-processing literature often refers to F0 and pitch interchangeably. F0 is an estimation of the quasi-periodic rate of vibrations caused by the opening and closure of the glottis per speech cycle. Again, as mentioned previously in Chapter 3.1, F0 is the actual measurable fundamental frequency value, whereas pitch is a perceptual value based on perceived intensity and spectral characteristics. Pitch provides valuable insight into the nature of excitation source, further it can indicate sentiment phrase transitions and/or voice under stress (e.g. physical, emotional) (Hansen & Patil, 2007; Scherer, 2003). As noted many years ago by Rabiner et al. (1976), there are two main types of pitch detection algorithms time-domain and frequency-domain based. The time-domain focused pitch method is estimated using the raw speech waveform, wherein generally peak and valley measurements are recorded over the course of many frames. The frequency-domain pitch method relies on the semi-

68

periodicity of the speech biosignal in the time domain having a frequency spectrum with a series of impulses at the fundamental frequency and its harmonics. A popular time-domain method for calculating pitch is called short-time autocorrelation (A), which relies on frame-by-frame logarithmic short-term energy and determines the time difference between two signals; where one is a nearly perfect delayed version of the other. The autocorrelation process can be defined by using the equation (4.1): 𝐴[𝑙] =

! !

!!! !!! 𝑠

𝑛 𝑠[𝑛 − 𝑙]

(4.1)

where s[n] is the speech window with N samples per window and l is the lag index. The autocorrelation method requires a relatively large time window in order to adequately cover F0 ranges in speech. Intensity According to Zwicker and Fastl (1999), the actual intensity of a speech biosignal is perceptual and based on psychoacoustic attributes, such as pitch, duration, and spectral shape. Thus, for speech automatic processing applications, a low-level descriptor is often utilised for intensity extraction purposes (Kießling, 1997; Schuller et al., 2010a). Measuring the energy for a frame interval, the squared short-term energy E can be achieved as follows (Lokhande, et al., 2011): 𝐸=

!!! 𝑠[𝑛]! !

(4.2)

Voicing To automatically help determine which speech frames are voiced or unvoiced the zero crossing algorithm can be used. The zero crossing rate (ZCR) is a calculation of the number of zero crossings per time within a frame. Deller et al. (1993) define the zero crossing as: 𝑍𝐶𝑅[𝑘] =

!!! 𝑠! [𝑛], !

𝑤𝑖𝑡ℎ 𝑠! [𝑛] =

0 if sign (𝑠 𝑛 ) = sign(𝑠 𝑛 − 1 ) 1 if sign (𝑠 𝑛 ) ≠ sign(𝑠 𝑛 − 1 )

(4.3)

It is critical that the DC offset is removed from the signal prior to the ZCR calculation. The ZCR provides summary information about the frequency distribution of the input signal. Thus, a high ZCR implies a high-frequency sound (e.g. voiceless phoneme, random noise), whereas a low ZCR is an indication of a low-frequency sound (e.g. voiced phoneme) (Lokhande et al., 2011).

69

Spectral Centroid The spectral centroid is based on the weighted mean of frequencies within the speech biosignal, wherein the center of the energy power is calculated. Therefore, the speech biosignal is divided into a number of frequency bands, wherein the highest energy regions per frequency band are recorded. The spectral centroid has been attributed to the perceptual degree of ‘brightness’ in a speaker’s voice (Low et al., 2010). Spectral centroid is calculated by the following equation (4.4): 𝑆𝐶 =

!!! ! ![!] !!! ! !!! ![!] !!!

(4.4)

where S[k] is the magnitude of the power spectrum for bin number n, bin center fk, and K total bins (Low, 2011). Spectral Flux The spectral flux measures how quickly power changes within the speech biosignal by comparing adjacent power spectra. In other words, it measures the frame-to-frame spectral shape variance. This measure can be calculated by the Euclidean norm of the difference in power between adjacent fames is measure as: 𝑆𝐹 = ∥ 𝑆[𝑘] − 𝑆[𝑘 + 1] ∥

(4.5)

where S[k] is the short-time Fourier transform of s[n] at index k (Low, 2011). The spectral spread is normalized from 0 to 1. Spectral Roll-Off The spectral roll-off is yet another measure of the spectral contour. The spectral roll-off is defined as the frequency point below which 80% of the power spectrum rests. Shimmer Shimmer refers to the short-term cycle-to-cycle perturbation in the amplitude of the speech biosignal (Asgari et al., 2014; Godsill & Davy, 2002). Voice shimmer is an objective measure of glottal resistance and noise (Brockmann et al., 2011). For instance, a speaker exhibiting excessive shimmer will have a perceptible breathiness in his/her voiced speech sounds. This feature has been used to evaluate many voice disorders (e.g. nodules, tumors, strain). Generally, a normal shimmer cycle-to-cycle difference is 0.70 dB or a variation less than 7% of his/her mean amplitude (Baken & Orlikoff, 2000; Teixeira et al., 2013).

70

To compute shimmer, the speech waveform is represented by a harmonic model, expressed in continuous time as: 𝑠(𝑡) = 𝑎! (𝑡) +

! !!![𝑎! (𝑡)𝑐𝑜𝑠(2𝜋𝑓! ℎ𝑡)]

+

! !!![𝑏! (𝑡)𝑠𝑖𝑛(2𝜋𝑓! ℎ𝑡)]

(4.6)

In equation (4.6), the coefficients are allowed to vary as noted by the 𝑎! (𝑡) and 𝑏! (𝑡). Therefore, the harmonic approach captures the sample variation in harmonic amplitude within a frame. Generally, for shimmer, continuity constraints are imposed using a small number of basis functions 𝜑! as shown below in equation (4.7): 𝑎! (𝑡) =

! !!! 𝛼! , ℎ𝜓! (𝑡), 𝑏! 𝑡

=

! !!! 𝛽! , ℎ𝜓! (𝑡)

(4.7)

There are a few different types of shimmer measurements: mean shimmer dB, mean shimmer percent, and amplitude perturbation quotient (Teixeira et al., 2013; Teixeira & Fernandes, 2014). The mean shimmer dB examines the mean absolute dB-SPL difference between sequential vocal amplitudes measured during sustained phonation (i.e. usually a held-vowel). The mean shimmer dB can be represented as: 𝑆𝐻_𝑑𝐵 =

! !!!

!!!! !!! !!! |20log !" ( ! )| !

(4.8)

where Ai are the extracted peak-to-peak amplitude values and N is the number of fundamental frequency periods. The mean shimmer percent is the mean absolute cycle-to-cycle difference in the vocal amplitude divided by the mean amplitude, which is then multiplied by 100 to produce an average percent value. Jitter Jitter refers to the short-term cycle-to-cycle perturbation in the pitch frequency of the speech biosignal (Asgari et al., 2014). Perturbation analysis is based on the fact that small fluctuations in frequency reflect the inherent noise of the voice. Much like shimmer, jitter is also considered a measure of vocal cord stability and control (Brockmann et al., 2011). Thus, as jitter increases, so does the auditory perception of vocal hoarseness. Jitter is also used to help diagnosis of many common voice pathologies. Normal healthy voices generally produce less than 1% frequency variability between cycles (Baken & Orlikoff, 2000; Teixeira et al., 2013). Traditionally the computation of jitter assumes that speech parameters are constant during the frame. These parameters can be reconstructed based on the speech biosignal harmonic structure. To estimate jitter, a matched filter using a single pitch period long segment reconstructed signal is convolved with the original speech waveform.

71

The distance between the maxima in the convolved signal defines the pitch periods (Teixeira & Fernandes, 2013). The perturbation of the period is normalized with respect to the given pitch period and its standard deviation is an estimate of the jitter. Thus, this method allows the computation of jitter within the analysis window. There are a few different jitter measurements, such as the mean absolute jitter, mean percent jitter, relative average perturbation jitter, and n-point period perturbation quotient jitter. The mean absolute jitter (MAJ) is the mean absolute difference between consecutive vocal periods measure during a held phonation task. It is calculated as: 𝑀𝐴𝐽 =

! !!!

!!! !!! |𝑇!

− 𝑇!!! |

(4.9)

where Ti are the calculated glottal period lengths and N is the number of extracted glottal periods. The mean percent jitter is derived by taking the mean absolute jitter value, and dividing it by the mean vocal period during phonation; after which, this value is then multiplied by 100 to yield a percent average. The relative average perturbation (RAPJ) is a measure that minimizes the effects of long-term F0 changes, such as a slow fall/rise in pitch. It compares the average of 3 cycles with the given period. The average difference is divided by the mean period, after which the divided value is multiplied by 100 to yield the relative average perturbation. The relative average perturbation jitter is derived from the following equation (4.10): 𝑅𝐴𝑃𝐽 =

! !!!

! !!! !!! !!! |!! !(! !!!!! !! )| ! ! ! ! !!! !

×100

(4.10)

Similarly to the relative average perturbation jitter calculation, the m-point period perturbation quotient jitter (PPQJ) uses multiple neighbouring values to calculate its average. Generally, m is set to 5, 9, or 11. An example of a 5-point period perturbation quotient for jitter expressed mathematically is the following equation: 𝑃𝑃𝑄𝐽 =

! !!!

! !!! !!! !!! |!! !(! !!!!! !! )| ! ! ! ! !!! !

×100

(4.11)

Harmonic-to-Noise Ratio By estimating the parameters of the harmonic model, the noise can be computed by subtracting the reconstructed signal from the original speech biosignal. Given the estimated harmonic model

72

parameters for each frame, the harmonic-to-noise ratio and the ratio of the energy in the first and second harmonics (HND), can be computed as follows: ! ! !!! 𝑎!

𝐻𝑁𝑅 = ! ! !!! 𝐴𝐶!

𝐻𝑁𝑅 = 𝑙𝑜𝑔

− 𝑙𝑜𝑔

+ 𝑏!!

! !!!(𝑦(𝑡)

(4.12) − 𝑠(𝑡))!

𝐻𝑁𝐷 = 𝑙𝑜𝑔𝐴𝐶! − 𝑙𝑜𝑔𝐴𝐶!

(4.13)

(4.14)

where 𝑎!! and 𝑏!! are the estimated first and second harmonics, respectively. However, Boersma (1993) provides a simpler equation form for calculating the harmonic-to-noise ratio (HNR) as: 𝐻𝑁𝑅 = 10 ∗ 𝑙𝑜𝑔!"

!!! [!! ] !!! [!]!!!! [!! ]

(4.15)

where Ah[0] is the autocorrelation coefficient origin that consists of all energy in the signal. The ACh[lT]) is the component of the autocorrelation related to the fundamental period energy. The disparity between all signal energy and the fundamental period energy is presumed to be the noise portion of energy. For example, in a speech recording if the HNR = 0dB, then it means that the amount of energy in the harmonic part is identical to the noisy part. Therefore, a larger HNR value is desirable for speech analysis because it generally means there is less noise. If a speaker exhibits a below average HNR value, his/her voice will have noticeable hoarseness. Prosodic Timing Features Prosodic timing is often used to evaluate the rate-of-speech. The rate-of-speech per individual is known to be highly individual (Goldman-Eisler, 1961, 1968), and further, also may be heavily influenced by the familiarity language topic, familiarity of social environment, and/or speaker's cognitive load (Cichocki, 2015; Haynes et al., 2015; Laan, 1992; Trouvain et al., 2001). Despite these speaker variables within a speech sample, if normalized with other speakers, the prosodic timing information provides comparative speaker fluency measures. Among the most common measures for prosodic timing are the following: 𝑆𝑅 =

!"#$% !"#$!" !" !"##$%#&' !"#$ !"#$%&

𝑃𝑇 = 𝑇𝑜𝑡𝑎𝑙 𝐷𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑉𝑜𝑖𝑐𝑒𝑑 𝑆𝑝𝑒𝑒𝑐ℎ 𝐴𝑅 =

!"#$% !"#$%& !" !"##$%#&'

𝑀𝑆𝐷 =

!!!"#$%!" !"#$ !!!"#$%!" !"#$ !"#$% !"#$%& !" !"##$%#&'

73

(4.16) (4.17) (4.18) (4.19)

To determine the speech rate (SR) in equation (4.16), either the speech data must be segmented by hand or by automatic processes (Boersma & Weenink, 2014) to determine the individually segmented syllables. The speech rate is calculated using one-second evaluation time periods, and then an average and standard deviation calculation is derived from the entire speech data. The phonation time (PT), equation (4.17), is a count based on the number of voiced frames. The number of unvoiced and silence frames is often calculated during prosodic timing analysis, as well. The voiced, unvoiced, and silence counts are often converted into percentages, as this can help to normalize these features across different speakers’ verbal idiolects and speech tasks (e.g. spontaneous, automatic, read). The articulation rate (AR), as shown by equation (4.18), is slightly different from the speech rate in that its feature is a ratio based only on voiced frames. The mean syllable duration (MSD), equation (4.19), is another ratio-based feature wherein phonation time is divided by the number of syllables. In the speech-based depression literature, prosodic timing features are generated over an entire speech recording, rather than over shorter segments, which could indicate exact moments of disfluency or increased fluency. Cepstral Features Mel-frequency cepstral coefficients (MFCC) features are used for the majority of speech processing applications, such as speaker recognition, language identification, and automatic speech recognition. MFFCs are a robust, compact parametric representation of the acoustic speech biosignal.

Speech Biosignal

Windowing Process

DFT

mel-Frequency Warping

log

DCT

melcepstra

Figure 4.2. The mel-cepstral coefficient (MFCC) calculation process begins with segmenting the speech signal into a series of finite consecutive windows; wherein each windowed frame is has the Discrete Fourier Transform (DFT) applied after which mel-frequency warping is applied, converting the signal representation into the mel-spectrum using a log energy per filter. The DCT is then computed to decorrelate the log energies to produce mel-cepstral coefficients.

A mel is a unit of measurement of perceived pitch, and takes into account that humans have decreased sensitivity of sounds with higher frequencies. The human auditory system does not perceive sounds on a linear scale above approximately 1kHz, and beyond this range a non-linear logarithmic scale is used. This nonlinearity in auditory perception is represented by the mel-scale,

74

wherein the pitch of a 1kHz tone 40dB above the perceptual auditory threshold is defined as 1000 mels. The following equation computes mels for any given frequency in Hz: 𝑚𝑒𝑙 𝑓 = 2595𝑙𝑜𝑔!" (1 +

! !""

)

(4.20)

To help reconstruct the subjective human auditory spectrum, a set of filterbanks is used per desired mel-frequency component. These filter banks use triangular bandpass frequency responses with mel-frequency scale spacing between each filter bank. The mel filterbank energies Sk are then computed from the resulting sub-band signals, and a DCT is applied to decorrelate the features, resulting in MFCCs: 𝐶! =

!!! !!! log(𝑆! )𝑐𝑜𝑠

𝑛 𝑘−

! ! ! !

, 𝑛 = 1, 2, … , K

(4.21)

MFCC features provide information on the power spectral envelope of a sequence of frames. However, to obtain dynamic information on the coefficient trajectories over time, Δ (delta) and Δ − Δ (acceleration) coefficients can be calculated using: 𝐷! =

! !!! !(!!!! !!!!! ) ! ! !!! ! !

(4.22)

where Dt is the vector of ∆ coefficients at time t calculated from the static coefficients C and 𝜙 denotes the number of frames on each side to calculate the deltas (Low et al., 2009). Automatic Acoustic Speech Feature Extraction Toolkits The Collaborative Voice Analysis Repository for Speech Technologies (COVAREP) (Degottex et al., 2014) acoustic speech feature toolkit was intended as an open-source research-based set of speech algorithms, which would allow more reproducible research, further serve to strengthen automatic speech processing system design development. While COVAREP has been applied to many areas of speech processing, more recently, this toolkit has been used for speech-based depression research (Cummins et al., 2017; Pampouchidou et al., 2016; Scherer et al., 2014; Yang et al., 2016). COVAREP toolkit extracts frequency, energy, spectral, and glottal speech feature types, as shown in Table 4.1. The Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) (Eyben et al., 2016) is an opensource acoustic feature set based on the openSMILE speech toolkit (Eyben et al., 2013). eGeMAPS features were originally intended as a minimalistic set of voice parameters for use in paralinguistic speech applications. However, more recently the eGeMAPS feature set has been used for speechbased depression research (Cummins et al., 2016; Valstar et al., 2016, Vlasenko et al., 2017).

75

Features types included in the eGeMAPS set as shown in Table 4.1. According to Eyben et al. (2016), the acoustic feature types included in the eGeMAPS features set were chosen specifically due to their: (1) established use in previous literature related to paralinguistic evaluation; (2) capacity to indicate affective physiological changes in speech, and (3) automatic extractability. VoiceSauce (Shue, 2010; Shue et al., 2011) was developed to provide comparative automatic voice measurement extraction from speech recordings using a collection of acoustic speech algorithms from publically available code: Straight (Kawahara et al., 1999), Snack Sound Toolkit (Sjölander, 2004), Sun’s Pitch Determination (Sun, 2002), and Praat (Boersma & Weenink, 2014). As shown in Table 4.1, VoiceSauce extracts mainly frequency, energy, and spectral related features, albeit using more than one algorithm per feature. VoiceSauce toolkit parameters are available per individual feature algorithm, such as window frame size and formant bandwidth minimum/maximum limits. More recently, VoiceSauce acoustic features have been used for speech-based depression analysis (Gillespie, 2017; Kim et al., 2014) Table 4.1. Comparison of open-source acoustic speech feature sets: eGeMAPS (Eyben et al., 2016), COVAREP (Degottex, 2014), and VoiceSauce (Shue, 2010; Shue et al., 2011). ● indicates that this feature type is included. COVAREP eGeMAPS VoiceSauce Pitch (F0) Frequency-Related Features

Formant Frequencies Formant Bandwidths Jitter Loudness

Energy-Related Features

● ● ● ● ● ●

● ● ● ● ● ● ●

● ●

● ●

Shimmer Harmonic-to-Noise Ratio Subharmonic-to-Noise Ratio Voicing MFCC Spectral Flux Spectral Slope Creak

● ●

Alpha Ratio Hammarberg Ratio

Quasi-Open Quotient

● ● ● ● ● ●

TOTAL FEATURE DIMENSIONALITY:

81

Harmonic Differences (H1-H2) Parabolic Spectral Glottal-Relate Features

Glottis Dynamic Maxima Dispersion Quotient Normalized Amplitude Quotient

76

● ● ● ●

Cepstral Peak Prominence Spectral-Related Features

● ● ●

● ● ● ● ●



88

58

4.2.2 Linguistic In psychopathology, lexical content and grammatical structure differences have been recorded in both spoken and written language forms of patients with various illnesses (see previous Table 3.2). Since spoken language can be directly interpreted using quantitative linguistic representations, analysis of high-level language components (e.g. word choice, grammar structure) in the form of Natural Language Processing (NLP) provides insight relating to text-based content (Cambria & White, 2014; Kyle & Crossley, 2015) and clinical information (Friedman & Hripcsak, 1999). Some common examples of NLP analyses are word-frequencies within a transcript or compared with other texts, n-grams (e.g. bigrams, trigrams), topic coverage range (e.g. bag-of-words), and psycholinguistic properties (e.g. age of acquisition, concreteness, imaginability). In Lott et al. (2002), multiple psychiatric disorder (e.g. schizophrenia, bipolar depression, major depression) classification was conducted using spontaneous speech transcripts and human evaluators. Their experimental generated a high degree of accuracy based on dozens of linguistic variables (e.g. grammar, syntax, psycholinguistic properties, cohesiveness). The combined linguistic features along with a discriminant analysis classifier generated classification accuracy of 73%; and further, specifically for bipolar and major depression, 64% and 82% classification accuracy was achieved for three classes, respectively. Further, in Smirnova et al. (2018), an analysis of written language patterns based on psycholinguistic methods (e.g. metaphors, similes, sentence type) showed discriminative power even between mildly depressed and normal sadness (nondepressed) speaker groups. For a summary of speech-text based depression studies using linguistic features, refer to Appendix E. n-gram The n-gram is a commonly used analysis method in natural language processing (NLP). n-grams consist of a sequence of n tokens that are computed usually using a total count, ratio-based, or histogram feature representation. n-grams can include various kinds of text-based information such as phonemes, letters, syllables, and words. Moreover, often n-grams are computed in multiple unit segments (e.g. bigrams, trigrams) at phoneme/letter and word levels. The n-gram feature analysis method has proven useful for text-based depression analysis via clinician-patient cognitive behavioural therapy treatment dialogues (Howes et al., 2014) and social-media communication platforms (Coopersmith et al., 2015).

77

Type Token Ratio The type token ratio (TTR) is a popular ratio obtained by dividing the total number of unique words by the total number of word tokens. Generally, a high TTR indicates a high degree of lexical variation, whereas a low TTR indicates less lexical variation. In Manschreck et al. (1981), it was shown that TTR is a reliable and objective indicator of spoken language deviance, especially with psychomotor disturbances. Bag-of-Words Bag-of-words is another popular text-based linguistic analysis method in which token word distributions from an input text/transcript are compared with prearranged modeled subject-text categories, with each category containing different word distributions (Zhang et al., 2010). In the bag-of-words method, only matching input token words are recorded per modeled subject-text category wordlist. The input token word distribution is then compared to the category model wordlist via a histogram approach. Using Bayesian probability methods, the more similar the input text and its token words is to a particular modeled subject category, the more likely the input text belongs to a particular category. The bag-of-words approach has been used previously for detection of negative emotions in speech (Pokorny et al., 2015) and identification of depression via social media (Nadeem et al., 2016). Automatic Linguistic Feature Extraction Toolkits More recently, many open-source text-processing toolkits have been made available. Most of these approaches were originally intended as automatic methods for judging the quality of written texts; however, they also can provide useful information contained within spoken transcripts. As shown in Table 4.2, examples of linguistic text-processing toolkits are: Custom List Analyzer (CLA) (Kyle et al., 2015), Constructed Response Analysis Tool (CRAT) (Crossley et al., 2016a), Simple Natural Language Processing Tool (siNLP) (Crossley et al., 2014), Tool for the Automatic Analysis of Cohesion (TAACO) (Crossley et al., 2016b), Tool for the Automatic Analysis of Lexical Sophistication (TAALES) (Kyle & Crossley, 2015), and Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC) (Kyle et al., 2016). The CLA, CRAT, and siNLP allow custom-based token word dictionaries, wherein words, wildcards, and n-grams can be easily computed from text or transcripts. The TAALES uses indices based on causal, connective, overlap, and verb arguments which are calculated by examining token word-level n-grams and their attributed parts-of-speech (e.g. noun, verb, determiner, preposition).

78

Furthermore, for the TAALES and TAASSC, a text or transcript can be compared against other specific reference examples (i.e. similar to bag-of-words) to provide specific phrasal, syntactic, and proficiency information regarding how alike or different they are to each other. Table 4.2. Comparison of open-source text-based linguistic toolkit feature sets: CLA (Kyle et al., 2015), CRAT (Crossley et al., 2016a), siNLP (Crossley et al., 2014), TAACO (Crossley et al., 2016b), TAALES (Kyle & Crossley, 2015), and TAASSC (Kyle et al., 2016). ● indicates that this feature type is included, whereas c indicates a customized parameter feature size. CLA n-Gram Lexical Dictionary n-Gram Lexical Wildcards

CRAT

SiNLP

TAACO

TAALES

● ● ● ●

Lexical Sophistication Lexical Cohesion

● ● ●

Local Cohesion Global Cohesion



Comparative Lexical Overlap

● ● ● ● ● ● ●

Lemma/Parts-of-Speech Type-Token Ratio Letters Per Word Number of Words Number of Words per Sentences Number of Sentences Number of Paragraphs

● ●

● ● ● ●

Word Bigram/Trigram Clausal Indices Connective Indices Adjacent Overlap Indices

● ● ●

Verb Argument Indices Phrasal Complexity Syntatic Complexity



Language Proficiency TOTAL FEATURE DIMENSIONALITY:

TAASSC

c

700+

c

150+

400+

100+

4.2.3 Affective For depressed and non-depressed populations, affective content differences have been recorded in both spoken and written language forms (see previous Table 3.3). In Oxman et al. (1988) and Rosenberg et al. (1990), using spontaneous speech, systematic quantification of n-gram transcriptbased one-dimensional affective token word score features based on the General Inquirer (GI) Harvard III/IV Psychosociological Dictionaries (Kelly & Stone; Stone et al., 1969) demonstrated a useful discriminative application amongst patients with different illnesses (e.g. somatization,

79

paranoid, cancer, major depression). The GI is a collection of English token words that each has over a dozen rated affective categories (see Table 4.3). In both of these studies (Oxman et al., 1988; Rosenberg et al., 1990), careful attention towards a speaker’s lexical choice and its associated subject category greatly enhanced diagnostic sensitivity. Their automated use of n-gram features produced 80-85% classification accuracy for 4-classes, whereas trained psychiatrists obtained an average of 66% (Oxman et al., 1988; Rosenberg et al., 1990). Moreover, for a 4-class task, specifically for major depression classification, the accuracy of the n-gram features surpassed that of the diagnostic performance of psychiatrists, 83% and 79%, respectively (Oxman et al., 1988). In addition, Cannizzaro et al. (2004a) examined word n-gram counts with some degree of success for depression classification. However, they indicated that for some purposeful verbal tasks (e.g. picture description, answering very specific enquiry) rather than open spontaneous tasks, depressives appeared to become self-aware of their recorded verbal output, and therefore, may have taken deliberate actions to mask their depressive speech habits (Cannizzaro et al. 2004b). For a summary of speech-text based depression studies using affective features, refer to Appendix F. Automatic Affect Feature Extraction Toolkits As shown in Table 4.3 (see next page), the Sentiment Analysis and Cognition Engine (SEANCE) (Crossley et al., 2017) contains several of the most popular affective reference token word sets. For the various affective rating score references shown below in Table 4.3, the negative and positive affective words are the most examined. While many of these affective rating score references contain categories related to negative/positive emotions, physical, or social relations, there are currently none designed specifically for medical diagnostic analysis.

80

Table 4.3. Open-source affect-based sentiment SEANCE toolkit feature (Crossley et al., 2017). Harvard-IV General Inquirer (GI) (Stone et al., 1969); Lasswell (Lasswell & Namenwirth, 1969, Namenwirth & Weber, 1987); Geneva Affect Label Coder (GALC) (Scherer, 2005); Affective Norms for English Words (ANEW) (Bradley & Lang, 1994); EmoLex (Mohammad & Turney, 2010, 2013); SenticNet (Cambria et al., 2010, 2012); Valence Aware Dictionary for Sentiment Reasoning (VADER) (Hutto & Gilbert, 2014); Hu-Liu Polarity (HLP) (Hu & Liu, 2004). ● indicates that this feature type is included. GALC

EmoLex

ANEW

SenticNet

VADER

HLP

GI



● ● ● ● ● ● ● ● ● ● ●

Action Arousal





Arts/Academics



Cognition Communication



Dominance/Respect/Money/Power Economics/Politics/Religion Effort Evaluation



Feeling/Emotion Negative Emotion Words





● ●

Other Affect Physical Positive Emotion Words





● ●

● ●







Quality/Quantity Reference Social Relations Surprise Time/Space Valence/Polarity TOTAL FEATURE DIMENSIONALITY:

4.3

38

10





6

5

● ● ● ● ●

Lasswell

● ● ● ● ● ● ● ● ● ● ●





119

63

● 4

4

Statistical Classification Techniques

Generally, supervised techniques aim to maximise the discrimination between the features of training data from the different classes. Classification and prediction are the two major groups of methods in supervised machine learning techniques. Some classification method problems are binary, such as labeling ‘depressed’ and ‘non-depressed’, while other method problems are multiclass, such as ‘bipolar’, ‘unipolar’, ‘perinatal’, and ‘healthy’. Prediction produces a real number value, for example, prediction can estimate the numerical depression severity test score of a patient. There are several considerations to take into account when selecting a machine learning technique, but perhaps the most important, particularly for a small database, is the number of parameters. The training process for any classifier involves the optimization of many parameters, and doing so can

81

lead to overfitting of the data (i.e. optimization to specific training/test criteria), where the number of parameters is too large relative to the training database size. Over-parameterization and overfitting to the training database leads to a poor generalization; and hence, poor machine learning performance on unseen data. There many different acoustic speech processing modeling and classification techniques found within speech-depression studies. A few of the most commonly used classification/prediction approaches are: decision trees, linear discriminant analysis (LDA), Support Vector Machine (SVM), Support Vector Regression (SVR), and Relevance Vector Machine (RVM) methods. In addition, more recently, deep learning (for more details refer to: Bengio et al., 2013; LeCun, et al., 2015) has gained momentum as a state-of-the-art approach for many speech-processing applications (e.g. automatic speech recognition, language identification, speaker recognition). However, deep learning techniques, such as Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM),

have

only

recently

been

explored

for

automatic

speech-based

depression

classification/prediction tasks (Harati et al., 2018; He & Cao, 2018; Ma et al., 2016; S’adan et al., 2018).

4.3.1 Decision Trees Decision tree algorithms (e.g. hierarchal, k-means), also known as regression trees, are a systematic approach to data classification. Decision trees use straightforward recursive tree-like hierarchical branch structures to model data. Decision trees are commonly used as a machine-learning tool in many fields of communication, including speech language, paralinguistic, and text processing applications (Dattatreya & Sarma, 1984; Gosztolya et al., 2013; Hasan et al., 2014). More recently, decision trees have been applied to speech-depression and linguistic text-based depression classification problems (Howes et al., 2014; Mitra et al., 2014; Nadeem et al., 2016; Yang et al., 2016). Decision trees attempt to generalize, or find patterns in data. They do this by determining which tests (or questions) best divide the instances into separate classes, forming a tree. This procedure can be conceived as a greedy search through the space of all possible decision trees by scanning through the instances in a given node to determine the gain from each split and picking the single split that provides the greatest gain. Then the instances are partitioned based on the split, and this procedure is applied recursively until all the instances in the node are of the same class (Kotsiantis,

82

2013). Decision nodes usually operate using a binary ‘all or nothing’ approach, which determines whether an operation passes further throughout paths in the tree. Decision trees use nodes and decision rules, with each branch representing a link between all subsequent features, eventually leading to a final classification decision (Hoque et al., 2006). The root node or mother node branches out to three other types of nodes found throughout the tree structure: (1) decision nodes; (2) chance nodes; and (3) end nodes. Decision nodes have two or more nodes extending from them. Examples of decision nodes (e.g. choice nodes) are found in Fig. 4.3, wherein the categories in the second top row (e.g. vowels, diphthongs, semivowels, consonants) spread out into other extending nodes referred to as child nodes. Some child nodes are also decision nodes, whereas others are end nodes. Chance nodes are branches in a decision tree representing a group of possible outcomes that are not under control of the decision maker. For instance, some outcomes are assigned to chance nodes because these nodes are reliant on outcomes based from other nodes whose decisions have yet to be made. The end nodes, also known as leaf nodes, terminate with no further children. From a decision tree’s starting root node, it can quickly branch out into a path with a multi-connected series of nodes. This type of structure is commonly found in nature and biological structures (e.g. plants, lightning, lymphatic system, neural pathways).

ENGLISH PHONEMES VOWELS

FRONT

ae ih iy eh

MID

aa aw er

DIPHTHONGS

BACK

ow uh uw

ay ey oy

CONSONANTS

SEMIVOWELS

LIQUIDS

l r

GLIDES

w y

NASALS

m n ng

STOPS

VOICED

b d g

FRICATIVES

UNVOICED

k p t

VOICED

dh dx v z

UNVOICED

f hh s sh th

AFFRICATES

VOICED

jh

UNVOICED

ch

Figure 4.3: Basic tree structure example for a standard set of English phonemes after Rabiner & Schafer (1978). The root node is ‘English Phonemes’, from which several child and leaf nodes emanate. Decision tree classifiers learn from the structure automatically from the data.

As discussed in Schuller & Batliner (2014) and Kulkarni & Joshi (2015), a decision tree can be represented by a set of nodes 𝑉 and with a set 𝐸 ⊆ 𝑉 × 𝑉of edges 𝑒, such that 𝑒 = 𝑣! , 𝑣! ∈ 𝐸 represents a link from 𝑣! ∈ 𝑉 to 𝑣! ∈ 𝑉 nodes. As a tree structure, the condition |E| = |V| - 1 holds. A path of length P through the tree is a sequence 𝑣! , . . . , 𝑣! , 𝑣! ∈ 𝑉with (𝑣! , 𝑣!!! ) ∈ 𝐸, 𝑘 = 1, . . . , 𝑃 − 1. Each path begins at the root node r, which is a singular node without edges. Thus, E contains no element of the form (𝑣, 𝑟), 𝑣 ∈ 𝑉. Given a feature space of dimension N, a partial function of 𝑎 ∶ 𝑉 ⟶ {1, . . . , 𝑁} is defined, therefore mapping every node that has outgoing edges (inner nodes) to features. This function only examines each feature at most once in a tree. 83

The edges traversed in a path correspond to branch decisions based on the individual values of these features. Each edge is only assigned a single feature interval. To determine the class label of a feature vector 𝐱 = [𝑥! , . . . , 𝑥! ]! , a path through the tree is followed that fulfills specific criteria. Beginning at the root node, the nodes 𝑣 in the path are interconnected by the edge for which 𝐱 !(!) is within the particularly observed edge interval. The number of edges emitted from a node depends on quantisation of the features into 𝐽! intervals per feature 𝑛, resulting in a finite number of outgoing edges. The root node and each inner node 𝑣 have 𝐽!(!) outgoing edges. For a binary decision tree each inner node has two outgoing edges with intervals of the form | − ∞, 𝜉| and |𝜉, +∞| corresponding to a threshold decision per node. The intervals are usually calculated using basic 'binning' via a histogram distribution (Witten & Frank, 2005). An alternative to the binning approach is to use sigmoid functions for the decisions (Landwehr et al., 2005). Each path terminates at a leaf. The leaves are the nodes 𝑏 without an outbound edge, as in 𝐸 there exist no (𝑏, 𝑣) with 𝑣 ∈ 𝑉. Based on a mapping from leaves to class labels, the class of the leaf where the path ends determines the final decision outcome. The optimization criteria for deciding the parameters of a decision tree are explored by maximizing the information gain per node in each pathway. Essentially, the decision tree classification process examines remaining features at each future node. To optimize these parameters, the Shannon entropy H(Y) of the distribution of the probabilities 𝑌! , . . . , 𝑌! is often employed (Schuller & Batliner, 2014). Like any classifier, decision trees have advantages and disadvantages. A major advantage of decision trees is that they employ symbolic logic as human interpretive sets of rules that can make the classification process more transparent than other types of expert knowledge classifiers (e.g. SVM, LDA) (Schuller & Batliner, 2014). With regard to language analysis, decision trees follow a natural rule-based structure naturally found in the majority of known spoken and written languages (e.g. word syntax trees, constituency-based parse trees, dependency-based parse trees). One drawback of using decision trees for classification is that they rely heavily on having accurate early paths (Kotsiantis, 2013; Schuller & Batliner, 2014). So, if the wrong path is made during the early stages, no correction can be made. Also, decision trees are not used with large dimensionality feature sets because they are prone to overfitting. To avoid an over-fitting decision tree model, large-dimensional feature sets may be randomly subsampled, wherein each feature subset has a decision tree built. During classification the decisions of all subspace decision trees are fused using majority voting. This fusion process that generates a

84

random subspace method often is referred to as decision forests (Ho, 1998) or random forests (Quinlan, 1996).

4.3.2 Linear Discriminant Analysis Sir Ronald Fisher developed Linear Discriminant Analysis (LDA) nearly a hundred years ago (Fisher, 1936). LDA is useful in determining whether a set of variables is effective in predicting membership of a particular pre-defined membership class. The main function of LDA is to analyze linear combinations of variables that best explain the features and allocate a set of these values into groups, classes, or categories of the same type. LDA acclimates well when the within-class occurrences are imbalanced, and their performance is evaluated on randomly generated test data. LDA maximizes the ratio of between-class variance to the within-class variance for a database. The ratio guarantees maximal separation between the given classes according to these criteria. LDA emphasises class separability by determining the best approximate decision region for each given class (Erdogmus et al., 2008). Previously, for speech processing applications LDA has been used for Automatic Speech Recognition (ASR), speaker identification, language identification, and emotion identification tasks (Dehak et al., 2011; Haeb-Umbach & Ney, 1992; Jin & Waibel, 2000; Wang & Guan, 2004). More recently, Cummins et al. (2014) and Lopez-Otero et al. (2014), explored LDA as a classification technique for speech depression classification. There are two different approaches to LDA: class-dependent and class-independent. If generalization of all data within the classes is important, the class-independent transformation is applied. However, if specific discrimination of each single class is of most importance, a classdependent approach is then preferred. The class-dependent approach maximizes the ratio between class variance to within-class variances, maximizing the ratio so that separability to a fair extent is achieved. The class-dependent approach requires two optimizing criteria for independently transforming the data sets. Unlike the class-dependent approach, instead the class-independent transformation involves maximizing the ratio of overall variance to within-class variance. Thus, this approach only requires a single universal optimization criterion transform onto each class dataset. An example of mathematical operations for LDA is shown for a two-class problem below, based on examples from Balakrishnama & Ganapathiraju (1998, 1999). S1 and S2 represent set one and set two dataset classes.

85

𝑎! 𝑎! 𝑆! = … 𝑎!

𝑏! 𝑏! 𝑆! = … 𝑏!

(4.23)

The features per class dataset are placed into uniform matrices, after which the mean of each of these symmetric feature matrix representations is computed, along with the mean of both datasets combined. The mean (𝛍) of both datasets combined is derived from the equation below: 𝛍 = 𝑃! 𝛍! + 𝑃! 𝛍!

(4.24)

where P1 and P2 are the a priori probabilities of the two classes. Generally, the Pn value is derived from dividing the value 1 by the total number of classes. Hence, for the 𝜇 above, the probability factor (Pn) is 0.50, as it is the value 1 divided by 2 (i.e. number of classes in above example). An illustrated example for the two given classes is shown in Fig. 4.4 below.

𝜇! 𝜇

𝑥

𝜇!

Figure 4.4: Distribution of two dataset classes, red dots and blue diamonds. The mean representations derived from equations 1 and 2 are also shown (𝛍! , 𝛍! ) along with a green star unknown class test data point (x). The LDA two-class areas are designated by red and blue areas; with the LDA linear decision boundary shown as a gray dotted line.

Within-class and between-class scatter matrices are utilised to formulate the base criteria for class separability. The within-class scatter is the expected covariance of each of the classes. All scatter measures are calculated via equations (4.25) and (4.26): 𝐒! =

𝑃! (𝑐𝑜𝑣! )

(4.25)

!

For the above two-class problem, the given equation above simplifies to the following: 𝐒! = 0.5(𝑐𝑜𝑣! + 𝑐𝑜𝑣! )

(4.26)

𝑐𝑜𝑣! and 𝑐𝑜𝑣! are the correlating two-set covariances, which are individually computed using the

following equation:

86

𝑐𝑜𝑣! = (𝐱 ! − 𝛍! )(𝐱 ! − 𝛍! )!

(4.27)

Here, T is the transpose. The between-class scatter is then computed by: 𝛍! − 𝛍 (𝛍! − 𝛍)!

𝐒! =

(4.28)

!

The covariance of a dataset whose members are the mean vectors of each class is represented as 𝐒! . Again, as mentioned earlier, the ratio of between-class to within-class scatter generates the LDA optimizing criterion. The solution that maximizes the criterion serves to demarcate the axis of the transformed space. However, for the class-dependent transform, optimization is computed on a separate per class-basis using both equations (4.27) and (4.28). Further, if the class-dependent approach to LDA is used, for L-classes this requires L separate optimization criteria for all individual classes. The optimizing factor equation for the class-dependent approach is: 𝐜! = 𝑖𝑛𝑣 𝑐𝑜𝑣! 𝑺!

(4.29)

For the class-independent transform, the optimization criteria equation is computed as: 𝐂 = 𝑆! !! 𝑺!

(4.30)

There are some known limitations to LDA (Erdogmus et al., 2008; Yan & Dai, 2011). For instance, if the feature dimensionality is too low and/or has minimal variance across class types, then performance can be limited. Additionally, LDA is a parametric approach to classification; thus, it assumes a uni-modal Gaussian shape and likelihoods. If the discriminatory feature information is contained only in the variance, and not the mean, LDA will perform poorly. Also, if dataset features are significantly non-Gaussian, LDA will be unable to preserve the complex structure required for effective classification. Quadratic Discriminant Analysis (QDA) is another variant of discriminant analysis. Unlike LDA, wherein all classes have the same diagonal covariance matrix, QDA allows covariance matrices to vary among classes and allows a non-linear decision boundary (Tharwat, 2013; Qin, 2018). The flexibility of QDA is that it does not assume that the covariance of each class is identical; and thus, if the matrices substantially differ, observations will tend to be assigned to the class where variability is greater. QDA calculates a separate 𝐒! and 𝛍 for every class as shown previously in equations (4.24) and (4.25), respectively.

87

4.3.3 Support Vector Machine In paralinguistics, and in speech processing in general, the Support Vector Machine (SVM) is among the most popular classification techniques. SVM baselines have been included in several annual international challenges for paralinguistics and emotion (Schuller & Batliner, 2014; Valstar et al., 2013). Also, SVM techniques have been used specifically in depression classification and prediction tasks (Algowhinem et al., 2013; Cohn et al., 2009; Helfer et al., 2013; Scherer, 2013, Valstar et al., 2013; Yu et al., 2016). SVM has been used for text-based classification tasks as well (Joachims, 1998). The SVM technique is popular because it can support a large feature space, sparse features, and is robust to over-fitting issues due to its dependence on a relatively small set of training data near the class boundary, i.e. the support vector (Schuller & Batliner, 2014). Unlike other classification techniques that involve strict linear decision boundaries (e.g. LDA, kNN), SVM are designed around binary linear classifiers that are optimized to provide the best possible separation between classes in the feature space, including for classification tasks that are non-linear in nature. An example of non-linear SVM decision boundary class separation is illustrated in Fig. 4.5. SVM rely on a quadratic optimization criterion that identifies support vectors lying in the margin around the decision boundary and employs them essentially to define the classification boundary. To solve classification tasks with non-linear boundaries, a kernel trick is used to map data features into a higher-dimensionality space while still maintaining a low complexity (Burges, 1998). It should be noted that a poor kernel selection will result in low SVM performance (Burges, 1998). The key aspects of the linear SVM were developed in the 1960’s and 1970’s (Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1964, 1974; Vapnik, 1982, 1995). However, for decades the application of SVM technique remained largely untested and applied. In the early 1990’s, AT&T Bell Laboratories revisited the SVM technique for written character recognition tasks and also modified it to allow classification problems with non-linear boundaries (Boser et al., 1992; Guyon et al., 1993; Cortes & Vapnik, 1995; Vapnik, 1995). The performance results were so impressively robust that SVM techniques were soon applied to classification and prediction problems found in many fields of science (Blanz et al., 1996; Scholkopf, 1997; Moellar et al., 1997; Drucker et al., 1997; Smola et al. 1997).

88

𝜇!

𝜇

𝑥

𝜇! Figure 4.5: An example of two nonlinearly separable classes of data, denoted by red dots and blue diamonds. As illustrated, the SVM approach will allow a non-linear separation between two classes, as shown by the red and blue coloured areas, whereas an unknown test feature vector is represented by the green star (x).

Given a set of training input vectors and corresponding class labels ! 𝐷 = (𝐱 ! , 𝑐! )! !!! 𝐱 ! ∈ ℝ , 𝑐! ∈ +1, −1

(4.31)

where 𝐱 ! represents a K-dimensional feature vector and 𝑐! is the corresponding class label, the aim of SVM training is to learn a model to assign a class label 𝑐∗ , to a test vector 𝐱 ∗ : 𝑐∗ = 𝑠𝑖𝑔𝑛(𝑓 𝐱 ∗ )

(4.32)

The SVM classification function is: !

𝑓 𝐱∗ =

𝑎! 𝑐! 𝜘 𝐱 ∗ , 𝐱 ! + 𝑏

(4.33)

!!!

where 𝑎! is the weight term associated with each training vector, those 𝐱 ! for which 𝑎! > 0 are referred to as the support vectors, b is a learnt bias term and 𝜘(∙,∙) is a kernel function. A linear SVM has 𝜘 𝐱 ∗ , 𝐱 ! = (𝐱 ∗ , 𝐱 ! ) where (∙,∙) represents the inner product, while a non-linear SVM has 𝜘 𝐱 ∗ , 𝐱 ! = 𝜘(𝜙(𝐱 ∗ , 𝜙(𝐱 ! )), where 𝜙 is a non-linear mapping typically to a higher dimensional space. Note, due to the potentially very high dimensionality of the mapping, 𝜙(𝐱) is normally not explicitly computed. Instead, kernel functions 𝜘 𝐱 ∗ , 𝐱 ! with suitable closed form expressions operate directly on 𝐱 ∗ and 𝐱 ! to produce a result which is equivalent to the inner product of the higher dimensional vectors are chosen. A solution for the SVM classification function is formulated as a soft-margin optimization problem (Burges, 1998) which minimises

89

1 | 𝜑 |! + 𝐶 2 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜

!

𝜉!

(4.34a)

!!!

𝑐! 𝜙 𝐱 ! + 𝑏 ≤ 1 − 𝜉! 𝜉! ≥ 0 𝑐! ∈ {+1, −1}

(4.34b)

where 𝜑 is the separating hyperplane, C is a cost regularization term which trades off model complexity, in terms of the total number of support vectors, with the amount of training error permitted by slack parameters 𝜉! . Note that setting 𝜙 𝐱 ! = 𝐱 ! gives a linear SVM. The hyperplane is expressed in terms of the support vectors: !

𝜑=

𝑎! 𝑐! ϕ 𝐱 !

(4.35)

!!!

with the kernel trick being used to express equations (4.34) and (4.35) as the SVM classification function found in equation (4.33).

4.3.4 Support Vector Regression Support Vector Regression (SVR) employs many similar principles to SVM, but instead to solve a prediction problem. The primary objective of SVR is to minimize the prediction error, while generating a hyper-plane that results in a maximization of the margin (see Fig. 4.6). SVR is a sparse regression technique widely used for multivariate regression in speech processing (Gunes & Schuller, 2013; Schuller et al., 2013; Valstar et al., 2013). Moreover, SVR has specifically been applied on depression severity prediction tasks in Redlich et al. (2016) and Valstar et al. (2013). The SVR model is characterized by a subset of the training feature vectors, the support vectors, as previously discussed in Chapter 4.2.3. As with SVM, the advantages of using SVR include good classification generalization and non-linear regression capability through careful kernel function selection (Smola & Schollhopf, 2003). Given a set of training input vectors and corresponding targets: ! 𝐷 = (𝐱 ! , 𝐭 ! )! !!! 𝐱 ! ∈ ℝ , 𝑡! ∈ ℝ

(4.36)

where 𝐱 ! is again a k-dimensional feature vector and 𝑡! represents the corresponding continuouslyvalued target label. The objective of SVR training is to learn a model fit for predicting the target label, 𝑡∗ , for some test vector 𝐱 ∗ :

90

𝑡∗ = 𝑦 𝐱 ∗ , 𝛚

(4.37)

where 𝛚 = [𝜔! , … 𝜔! ]! is an estimated set of regression parameter weights. SVR models this function as a hypertube creating a regression function that has at most 𝜀 deviation from the training targets 𝐭 ! . The SVR regression function is: !

𝑦 𝐱∗, 𝛚 =

𝜔! 𝒦 𝐱 ∗ , 𝐱 ! + 𝑏

(4.38)

!!!

where b represents the learnt bias and 𝒦(∙,∙) is a kernel function. To compute a solution for this function, an epsilon-SVR approach is calculated as a soft margin convex optimization problem (Smola & Scholkopf, 2004), which minimises 1 | 𝛚 |! + 𝐶 2 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜

!

(𝜉! + 𝜉!∗ )

(4.39a)

!!!

𝑡! − 𝛚, 𝜙 𝐱 ! − 𝑏 ≤ 𝜀 − 𝜉! (𝛚, 𝜙 𝐱 ! ) + 𝑏 − 𝑡! ≤ 𝜀 + 𝜉!∗ 𝜉! , 𝜉!∗ ≥ 0 𝑛 = 1, … , 𝑁

(4.39b)

where C is a cost regularization term that trades model complexity for the amount of deviation taken from 𝜀 , where the slack parameters 𝜉!, 𝜉!∗ are permitted. Note that setting 𝜙 𝐱 ! = 𝐱 ! gives a linear SVR. As previously stated, both SVM and SVR can be adjusted to accommodate nonlinearity separability through the choice of a specific kernel function. For both SVM and SVR the kernel functions are limited to those that satisfy the Mercer conditions – a continuous symmetric kernel of a positive integral operator (Burges, 1998; Smola & Scholkopf, 2004). Common choices for kernel functions include: linear, polynomial, Laplacian and Gaussian (Smola & Scholkopf, 2004). The performance of SVM and SVR depends greatly on the hyperplane parameters used in training. Hyperparameters are set manually and should be selected to reflect some prior knowledge based on the distribution of training data samples. The main hyperparameter is the cost regularization term C which controls model complexity, however in SVR the width of the soft margin loss function used in SVR training 𝜀 also has to be set.

91

F0#Average#(Hz)#

130#

wŸxi + b - yi ≤ ε

120#

ξ2

110# 100#

w1Ÿxi + b - yi ≤ ε

ξ1

ε

90# 80#

65# 70# 75# with regression 80# Figure 4.6: Illustrative55#example of60# set of prediction and ground-truth data line (solid black Intensity#(dB)# line). The region between the support vectors is referred to as the hyper-tube space. The regression predictive error is the difference between the true data values shown in blue and the predicted values shown by red circles. The 𝜉 is the soft margin-loss error from the support vectors shown for two classes.

4.3.5 Relevance Vector Machine Relevance Vector Machine (RVM) is a state-of-the-art regression technique (Candela, 2014; Tipping, 2000, 2001, 2003), which has also been applied for depression severity prediction tasks, such as in Cummins et al. (2015b), Huang et al. (2016), and Mwangi et al. (2017). RVM utilises a Bayesian approach to learning, wherein weights are determined by a set of hyperparameters and highly probable values are then iteratively estimated from the training data. Unlike SVM related techniques, RVM classification is not particularly associated with data samples nearest to the decision boundaries, and attempts to use prototypical examples found in individual classes. A notable advantage of RVM is that unlike SVM-based techniques, it does not necessitate an estimate of the trade-off parameter and Mercer kernel functions. Moreover, RVM typically requires less of kernel functions than the SVM technique, but allows more flexibility with their use (Tipping, 2000). As described in Tipping (2000), given a dataset of sample-target pairs {𝐱 ! , 𝑡! }! !!! , p(tn|xn) is presumed Gaussian with 𝒩(𝑡! |𝑦 𝒙! , 𝜎 ! ). The distribution mean for feature xn is modeled by y(xn), as previously defined in the SVM equation (4.33). The sample dataset likelihood is then defined as: !

𝑝 𝐭 𝐰, 𝜎 ! = (2𝜋𝜎 ! )! ! exp − where t = (𝑡! … 𝑡! ), w = 𝑤! … 𝑤!

1 𝑡 − 𝚽𝐰 2𝜎 !

!

(4.40)

and 𝚽 is the 𝑁×(𝑁 + 1) ‘design’ matrix with 𝛟!" =

𝐾 𝐱 !, 𝐱 !!! and 𝚽!! = 1. Maximum-likelihood estimation of w and 𝜎 ! from equation (4.20) often 92

results in overfitting; thus, smoother functions are preferred and are defined by automatic relevance determination (for more details, see Mackay (1994) and Neal (1996)) Gaussian prior over the weights shown as: !

𝒩(𝑤! |0, 𝛼! !! )

𝑝 𝐰𝛂 =

(4.41)

!!!

with 𝛂 as a vector of (N+1) parameters. A trait of this model is that it provides sparsity properties and it assumes zero-mean Gaussian for each weight. Bayes’ rule is used to calculate the posterior over the weights: 𝑝 𝐰 𝐭, 𝛂, 𝜎 ! = (2𝜋)!

(!!!) ! ! ! |𝚺| !

exp −

1 𝐰 − 𝛍 ! 𝚺 !! (𝐰 − 𝛍) , 2

(4.42)

with 𝚺 = (𝚽 ! 𝑩𝚽 + 𝐀)!! , 𝛍 = 𝚺𝚽 ! 𝐁𝐭,

(4.43) (4.44)

where A is calculated by A = diag(𝛼! , 𝛼! , . . . 𝛼! ) and B = 𝜎 !! 𝐈! . The 𝜎 ! is additionally treated as a parameter, as it is estimated from the sample data. The marginal likelihood is calculated by integrating individual weights (Tipping, 2000), which can be shown as: ! ! 1 𝑝 𝐭 𝐰, 𝛂, 𝜎 ! = (2𝜋)! ! |𝐁 !! + 𝚽𝐀!! 𝚽 ! |!! exp − 𝐭 ! 𝐁 !! + 𝚽𝐀!! 𝚽 ! 2

!!

𝐭

(4.45)

In recognizing ideal Bayesian inference, hyperpriors over 𝛂 and 𝜎 ! are defined, and further, parameters w are integrated out. Marginalization is not performed in closed-form so a practical procedure is used, which is based on MacKay (1992). The optimization of the marginal likelihood as in equation (4.43) is applied with respect to 𝛂 and 𝜎 ! similarly to the maximum likelihood approach. This approach is equivalent to determining the maximum of 𝑝 𝛂, 𝜎 ! 𝐭 , with presumption of a uniform hyperprior. Based on equation (4.42), using these maximization values, class-value predictions are made. The values of 𝛼 and of 𝜎 ! which maximize equation (4.43) also cannot be derived in closed-form; therefore, the weights are interpreted as ‘hidden’ variables (Tipping, 2000) and an expectation maximisation (EM) approach method is applied as: 𝑎!!"# =

1 (𝑤!! )!(!|!,!,!! )

=

1 Σ!! + 𝜇!!

(4.46)

Furthermore, a direct differentiation can be found using equation (4.46) and reorganized as:

93

𝑎!!"# =

𝛾! , 𝜇!!

(4.47)

where the quantities 𝛾! = 1 − 𝛼! Σ!! , are defined, and are deduced as a measure of how wellgeneralized each parameter 𝑤! is by the sample dataset. The latter update is observed to produce more speedy convergence. For the noise variance, both approaches lead to the same re-estimated. ! !"#

(𝜎 )

𝐭 − 𝚽𝛍 = 𝑁 − ! 𝛾!

!

(4.48)

During re-estimation, many of the 𝑎! values approach infinity, and based on the application of equation (4.42), 𝑝 𝑤! 𝐭, 𝛂, 𝜎 ! become infinitely peaked at zero. In addition, there is an Occam penalty for the smallest values of 𝛼! present in the determinant in the marginal likelihood in equation (4.43). For 𝛼! , a lesser penalty is paid because it is described as data with added noise 𝜎 ! , resulting in 𝛼! → ∞, which forces Wi to be 0. The main drawback to RVM is the complexity required at the training phase (Tipping, 2000). Also, for large datasets the RVM process may be considerably slower than for the SVR-based approach.

4.3.6 Neural Network Based Methods More recently, due to computational advances, machine learning approaches such as deep neural networks and other similar methods (e.g. recurrent neural networks, convolutional neural networks) have increased in use (Bengio et al., 2013; LeCun, et al., 2015). However, these algorithmic approaches typically require a great deal of data in order to avoid overfitting. A literature review over the last decade, which included over 60 automatic depression analysis studies conducted by Pampouchidou et al. (2017), indicated that although a variety of neural network based methods have been explored for automatic depression classification and prediction tasks, they are only a small handful when compared with other techniques (e.g. SVM, SVR).

4.3.7 Evaluation Methods For classification performance evaluation, accuracy is often reported and computed by the following confusion matrix equation: 𝐚=

𝑇𝑃 𝐹𝑃

𝐹𝑁 𝑇𝑁

94

(4.49)

where TP equals the number of ‘true positives’, TN equals the number of ‘true negatives’, FN equals the number of ‘false negatives’, and FP equals the number of ‘false positives’. The classification accuracy (A) can be computed by the following equation: !"!!"

𝐴=

!"!!"!!"!!"

(4.50)

where precision (P) is given by: 𝑃=

!"

(4.51)

!" ! !"

and recall (R) is given by: 𝑅=

!"

(4.52)

!" ! !"

The F1 score is also a common metric that combines precision and recall, thus allowing further evaluation of specific class performance based on true/false positives and true/false negatives, and is a helpful evaluation criterion for unbalanced classification problems. The F1 score is computed by the following equation (a large F1 score implies better discrimination): F1 =

!"

(4.53)

!!!

For prediction evaluation tasks, the two most common performance metrics used predictive accuracy are the mean-absolute error (MAE) and root-mean square error (RMSE). MAE is given by the following formula: 1 MAE = 𝑛

!

!!!

1 𝑓! − 𝑦! = 𝑛

!

𝑒!

(4.54)

!!!

MAE is an average of the absolute errors |ei| = |fi – yi|, where fi is the prediction score and yi is the ground truth score. This error measurement is scale-dependent; and thus, cannot be applied to make comparisons between scores on different types of scales. The RMSE aggregates the magnitudes of the errors in score predictions for various times into a single measure of predictive power. Again, like the MAE, the RMSE is scale-dependent. One distinct advantage the RMSE has over the MAE is that it does not utilise absolute values and is generally better at revealing model performance differences. The RMSE formula is given as:

95

RMSE =

! !!!( 𝑦!

− 𝑦! )!

𝑛

(4.55)

The 𝑦t is represented as the prediction score values for times ground-truth prediction score Yt computed over n different predictions as the square root of the mean of the squares of the deviations.

4.4

Summary

This chapter summarizes data selection, features, machine learning techniques, and evaluation methods commonly used for speech-based depression classification and prediction. The majority of feature and classifier types used for automatic speech-based depression analysis are based on previously well-investigated approaches stemming from other more established areas of speech processing (e.g. speaker recognition, speech recognition, language identification). This chapter also highlights the differences between classifier techniques, and pros/cons in implementing each. While no specific feature, set of features or classifier type is pre-eminent in depressed speech analysis, research continues to explore new promising methods. It should be noted that while many features and machine learning techniques have been explored for speech-based depression analysis and classification, no particular technique/s has shown significant dominance over all others in the literature, particularly across different databases and studies. Further, there is still ongoing investigation into developing new features and statistical decision-making methods for speech-based depression tasks. As highlighted later in Chapter 5.1, there is a limitation in the number of depression-related speech corpora currently available for research, which considerably limits the opportunity to use emerging advanced machine learning techniques, and requires care even in the application of well-established approaches.

96

Chapter 5 DEPRESSION SPEECH CORPORA

This chapter begins with a general overview on depression database criteria and describes pros/cons of several elicitation speech modes. Additionally, nuisance factors are identified and examples are given on how these can impact depressed speech recording collections. The most commonly found database elicitation speech modes are identified and discussed: diadochokinetic, held-vowel, automatic, read, and spontaneous. Thereafter, current experimental depression databases are described in detail including number of speakers, speaker demographics, depression type/severity, recording lengths, and speech mode/s. Note that all depression databases detailed herein are used in experimental work presented later in Chapters 6-13.

5.1

General Speech Depression Database Criteria

Access to large speaker-size speech-depression corpora (i.e. 300+ speakers) is a major challenge for researchers mainly due to patient-doctor confidentiality and ethics restrictions. According to Cummins et al. (2015a), there are less than two-dozen speech-depression corpora worldwide, most of which are publicly unavailable. Moreover, the majority of these speech-depression databases have less than 30 depressed speakers and a single depression subtype diagnosis (e.g. bipolar, postpartum, MDD) involving a limited number of speech elicitation tasks. Currently, there is neither a principal guide nor set of recommended test batteries including speech modes that adequately detail what methods are most effective for reliably assessing disorders/diseases using automatic speech analysis. Automatic speech processing researchers have advocated more transparent collection methods for clinical speech data (Baghai-Ravary & Beet, 2013; Cummins et al., 2015a). However, in this field they have yet to propose audio collection standards with clear terminology, specific test criteria, and elicitation protocol materials based on solid clinically motivated investigations; this remains an open research problem. Many recent speech-based depression studies (Alghowinem et al., 2013a; Cummins et al., 2015a; Jiang et al., 2017; Kiss & Vicsi, 2017; Liu et al., 2017; Long et al., 2017a) have produced

97

discordant findings as to whether automatic, diadochokinetic, read, or spontaneous speech modes are best for analysis. Furthermore, the type of affect information contained in speech has also shown divergent speech-based depression performance results in many studies, such as Alghowinem et al. (2013a), Jiang et al. (2017), Liu et al. (2017), Long et al. (2017b), and Shankayi et al. (2012).

5.2

Speech Elicitation Considerations and Modes

Clinicians rely on a battery of evidence-based tests and their subjective experience to help diagnose patients. Depending on the clinician’s expertise and the patient’s issue, different speech collection modes may be used to elicit and test against healthy speaker norms. In Egan (1948), it was suggested that speech mode test material and speech data collection include at least four criteria: (1) speech sounds representative of those found in spontaneous conversations; (2) considerations of the economy of speech production; (3) recording methods similar in quality across all speakers; and (4) tasks that include meaningless syllable combinations, meaningful words given out of context, or meaningful phrases with contextual relations among words. Affect is an additional speech parameter (Egan, 1948). Emotional analysis of speech has often included opposite situations, affective versus non-affective pictures, and stressed versus nonstressed stimuli (Chevrie-Muller et al., 1985). In a clinical setting, linguistic and affect measures have been used in the design of speech task materials, allowing more control over linguistic complexity and sentiment. Another important measure often overlooked in clinical speech databases pertains to speaker description metadata. Metadata is essential to de-generalize big populations, allowing the creation of more specific speech-language speaker models. Metadata speaker trait examples that can influence clinical speech-language analysis include a speaker’s personality; parent’s language; accent/dialect; prior pathologies; age; height; weight; drug use (e.g. drinking, smoking); hearing; education; and profession. In addition, metadata information regarding the interviewer can also be of interest, as it has been shown in studies this influences a speaker’s social behaviours (Chevrie-Muller et al., 1985; Egan, 1948). Metadata adds considerable value to a speech data collection because it allows for stronger result validation of specific, isolated speaker criteria. Additionally, metadata also increases the opportunity for new discrete speech-illness correlations to be discovered.

98

5.2.1 Diadochokinetic Diadochokinetic (DDK) speech mode is a conventional component of oral clinical neurology and speech-language pathology articulatory assessment protocols (Fletcher, 1972; Kent et al, 1987; Duffy, 2005). DDK speech consists of syllables with rapid articulatory repetitions. During DDK tasks, speakers are instructed to produce a fixed number of repetitive syllable combinations as quickly as possible involving different consonant-vowel combinations. A common DDK speech task is /pa-ta-ka/, which contains three different stop consonant positions. Although very constrained, DDK has remained a popular speech mode for clinical speech assessment analysis because it has a clinical origin, instructive simplicity, and requires a minimal amount of data for assessment. However, Gadesmann & Miller (2008) have suggested that DDK has shortfalls in terms of specificity to disorders, repeatability, languages, and demographics. Other work (Jaeger, 1992; Stemberger, 1989) has revealed that speakers exhibited a greater number of speech errors for the DDK mode than is normally expected in conversational speech, advising that the DDK task is more unnaturally difficult than natural speech tasks.

5.2.2 Held-Vowel The held-vowel speech mode is effective for providing fundamental information about an individual’s voice quality (e.g. voicing, hoarseness, timbre) and respiratory control. In clinical audio analyses, held-vowel tasks often simply comprise the /a, i, u/ corner vowels. Surprisingly, despite only consisting of a single phoneme, the held-vowel mode has demonstrated usefulness in identifying and monitoring depression (Cummins et al., 2015a), dysphonia (Alsulaiman, 2014), oral cancer (De Bruijn et al., 2009), Parkinson’s disease (Hemmerling et al., 2016), respiratory disorders (Shrivastava et al., 2018) and other pathological speech disorders (Dibazar et al., 2002). Since the held-vowel speech mode provides trivial linguistic content, it is not very useful for assessing language deficits. It is known that between-session held-vowel performance may vary even with the same speaker due to everyday circumstances (e.g. cold, tiredness, stress) (Cummins et al., 2015a).

5.2.3 Automatic Automatic speech mode is both articulatory and language driven. Additionally, it deals to a great extent with explicit knowledge and relies on information retrieval. When observing automatic

99

speech, clinicians are primarily concerned with the articulatory accuracy, speed and appropriateness of a patient’s verbal response. Examples of automatic speech include counting, birthdate, alphabet, months, repeating audible sentences, yes/no responses, image identification, word opposite, and categorization tasks (Bookheimer et al., 2000). Automatic speech also includes semi-spontaneous and semi-read tasks, such as rule-based unprompted dialog, prompted-question answering, device command-control, and spelling diction. For example, such a task may include reading aloud part of a common idiom and verbalizing its completion (e.g. “It is raining cats {and dogs}”). In Bookheimer et al. (2000), it was shown that automatic speech typically does not engage the language cortex; however, reiterating prose passages produces activation in both Broca’s and Wernicke’s language areas. Automatic speech tasks require careful consideration in regard to a speaker’s age, native language, and cultural background.

5.2.4 Read The read speech mode consists of pre-selected words, sentences, or paragraph excerpts that are read aloud by a patient. Similarly to automatic speech elicitation, clinicians select read tasks depending on the patient’s articulatory or language concern. Clinicians can easily manipulate the degree of articulatory-linguistic difficulty via the text stimuli, and consequently also elicit larger amounts of the target behaviour of interest (Sidtis et al., 2012). Other advantages to read speech tasks include simple instruction, repeatability, isolated cognitive demand, and phonetic variability control. Read speech tasks have been used to clinically assess patients with aphasia (Duffy, 2005), apraxia (Duffy, 2005), depression (Cummins et al., 2015a), learning disabilities (Gurland & Marton, 2012), and Parkinson’s disease (Harel et al., 2004). The most popular read speech excerpts for clinical speech-based analysis were excerpts written by Fairbanks (1960), Van Riper (1963), Townsend (1868), Crystal & House (1982), Tjaden & Wilding (2004), and Patel et al. (2013). However, many of these texts are old and are unnatural in terms of modern-day speech (Reilly & Fisher, 2012). Further, Boaz et al. (2016) found that some of these texts comprise unusual English syntax, which impact speakers’ testing performance. Perhaps the strongest argument against using many of the aforementioned texts is that they were not all designed or originally intended for diagnoses of multiple types of disorders – in fact, many of these texts were only intended as subjective tests for speech intelligibility (Fairbanks, 1960; Van Riper, 1963; Townsend, 1868; Reilly & Fisher, 2012). For read speech tasks there is a need to address age and cultural appropriateness. Additionally, pre-

100

screening evaluations for reading ability and eyesight should be completed to further substantiate read speech task results.

5.2.5 Spontaneous The spontaneous speech mode is synonymous with spoken open-thought or conversational speech. This mode is often called ‘free’; however, in a clinical assessment setting restrictions usually still apply (i.e. structure, topic). Spontaneous speech allows the examination of an individual’s explicit and tacit knowledge in parallel with a high level of organization when compared to other speech modes. The most compelling argument for the analysis of spontaneous speech is that it has ecological validity, which best represents a person’s unrestricted natural social behaviours (Hirschberg et al., 2010). Spontaneous speech is usually collected using the following tasks: interview-type questions, telling personal stories, conversing on a particular topic, summarizing a short video, opinion/review, or the Rorschach test. For spontaneous speech interviewing, even though the same questions are repeatable across different speakers, there is no guarantee that all speakers will interact using a suitable useful response for analysis. It has been shown that interviewers’ demeanor, personal bias, interviewing skill, and cultural sensitivity influence how patients verbally socialize (Chevrie-Muller et al., 1985; Egan, 1948). Even within the spontaneous speech mode, there exists a wide range of spoken content. It is relatively unknown how this varied content impacts behaviour and automatic speech processing performance. For example, spontaneous speech constitutes numerous variables related to socialization, such as: point-of-view accounts; impersonal to personal; multidisciplinary to highly specialized focused subject matter; monologue to competitive discussion; unilateral to multilateral discourse; and question-answer to unconstrained formats. Spontaneous speech also contains a high number of natural disfluencies and auxiliary speech behaviours (e.g. sighs, laughs, pauses). During spontaneous speech, speakers often openly discuss their life, family, workplace, and medical history. Therefore, a pervasive complication in using recorded spontaneous speech is maintaining patient confidentiality.

101

5.3

Speech Nuisance Factors

Nuisance factors are anything that corrupts the originally intended signal information. Nuisance factors influence the quantity and quality of speech information collected. Therefore, as expected, nuisance factors have significant implications on how well a speech-based depression method can identify discriminative information. Ideally, only the depressed person’s speech biosignal should be included during analysis because other noise can lead to poor automatic classification/prediction performance, although this may be a practical reality for some speech systems. As shown in Fig. 5.1, nuisance factors can influence each tier of speech acoustic, linguistic, and affect information (refer to Chapter 1, Fig. 1.1) individually, or in combination. Inherently, all databases have a degree of nuisance factor attributes. Many databases try to identify these and/or try to minimize these nuisance factors during database collections. It is important to acknowledge that recordings collected in natural environments are often are deemed more practical for real-world modeling applications because they are a more truthful signal representations.

LISTENER

V DE

AL LT UR

E OD

EL

HM

CU

C

EC

HA NN

SPE

SPEAKER

ICE

BIOLOGICAL

ENVIRONMENTAL

Figure 5.1: Diagram of unilateral speech biosignal transfer from a speaker to a listener that illustrates two origins for nuisance factors, internal and external. The dashed black arrow indicates a disruption of the speech biosignal information due to internal and external nuisances. Three primary internal speech nuisance factor types are shown in the green area: speech mode, biological, and cultural limitations. Additionally, in the red area, three main external speech nuisance factor types are shown, which include: device, channel, and environmental interference.

As shown previously in Fig. 5.1, the speaker has internal nuisance factors, such as the mode of speech (i.e. some modes are less informative and/or expressive than others), biological aspects (i.e. physiological restrictions, comorbidity) and personal cultural considerations (i.e. socio-personality appropriateness). All of these internal nuisance factors can limit the quantity or quality of information expressed verbally. In addition, external nuisance factors include channel conditions

102

(e.g. microphone type, sampling rate), environmental (e.g. background noise), and device attributes (e.g. specifications of audio acquisition hardware). As a whole, internal and external nuisance factors can contribute considerably to the intelligibility of speech. Interestingly, the internal nuisance factors are similar to the kind of demographic and event diathesis factors that are of high importance to clinicians during assessments (refer to Chapter 2.1 for more details). Again, this demonstrates the importance of identifying these kinds of metadata, so as to incorporate them into speaker models for depression classification/prediction. The nuisance variability also helps explain differences found in classification/prediction performance across various speech-depression databases.

5.4

Speech-Based Depression Databases

5.4.1 Audio Visual Depressive Language Corpus The Audio-Visual Depressive Language corpus (AViD) (Valstar et al., 2013) was used for experiments found later in Chapter 7. The AViD database contains different speech tasks (e.g. telling a story, reviewing/describing a commercial). The AViD spontaneous speech data was collected using an interview with an interactive computer screen with text that provided instructions. Shown in Fig. 5.2, the experimental data subset has a total of 24 male and female speakers that were recorded using a high-quality close-talk microphone in a simulated clinical environment. The average length per AViD speaker file excluding silence was approximately 45 seconds and excluded speakers whose recording session comprised only a few spoken sentences. Every AViD speaker was given a Beck Depression Inventory-II (BDI-II) questionnaire (Beck et al, 1996), which is another commonly used clinical depression self-assessment previously described in Chapter 2.3.1. The total score for the BDI-II has a scale of 0 to 63, wherein larger scores imply greater depression severity. For experiments reported later in Chapter 7, this AViD experimental subset omitted speakers with clinically ‘mild’ depression on the basis that they are less likely at-risk for self-harm. Therefore, only speakers from the AViD with BDI-II scores of 0-10 (i.e. “minimal depression” symptoms) and 20-63 (“severe depression” symptoms) were evaluated.

103

Number of Speakers

6 5 4 3 2 1 0 0 1 2 3 4

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

BDI-II Severity Score Figure 5.2: Distribution of the AViD database experimental subset of 24 speakers; 13 females (red) and 11 males (blue). BDI-II ranges from 0-10 were ‘Non-Depressed’ (~70% of speakers), whereas ranges 21-65 were ‘Depressed’ (~30% of speakers).

5.4.2 Audio Visual Emotion Challenge 2014 A subset of the Audio-Visual Emotion Challenge (AVEC 2014) corpus was used for experimental research found later in Chapter 10. As shown in Fig. 5.3, the AVEC 2014 includes studio-quality speech audio recorded from 84 male and female speakers. Each speaker undertook a Beck Depression Inventory-II (BDI-II) questionnaire (Beck et al., 2004) as previously described in Chapter 2.3.1. The AVEC 2014 data also contains per speaker, standard baseline acoustic functional features, including 78 Low Level Descriptors (LLD) and affective ratings.

Number of Speakers

12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

BDI-II Severity Score Figure 5.3: BDI-II distribution of the AVEC 2014 database experimental subset of 84 speakers; 43 females (red) and 39 males (blue). BDI-II ranges from 0-9 were ‘Non-Depressed’ (~50% of speakers), whereas ranges 19-46 were ‘Depressed’ (~50% of speakers).

The affect ratings were gathered from 4-5 listeners individually and they continuously rated every such file for arousal, valence, and dominance. All manually rated values were scaled between -1 and 1 on a per-rater basis. Fig. 5.4 shows the distribution of total ratings per arousal, dominance,

104

and valence for the AVEC 2014 database. Individual ratings were compiled using a weighted average to produce continuous (30 per second) arousal, valence, and dominance gold-standard ratings per file (see Fig. 5.5). The average length per AVEC 2014 speaker file excluding silence was approximately 2 minutes. For full details of the AVEC 2014 corpus the reader is referred to Valstar et al. (2014).

Figure 5.4: Histograms of all the AVEC 2014 train/development/test arousal, dominance, and valence goldstandard ratings. Note that the gold-standard affect ratings were all normalized between 1 and -1. For all three affect types, ‘0’ was the most common value because it represents the most neutral ratings.

Figure 5.5: 3-dimensional plot of the recording-mean affect ratings for the AVEC 2014 database. The nondepressed speakers are shown in blue, whereas the depressed speakers are shown in red. Distinction between the two classes is observed especially in the arousal and valence axes, wherein depressed speakers have generally lower values.

105

5.4.3 Black Dog Institute Affective Sentences The Black Dog Institute Affective Sentences (BDAS) database, data similar to that found in Alghowinem et al. (2012, 2013a, 2013b, 2015) and Joshi et al. (2013a, 2013b), was used for experiments found later in Chapter 11. 70 speakers were chosen according to criteria of recording quality, completion of all instructed affective read tasks, and equal cohort balance with regards to depression severity. The majority of the speakers had Australian-English as their primary language, while a few non-Australian-English accents were also included (e.g. Indian-English, Irish-English, American-English) per non-depressed/depressed groups. All speakers were recorded during a single session using QuickTime Pro, at 44.1 kHz sampling frequency, subsequently downsampled to 16 kHz. The average length per DAIC-WOZ speaker file excluding silence was approximately 2 seconds. All audio recordings were recorded in a clinical setting at the Black Dog Institute in Sydney, Australia. In total, as shown in Fig. 5.6, there were 21 female and 14 male speakers per nondepressed/depressed group. Speaker ages ranged from 21 to 75 years old. While the equal balance between non-depressed and depressed speakers is not representative of larger non-clinical populations (i.e. roughly 10% of the general population has a depression disorder (WHO, 2018)), it should be noted that depression occurs across wide age demographics and females have a higher rate of depression than males (ABS, 2008). Moreover, as previously mentioned in Chapter 1, literature suggests that rates of depression are actually higher than currently diagnosed (Mitchell et al., 2009). Initially, depressed speakers were recruited to the Black Dog Institute via tertiary referral from the Black Dog Institute Depression Clinic. The depressed speakers were then re-verified as currently exhibiting a major depression episode based on the Mini International Neuropsychiatric Interview (MINI) (Sheehan et al., 1998). This examination specifically helps clinically categorize melancholic and non-melancholic depression disorders under guidance by a clinical psychiatrist. Only speakers who met these categorical depression disorders were included in the recordings. Speakers were also vetted and excluded from the recordings if they had current/prior drug dependencies, neurological disorders, or a history of traumatic brain injury. In addition, due to the elicitation mode that required read aloud sentences; all speakers were given the Wechsler Test of Adult Reading (WTAR) (Venegas & Osok, 2011; Wechsler, 2001), wherein, any speakers that score below an 80 were excluded. The non-depressed speakers were also subject to the same exclusion criteria.

106

Before each recording session, the Quick Inventory of Depressive Symptomatology Self Report (QIDS-SR) (Rush et al., 2005) was used to evaluate the speaker’s current severity. For details on the QIDS-SR, see the previous Chapter 2.3.3. It should be noted that only speakers with ‘none’ and those with depression greater than ‘moderate’ severity levels were included in experiments herein. Similarly to previous studies (Hashim et al., 2017; Liu et al., 2016; Liu et al., 2017; Solomon et al., 2015; Long et al., 2017b), experimental analysis found later in Chapter 11 concentrated on speakers with higher severities of depression because they are most at-risk for suicide and often exhibit abnormal speech behaviours (Loo et al., 2008; Yorbik et al., 2014). Number of Speakers

40 30 20 10 0 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14

15 16 17 18 19 20 21 22 23 24

QIDS-SR Severity Score

25 26 27

Fig. 5.6: QIDS-SR distribution of the BDAS database experimental subset of 70 speakers; 42 females (red) and 28 males (blue). QIDS-SR ranges from 0-10 were ‘Non-Depressed’ (~70% of speakers), whereas ranges 11-27 were ‘Depressed’ (50% of speakers). The average QIDS-SR score for the depressed speaker group was 16.74, which is clinically labeled as ‘severely depressed’. Furthermore, in the depressed speaker group, there were 19 melancholic and 16 non-melancholic speakers.

While the BDAS database includes several different participant tasks (Alghowinem, 2015; Cummins et al., 2011; Joshi et al., 2013b), research herein focused specifically on the task of reading the 20 short sentences found in Table 5.1. This task was self-administered by individual speakers via a computer screen without clinician interaction. Speakers viewed each entire sentence one-by-one on the computer monitor, pressing the spacebar to proceed to the next sentence. There were no time constraints per sentence or intermediary neutralizing stimuli between each sentence. The reading task included sentences with ‘neural’ and ‘negative’ target keyword stimuli labels given in Alghowinem (2015), Cummins et al., (2013a), and Joshi et al. (2013b). Table 5.1. BDAS read sentences 1-20 with the keywords indicated in bold. The affective keyword in each sentence is indicated via ‘neutral’ (dark green), ‘positive’ (light green), and ‘negative’ (red) target keywords. {1} {2} {3} {4} {5} {6} {7} {8} {9} {10}

He would abuse the children at every party There was a crowd gathering around the entrance The teacher made the class afraid There had been a lot of improvement to the city The devil flew into my bedroom The chefs hands were covered in filth My next door neighbour is a tailor The pain came as he answered the door She gave her daughter a slap There was a spider in the shower

{11} {12} {13} {14} {15} {16} {17} {18} {19} {20}

107

There was a fire sweeping through the forest The swift flew into my bedroom There had been a lot of destruction to the city The teacher made the class listen There was a crowd gathering around the accident He would amuse the children at every party My uncle is a madman The post came as he answered the door She gave her daughter a doll There was a puddle in the shower

The affective keyword speech elicitation read aloud sentence mode design was loosely based on the Brierley et al. (2007) and Lawson et al. (1999) depression studies regarding emotionally charged keywords; wherein researchers demonstrated that depressive speakers display a negative bias in interpretation of short sentences. The sentence reading tasks allowed for an investigation of affective speech characteristics and their relationship with depression severity index scores. This reading task also allowed for a controlled reduction in phonetic variability and sentence structure per speaker. Similarly to Mitterschiffthaler et al. (2008), the majority of affective keyword sentences were constructed using pairings containing the same approximate lexical frequency, pronunciation likeness, grammar classification, and syllable length. While speakers in the Black Dog database were not assessed for possible dyslexia conditions or visual impairments, the prevalence of occurrence should be relatively similar for both depressed and non-depressed speakers. It is also identified that affective keyword priming of a prior sentence might have influenced results found in succeeding sentences. During reading tasks, there is always the potential for this as some individuals can fixate on a single phrase over all others. However, in a similar affective reading task study (Barrett & Paus, 2002), little difference caused by affective priming carry-over was found. According to the clinical depression literature (Goeleven et al., 2006; Gotlib & McCann, 1984; Leyman et al., 2007; Williams et al., 1986), a unique symptom that helps characterize an individual with depression is that negative stimuli trigger negative fixation.

5.4.4 Distress Analysis Interview Corpus – Wizard of Oz For the experiments found later in Chapters 6, 7, 8, and 9, an audio subset of the training and development partitions from the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) (Gratch et al., 2014) was used due to its large number of speakers, transcripts, speaker timemarkings, and previously published speech depression disorder analysis (Valstar et al., 2016). The DAIC-WOZ was created to examine a series of various language related behaviours, such as speech patterns, kinesics, psychophysiology, and assisted human-computer spoken dialog. Unlike a human interviewer, the virtual human-computer interviewer provides neutral unbiased emotion and a limited number of questions/responses. Shown in Fig. 5.7, the experimental data subset has a total of 82 male and female speakers that were recorded using a high-quality closetalking microphone. All recordings contain spontaneous North American English in a simulated clinical environment. Spontaneous speech was purposely selected for experimental analysis because it purportedly captures an individual’s state-of-mind and social behaviours more naturally than 108

other modes of speech (e.g. read, automatic) (Cummins et al., 2015a). The average length per DAIC-WOZ speaker file excluding silence was approximately 7 minutes. Every speaker undertook a Patient Health Questionnaire (PHQ-8), which indicates his/her severity level of clinical depression (for more information on the PHQ-8, refer to the previous Chapter 2.3.2). Similarly to studies that precluded speakers with clinically ‘mild’ to ‘moderate’ depression (Liu et al., 2016; Solomon et al., 2015), experiments reported later in Chapters 6, 8, and 9 also omitted speakers within these ranges. Therefore, only speakers from the DAIC-WOZ with PHQ-8 scores of 0-4 (i.e. “no significant depression” symptoms) and 15-24 (“moderately severe” to “severe” symptoms) were included. Furthermore, speakers with “moderately severe” to “severe” PHQ-8 symptoms are also the most likely to exhibit psychomotor retardation (Loo et al., 2008; Yorbik et al., 2014). Of the total 82 speakers in the DAIC-WOZ subset, approximately 20% were labeled as ‘Depressed’. While this percentage is a higher representation than is typically found in a primary care setting, where ~10% of patients meet the diagnostic criteria for depression (Luber et al., 2000), it is also further estimated that nearly two-thirds of individuals with clinical depression go clinically

Number of Speakers

undiagnosed in a primary setting (Ani et al., 2008). 25 20 15 10 5 0 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

PHQ-8 Severity Score Figure 5.7: Distribution of PHQ-8 scores for the experimental DAIC-WOZ subset of 82 speakers; 36 females (red) and 46 males (blue). PHQ-8 ranges from 0-4 were ‘Non-Depressed’ (~80% of speakers), whereas ranges 15-24 were ‘Depressed’ (~20% of speakers).

5.4.5 Sonde Health The Sonde Health (SH1) database is a subset of a speech corpus privately collected via an interactive AndroidTM based smartphone health app designed by Sonde Health, Inc. The SH1 database was developed in part to explore non-laboratory natural environment speech biosignal analysis for depression including depression classification variation across different speaker

109

demographics and across various communication devices. All data were collected under a human subjects research protocol reviewed and approved by an Institutional Review Board. The SH1 database has characteristics that are unavailable from other publicly offered corpora: a sizeable group of speakers (males/females), high-quality (16kHz) recordings made on personal smartphones in uncontrolled naturalistic environments, free/read speech examples with nearly 100% same speaker overlap, speaker demographic metadata, audio quality ratings, and Patient Health Questionnaire-9 (PHQ-9) evaluations including the total score for each speaker (see previous Chapter 2.3.2 for more details on the PHQ-9). All audio files had Voice Activity Detection (VAD) (Kinnunen & Rajan, 2013) applied to remove silence. After VAD was applied, the free speech (SHf) and read speech (SHr) files were on average 20 and 120 seconds in length, respectively. For experiments reported later in Chapter 12 which used the SH1 database, and as shown in Fig. 5.8, the non-depressed class was determined as any score 0-7, whereas the depressed class included any score higher than 15, with intermediate scores excluded as suggested in (Hamilton, 1960). Of the total 150 speakers in the SH1 database, approximately 30% were labeled as depressed – an intentional over-representation of depressed individuals compared with the original SH1 database, which much more closely matched the typical depression incidence in the general population; this increase was done to maximize the available training data. In total, there were over 45 different

Number of Speakers

smartphone manufacturers/versions contained in the SH1 database. 35 30 25 20 15 10 5 0

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 PHQ-9 Severity Score

Figure 5.8: Distribution of PHQ-9 scores for the experimental SH1 database of 150 speakers; 81 females (red) and 69 males (blue). PHQ-9 ranges from 0-7 were ‘Non-Depressed’ (70% of speakers), whereas ranges 15-27 were ‘Depressed’ (~30% of speakers).

110

5.5

Summary

There is still significant uncertainty about which elicitation speech mode or modes should be adopted as a test battery for automatic depression analysis. For automatic speech-based depression analysis, the clinical test application of systematic elicited speech modes remains mostly unexplored, unchallenged, and/or under-researched across depression diagnoses. Despite the groundwork set by Egan (1948) and others (Chevrie-Miller et al, 1985; Schiel et al., 2004), still today the majority of clinical speech studies contain dissimilar data collected differently in terms of number of speakers, demographics, languages, speech modes, stimuli, cognitive load, recording methods, and number of sessions or durations. With so much variation in speech parameters and test materials, it is hard to establish the optimal speech elicitation mode for analysis because it is exceedingly rare that two clinical speech databases follow the exact same protocols. Chapter 5 has also provided an overview of elicitation speech modes, nuisance factors, and speechbased depression databases. As indicated earlier in Chapter 5.1, speech-depression databases are limited by their public availability, speaker size, subtype, and elicitation task attributes. The speechbased depression databases presented in this chapter were specifically described because they are utilised in experiments presented later in Chapters 6-12. Moreover, for experiments described later in Chapters 6-12, these specific databases were conducted using read (e.g. BDAS, SH1) and/or spontaneous (e.g. AViD, DAIC-WOZ, SH1) speech modes. Research found in Chapters 11 and 12 are the first to conduct experiments using the BDAS and SH1 databases.

111

Chapter 6 ACOUSTIC PHONEME ATTRIBUTES 6.1

Overview

As explained in Chapter 1.2, an organized system allowing the examination of particular speech segments must be further developed to ascertain which of these segments provide the most reliable indication of depression. While acoustic-based biosignal links between clinical depression and abnormal speech have been established, there is still however little knowledge regarding what kinds of phonological content is most impacted. Moreover, for automatic speech-based depression classification and depression assessment elicitation design protocols, even less is understood as to what phonemes or phoneme transitions provide the best analysis. If more is to be learned regarding phonological abnormalities found in depressed speakers, the concurrent analysis of the acoustic and linguistic tiers found speech production could reveal a more systematic informative evaluation. By systematically evaluating spoken phrases and their articulatory markedness attributes/transitions, broadened knowledge can be gleaned with respect to which phonetic information differentiates depressed versus nondepressed speakers. As previously mentioned briefly in Chapter 1.1, phono-articulatory complexity score-based systems have been briefly investigated as a measure of articulatory kinematic effort for detection of pathological speech disease/disorder. Yet, in this literature, such as Stoel-Gammon, (2010), Jakielski et al. (2006), Maccio (2002), Shriberg et al. (1997), and Shuster & Cottrill (1997) only a small set of articulatory characteristics, speakers, and languages were examined. Furthermore, these studies were primarily designed for speech pathology evaluation applications. It is thought that for depression, diminished articulatory ability and hypokinetic characteristics observed in speech could be measured using these types of descriptive phonologically-driven metrics. While analysis of articulatory measure aspects in spoken language-specific disorders/diseases is common, fewer studies exist that examine the impact of depression on individual’s articulation, and perhaps more importantly, which phoneme or syllable combinations are most affected by

112

depression. In Gábor & Klára (2014), it was suggested that acoustic-based depression classification should place more emphasis on finding a correlation between depression severity and changes in articulatory acoustic phoneme parameters. As a follow-up extended analysis, an examination of different sets of articulatory measures proximately grounded on phonetic markedness (Chomsky & Halle, 1968) and gestural efforts (Browman & Goldstein, 1989, 1995) were proposed. Trevino et al. (2011) examined the correlation of individual phonemes and broad phoneme classes (i.e. vowels, consonants, nasals) to psychomotor retardation by individually evaluating average phoneme lengths and centroid energy spread. However, in this study, several articulatory attributes were left unexplored (e.g. placement, roundness, tenseness). Moreover, no investigative analysis of dynamic phonetic gestures including the degree of articulatory transformation from one phoneme to the next was considered. To date no investigation has examined the use of phoneme-to-phoneme articulatory gestural effort in a speech-based acoustic feature depression classification framework. Based on the aforementioned depression disorder speech behaviours (Cummins et al., 2015a; Scherer & Zei, 1988; Flint et al., 1993; Darby et al., 1984), it is hypothesized that: •

For depressed speakers, the psychomotor retardation symptoms, which reduce articulatory motor control, will be exhibited to a greater extent in the production of some particular phoneme manners (e.g. voiced, round) than others.



As changes in the demand for articulation gestural effort activation increases on a phoneme-to-phoneme level within a spoken phrase, depressed speaker characteristics will become acoustically more evident within these phrases.

6.2

Methods

A promising, practical and inexpensive method to evaluate articulatory-physiological speech parameters is analysis of phoneme occurrences/transitions based on spoken transcript analysis and acoustic-based features. Early key influential physiological articulation parameter investigations concentrated on articulatory descriptors, such as English vowel attributes (Ladefoged, 1993), universal phonetic markedness (Chomsky & Halle, 1968; Henning, 1989) and spoken articulatory gestures (Browman & Goldstein, 1986, 1992). These linguistically motivated investigations provided further insight on natural speech production, including static phoneme analysis and nonstatic articulatory transitions called articulatory gestures, which are essentially parameterized dynamic systems based on overlapping coordinated muscular movements.

113

In (Browman & Goldstein, 1989, 1995), a gestural score was proposed to help measure articulatory movements based on the linguistic order of articulatory change. Unlike a discrete sequence of phonemes, a gestural score has an advantage because it encodes a set of hidden states present in natural interconnected speech. For example, the word “bake” can be represented as the sequence of discrete phonemes /B-EY-K/. Or, instead, using gestural representations “bake” can also be represented as consonant-vowel-consonant and voiced-voiced-voiceless, providing explicit articulatory information regarding this word, which might otherwise be overlooked. Two different types of articulation measures were proposed and investigated per phrase motivated by the Chomsky-Halle phonetic model (Chomsky & Halle, 1968): (1) phonetic markedness which included 17 articulatory manners and (2) a Hamming distance mean. The phonetic markedness is a term that is used to describe the physiological-articulatory characteristics of a particular phoneme, whereas the Hamming distance mean is a binary distance measure defined by the amount of articulatory gestural change between each phoneme. The phonetic markedness is a percentage-based distribution calculated by the following simple equation: 𝑀! = 𝑌!,!

(6.1)

where mk is the markedness of the kth manner. Per utterance, the mean of mk is represented by Mk. This percentage-based approach provided articulatory markedness distributions within each phrase, and allowed all different phrase lengths to be compared equally to each other. For example, according to Table 6.1, the name “John” /J-AO-N/ is represented by the three phonemes that have each of the following phonetic markedness distributions: voice (Mk = 1.0), back (Mk = 0.33), and anterior (Mk = 0.66). Table 6.1. Phonetic markedness distributions for a few English sounds, wherein each column is has a phonetic markedness binary value; the 0 (white) represents inactive, whereas the 1 (black) indicates active. Phonemes

Voice

AO AA JH N K F SH HH

114

Back

Anterior

Considering Table 6.1, the articulatory gestural effort due to the transition between a pair of consecutive phonemes can be reflected by the difference in the Chomsky-Halle phonetic markedness corresponding to each parameter. Since the Chomsky-Halle parameters are binary, this is a bitwise distance. The Hamming distance (Hamming, 1950) is a bitwise metric that measures the minimum number of substitutions required to transform from one binary string 𝑌! to another binary string 𝑌! of equal length n. It can be represented by: !!! 𝑑 !" (𝑖, 𝑗) = Σ!!! |𝑦!,! − 𝑦!,! |

(6.2)

The Hamming distance gives the number of mismatches between strings, and in this context, is related to how many changes in manner occur during a particular sequence of phonemes. This is treated herein as a proxy for the articulation effort of phoneme transitions. In Fig. 6.1, the higher the



Hamming distance, the greater the difference between two binary string representations. 110

111

010

011

101

100 000

001

Figure 6.1. Illustration of Hamming distance for a 3-bit string. The Hamming distance between 000 and one of 100, 010, or 001 is only 1, whereas there is a distance of 3 between 000 and 111. This was applied to calculate the articulatory distance between consecutive phoneme pairs, by calculating the distances between the relevant pair of rows found later in Table 6.2.

The example previously shown in Table 6.1, indicates the phonetic markedness mk within phoneme /HH/ is least like /AO/, hence has larger gestural effort (dHD = 1.0) than similar phonemes /SH, JH, K/ (dHD = 0.66). For every phrase a Hamming distance mean was calculated by calculating dHD between all consecutive phoneme pairs across all 17 phonetic markedness articulation manners. After a Hamming distance measure according to Mk for each kth manner, dHD was calculated as the mean of all dHD for the phrase.

115

High (13)

Back (9)

Low (4)

Anterior (21)

Coronal (14)

Round (7)

Tense (7)

Voiced (30)

Continuant (27)

Nasal (3)

Strident (8)

Sonorant (22)

Interrupted (8)

Distributed (6)

Lateral (1)

AA AE AH AO AW AY EH EY IH IY OW OY UH UW ER B D P T V Z ZH S SH M N NG DH CH F G HH JH K L R TH W Y

Consonantal (25)

odd at hut ought cow hide Ed ate it eat oat toy hood two hurt be dee pee tea vee zee seizure sea she me knee ping thee cheese fee go he jeep key Lee read theta we yield

Vocalic (15)

Table 6.2. Articulation effort matrix based on 17 Chomsky-Halle’s phonetic markedness attributes. The columns (k) are the manner and the rows (i) are English phonemes; white indicates inactive (yi,k = 0), whereas black indicates active (yi,k = 1). The parentheses indicate total number of phonemes contained within. During experimental work, in an identical fashion to Fig. 6.1, each phoneme was represented as a 17-bit string.



































116

6.3

Experimental Settings

For experiments herein, the open-source COVAREP speech toolkit (Degottex et al., 2014) was used to extract 74 acoustic features (i.e. glottal flow, MFCC, pitch) by aggregating 20-ms frame-level features across individual utterances. The COVAREP feature set was chosen because it has been used previously for emotion and speech-based depression research on this database (Valstar et al., 2016; Cummins et al., 2017). Each frame-level feature had its mean, standard deviation, kurtosis, and skewness calculated, which generated a total feature vector dimensionality of 296. For a baseline, each speaker’s acoustic features were averaged based on all phrases without partitioning. During the partitioning experiments, only acoustic features corresponding to their specific partitioned phrases were averaged. Similarly to Mitra et al. (2014), depression classification was conducted using decision trees, which performed well in preliminary experiments. All experiments used the simple decision tree classification configuration from the MATLAB toolkit using four leaves with coarse decisionmaking and a maximum of 4 splits. Experiments utilised 2-class classification (depressed/nondepressed) with 5-fold cross validation using a 20/80 training/test split. Classification performance was determined using overall accuracy, individual class F1 scores, and F1 average scores (similarly to Valstar et al. (2016)). All experiments utilised the Distress Analysis Interview Corpus (DAIC-WOZ) (Gratch et al., 2014). For more information on the DAIC-WOZ database, see the previous Chapter 5.4.4 for more details. The DAIC-WOZ was chosen for experiments because it had a large number of speakers with clinically diagnosed depression and – importantly - transcripts for each speaker. Herein, two different speaker groups were analyzed based on a two restricted PHQ score ranges: non-depressed (0-4) and depressed (15-24) (similarly to Solomon et al. (2015); Liu et al. (2016)). Speakers within the mid-PHQ-8 range were purposely removed to better establish far-end class trends. According to the descriptors associated with the PHQ-8 scores (Kroenke et al., 2009), the non-depressed group can be considered to have “no current depression”, whereas the depressed group has “current moderately severe to severe depression”. Nearly 20% of speakers in this experimental subset were in the latter group (i.e. clinically diagnosed as depressed). The DAIC-WOZ transcripts were converted from standard text format to phoneme representations using the Carnegie Mellon University (CMU) phonetic dictionary (CMU, 1993). This dictionary consists of phonetic spellings for over 130k American English words using 39 phoneme

117

representations (see Table 6.2 - column one). Afterwards, for each speaker, his/her audio file was segmented according to the phrase level time-markings (i.e. roughly one spoken sentence per entry) of the transcript. All phrases were sorted and partitioned into low, mid-low, mid-high, and high according to the 17 individual phonetic markedness (Mk) and Hamming Distance (dHD) measures. An ad hoc method was used to determine the partition cutoffs; this selection method involved obtaining near equal numbers of phrases per partition, so that classification experiments on each of the four individual partitions would be comparable in terms of training/test size. Thus, there were ~2,500 phrases per partition and all speakers had multiple phrases represented within each partition.

6.4

Results and Discussion

The baseline acoustic system experiment resulted in an average accuracy of 69.5% with F1 scores of 0.07(0.82), i.e. depressed and non-depressed, respectively. In comparison with the baseline, results in Fig. 6.2(a) show depressed/non-depressed classification accuracy performance for each static phonetic markedness Mk parameters for low (light gray), mid-low, mid-high, and high (dark gray) phrase partitions. With regard to the first hypothesis, for several phonetic markedness parameters, the results indicate that depressed/non-depressed accuracy is better for higher than for lower density phonetic markedness partitions. This was particularly the case for consonantal, back, round, and voiced phonetic markedness, wherein classification accuracy improved as these parameters had greater prominence in phrases. The opposite effect was recorded with strident and lateral phonetic markedness parameters – as phrases with fewer of these parameters produced higher classification accuracy. Based on prior literature (Buyukdura et al., 2011; Cannizzaro et al., 2004a; Flint et al., 1993; Helfer et al., 2013; Mundt et al., 2012) discussed in previously Chapters 2.1 and 3.4.1, it was anticipated that depressed speakers (i.e. due to the effects of psychomotor retardation, decreased speech clarity) would demonstrate more discriminative acoustic differences in ‘voiced’ and/or non-neutral positioned phonemes than ‘unvoiced’ or more centrally located phonemes. Non-neutral positioned phonemes include vowel extremes (e.g. back, rounded, voice) and consonants that typically require more coordinated gestural effort than more neutrally phonemes (e.g. continuant).

118

% Classification Accuracy

High

Mid-High

Mid-Low

Low

85.0 80.0 75.0 70.0 65.0 60.0 55.0 50.0

(a)

0.45

Depressed F1 Score

0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05

(b)

0.90 0.85 0.80 0.75

Lateral

Distributed

Interrupted

Sonorant

Strident

Nasal

Continuant

(c)

Voice

Tense

Round

Coronal

Anterior

Low

Back

High

0.65

Consonantal

0.70 Vocalic

Non-Depressed F1 Score

0.00

Figure 6.2. (a) Classification accuracy, (b) depressed F1 score, and (c) non-depressed F1 score results for acoustic features based on sorted mean markedness (Mk) of all 17 phonetic markedness parameter phrase partitions. Differently shaded green dots per markedness parameter represent the four partitions (i.e. high, mid-high, mid-low, low), which comprise the different range densities of phonetic content analysed for depression classification. Further, the far-end partitions (e.g. high, low) are indicated by an interconnected line between each phonetic markedness parameter.

Fig. 6.2(b) shows the F1 scores (depressed) for individual phonetic markedness parameters. The F1 depressed classification results in Fig. 6.2(b) display a great deal of variance in scores across nearly every phonetic markedness parameter, whereas for Fig. 6.2(c) the F1 non-depressed classification

119

score ranges are significantly narrower. The depressed F1 scores are considerably lower than the non-depressed F1 scores. This is presumably due to the 1:5 ratio of depressed to non-depressed speakers, which means significantly less training are data available for training the depressed than the non-depressed class. Preliminary experimental results (not shown here) using balanced speaker classes showed classification improvements for depressed F1 scores, albeit with minor costs in classification accuracy. However, from a clinically realistic sample standpoint, moderately severe to severe depressed speakers typically make up a smaller percentage of the population (WHO, 2001). Considering articulatory transitions, utilising the Hamming distance (dHD) to partition speakers’ acoustic features per phrase, Fig. 6.3 results show absolute accuracy gains of around 5% when using mid-high to high partitions rather than low to mid-low partitions. While the absolute accuracy gain is only a small improvement for higher Hamming distance partitions over the all-phrase baseline (~2%) there were considerable depressed F1 score absolute accuracy gains for the higher partitions (from 0.07 to 0.21). As suggested by the second hypothesis, the depressed F1 score improvement does indicate that a greater number of depressed speakers are correctly classified using acoustic features from phrases that have a larger Hamming distance mean. These improvements, in the mid-high and high Hamming distance partitions, suggest that elicitation of speech should contain the widest phonemeto-phoneme activation range of articulatory gestural transitions, to improve depression discrimination. 80

0.8

70

0.7

60

0.6

50

0.5

40

0.4

30

0.3

20

0.2

10

0.1

0

F1 Score

% Classification Accuracy

90

0 Baseline

Low

Mid-Low

Mid-High

High

Figure 6.3. Classification accuracy (grey bar), depressed (*) and non-depressed (•) F1 score performance results for the baseline (all phrases) versus phrase phonetic markedness dHD Hamming distance partitions.

Differences in the number of phonemes per phonetic markedness parameter suggest that if the 17 different articulatory marked parameters were weighted by natural phoneme occurence, the results might achieve better classification performance. However, the issue of determining suitable weights per articulatory trait is a problem found in many of the score-based systems mentioned earlier in

120

Chapter 6.1 (Stoel-Gammon, 2010; Jakielski et al., 2006; Maccio, 2002; Shriberg et al., 1997; Shuster & Cottrill, 1997). Moreover, there is surprisingly a scarcity of literature related to the kinematic demand and processes of speech (Locke, 2008; Pellegrino et al., 2009; Westbury & Dembowski, 1993). While it is presumed that greater muscular involvement is associated with greater articulatory effort, there seems to be no literature for instance that provides a quantitative guide on whether the rounding of the lips requires less, equal, or more effort than voicing a phoneme. To gain greater insight into speech affected by depression along with other common disease/disorders, additional research in the area of measurable phono-articulatory kinematics is needed. Finally, it is noted that although the experiments herein rely heavily on human transcribed speech, current Automatic Speech Recognition (ASR) software is achieving near-human word error rates.

6.5

Summary

An association between acoustic and linguistic speech components was proposed to explore new measures for automatic speech data selection. Experiments presented herein explored a new set of articulatory data selection measures derived from Chomsky-Halle phonetic markedness and articulatory transitions using the Hamming distance mean. These previously unexplored data selection measures were used to gain further insight into how articulation is affected by depression; and how speech articulatory-linguistic information can be used for data selection to improve speech-based depression classification performance. As hypothesized, depression classification accuracy was higher than baseline accuracy for specific phonetic markedness manner types, such as consonantal, back, round, and voiced manners, over other types. It was surmised that depressed speakers, due to the subsymptoms of psychomotor retardation and cognitive deficiencies, have greater articulatory difficulty when compared with nondepressed speakers during phrases with more demanding articulatory-gestural phoneme-tophoneme effort. The experimental results support the hypothesis that spoken phrases with more complex articulatory transitions, in terms of markedness Hamming distance means, also produced better depression classification accuracy. The introduction of phonetic markedness into data selection has an important implication for understanding which speech sounds and combinations are altered by depression. Based on these experimental results, more effective data selection and/or development of elicitation design

121

protocols (e.g. text-dependent read material) for speech-based depression detection can be created using articulatory-gestural effort measures allowing for the collection greater amounts of depressive discriminative speech data.

122

Chapter 7 ARTICULATORY STRESS 7.1

Overview

The effects of psychomotor retardation associated with clinical depression have been linked with a reduction in variability of acoustic parameters in previous work as noted earlier in Chapter 3.4. However, linguistic stress differences between non-depressed and clinically depressed individuals have yet to be investigated. In Chapter 1.2, it was stated that a primary objective of this thesis is to explore how linguistic attributes can be leveraged for acoustic data selection in speech-based depression analysis. An investigation of paraphonetic acoustic-linguistic properties exhibited by depressed speakers when compared with non-depressed speakers could offer new insights into which phonemes contain more discriminative information, and furthermore, impact clinical elicitation criteria and designs. Recently, new linguistic text-based methods, such as topic modeling (Gong & Poellabauer, 2017) and natural language processing (Dang et al., 2017) have been used to investigate depression severity. These types of approaches mainly treat acoustic and linguistic information separately, fusing the outputs of two independent subsystems, rather than employing acoustic analyses that are dependent on the linguistic transcript. Gábor and Klára (2014) suggest that acoustic-based depression classification should place more priority on discovering a correlation between depression severity and changes in articulatory acoustic phoneme parameters. In diagnostic evaluations, clinicians have referred to depressed speech using subjective descriptors such as ‘flat’, ‘monotonous’, and ‘monoloud’ (Cummins et al., 2015a; Darby & Hollien, 1977; Newman & Mather, 1938; Ostwald, 1965). For individuals with depression disorders, psychomotor retardation (Caligiuri & Ellwanger, 2000; Mayer-Gross et al., 1969) is a common key subsymptom that encompasses a measurable decline in neural planning and control of motor movements. Clinical studies on depressed speakers displaying psychomotor retardation (Bennabi et al., 2013; Buyukdura et al., 2011; Cannizzaro et al., 2004a; Darby & Hollien, 1977; Flint et al., 1993; Szabadi et al., 1976) have discovered abnormal recurrent speech production indicators, such as greater

123

muscle tension and respiratory rate, especially as an individual’s depression severity increases (Kreibig, 2010; Scherer, 1986). The increase in overall muscle tension directly impacts the dynamic function and range of the vocal folds. In depressed individuals exhibiting psychomotor retardation, Roy et al. (2009) found that constraints in the vocal folds also similarly impact the jaw and facial muscles in a gross manner. This global manifestation of fine motor strain adversely impacts speech production leading to an increase in speech errors, decrease in speaking rate, and more hesitant speech patterns (Cannizzaro et al, 2004a; Darby et al., 1984; Ellgring & Scherer, 1996; Fossati et al, 2003; Nilsonne, 1987, 1988; Sobin & Sackeim, 1997; Szabadi et al., 1976). The deterioration of fine motor control due to psychomotor retardation ordinarily results in underarticulation (Darby et al., 1984; Scherer et al., 2015), which is also referred to as hypoarticulation (Lindblom, 1990). Hypoarticulation is a uniform non-dynamic speech production approach that minimizes the degree of articulatory effort and variability. Therefore, hypoarticulation causes greater perceptual auditory blur between dissimilar sounds while also affecting elements of prosody across syllables. It has been documented in numerous speech and depression studies (Cannizzaro et al., 2004a; Greden et al., 1981; Gruenwald & Zuberbier, 1960; Hargreaves & Starkweather, 1964; Helfer et al., 2013; Kuny & Stassen, 1993; Mundt et al., 2007, 2012; Ostwald, 1965) that clinically depressed speakers exhibit a reduction in vocal emphasis quality and articulatory precision that results in poorer speech intelligibility than is found in healthy populations. In the literature, thus far, no research has investigated English vowel sets with regards to linguistic stress. From a paraphonetic (i.e. individual phoneme timbre) standpoint, in English and most other spoken languages including German, linguistic stress is a perceptual observation of rapid spoken fluctuations in the following: duration (length), loudness, pitch (F0), and quality (Fry, 1955, 1958, 1965; Morton & Jassem, 1965). In linguistic terms, this form of speech modulation can be understood as being composed of a mixture of stressed and non-stressed sounds. In natural speech, linguistic stress functions at a phoneme unit level to permit greater segmental distinction between streams of interlinked phonemes (e.g. syllables, words, phrases). Linguistic stress improves speech intelligibility by emphasising sound units that differ from each other. Linguistic stress also provides audible cues as to which sound units carry the most important information content (Hockett, 1958; Ladefoged, 1967; Miller, 1963). In Hitchcock & Greenberg (2001) and Greenburg (2002), it was shown that syllable stress perceptually influences an

124

individual’s ability to identify phonetic segments in spontaneous speech, especially temporal aspects of the vocalic nucleus. It is believed that depressed English speakers with hypoarticulation exhibit an overall reduction in intensity and length variation across all syllable types (e.g. stressed, unstressed), whereas nondepressed speakers demonstrate a greater variability. Due to tongue mobility limitations and neutral tongue placement found in speakers with hypoarticulation, it is also thought that more severely depressed speakers will have shorter vowel durations when compared with non-depressed speakers. Only a little emphasis has been placed on the effects from depression on English vowel production in prior studies (Scherer, et al., 2016; Trevino et al., 2011; Vlasenko et al., 2017). Trevino et al. (2011) discovered that depressed speakers had statistically significant correlations with the psychomotor subsymptom and duration/signal power for most vowels. Further, on the contrary, Trevino et al. (2011) found minimal statistical correlation between the agitation subsymptom and duration/signal power in depressed speakers’ vowel production. Note that this research did not evaluate linguistic stress duration and power as isolated features for speech-depression classification. Scherer et al. (2016) found that depression affected vowel frequencies F1 and F2 and the Vowel Space Area (VSA), formulated by Bradlow et al. (1996) for speech clarity. Using a similar F1 and F2 VSA-based approach to that of Scherer et al. (2016), Vlasenko et al. (2017) used a phonetic recognizer to compare recorded vowels of clinically depressed and non-depressed speakers, but with further implementation of gender-dependent modeling. In both Scherer et al. (2016) and Vlasenko et al. (2017), their experimental approaches only investigated articulatory formant VSAbased area traits of the three corner-vowels, excluding other potentially useful vowels. Again, this study too did not examine linguistic stress across different vowel characteristics. Fig. 7.1 contains important details regarding how the placement of the tongue normally operates during English vowel sounds. As shown in Fig. 7.1, the vowel space comprises different articulatory positions that affect vowel quality and allows greater distinctions between different vowel sounds. There are three major vocoid articulations that have a significant impact on the shape of the oral cavity, and in turn, vowel quality: tongue-height, tongue advancement, and lip position (Hockett, 1958). The tongue height is based on the vertical positioning of the tongue along with the upper and lower jaw positions. The opening of the jaw aids in allowing the tongue to reach its correct placement.

125

In Fig. 7.1, the y-axis, tongue height is described in terms of its vertical movement: high, mid, and low. The x-axis indicates tongue advancement, which is related to the horizontal positioning of the tongue in terms of tongue area predominantly involved during vowel production. For example, in the sound /iy/, the whole upper portion of the tongue is high from dorsum to blade, whereas with /uw/ only the dorsum has high placement. The lip position relates to the shape of the lips during articulation. Both the /uw/ and /ow/ vowel sounds are considered rounded, whereas /ah/ for instance is unrounded. An additional aspect of vowel production is lax and tense. Lax vowels (/ih, eh, ah, ae/) tend to be shorter in duration than tense vowels and are produced in a less constrained muscular manner. On the contrary, tense vowels (/iy, uw, ow, aa/) are usually longer in duration than lax vowels, and are produced with more lip rounding involving greater muscular tension (Hockett, 1958). A monophthong is a fixed pure vowel sound within a single syllable, whereas a diphthong is a single syllable containing a vowel transformation into another adjacent vowel.

Fig. 7.1: Illustration of the vowel quadrilateral and various tongue parameters based on North American English (Hockett, 1958; Ladefoged, 1967, 1975). Note that only the 8 monophthong (in black) and 3 diphthong (in grey) vowels investigated in this paper are shown. The arrows indicate the approximate starting and ending positions for the diphthongs. The superscript ˚ indicates that a vowel is rounded.

It is hypothesized that certain articulatory vowel parameters and/or movements as illustrated in Fig. 7.1 are more affected by depression than others. This hypothesis is motivated from work by Flint et al. (1993) and Tolkmitt et al. (1982) who previously inferred reduced articulatory effort effects based on noticeable F2 reductions in depressed speakers. It is also anticipated that many depressed individuals demonstrate hypoarticulation, which influences the dynamic nature of their linguistic stress strategies (e.g. duration, loudness, pitch) triggering shorter, more unified vowel durations and less dynamic vocal intensity. The motivation behind research in this experimental analysis is to link known English articulatory norms of linguistic stress to hypoarticulation effects found in speakers with a depression disorder. It is proposed that by investigating English (along with a German

126

database for comparison) linguistic stress components at a constrained fine-grained paraphonetic level, acoustic differences between depressed and non-depressed speakers are more evident.

7.2

Methods

To automatically segment phonemes from the audio files, the Brno Phoneme Recognizer was used (Schwarz et al., 2006). While the Brno recognizer has a moderate degree of error (~30%), it has been widely applied in speech processing research and is considered a de facto standard recognizer (Trevino et al., 2011). Moreover, studies on human-annotated transcripts have noted a degree of error roughly as high as contemporary automatic methods, especially for spontaneous speech (Hayden, 1950; Mines, 1978). For phoneme recognition, a North American English model was used because the DAIC-WOZ English speech data analyzed comprised a similar dialect origin. The North American English recognizer was also applied to the German AViD data. Although a German phonetic model is optimal for labeling spoken German data, it was desirable to be able to directly compare DAIC-WOZ and AViD experiments using the same vowel labels found within the English model. There are limitations in using an English phonetic model to label German phonemes; this results in more generalized phoneme labels. It is noted that the German language has several more monophthong vowels than English, including ‘rounded’ front vowels, which English does not have (Schultheiss, 2008). The experiments herein focused specifically on the most frequently occurring vowels across all speakers, as listed in Table 7.1. A few vowels and diphthongs were omitted because of their low occurrence or insufficient examples across all 82 DAIC-WOZ and 24 AViD speakers. It should be noted that the histogram of total DAIC-WOZ English phoneme outputs (e.g. phoneme percentage distribution for all sounds including consonants) generated by the automatic phoneme recognizer was consistent with prior large corpus human-transcription phoneme study distributions (French et al., 1930; Hayden, 1950; Voelker, 1937). The AViD German phoneme percent distributions were similar to those of DAIC-WOZ English and comparable German analysis by Delattre (1964). In both languages, /ih/ was the most numerous example, whereas /aw/ the least.

127

Table 7.1: Summary of North American English vowels extracted from DAIC-WOZ and German vowels extracted from AViD using an American English phonetic recognizer, along with articulatory parameter descriptions based on Fig. 7.1. Each vowel sound has a word example with its vowel pronunciation highlighted in red. A total vowel count across all speakers is provided; as expected, some vowels occur more frequently than others. The 12 different articulatory characteristics within this table are high, mid, low, front, central, back, unrounded, rounded, lax, tense, monophthong, and diphthong. In total for DAIC-WOZ, there were approximately 77k English vowels evaluated during experimentation and 80 examples per vowel per speaker. AViD contained fewer vowel examples (~3k total), with approximately 11 examples per vowel per speaker. Phonetic Symbol

English Example

Tongue Height

Tongue Advancement

Lip Position

Contrast (duration)

Transition

DAIC-WOZ # Total

AViD # Total

ih

/ɪ/

sit

High

Front

Unrounded

Lax (short)

Monophthong

16,250

738

iy

/i/

eat

High

Front

Unrounded

Tense (long) Monophthong

14,969

554

uw

/u/

boot

High

Back

Rounded

Tense (long) Monophthong

6,324

252

eh

/ɛ/

bet

Mid

Front

Unrounded

Lax (short)

Monophthong

4,453

103

Lax (short)

ah

/ʌ/

cut

Mid

Central

Unrounded

ow

/o/

over

Mid

Back

Rounded

ae

/æ/

cat

Low

Front

aa

/ɑ/

hot

Low

ey

/eɪ/

bay

ay

/ɑɪ/

hide

aw

/ɑʊ/

brown

LowFront LowBack LowBack

Monophthong

7,275

633

Tense (long) Monophthong

4,035

49

Unrounded

Lax (short)

Monophthong

4,935

102

Back

Unrounded

Tense (long) Monophthong

3,906

49

High-Front

Unrounded

Tense (long)

Diphthong

3,426

134

High-Front

Unrounded

Tense (long)

Diphthong

9,346

284

High-Central

Unrounded to Rounded

Tense (long)

Diphthong

2,754

33

For experiments on articulatory characteristics and linguistic stress herein, vowel duration features were computed per speaker based on the mean and standard deviation of various articulatory parameter sets. Additional types of feature calculations were considered (e.g. kurtosis, skewness). However, these did not improve upon depression classification performance and were therefore omitted from further analysis. For acoustic feature extraction, as discussed previously in Chapter 4.1.1, the open-source openSMILE speech toolkit was used to extract 88 eGeMAPS features (Eyben et al., 2016). The eGeMAPS feature set was chosen because it was used previously for emotion and speech-based depression research (Valstar et al., 2016). The AC features were each obtained by calculating the mean eGeMAPS features within a particular vowel parameter set per speaker. An AC feature vector comprised an 88-dimensional eGeMAPS feature 𝐕 along with a duration mean dimension

𝐿!

which can be represented as:

𝐴𝐶 = 𝐕! ! 𝐿! 𝐕! ! 𝐿! … 𝐕! ! 𝐿!

!

(7.1)

For the linguistic stress experiments, an LS feature was proposed using only the eGeMAPS loudness mean 𝐿! , pitch mean 𝑃! , and duration mean 𝐷! features derived from the phonetic recognizer timestamps. The LS feature mean and standard deviation was computed per nth (of 12

128

total) vowel parameter set. Thus, the complete LS features comprised a compact 36-dimensional feature vector size per speaker: 𝐿𝑆 = 𝐿! 𝑃! 𝐷! 𝜎!! 𝜎!! 𝜎!! 𝐿! 𝑃! 𝐷! 𝜎!! 𝜎!! 𝜎!! … 𝐿! 𝑃! 𝐷! 𝜎!! 𝜎!! 𝜎!!

!

(7.2)

Articulatory Characteristics (AC)

Table 7.2: Summary of eGeMAPS acoustic feature articulatory characteristics. Note that only 11 vowels were evaluated in experiments herein. Infrequently occurring vowels (as indicated by ††) with grey diagonal texture were omitted because these did not provide examples across all speakers. The 12 AC features and their corresponding vowels are highlighted in light blue. eGeMAPS Vowels Only aa ae ah ay aw eh ey ih iy ow uw †† High Mid Low Front Central Back Rounded Unrounded Lax Tense Monophthong Diphthong

7.3

Experimental Settings

DAIC-WOZ speaker files had the interviewer speech removed using speaker-interviewer speech time annotations provided by human-listening transcripts. The AViD data was manually segmented by a human annotator. Both databases had voice activity detection (VAD) (Kinnunen et al., 2013) applied to remove undesirable silence and noise. For the VAD, optimized parameter settings described in Kinnunen et al. (2013) for clean microphone data were used (e.g. 20ms frame size with 50% overlap). Afterwards, each speaker file was processed using the phonetic recognizer with the standard parameter settings found in Schwarz et al. (2006). The phonetic recognizer generated a metadata file with the proposed phoneme output per speaker file along with the start and end times in milliseconds per phoneme. As a secondary VAD measure, only vowel outputs that contained eGeMAPS formant mean acoustic feature values were analyzed (see former Chapter 3.2). This helped further reduce any phonetic recognizer labeling errors because all vowels should have laryngeal voicing, which generates harmonic formant values.

129

As an experimental baseline, the mean eGeMAPS features were extracted from the whole speaker file (i.e. including consonants and vowels) and also from combined individual vowels (i.e. including only 11 vowels). In addition, two other novel feature types based on vowel sets were proposed based on the eGeMAPS features: Articulatory Characteristic (AC) and Linguistic Stress (LS) features. The AC and LS only contained features from vowels within the specific articulatory parameter set as detailed previously in Tables 7.1 and 7.2. Similarly to Mitra et al. (2014), depression classification was conducted using decision trees, which performed consistently well in preliminary experiments involving comparison of different classifier types (e.g. LDA, SVM). A decision tree classifier was also used because of its simple interpretation of low-dimensional feature sets. All experiments used the medium decision tree classifier from the MATLAB toolkit using a medium leaves setting and a maximum number of 20 splits. Each experiment used 10-fold cross validation with a 90/10 training/test split to help maximize data available for training. Classification performance was determined using overall accuracy and individual class F1 scores (similar to Cummins et al. (2017); Valstar et al. (2016)). The system employed in the depression classification experiments is shown in Fig. 7.3. Three different types of feature configurations were evaluated: baseline eGeMAPs features 𝐕, vowel articulatory characteristics features (AC), and linguistic stress features (LS). In addition, the effects of data selection were examined specifically for articulatory vowel parameter and linguistic features. The speech data were preprocessed using VAD and then automatically segmented using a phonetic recognizer (for more details see Chapter 3.1). Data selection was employed for AC and LS feature experiments, in which specific vowel sets were, as later shown in Chapter 7.4, selected for analysis on a per-vowel set basis. A decision tree classifier was then used to a decision output to compare the ground truth labels to the test labels.

Speech Recording

Voice Activity Detection

Automatic Phoneme Recognizer

Data Selection

Feature Extraction

AC LS

Classifier

Depressed / Non-Depressed Decision

Figure 7.3: System configuration for the articulatory characteristics and linguistic stress experiments. Dashed lines indicate data selection based on articulatory vowel parameters (e.g. ‘front’, ‘central, ‘back’).

130

7.4

Results and Discussion

Based on Fig. 7.1 and Table 7.2, experiments were conducted evaluating how articulatory vowel parameters influence depression classification. Fig. 7.4 shows the articulatory characteristic (AC) feature set depression classification performance results for the DAIC-WOZ English and AViD German experimental subsets. Considering tongue height, the ‘mid’ vowels yielded the best depression classification and F1 scores for both DAIC-WOZ and AViD. The increased range of tongue activation within this AC feature set improves its depression classification performance over other tongue height sets. Notably the ‘mid’ set is the only tongue height parameter AC feature set that contains three different

(b)

TONGUE HEIGHT

LIP POSITION

Monophthongs

Diphthongs Diphthongs

Tense

CONTRAST

TRANSITION

Monophthongs

Tense

Lax

CONTRAST

Lax

Unrounded

Rounded Rounded

Back

Central

TONGUE ADVANCEMENT

LIP POSITION

Unrounded

Back

Central

Front

TONGUE ADVANCEMENT

Front

Mid

Low Low

0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00

TONGUE HEIGHT

Mid

(a)

High

0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00

High

vowel tongue advancement placements (e.g. /eh/ (front), /ah/ (central), /ow/ (back)).

TRANSITION

Figure 7.4: (a) English DAIC-WOZ (English) and (b) AViD (German) ‘Non-Depressed’ and ‘Depressed’ classification results for vowel articulatory characteristics (AC) features. The colours represent classification accuracy (dark shade); depressed F1 (medium shade); and non-depressed (light shade). Apart from the diphthong set, all others included only monophthong fixed pure vowels.

For the tongue advancement vowel parameter, the ‘front’ vowels perform better than ‘central’ or ‘back’ for both DAIC-WOZ and AViD. The ‘front’ vowels have four vowel sounds (two of which occur most frequently), whereas the ‘central’ and ‘back’ have fewer. In agreement with the earlier hypothesis that central vowels would perform poorly due minimized motor demands, the ‘central’

131

AC feature set produced lower depression classification accuracy and poorer F1 scores when compared with other tongue advancement sets. This weak performance is due to the neutral positioning of ‘central’ vowel sounds along with their mild kinematic demand (i.e. similarly to previous experimental results discussed in Chapter 6.1) and short duration. In Fig. 7.4a, despite its small pool of only two English vowels (/ow/, /uw/), the ‘rounded’ lip position AC feature set depression classification results performed slightly better than the ‘unrounded’ AC feature set. Based on the aforementioned literature (Flint et al., (1993); Hockett (1958); Kreibig, (2010); Roy et al. (2009); Scherer, (1986)) it was believed that depressed speakers’ tightening of the facial muscles and the auxiliary bilabial muscular demand would increase depression discriminatory characteristics. However, this hypothesis was unsubstantiated by the DAIC- WOZ lip position results. The AViD lip position AC depression classification results showed that ‘unrounded’ vowels performed better than ‘rounded’ vowels (Fig. 7.4b). However, German lip position differed from the English because there were less than 300 total German ‘rounded’ examples versus over 10k for English. German also has front ‘rounded’ vowels, which English lacks (Schultheiss, 2008). These additional vowels were not included in the ‘rounded’ AC feature set because the English phonetic model did not make automatic labeling distinctions between German front ‘unrounded’ and ‘rounded’ vowels. Therefore, only back ‘rounded’ vowels were included in the German AC ‘rounded’ feature set. As anticipated for English (Fig. 7.4a), the contrast vowel parameter results show that the ‘tense’ AC feature set surpasses the ‘lax’ in performance, especially in terms of F1 depressed classification (0.11 absolute gain). This is due to increased kinematic effort when producing tense relative to lax vowels. Based on English phonetic studies (Flemming & Johnson, 2007; Knight, 2012), most lax vowels are produced as a generic schwa vowel sound, whereas tense vowels typically do not follow in this manner and are usually also longer in duration, too. As shown in Fig. 7.4b, depression classification performed poorly for both ‘lax’ and ‘tense’ AC feature sets in AViD. An explanation for the weak AViD contrast AC feature depression classification performance is that German vowels are generally more stressed than English vowels (Ramers, 1988; Schultheiss, 2008); and therefore, they provide less distinction between contrast AC ‘tense’ and ‘lax’ vowel feature sets. For DAIC-WOZ, the transitional vowels in the form of the ‘diphthong’ AC feature set generated better classification performance relative to fixed pure vowels in the ‘monophthong’ set.

132

Diphthongs are inherently more dynamic than monophthongs due to their transition from one vowel sound to another within a single syllable, as well as being longer in duration than monophthongs (Hockett, 1958). This was not the case for the German, which performed poorly for both ‘monophthong’ and ‘diphthong’ sets. The small quantity of AViD training data may have influenced results shown in Fig. 7.4a and 7.4b. As with other speech-depression related investigations, this study was limited by the small number of available databases and limited range of speaker sizes. The LS feature distribution analysis of DAIC-WOZ vowels in Fig. 7.5a shows that the median duration for depressed speakers is shorter for all vowel sets. Fig. 7.5a shows considerable reductions in depressed speakers’ median duration ranges for ‘mid’, ‘front’, ‘back’, ‘round’, ‘tense’, and ‘diphthong’ vowel sets when compared with non-depressed speakers. The DAIC-WOZ vowel durations also exhibited statistically significant differences in standard deviation between depressed and non-depressed speakers’ vowel sets, especially for the ‘mid’, ‘back’, ‘low’, ‘unrounded’, ‘rounded’, ‘tense’, ‘monophthong’, and ‘diphthong’. With regard to the DAIC-WOZ vowel median loudness and ranges shown in Fig. 7.5b, again these overall values were lower for depressed speakers than for non-depressed speakers. In Fig. 7.5b, the mean loudness range per vowel set containing the greatest difference between depressed and nondepressed English speakers was ‘high’, ‘front’, ‘back’, ‘rounded’, ‘lax’, and ‘diphthong’. With respect to the loudness standard deviations only minor differences in ranges for ‘high’, ‘front’, and ‘diphthong’ vowel sets were recorded for non-depressed and depressed English speakers. According to paired t-tests, the loudness standard deviation differences across vowel parameters were generally not statistically significant when compared with their mean loudness or duration. The LS feature distribution analysis of AViD mean duration in Fig. 7.5c shows that only for some vowel sets, depressed speakers’ vowels were shorter in duration. These included ‘low’, ‘back’, and ‘rounded’ sets. As indicated by Fig. 7.5c, AViD mean vowel durations had less statistically significant attributes, possibly due to the small database size (i.e. almost an order of magnitude smaller than the DAIC-WOZ). Moreover, the AViD vowel set durations had inconsistent standard deviation differences between depressed and non-depressed speakers’ vowel sets. Shown in Fig. 7.5d, AViD vowel mean loudness ranges show depressed speakers having overall greater median intensity than non-depressed speakers. Additionally, the majority of AViD mean loudness ranges across vowel sets showed statistical significance. Loudness standard deviations for AViD vowel sets indicated similar result ranges for both non-depressed and depressed speakers; the

133

only exception being ‘back’ vowels, wherein non-depressed speakers had a much higher value than depressed speakers. It is noted that due to nuisance attributes, discussed previously in Chapter 5.3, loudness can be influenced by recording conditions (i.e. distance from microphone, type of microphone, room size). As predicted, based on articulatory characteristics described in Chapter 7.1, for both DAIC-WOZ and AViD the median durations for the ‘central’ vowel set were the shortest amongst the parameters, whereas the ‘back’, ‘rounded’, ‘tense’, ‘low’, and ‘diphthong’ sets contained the longest. Further, DAIC-WOZ and AViD ‘low’ and ‘diphthong’ LS feature sets had the loudest mean values of all vowel articulatory parameters. The pitch mean and standard deviation plots were omitted in Fig. 7.5 because these showed little statistical significance. For DAIC-WOZ the overall pitch mean showed a trend of depressed speakers having an increase over non-depressed speakers, which was due to having more females in the ‘depressed’ group. The AViD pitch mean indicated minimal difference between ‘non-depressed’ and ‘depressed’ speaker groups.

(Space purposely allotted for next figure.)

134

⌈⌉

** ⌈⌉

** ⌈⌉

*

*

⌈⌉

Tense

CONTRAST

*

⌈⌉

*

⌈⌉

TONGUE HEIGHT

**

LIP POSITION

**

⌈⌉

Tense

Lax

Round

Unround

TONGUE ADVANCEMENT

CONTRAST

⌈⌉

** ⌈⌉

TRANSITION

*

⌈⌉

*

⌈⌉

TRANSITION

*

⌈⌉

⌈⌉

⌈⌉

** ⌈⌉

⌈⌉

⌈⌉

** ⌈⌉

** ⌈⌉

TONGUE ADVANCEMENT

LIP POSITION

Tense

Lax

Round

Unround

Front

Mid

TONGUE HEIGHT

CONTRAST

Diphthong

Monophthong

Tense

Lax

Round **

TRANSITION

** ⌈⌉

** ⌈⌉

Diphthong

**

⌈⌉

Unround

Back

Central

** **

CONTRAST

Monophthong

Low

⌈⌉

LIP POSITION

Normalised Loudness

⌈⌉

Back

**

⌈⌉

TONGUE ADVANCEMENT

Central

** **

High

TONGUE HEIGHT

(d)

Front

Low

Mid

High

Duration (msec.)

(c)

Back

Low

Mid

High

Normalised Loudness

⌈⌉

Lax

Back

LIP POSITION

*

⌈⌉

Central

*

⌈⌉

Central

Front

TONGUE ADVANCEMENT

Front

High

TONGUE HEIGHT

(b)

**

Diphthong

*

⌈⌉

Diphthong

*

⌈⌉

Monophthong

**

Monophthong

Low

*

⌈⌉

Duration (msec.)

⌈⌉

Round

**

⌈⌉

Unround

*

Mid

(a)

TRANSITION

Figure 7.5: DAIC-WOZ English speaker mean distributions for (a) duration and (b) loudness and AViD German speaker mean distributions for (c) duration and (d) loudness; ‘non-depressed’ (blue) and ‘depressed’

135

(red). The circle and color bar edges indicate the median and 25th to 75th percentile ranges of each vowel set respectively, whereas the narrower lines indicate the outer ranges and outliers are shown as dots. A starred and double starred bracket indicates pairs of results that were statistically significantly different using a paired t-test with α = 0.05 and α = 0.01 criteria, respectively.

Depression classification results for DAIC-WOZ and AViD databases using the Linguistic Stress (LS) features are shown in Table 7.3. For English, the mean duration LS features obtained a high of 76.8% classification accuracy with 0.35 (0.86) F1 scores, while the standard deviation duration features achieved 80.5% classification accuracy with 0.50 (0.88) F1 scores. The latter represents the best accuracy and F1 scores amongst all other LS feature combinations for DAIC-WOZ. Based on the DAIC-WOZ depressed and non-depressed duration comparisons shown previously in Fig. 5a, it was expected that the duration LS features would perform well. The AViD database also demonstrated improvements using the mean duration LS features, wherein a depression classification accuracy of 83.3% and 0.75 (0.88) F1 scores were achieved. In addition, mean duration + mean pitch LS features and their combined mean and standard deviation LS features also gave high performance. Loudness LS features alone achieved lower classification performance than the duration LS features. Among the LS features, the variation in depression classification performance accuracy and F1 scores was wide for both DAIC-WOZ and AViD. Across LS features, differences in performance of 33% absolute for depression classification accuracy and up to 0.58 absolute for F1 scores were recorded. The best LS feature depression classification and F1 score results across both DAIC-WOZ and AViD databases were the combined mean and standard deviation duration LS features, wherein ~78% depression classification was achieved with high F1 scores for both languages. Results in Table 7.3 broadly concur with previous English studies (Darby & Hollien, 1977; Mundt et al., 2007, 2012; Ostwald, 1965) that indicate reductions in overall duration dynamics in clinically depressed speakers. However, results shown herein provide further detail in regards to hypoarticulation in connection with kinematic expectations based on vowel articulatory characteristics.

136

Table 7.3: Summary of ‘Depressed’ and ‘Non-Depressed’ classification accuracy results and F1 scores for mean/standard deviation linguistic stress (LS) feature combinations. These results are from LS features based on using all 12 possible vowel parameter sets. The total number of LS features (feature dimension) is shown in parenthesis. Statistical significance was calculated using a two-sided Wilcoxon signed-rank test. A starred and double starred bracket indicates pairs of depression classification accuracy results that were statistically significantly different from the baseline with α = 0.05 and α = 0.01 criteria, respectively.

Mean & Standard Deviation

Standard Deviation

Mean

DAIC-WOZ (English) Linguistic Stress (LS) Features

%

F1 Depressed

Duration (12) Loudness (12) Pitch (12) Duration + Loudness (24) Duration + Pitch (24) Duration + Loudness + Pitch (36) Duration (12) Loudness (12) Pitch (12) Duration + Loudness (24) Duration + Pitch (24) Duration + Loudness + Pitch (36) Duration (24) Loudness (24) Pitch (24) Duration + Loudness (48) Duration + Pitch (48) Duration + Loudness + Pitch (72)

76.8 64.6 61.0 69.5 75.6 68.3 80.5* 69.5 69.5 76.8 73.2 76.8 78.0* 69.5 74.4 70.7 72.0 72.0

0.35 0.17 0.20 0.32 0.38 0.28 0.50 0.26 0.32 0.42 0.35 0.42 0.44 0.24 0.32 0.40 0.43 0.30

F1 NonDepressed 0.86 0.78 0.74 0.80 0.85 0.80 0.88 0.78 0.80 0.86 0.85 0.86 0.86 0.81 0.84 0.81 0.82 0.82

AViD (German) %

F1 Depressed

83.3** 58.3 67.1** 79.2** 83.3** 79.1** 63.3** 50.0 58.2 63.0 79.2** 79.2** 78.9** 62.5 63.0 74.7** 83.3** 79.1**

0.75 0.17 0.50 0.71 0.75 0.71 0.47 0.25 0.29 0.40 0.71 0.71 0.71 0.31 0.40 0.67 0.75 0.71

F1 NonDepressed 0.88 0.72 0.75 0.84 0.88 0.84 0.71 0.63 0.71 0.73 0.84 0.84 0.84 0.74 0.73 0.80 0.88 0.84

Trevino et al. (2011) found correlations between individual phoneme-specific length/intensity and the major depression psychomotor retardation sub-symptom. Yet, although suggested, they did not further explore feature development based on this keen insight or extend it to evaluating articulatory vowel parameters as features for depression classification. The research presented herein is the first to examine more than one language and utilise constrained articulatory vowel groupings with linguistic stress components for speech-based depression classification. Trevino et al. (2011) used a much smaller number of speakers from a single language database. The depressed and nondepressed speaker comparisons based on the newly proposed AC vowel sets in Fig. 7.4 and LS feature results in Table 7.3 show performance advantages to utilising articulatory acoustic phoneme-specific parameters, the need for which was proposed by Gábor & Klára (2014). DAIC-WOZ and AViD experiments using the baseline eGeMAPS features combined with the various LS features are shown in Table 7.4. For DAIC-WOZ, when compared to the eGeMAPS baseline, the eGeMAPS + duration standard deviation feature combination achieved a depression classification accuracy improvement of ~4% in absolute terms with similar F1 scores. This combined approach did not perform better than the lower-dimensional stand-alone duration LS

137

standard deviation feature result (80.5%) in Table 7.3. However, in comparing DAIC-WOZ results in Table 7.3 and Table 7.4, it is observed that the combined eGeMAPS and LS feature sets are more stable overall in terms of F1 score performance than the LS features alone. For AViD, the depression classification results were inconsistent across all different LS features when combined with eGeMAPS features. However, as a general rule for both databases, the durational mean LS features contributed to improved depression classification performance. As shown in Table 7.4, for AViD, the combined eGeMAPS + duration mean + loudness mean + pitch mean features yielded a depression classification accuracy of 79.2% and 0.71 (0.84) F1 scores. The latter was considerably higher than the eGeMAPS AViD baseline. Furthermore, in Table 7.4, several other LS feature combinations also performed better in terms of depression classification accuracy, indicating that complementary information is provided from the LS features to the eGeMAPS features. Table 7.4: Summary of ‘Depressed’ and ‘Non-Depressed’ classification accuracy results and F1 scores for mean/standard deviation feature combinations per LS feature types combined with eGeMAPS. The combined results are based on using eGeMAPS All Sounds (e.g. vowels and consonants) and the 12 vowel parameters listed previously in Fig. 7.4 and 7.6. The total number of eGeMAPS + LS features (feature dimension) is shown in parenthesis. Statistical significance was calculated using a two-sided Wilcoxon signed-rank test. A starred and double starred bracket indicates pairs of depression classification accuracy results that were statistically significantly different from the baseline with α = 0.05 and α = 0.01 criteria, respectively.

Mean &Standard Deviation

Standard Deviation

Mean

DAIC-WOZ (English)

7.5

Feature Combination

%

eGeMAPS All Sounds (88) eGeMAPS Vowels Only (88) eGeMAPS + Duration (100) eGeMAPS + Loudness (100) eGeMAPS + Pitch (100) eGeMAPS + Duration + Loudness + Pitch (124) eGeMAPS + Duration (100) eGeMAPS + Loudness (100) eGeMAPS + Pitch (100) eGeMAPS + Duration + Loudness + Pitch (124) eGeMAPS + Duration (112) eGeMAPS + Loudness (112) eGeMAPS + Pitch (112) eGeMAPS + Duration + Loudness + Pitch (160)

73.2 72.5 74.4* 70.7 73.2 74.4** 76.8* 72.0 74.4** 72.0 74.4 76.8** 68.3 73.2

F1 F1 NonDepressed Depressed 0.48 0.82 0.42 0.82 0.43 0.84 0.40 0.81 0.45 0.82 0.46 0.83 0.46 0.85 0.40 0.81 0.46 0.83 0.44 0.81 0.43 0.84 0.49 0.85 0.38 0.79 0.39 0.83

AViD (German) % 63.1 66.7 79.2** 50.0 58.3 79.2** 58.3 58.3 58.3 54.2 75.0** 54.2 62.5 75.0**

F1 F1 NonDepressed Depressed 0.47 0.71 0.50 0.75 0.71 0.84 0.25 0.63 0.29 0.71 0.71 0.84 0.38 0.69 0.38 0.69 0.38 0.69 0.15 0.69 0.67 0.80 0.27 0.67 0.40 0.73 0.63 0.81

Summary

In this research, articulatory characteristics and features associated with linguistic stress were evaluated for the automatic analysis of clinically depressed and non-depressed speakers. The

138

investigation of vowel articulatory characteristics using vowel set parameters indicated considerable differences between depressed and non-depressed speakers. In particular, clinically depressed speakers demonstrated statistically significant reductions in duration for the ‘low’, ‘back’, and ‘rounded’ vowel sets, when compared with other sets. Furthermore, based on linguistic stress feature depression classification results presented herein, it can be suggested that psychomotor retardation and/or cognitive deficit affects a depressed speaker’s articulatory ability, causing hypoarticulation, which influences the degree of his/her linguistic stress. Experimental results using various English articulatory characteristic vowel sets indicate that depressed speakers have a reduction in linguistic stress duration and loudness components. It is noted that very few papers to date have shown statistically significant differences of any kind between non-depressed and depressed speech for two databases (Cummins et al., 2015b). Moreover, this research has provided evidence that by utilising vowel set linguistic stress component information as a compact and interpretable feature set, increases in depression classification can be achieved over baseline approaches.

139

Chapter 8 LINGUISTIC TOKEN WORDS 8.1

Overview

As stated earlier in Chapter 1.2, there is a need to investigate linguistic measures to help determine what type of speech data is most useful for automatic speech-based depression analysis. Understanding what kind of word-level speech segments are most effective for predicting depression deserves more attention since diagnosis and monitoring is often based on limited interview or questionnaire responses. In terms of linguistic content, an examination of commonly uttered language expressions should be made, as they could yield a higher degree of severity prediction reliability due to their habitually high frequency of occurrence and their reduced phonetic variability when compared to more generalize speech samples. Notwithstanding the ease of collecting speech recordings, the dilemma of knowing which vocal segments carry compact informational value is a questioned predicament by researchers. The concept of ‘thin slice’ data selection was studied by Ambady & Rosenthal (1992), wherein it was observed that brief social or clinical observations were found to yield more useful compact information than longer observations. Ambady & Rosenthal (1992) surmise several considerations regarding thin slice data selection theory: (1) The channel of observations, whether verbal or non-verbal, has little effect on the predictive observational results. (2) A great deal of behavioural affect is generated unintentionally or unconsciously yet it still contributes to other people’s observed predictions or interpretations. (3) When affective thin slice data selection is proven effective it can significantly reduce resources without sacrificing performance. (4) While examining human behaviour the thin slice approach works particularly well when predicting vital interpersonally oriented criterion variables.

140

The concept of data selection has proven applicable in many areas of speech processing. Prior research (Boakye & Peskin, 2004; Reynolds et al., 1995) indicates that in speech processing applications, such as speech recognition, speaker identification, and speech emotion classification, data selection that reduces phonetic variability and compares similar phonetic structures can improve performance. For instance, Reynolds et al. (1995) demonstrated that by focusing on specific acoustic classes (i.e. text-dependent words), individual speaker models develop better modeling of short-term variations thereby resulting in higher overall speaker identification performance for short utterances. In Boakye and Peskin (2004), researchers analyzed thirteen token words consisting of less than 10% of all utterances across speakers. Their results showed that speaker identification performance was superior for short fixed token words than for an entire set of words. Researchers have also advocated for the inclusion of linguistic features with acoustic features in paralinguistic applications. For instance, Shriberg and Stolcke (2008) and Ishihara et al. (2010) analyzed text transcripts to generate token word linguistic features that contributed to higher speaker recognition performance. Research has also shown that habitual speaker idiolects and formulaic language use can provide unique speaker information (Boakye and Peskin, 2004). Formulaic language is common in everyday discourse and even monolog narratives, where it contributes to nearly 25% of all conversational speech (Bridges, 2014). Therefore, for any spontaneously spoken language, formulaic language constitutes a significant amount of what is spoken. This sizeable data pool of formulaic language is important for collection and analysis purposes because it means it is easier to naturally obtain than other types of speech content. It includes conventional word expressions, proverbs, idioms, expletives, hedges, bundles, and fillers. In the literature, only a limited number of non-acoustic linguistic text-based studies have specifically examined formulaic word fillers in depressed populations (Bridges, 2014; Pope et al., 1970). The acoustic evaluation of formulaic language for predicting levels of depression has practical advantages. Filler words are commonly found in spontaneous speech, occurring in large numbers across a wide range of speakers irrespective of gender, age, language, and education. Furthermore, the naturally constrained acoustic phonetic variability found in token filler words helps to facilitate intra- and inter-speaker comparisons. Intra-speaker, there are typically multiple examples of each token word per utterance, helping to construct a more focused analysis of phoneme characteristics. Inter-speaker, token word characteristics can also be more readily compared between speakers than entire utterances. Studies have demonstrated that even a single

141

phoneme type occurring in large quantities can help to reveal discriminative information regarding a speaker’s emotional state (Scherer et al., 2016; Sethu et al., 2008). While formulaic language has arguably only a minor contribution to semantic/pragmatic content, it adds importantly to cognitive-emotive speaker internalization (Bridges, 2014). Consequently, a linguistic examination of formulaic token words may potentially provide a new word-level feature for speech-based depression prediction. It is hypothesized that since formulaic fillers are quite common in number and represent speakers’ mental internalization, they will reveal more about the acoustic effects of depression on speech than utterances in general. Further, a word-level isolated data selection acoustic-based analysis of a particular set of token-words based on linguistic principles and/or linguistic analysis of spoken transcripts could reveal advantages over a generalized acoustic-only approach.

8.2

Methods

Using the DAIC-WOZ transcripts, a selection of ten token words based on formulaic language was evaluated, as shown below in Tables 8.1 and 8.2. Only individual token word entries with a start and end transcript time were evaluated for acoustic-based analysis, as these timemarked segments were determined based on the transcriber’s indication per single token word alone. Thus, if any of these token words were spoken along with other words in a sentence or phrase, these token words were not included in the acoustic-based analysis. These token words are listed below in Table 8.2. When combined, these token words were spoken by approximately 95% of the DAIC-WOZ speakers (e.g. the other 5% of the training set were omitted due to transcript time marker errors). Table 8.1. DAIC-WOZ transcript excerpt showing showing the continuous transcripts for the virtual-human interviewer (white) and participant (green). The linguistic text-based feature extraction included all of the participant’s transcript entries, whereas the acoustic-based feature extraction only included specific tokenword entries as indicated by the red circles.

142

Table 8.2. Description of token words evaluated, percentage of training/test speaker coverage, total number of utterances, and number of unique speakers from the DAIC-WOZ database. The token word “umm” was the most commonly spoken filler in the DAIC-WOZ train/test data. Token Words "Hmm" "Mhm" "Mm" "Uh" "Umm" "So" "No" "Yeah" "Okay" "You Know"

Word Type Filler Filler Filler Filler Filler Filler Polar Polar Polar Bundle

% Train 33% 35% 48% 52% 85% 27% 77% 60% 27% 21%

% Test 49% 43% 60% 60% 94% 54% 74% 66% 43% 23%

# Total 139 123 168 298 1276 119 230 305 56 57

# of Unique Speakers 52 53 72 77 124 48 109 88 44 31

Each token word example had acoustic-based features extracted, which is discussed later in Chapter 8.3. In addition, since the transcripts were available, linguistic-based features were also extracted using each participant’s entire transcript excluding the virtual-human interviewer portion. Textbased linguistic features, such as average utterance length, average syllables per second, percentage of unique words, percentage of articles/pronouns/prepositions, and readability scores were extracted on a per participant basis. The readability scores were based on the Flesh-Kincaid Grade Level and Gunning Fog Index (Wu et al., 2013). While these readability scores are typically derived from written passages, they can also be useful when applied to verbal transcripts. Speech (input)

Transcript (input)

Acoustic Data Selection

Linguistic Data Selection

Acoustic Feature Extraction

Linguistic Feature Extraction

Acoustic & Linguistic Feature Fusion

Prediction

Result (output)

Figure 8.2. Acoustic data selection involves chosen token words, whereas linguistic data selection uses all the words. Feature selection was applied during the feature extraction stages.

The proposed system design for these experiments, shown in Fig. 8.2, involves two main inputs, speech and spoken transcript data. During the acoustic and linguistic feature extraction stage, feature selection was performed. For depression severity prediction experiments shown herein, token words (see previous Table 8.2) were individually and collectively evaluated based on their acoustic features. Additionally, text-based linguistic features based on each speaker’s entire transcripts were also evaluated. The filler token words and linguistic features were later concatenated together as an acoustic-linguistic fused feature set. Afterwards, a statistical regression

143

method (i.e. SVR) using acoustic/linguistic features was applied to generate a depression score prediction output and compared with the ground-truth depression PHQ-8 score.

8.3

Experimental Settings

The baseline experiments used the eGeMAPS (Eyben et al., 2016) and VoiceSauce features extracted using VoiceSauce (Shue et al., 2011). For more details on these feature sets refer to the previous Chapter 4.1. For all acoustic features, windows of 20 ms (with 10 ms overlap) were applied and mean functionals were computed across the entire file for the entire file length and segmented token word files. Again, in addition to the acoustic-based features, text-based linguistic features were also extracted from the DAIC-WOZ database transcripts. For comparison with this baseline, and as previously discussed in Chapter 4.2.4, Support Vector Regression (SVR) was used to predict the depression scores for experiments herein. SVR has been successfully applied to other speech depression/emotion prediction tasks and is known for effective statistical generalization (Cummins et al., 2015b). Moreover, it has previously been applied to speech depression/emotion prediction tasks (Cummins et al., 2015b; Smola et al., 2004). Based on the SVR output, two standard performance metrics were used to evaluate the overall predictive accuracy due to their application in recent speech depression prediction challenges: mean absolute error (MAE) and root-mean squared error (RMSE). The audio portion of the training and development from the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) (Gratch et al., 2014) was used for all experiments herein. The DAICWOZ database was chosen for these experiments because it provides a fixed set of utterances which were spoken by a computer generated interviewer, a large group of speakers, PHQ-8 scores per participant, and includes phrase-level transcripts with beginning/ending time markers, making extracting single token words possible with minimal error. For more information regarding the DAIC-WOZ transcription conventions refer to Chapter 5.4.4 and Gratch et al. (2014). As shown in Fig. 8.2, the token words made up less than 1% of the entire data in the DAIC-WOZ database.

144

Participant Token Words 1% Participant Speech 32% Interviewer/ Noise/ Silence 67%

Figure 8.2. DAIC-WOZ database audio breakdown - approximately 37 total hours of speech-audio. The total token word portion consisted of less than 30 minutes of participants’ speech data.

8.4

Results and Discussion

8.4.1 Token Words Versus Entire Utterances It should be noted that for comparison herein, the AVEC 2016 depression prediction sub-challenge baseline (Valstar et al., 2016) utilised all training and development data in the DAIC-WOZ, applied similar acoustic features, and employed a support vector machine for regression analysis. The baseline acoustic features were created using entire utterances, and a depression prediction baseline of 5.35 Mean Absolute Error (MAE) and 6.74 Root-Mean Squared Error (RMSE) was achieved. For comparison with AVEC 2016 baseline, Support Vector Regression (SVR) was also used to predict the depression scores for experiments herein. The purpose of these token-word experiments was to examine how smaller linguistic specific segments perform when compared with entire utterances, and determine which features or feature combinations contribute to better token word depression prediction. Formulaic filler word analysis nearly matched the baseline result for the complete database (‘entire utterances’) when identically partitioned comparisons were made based on speaker groupings, as shown in Table 8.3. When evaluated against the entire utterances baseline, fillers gave lower overall MAE and RMSE. Depression severity prediction results similar to those produced by the eGeMAPS acoustic features were found for VoiceSauce features, but with lower error in general. These features may have performed better due to containing a greater number of total features than eGeMAPS. In addition,

145

VoiceSauce applies more than one acoustic analysis method (i.e. Praat, Straight, Snack Sound Toolkit) for estimating frequency and energy formant-related features. VoiceSauce features, shown in Table 8.4, produced competitive results when compared with the entire utterances baseline. Using these features, only one of the token word fillers “umm” achieved a higher MAE than that of the baseline. Note that “hmm” performed particularly well for depression prediction, generating a 2.91 MAE versus the 5.22 MAE in the baseline. Table 8.3. Token word and entire utterances (baseline) depression prediction using eGeMAPS acoustic features and SVR prediction. Each token word and baseline experimental set contains the same speakers for equivalent speaker comparison. The all average includes all 10 token words, whereas the filler average includes only filler token words (hmm, mhm, mm, uh, umm, so). Word/Phrase Token Words "Hmm" "Mhm" "Mm" "Uh" "Umm" "So" "No" "Yeah" "Okay" "You Know" All Average Filler Average

MAE 3.85 4.08 5.58 5.37 6.56 6.31 5.00 7.17 5.73 8.67 5.83 5.29

RMSE 5.07 4.18 6.95 6.50 8.15 8.07 6.08 8.87 6.74 9.92 7.05 6.49

Baseline (all utt.) MAE 5.22 4.10 5.70 5.96 5.35 6.31 4.71 6.10 5.55 3.07 5.21 5.44

RMSE 6.90 4.80 6.77 6.90 6.67 8.17 5.79 7.10 6.54 4.31 6.40 6.70

Table 8.4. Token word and entire utterances (baseline) depression prediction using VoiceSauce acoustic features and SVR prediction. Each token word and baseline experimental set contains the same speakers for equivalent speaker comparison. The all average includes all 10 token words, whereas the filler average includes only filler token words (hmm, mhm, mm, uh, umm, so). Word/Phrase Token Words "Hmm" "Mhm" "Mm" "Uh" "Umm" "So" "No" "Yeah" "Okay" "You Know" All Average Filler Average

MAE 2.91 3.76 5.56 4.82 6.27 6.06 4.87 7.38 5.08 7.52 5.42 4.90

RMSE 4.13 4.62 7.09 6.03 7.96 8.57 6.24 10.91 6.68 8.96 7.11 6.40

146

Baseline (all utts.) MAE 5.22 4.10 5.70 5.96 5.35 6.31 4.71 6.10 5.55 3.07 5.21 5.44

RMSE 6.90 4.80 6.77 6.90 6.67 8.17 5.79 7.10 6.54 4.31 6.40 6.70

8.4.2 Linguistic Baseline System Trends in the linguistic features were discovered for depressed speakers having higher ‘moderately depressed’ to ‘severely depressed’ PHQ-8 scores (e.g. 15-23), whereas for scores below this range linguistic feature demonstrated relatively minimal differences. For instance, ‘moderately’ to ‘severely’ depressed speakers tended to exhibit a reduction in overall word syllable averages, reduced preposition usage, increased usage of pronouns, and overall simpler sentence structure based on average readability scores. While ‘depressed’ versus ‘non-depressed’ female speakers did not indicate a difference in average words per sentence, depressed males showed an overall reduction, especially for higher PHQ-8 scores. In experimenting with linguistic features the MAE and RMSE average using linguistic features was slightly lower than the entire utterances acousticbased baseline features, 5.17 and 6.30, respectively. These linguistic feature results indicate that text-based derived features are competitive to acoustic-based features for depression prediction.

8.4.3 Acoustic and Linguistic Features Fusion Experiments utilising all acoustic features from token words along with linguistic features were completed in an attempt to attain the lowest the MAE and RMSE possible. The eGeMAPS, VoiceSauce, and linguistic features were concatenated as a single vector per utterance before prediction using SVR. In Table 8.5, fusion experiments using a combination of acoustic token word and linguistic features produced the overall lowest MAE and RMSE average for fillers token words. In the results presented to this point, only subsets of the training/test data could be used, i.e. subsets that contained the relevant individual token words. To understand the depression prediction performance across the entire data, all 10 token words were collectively merged into training and test sets, which allowed for every speaker to be represented much like the baseline results found in (Valstar et al., 2016). Using the collective all utterances baseline eGeMAPS features, prediction errors of 5.51 MAE and 6.83 RMSE were attained. When collective token word sets were examined using all filler token words (e.g. hmm, mhm, mm, uh, umm, so) with eGeMAPS features, prediction errors of 6.07 MAE and 7.52 RMSE were achieved. While the collective filler token word results did not produce results as low as the entire utterances baseline, these filler words were surprisingly accurate considering that most comprise less than a second of speech. The collective token filler word results may be an indication that some particular fillers and their acoustic-phonetic attributes are better for depression prediction than others.

147

Table 8.5. Token word SVR depression prediction using fused eGeMAPS + VoiceSauce + linguistic features. Each token word and baseline experimental set contains the same speakers for equivalent speaker comparison. The all average includes all 10 token words, whereas the filler average includes only filler token words (hmm, mhm, mm, uh, umm, so). Fusion Token Words "Hmm" "Mhm" "Mm" "Uh" "Umm" "So" "No" "Yeah" "Okay" "You Know" All Average Filler Average

MAE 2.89 3.45 5.90 5.00 6.18 5.05 5.06 6.42 4.70 8.08 5.27 4.75

Baseline (all utts.)

RMSE 4.35 4.63 7.13 6.32 7.82 7.09 6.31 8.50 6.24 9.42 6.78 6.22

MAE 5.22 4.10 5.70 5.96 5.35 6.31 4.71 6.10 5.55 3.07 5.21 5.44

RMSE 6.90 4.80 6.77 6.90 6.67 8.17 5.79 7.10 6.54 4.31 6.40 6.70

8.4.4 n-Best Approach An n-best approach was experimented with, using the four lowest MAE/RMSE token words that, when combined, allowed score prediction for every test utterance; thus, creating a fair comparison with the entire utterances baseline. In Table 8.6, the best test MAE/RMSE performance was achieved using n-best eGeMAPS and linguistic features with feature reduction. The absolute improvement in MAE/RMSE over the entire utterances baseline was 0.95 and 1.24, respectively. The 4-best token words (“hmm”, “mhm”, “no”, “uh”) were fillers with nasal phonetic elements. For 4-best token words, the VoiceSauce features attained relatively similar results to eGeMAPS. Table 8.6. Comparison of entire utterances acoustic eGeMAPS and VoiceSauce feature baseline versus 4-best token words (“hmm”, “mhm”, “no”, “uh”) on all test speakers. Note * indicates feature selection applied. Each token word and baseline experimental set contains the same speakers for equivalent speaker comparison. eGeMAPS All Utterances (similar to Valstar et al., 2016) All Fillers 4-Best 4-Best + Linguistic 4-Best* + Linguistic*

148

VoiceSauce

MAE RMSE MAE RMSE 5.51 6.83 5.48 6.72 6.07 7.52 6.08 7.59 4.72 5.76 4.71 5.71 4.74 5.70 4.71 5.71 4.56 5.59 4.71 5.71

8.5

Summary

This linguistic-driven data selection research on formulaic language demonstrates that thin slice speech data selection can be competitive for depression prediction when compared with using whole utterances. Spoken formulaic language, and in particular filler words, hold acoustic discriminative properties that are useful when predicting different ranges of depression. Results from the analysis of the DAIC-WOZ database indicate that filler words are equally or more effective for depression prediction as using entire utterances. Furthermore, experimental results herein show that among the token words selected for this evaluation, fillers consistently provided the lowest depression score prediction error. Filler words are advantageous for speech-based depression analysis because they are naturally repeated in abundance. In addition, the general location of filler words is between phrase clauses; meaning they typically begin or end a phrase, make them potentially easier to identify using automatic speech recognition. It was also shown that linguistic information can help to isolate specific token words, or token word groupings, resulting in improved prediction performance over a more generalized data selection approach. A linguistic evaluation of speaker transcripts presented demonstrated that text-based linguistic features are competitive with acoustic features for speech-based depression severity predictive tasks. Moreover, the fusion of acoustic and linguistic feature components generated further prediction improvements, thus indicated that the acoustic and linguistic speech components contain complimentary depression severity discriminative information leading to more competitive depression prediction results.

149

Chapter 9 ARTICULATION EFFORT, LINGUISTIC COMPLEXITY, AND AFFECTIVE INTENSITY 9.1

Overview

Despite prior speech-based depression studies that have used a variety of speech elicitation design methods, there has been little evaluation of the best elicitation methods. Equally as vital for automatic speech-based depression diagnosis, relatively many questions still exist in regards to how depressed patient’s speech can be more discriminatively identified from healthy speakers using articulatory, linguistic, and affective measures. Thus, a study of these different speech-related measures must be further examined. One approach to understand this better is to analyze an existing database from the perspective of articulation effort and linguistic complexity measures as proxies for depression sub-symptoms of psychomotor retardation, negative stimulus suppression, and cognitive impairment. While prior studies have shown that acoustic features can be used as an indicator for depression (Cummins et al., 2011, 2015a); there has been little investigation into aspects of articulation effort as a depression indicator, despite its likely link with psychomotor retardation. One impediment could be the lack of quantified measures of articulation effort, as previously briefly discussed in Chapter 6.1.1. In addition, for depression classification, the application of linguistic complexity based on transcripts, and its relation to the acoustic speech biosignal has received minimal attention, despite its probable link with cognitive impairment. However, more recently in Williamson et al. (2016), it has been strongly advocated that all speech-based depression analysis should integrate speech-to-text linguistic analysis due to its considerable depression classification performance. In this chapter, the use of an articulatory effort and linguistic complexity measures are studied to help select acoustic-based speech segments for depression classification. It is believed that speech representing higher overall articulatory effort and/or linguistic complexity will provide increased

150

discrimination of depressed speakers. This hypothesis is driven by abnormal acoustic/linguistic speech characteristics and spoken language deficits exhibited in depressed populations, as discussed previously in Chapters 3.4.1 and 3.4.2. Thus far, no speech-based depression classification study has used phoneme acquisition mastery information based on clinical speech language pathology norms, and/or text-based linguistic complexity information derived from speaker transcripts, as a possible tool for automatic acoustic speech feature data selection.

9.2

Methods

It is generally agreed that American English has between 38 to 46 phonemes depending on the regional accent (Ladgeford & Disner, 2012). During conversational speech, the respiratory system, larynx, velum, jaw, tongue, and lip motor are essentially controlled in an unconscious manner just to generate just a single phrase. For more than a century, there have been studies describing which phonemes and/or phonemic syllables are more complex than others (Chomsky & Halle, 1968; Wellman et al., 1931). Without delving into too much articulatory detail (cf. Hardcastle, 1976; Ladefoged & Maddieson, 1996), the effort complexity of consonants depends on several factors, such as the formant transition from the preceding vowel, formant location, and voicing type. While Chapter 6 investigated phoneme-to-phoneme transitions and groupings based on articulatory manner parameters, the production effort concerning a specific individual phoneme was not evaluated. The production of voiced consonants is generally more complex than voiceless consonants. Thus, due to their reduced physical motor planning effort, it would seem that voiceless consonants occur earlier in age and in greater numbers when compared with voiced consonants. Indeed studies across many different languages have indicated that voiceless plosive consonants occur more frequently than voiced ones (Zipf, 1935). Furthermore, there is a highly significant positive correlation between a larger consonant phoneme inventory size and greater syllable complexity across languages throughout the world (Maddieson, 2006). For decades, speech pathologists have recognized that several phonemes in the English language require more time to master than others (Wellman et al., 1931; Sander, 1972; Smit et al., 1990) (see Fig. 9.1). Adult-like phonemic recognition/categorization mastery is achieved beyond early childhood, especially with consonants (Hazan & Barrett, 2000). One of the reasons for the early mastery of particular groups of sounds is that in monosyllabic words (i.e. usually the first types of

151

words learned), stop/nasal consonants are more common than other phonemes and acquisition of phonemes in longer words increases with age (Farwell, 1976; Shriberg et al., 1993). bdhmpw f kn gjt v l ch dh dz sh r th ng s z 0

12

24

36

48

60

Age (Months)

72

84

96

108

Figure 9.1. Recommended age at which 90% of children acquire articulation mastery of English consonants based on Smit et al. (1990). Stop and nasal consonants are among the first sounds mastered by children.

Many studies (Bennabi et al., 2013; Cummins et al., 2015a; Scherer & Zei, 1988) have indicated unusual lax articulation and sluggish verbal motor coordination (e.g. psychomotor retardation) in individuals with depression disorders. Moreover, (Flint et al., 1993; Darby et al., 1984) found significant changes in motor articulation transitions in depressed speakers, including poor intelligibility and imprecise consonant production. Due to the aforesaid articulatory abnormalities exhibited by depressed speakers, an acquisition-based phoneme articulation effort measure is proposed to help further identify depressed speakers. The proposed novel indicator of articulation effort is based on the age of articulation mastery for males shown previously in Fig. 9.1. For phoneme effort scoring, vowels/diphthongs were given a score of 0 as they are all vowels and are typically learned before consonants, whereas the eight different consonant groupings were given a score based on the age of mastery as indicated in Fig. 9.1. A phonetic dictionary was utilised to convert all database transcript text into English phoneme representations. This allowed proper scoring based on spoken phoneme rather than letters (i.e. ‘three’ as ‘TH-R-IY’) as further illustrated by Fig. 9.2.

Figure 9.2. Proposed articulation effort measure values for a spoken phrase, phonetic conversion, and corresponding articulation effort values; blue indicates only consonants calculated in the articulatory effort mean measure.

152

Articulatory effort produces an acoustic signal, which is intrinsically linked to quantitative linguistic representations. It is well established that aspects of high-level language (e.g. lexical choice, syntax, pragmatics) are affected by depression (Andreasen & Pfohl, 1976; Rude et al., 2004; de Choudhury et al., 2013). While computational text-processing techniques were originally developed for written language, they have been successfully transitioned for application on spoken language (Chomsky & Halle, 1968; Williamson et al., 2016). For instance, in Williamson et al. (2016), features derived from semantic analysis of spoken transcripts performed better than acoustic and video features for depression recognition. As shown in Fig. 9.3 below, the focus of this experimental work involves acoustic speech-based data selection by leveraging text-based measures (e.g. articulation effort, linguistic complexity) using partitioned data analysis. As proposed by Fig. 9.3, the speech data input is evaluated via articulation effort and linguistic complexity based on a speaker’s transcript. After which, acoustic feature selection is conducted based on the degree of articulation effort and/or linguistic complexity. Therefore, only phrases that meet a specific articulation effort and/or linguistic complexity scores were evaluated using during classification. The grey arrow indication shown in Fig. 9.3 is an optional step, wherein the elicitation design could be adapted to help achieve a larger collection of greater articulation effort and/or linguistic complexity data. Thus, in this proposed system, the elicitation protocol could be change depending on whether or not the effort or complexity met the desired threshold. Ground Truth Elicitation Method

Speech (input)

Data Selection

Feature Extraction

Articulation Effort

Acoustic

Classification

Depressed/ Non-Depressed (output)

Linguistic Complexity

Figure 9.3. Experimental design for comparing data selection measures for acoustic depression classification. The dashed line suggests the implications of the experimental results for elicitation design.

9.3

Experimental Settings

During experiments, each speaker’s articulation effort measures per phrase were averaged to create a non-partitioned all-phrase acoustic COVAREP feature set that consisted of the mean, standard deviation, 50th-percentile mean grouped, and variance were calculated per speaker transcript for consonants only, allowing for disparate utterance lengths to be equally compared. Additionally,

153

these phrases were partitioned into low, mid-low, mid-high, and high groups based on their degree of articulation effort (see Fig. 9.4). For research herein, linguistic measures, such as lexical sophistication, syntax phrase indices, grammar structure, and genre-related language were extracted from each transcript using the Simple Natural Language Processing Tool (SiNLP) (Crossley et al., 2014), Tool for the Automatic Analysis of Lexical Sophistication (TAALES) (Kyle & Crossley, 2015), and Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC) (Kyle, 2016). Note that this is the first research to apply these particular toolkits for depression classification, and all transcripts were processed in original text format. These automatic text-processing toolkits are discussed in previously in Chapter 4.1.2. For linguistic complexity experiments, each speaker’s linguistic complexity measures per phrase were averaged to create a non-partitioned feature set. However, for the partitioned experiments, a sorted mean per-utterance word letter length measure was used to separate the data into four partitions of low, mid-low, mid-high, and high linguistic complexity values (see Fig. 9.4). Afterwards, per partition, each speaker’s linguistic complexity values were averaged to obtain a partitioned feature set. The basic sorting by average number of word lengths was indicative of greater linguistic complexity. Essentially, it was expected that longer words were less likely to be determiners/conjunctions (e.g. the, a, an, and), contained greater variance in syntactic construction, and furthermore, carried more meaningful semantic information. It has been shown in Piantadosi et al. (2011), for several languages including English, that words and phonemes are assigned shorter lengths (e.g. pronunciations) in linguistic context where language is highly predictable and/or conveys less information. Hence, word length is associated with its degree of informativeness. High

Partitions

Mid-High Mid-Low Low 0

50

100

150

200

250

Minutes

300

350

400

450

Figure 9.4. Total number of minutes evaluated in the DAIC-WOZ database based on articulation (blue) and linguistic (green) measures and their partitions. As expected based on articulatory acquisition (Smit et al., 1990) and word-length frequency (Piantadosi et al., 2011) norms, the majority of DAIC-WOZ data is partitioned within the mid-low group, whereas the high group partition consists of the least amount of data.

154

In addition, comparatively, non-partitioned all-phrase experiment results were also included. For experiments herein, the COVAREP speech toolkit (Degottex et al., 2014) was used to extract acoustic features. Each COVAREP feature had its mean, standard deviation, kurtosis, and skewness calculated by aggregating 20-ms frame-level features with 10-ms overlap across individual utterances. COVAREP was chosen as an acoustic feature set since it is the baseline for the AVEC 2016 challenge (Williamson et al., 2016), and other depression analysis research. For more information on the COVAREP feature set see the previous Chapter 4.1.1. As previously indicated in Fig. 9.4, partitioned experiments used the mean per-utterance measure to separate the data into four partitions of low, mid-low, mid-high, and high articulation effort values. Similarly to Mitra et al. (2014) and Nadeem et al. (2016), depression classification was conducted using decision trees, which performed well in preliminary experiments. All experiments used the MATLAB simple decision tree classification toolkit using a few leaves with coarse decisionmaking and a maximum of 4 splits. Experiments utilised 2-class classification (nondepressed/depressed) with 5-fold cross validation using a 20/80 training/test split. The Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) (Gratch et al., 2014) was used for all experiments. The DAIC-WOZ was chosen because it has a relatively large number of speakers (82 male/female), spontaneous speech, a virtual human interviewer, phrase-level spoken transcriptions with time stamps, and a Patient Health Questionnaire (PHQ-8) score per participant. See previous Chapter 5.4.4 and Gratch et al. (2014) for more details about the DAIC-WOZ specific speaker content and transcription conventions.

9.4

Results and Discussion

It was hypothesized that depressed speakers would demonstrate a decrease in articulatory precision for consonants that are acquired later in childhood - thus, resulting in better discrimination of these speakers. Evidently, partitioned results shown in Fig. 9.5 confirm that using acoustic features from phrases with greater articulation effort or linguistic complexity produce higher depression classification results than using all utterances. In partitioned experiment results shown in Fig. 9.5, absolute gains in depression classification were recorded when compared with the non-partitioned acoustic-based results shown later in Fig. 9.6; but this comparison is not straightforward because these accuracies are from different data subsets (i.e. all-phrases versus specifically partitioned phrases).

155

Classification Accuracy

85% 80% 75% 70% 65% 60%

Articulation Effort

Linguistic Complexity

Figure 9.5. Average depression classification accuracy results for acoustic COVAREP features per articulation effort (blue) and linguistic complexity (green) measure partitions shown from left to right: low, mid-low, mid-high, and high. These partitioned results indicate that the 66% baseline acoustic feature depression classification accuracy using all speaker data can be generally surpassed by utilising acoustic features derived only from higher articulatory and linguistic complexity speech segments.

Although partitions shown in Fig. 9.5 were chosen by retaining near-equal numbers of phrases per partition, there was variation in the total phrase duration per partition, as shown previously in Fig. 9.3. Based on a more detailed investigation, among measures shown in Fig. 9.5, the high linguistic complexity partition had the smallest duration (see Fig. 9.3), which might explain its decrease in classification accuracy. The DAIC-WOZ transcripts are limited because they do not contain exact phonetic representations. Albeit a speaker could pronounce a word with a deleted, added, or substituted phoneme/s, but the transcript does not indicate these speech attributes. For example, if a speaker said ‘runnin’, the DAIC-WOZ transcripts would only render ‘running’. Ergo, a major advantage of referencing textbased measure values onto the acoustic features is that if abnormal articulatory production and/or affective states manifest in depressed speakers as complexity increases, these characteristics will be recorded acoustically. Currently, commercial automatic speech recognition may have sufficient performance to justify a text-based analysis application in combination with acoustic speech-based features for depression classification. Using the DAIC-WOZ corpus, results demonstrated that approximation of texttranscripts (e.g. gist of what was spoken; broad transcription) is required to generate effective textbased measures as features and/or to use as data selection references for acoustic features. Non-partitioned experiments comparing acoustic-based features with text-based features are shown in Fig. 9.6. Similarly to Williamson et al. (2016), the linguistic complexity features yielded superior classification accuracy than the acoustic-based features. The articulation effort and SiNLP features performed surprisingly well despite having relatively much smaller dimensionalities than other

156

feature sets. It is important to note that accuracies based on linguistic features in Fig. 9.6 may be

Classification Accuracy

optimistic relative to those based on an automatic transcript. 85% 80% 75% 70% 65% 60% COVAREP (296)

Lingusitic Articulation Effort Complexity (864) (4)

SiNLP (7)

TAALES (476)

TAASSC (377)

Figure 9.6. Average depression classification accuracy results for acoustic-based COVAREP (red) baseline and linguistic-based (green) features; the dark shade indicates results for all similar feature sets combined, while the number of total features are indicated in parenthesis.

9.5

Extension of Results and Discussion Using Affective Intensity

An extension of the articulation effort and linguistic complexity depression classification experiments presented in Chapter 9.4 was conducting by investigating the affective intensity (e.g. degree of affect). The hypothesis was that while word affect ratings are a qualitative measure, they could be useful in broadly interpreting text-based sentiment content. Thus, from each speaker’s transcript, an evaluation of text-based sentiment analysis could provide further insight into broad emotional context expressed. In de Choudhury et al. (2013), it was shown that word affect information can be transformed into a feature to help recognize individuals with depression. Therefore, the Sentiment Analysis and Cognition Engine (SEANCE) (Crossley et al., 2017) was used to extract token word affect-based features per individual transcript phrases. Several different affective word-rating references were included in SEANCE, such as the Affective Norms for English Words (ANEW), EmoLex, Laswell, and Hu Lui Polarity (refer to previous Chapter 4.1.3 for more information on these). Note that this is the first research to apply many of these affectively-rated token word references (see Chapters 3.3 and 4.2.3 for more details) for depression classification. Much like the previous partitions described in Chapter 9.4, a sorted mean per-utterance SEANCE ANEW arousal/valence word affect measure was employed to separate the phrase data info four partitions of low, mid-low, mid-high, and high based on arousal/valence values. Again, depression

157

classification was conducted using decision trees performing 2-class classification (nondepressed/depressed) with 5-fold cross validation using a 20/80 training/test split. As shown by Fig. 9.7, the low valence and affect partitions had the greatest amounts of data. It should be noted that this affective distribution may not always be the case, as the interviewers questions reflect to a large degree the context of what is discussed and data collected. Per partition, each speaker’s acoustic COVAREP features per phrase were averaged to obtain a partitioned acoustic feature set. Also, for the entire speaker transcript experiments, each speaker’s token word affect measures per phrase were averaged to create a non-partitioned affective feature set.

High

Partitions

Mid-High

Mid-Low

Low 0

50

100

150

200

250

Minutes

Figure 9.7. Total number of minutes evaluated in the DAIC-WOZ database based on token word valence (light orange) and arousal (dark orange) measures and their partitions. Note that although there are differences in number of minutes per affect partition, there was roughly the same number of recordings per partition.

As shown in Fig. 9.8, phrases with higher arousal and/or more positive valence were more discriminative for acoustic COVAREP features. This is likely due to depressed speakers’ difficulty producing happier (e.g. more positive valence) and/or more energetic (e.g. aroused) speech than non-depressed speakers (Bylsma et al., 2011; Cummins et al., 2015b). Furthermore, it has been shown in Barrett & Paus (2002) that even non-depressed speakers prosodically adjust their speech to have more ‘negative’ sounding tone (e.g. longer duration, reduced loudness, lower pitch) for negatively valenced phrases than positively valenced phrases. Thus, the adjusted phrase sentiments normally exhibited by non-depressed speakers make them more similar to a depressed patient’s overall abnormal speech habits. In Fig. 9.8, the decrease in the mid-high word affect arousal partition could be related to the narrower range of its partition. Nonetheless, for valence and arousal partitions, the low partition produced results performed roughly only as well as the COVAREP acoustic all phrases baseline.

158

Classification Accuracy

85% 80% 75% 70% 65% 60%

Word Affect (Valence)

Word Affect (Arousal)

Figure 9.8. COVAREP acoustic features average depression classification accuracy results per affect measure and partition shown form left to right: low, mid-low, mid-high, and high. These partitioned results suggest that the 66% baseline COVAREP acoustic feature depression classification accuracy using all speaker data can be generally surpassed by utilising acoustic features derived only from high degrees of affect speech segments.

Non-partitioned experiments comparing affective text-based features are shown in Fig. 9.9. As shown in Fig. 9.9, when affect word-rating references, such as the GI, ANEW, and Laswell were applied as features (i.e. extracted from the speaker transcripts), depression classification performance of up to 84% accuracy was achieved. It is important to note that accuracies based on linguistic features in Fig. 9.9 may be optimistic relative to an automatic transcript. 85%

Classification Accuracy

80%

75%

70%

65%

60%

Figure 9.9. Average depression classification results using several different affect reference text-based features. The number of features included per affect word analysis reference is shown in parentheses. The COVAREP acoustic baseline was 66% classification accuracy.

159

9.6

Summary

A study evaluating different speech-related measures for depression analysis yielded multiple insights regard how acoustic, linguistic, and affective information impacts speech-based depression analysis. Previous groundwork in the related fields of speech language pathology, computational linguistics, and affective computing motivated the development of new measures for automatic acoustic feature data selection. The newly derived articulation effort, linguistic complexity, and affective intensity measures produced partitioned gains in depression classification accuracy of up to 11% absolute over the all-phrase baseline. Moreover, the text-based articulation effort, linguistic complexity, and affective features based on all-phrases were superior over acoustic-based features, with up to an absolute gain of 18% in depression classification performance (e.g. Lasswell affect features). It should however, also be noted that low-dimensional features sets, such as the articulation effort and linguistic SiNLP features were relatively competitive when compared to the larger feature sets explored. In this chapter, a novel measure for quantifying articulation effort based on mastery of consonants was developed, and when applied experimentally to the DAIC-WOZ corpus shows promise for identifying speech data and analyzing acoustic-based feature regions that are more discriminative of depression. The application of acoustic speech-based analysis in conjunction with text-based processing shows considerable promise for guiding clinical elicitation protocol design and interview speech analysis. These experiment results demonstrate that by selecting speech with higher articulation effort, linguistic complexity and degree of affect, improvements in acoustic speech-based feature depression classification performance can be achieved, serving as a speech collection guide for future elicitation design.

160

Chapter 10 SPEECH AFFECT RATINGS 10.1 Overview Depression, as a mood disorder, induces changes in response to emotional stimuli, which motivates investigation into the relationship between different measures of affect and depressed speakers’ speech behaviours. Typical speech-based depression detection systems in the literature focus on eliciting a paralinguistic marker directly from a speech biosignal (Cummins et al., 2013a, 2015a; Helfer et al., 2013; Low et al., 2010). However, despite strong links between changes in continuous affective measures and depression (Bradley & Lang, 1994; Tellegen, 1985; Lang et al., 1993), there is little research studying the benefits that features derived from affect rating scores could introduce to a speech-based depression classification system. Depression prediction results presented by Perez et al. (2014) suggest that a small number of dimensional affect scores provided by human-listeners can produce comparable performance with large brute-forced acoustic feature spaces when predicting a speaker’s level of depression. However, this research neither provided individual details on affect rating-based feature performances, nor explored using these ratings to investigate emotional information for the improvement of automatic depression classification systems. Speech depression recognition using continuous affect ratings has potential as a future clinical noninvasive mental health assessment tool, assisting in discerning between healthy individuals from others suffering from clinical depression (Cummins et al., 2015a). It is believed that affect will be an important part of speech-based depression analysis and identification. For example, in Goeleven et al. (2006), it was discovered that depressed patients neglect to suppress pessimism when relating to negative stimuli. This abnormal change in affect caused by a depressive state could be quantitatively measured by examining affective computingrelated dimensions, such as arousal and valence, using a word-by-word rating scale. While a wordlevel affect approach is quite common in data mining sentiment analysis (Pang & Lee, 2008), its application is emerging quickly in other areas, such as medicine (Denecke & Deng, 2015).

161

Due to the aforesaid links between affective states and depression, emotion features derived from continuous affect rating measures are suggested to investigate depressed speech. Furthermore, investigation into evaluating explicitly controlled speech segments in terms of affect might reveal more insight in regards to informational regions of more value for depression classification.

10.2 Methods It is reasonable to ask whether depression discrimination is improved by selecting segments of speech residing exclusively in particular emotional regions. A convenient means of selecting these emotional subsets is to threshold features based on manually rated arousal, valence, or dominance per frame. In order to select emotional subsets from mild to severe, four thresholding approaches were applied (Fig. 10.1): upper bounds: lower emotion values retained; lower bounds: higher emotion values retained extroverted bounds: mild (center) emotion values retained; and introverted bounds: severe (fringe) emotion values retained.

Lower Bounds

(b)

Upper Bounds

(a)

Introverted Bounds

(d)

Extroverted Bounds

(c)

Fig. 10.1: Emotion-based data selection using four types of bounds: (a) Upper Bounds; (b) Lower Bounds; (c) Extroverted Bounds; and (d) Introverted Bounds. The red area indicates the region of excluded rating values, whereas the white area indicates region wherein rating values are retained. The direction of the area demonstrates the direction in which the bound parameter increases.

The scope of the proposed depression classification experiments is shown in Fig. 10.2. As shown in Fig. 10.2, features based on the acoustic speech biosignal can be selected based on manually rated

162

observations with set thresholds, previously shown in Fig. 10.1. Further, and yet to be recorded in the speech-based depression literature, manually rated affect types (e.g. arousal, valence, dominance) can also be applied directly as a feature for depression analysis. Functionals Low-Level Descriptors

Data Selection

Feature Extraction

Depression Classification

Manual Ratings

Fig. 10.2: System configuration for experiments applying arousal/valence/dominance thresholds as a means of data selection for depression classification. Dashed lines indicate components that may or may not be used in given experiments.

To explore whether manual ratings and their derived features could be replaced with an automatic process related to affective computing prediction techniques, the following emotion prediction system was proposed (Fig. 10.3). To predict the arousal, valence, and dominance ratings, Geneva Minimalistic Acoustic Parameter Set (GeMAPS) features were extracted from training/development audio files for prediction of arousal (similarly to the AV+EC 2015 reference emotion prediction system) (Eyben et al., 2016). As in Huang et al. (2015), these were passed to a Relevance Vector Machine (RVM) system (see previous Chapter 4.2.5 for details) trained iteratively on a leave-onefile-out basis to predict arousal ratings per file. Smoothing and delay compensation (6 seconds) were also applied. The automatic RVM arousal ratings achieved a correlation coefficient of 0.33 and 11.2 Root Mean Square Error (RMSE) when tested on the AVEC 2014 database. This RMSE value was close to the audio accuracy for the AVEC 2014 baseline system (11.52 RMSE) (Valstar et al., 2014) (note that valence and dominance had correlation coefficients of less than 0.10). Data Selection

Baseline Features

Speech

Feature Extraction

Depression Classification

Arousal Valence Dominance Prediction

eGeMAPS + RVM Prediction

Figure 10.3: Fully automatic depression classification using emotion prediction and emotion-based data selection.

163

10.3 Experimental Settings All experiments evaluated baseline, LLD, and manual ratings-based individual features along with various combinations. In addition, the effects of data selection were examined specifically for ratings-based and LLD features. The interaction between emotion and depression is investigated using combinations (concatenations) of these feature sets: • Baseline features (2068 functionals total) – Baseline Functionals (BF) (Valstar et al., 2014) • Low Level Descriptor features (78 total) (Valstar et al., 2014) • Manually generated features (arousal, valence, and dominance gold-standard continuous ratings, aggregated on a per-file basis using means, medians, and standard deviations; 9 total) – Manual Ratings (MR) Depression classification was performed by MATLAB 2016b software Support Vector Machine (SVM). SVM approaches have been shown to give good speech depression classification performance (Cummins et al., 2015a; Sethu et al., 2014). All experiments utilised an SVM linear kernel with default hyper-parameter settings. Five-fold cross validation, performed using this twoclass division, was used in all classification tests, with an average accuracy reported. A subset of the Audio-Visual Emotion Challenge (AVEC) 2014 corpus was used for this research due to its continuous affect ratings, depression scores, feature sets, and previous use in the aforementioned speech depression classification studies (Cummins et al., 2013a, 2013b, Helfer et al., 2013; Perez et al., 2014; Valstar et al., 2014). AVEC 2014 includes spontaneous speech audio recordings from 84 male and female speakers. Similarly to Cummins et al. (2014), two-class labels were determined by assigning BDI-II 0-9 severities as ‘none to low depression’ and BDI-II 19-65 severities as ‘moderate to high depression’. Also, for each speaker file, the AVEC 2014 database contained continuous affect ratings (e.g. arousal, valence, dominance). Note that continuous affect ratings were previously explained in Chapter 3.3. For more information regarding the AVEC 2014 database, refer to Chapter 5.4.2.

164

10.4 Results and Discussion

10.4.1 Manual Affect Ratings As seen in Fig. 10.3 (black dotted line ‘All’), for standalone ratings-based features, arousal demonstrated better depression classification results than valence and dominance. Similarly, prior studies indicated a strong relationship between speech arousal and BDI prediction (Valstar et al., 2014; Cummins et al., 2014). Notably, three manual arousal ratings-based features achieved 70% performance accuracy alone, while 2068 acoustic baseline features attained 77.5% accuracy. This

UPPER

indicates that ratings-based features can compactly summarize some key depression information. 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

LOWER

All

0.4

0.35

0.3

0.25

0.2

0.15

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

EXTROVERTED INTROVERTED

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

All

0.4

0.35

0.3

0.25

0.2

0.15

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 All

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.4

0.35

0.3

0.25

0.2

0.15

0.4

0.35

0.3

0.25

0.2

All

0.15

0.4

0.35

0.3

0.25

0.2

0.15

All

0.4

0.35

0.3

0.25

0.2

0.15

All

0.4

0.35

0.3

0.25

0.2

0.15

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.4

0.35

0.3

0.25

0.2

0.15

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 All

All

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 All

0.4

0.35

0.3

0.25

0.2

0.15

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50

All 0.004 0.008 0.012 0.016 0.020 0.024

All 0.004 0.008 0.012 0.016 0.020 0.024

All 0.004 0.008 0.012 0.016 0.020 0.024

Baseline FuncYonals (BF)

Manual RaYngs-based (MR)

LLD (based on single MR)

BF + MR

MR + LLD (based on single MR)

MR + LLD (based on all MR)

Figure 10.3: Depression classification accuracies versus threshold value (four threshold types) for arousal (red), valence (blue), and dominance (yellow) ratings-based features, speech features, and a combination. The LLD based on single manual rating (MR) is the selection of any LLD feature frame whose corresponding manual rating was within the given data selection threshold. The LLD based on all MR selects only those LLD frames for which all three manual ratings were within the threshold.

165

Manual ratings-based features concatenated with acoustic baseline features (Fig. 3, black dashed line) consistently produced equal or better depression classification results over baseline functionals (arousal + valence + dominance same as standalone arousal result). Dominance ratings-based features performed the worst for depression classification. This was expected based on previous literature showing difficulty procuring above chance emotion classification results using ‘freeform’ speech dominance information (Valstar et al., 2014). When compared with using all rating features, improvements in classification performance were recorded by applying data selection to standalone arousal, valence, and dominance ratings-based features (Fig. 3, black dotted lines). Using manual ratings-based and baseline features, the upper, lower, and extroverted data selection with any threshold between 0.40 to 0.20 (retaining roughly 95% to 75% of the ratings) produced equal or better performance than baseline results. The introverted bounds when paired with baseline features did not exhibit wide threshold parameter ranges that consistently resulted in better classification performance. This suggests that discarding features (both rating-based and acoustic) corresponding to severe (fringe) emotions was helpful in general. Initially for this set of results, it was believed that threshold parameters aided in removing ratings that were outliers.

10.4.2 Acoustic Frame Selection Based on Manual Affect Ratings When identifying which acoustic frames to keep or drop, manual ratings were successfully used as a data selection criterion for acoustic features. Performance gains over LLD (all data retained) are noticeably observed for thresholds applied to ratings-based features and LLD based on all MR features (Fig. 10.3, light gray dashed line). In many instances, these performed better than or equal to the baseline results. Thus, LLD frames for which all three manual ratings were within the threshold were more advantageous for classification accuracy performance gains than a single rating (Fig. 10.3, light gray dashed line). There was a sizeable performance impact when rating-based features were combined with a small set of acoustic LLD features (Fig. 10.3, solid gray line versus light gray dashed line). Without data selection, arousal or valence ratings-based and LLD based on all manual ratings together demonstrated 5-7% absolute gains. However, dominance ratings-based features only had a boost in performance after thresholds were applied to dominance ratings-based features and LLD features. This finding suggests that dominance requires data selection for a gain in depression classification performance. 166

A heightened level of depression usually results in reduced vocal dynamic energy (Cummins et al., 2015a), and in turn generally lower perceived speech arousal ratings. As thresholds rose and more lower rating scores were removed from LLD and/or arousal ratings-based features, a falling trend in arousal LLD performance was noted. Individuals with depression generally speak with perceptibly less dominance (Osatuke et al., 2007). Therefore, it was speculated that as additional lower threshold dominance ratings were removed, depression classification performance would decrease. The standalone dominance ratings-based features and LLD features generally had a steep drop off in classification performance when less than 80% of the ratings were retained. Speakers with depression typically have lower perceived valence ratings than healthy populations (Joorman & Gotlib, 2010). Consequently, introverted thresholds on standalone valence ratings-based features demonstrated up to 5% improvements with speech depression performance by retaining the outermost ratings, which included the severe (fringe). These results also suggest that perhaps there is important affective-acoustic information contained within multiple different affective score range regions, each potentially independently useful for depression assessment. This discovery warrants further investigation (see Chapter 11 for experimental follow up).

10.4.3 Acoustic Frame Selection Base on Automatic Affect Ratings After automatic arousal ratings were generated from GeMAPS features and an RVM system, data selection was applied using the upper, lower and extroverted bounds. Results showed that standalone automatically generated arousal rating-based features attained depression classification accuracy similar with standalone manually generated arousal ratings-based features (see Table 10.1). Furthermore, results indicate that when combining automatically generated arousal and GeMAPS features, subsequent thresholding can produce fully automated systems whose performances are equal to that of manual ratings-based features. The automatically predicted arousal rater-based features improved with upper, lower, and extroverted data selection using narrower thresholds settings of 0.35 to 0.15 (retaining roughly 95% to 75% of the ratings).

167

Table 10.1. Depression classification percentage accuracy based on arousal (A) from either manual ratings or emotion prediction and/or acoustic baseline features (BF). Manual All

Auto All

Auto .35

A A + BF

70.0 80.0

67.5 76.2

71.2 82.5

A A + BF

70.0 80.0

67.5 76.2

A A + BF

70.0 80.0

67.5 76.2

A A + BF

72.5 80.0

67.5 70.0

Auto .30

Auto .25

UPPER 71.2 71.2 78.8 82.5 LOWER 67.5 67.5 72.5 76.2 76.2 82.5 EXTROVERTED 71.2 70.0 71.2 78.8 82.5 77.5 INTROVERTED 73.0 70.0 71.2 71.2 73.8 82.5

Auto .20

Auto .15

Auto .10

75.0 77.5

71.2 75.0

67.5 82.5

71.2 78.8

73.8 81.2

73.8 77.5

70.0 75.0

70.0 82.5

68.8 78.8

68.8 76.2

67.5 78.8

68.8 77.5

10.5 Summary Emotion-based data selection was shown to provide improvements in depression classification and while also a range of threshold methods were explored. The experiments presented demonstrate that automatically predicted emotion ratings can be incorporated into a fully automatic depression classification to produce a 5% absolute accuracy improvement over an acoustic-only baseline system. Experiments based on the AVEC 2014 database also show for the first time that manual emotion ratings alone are discriminative of depression; and combining rating-based emotion features with acoustic features improves classification between mild and severe depression. Given continuous affect ratings or scores, this research suggests that features derived from them carry complementary information to conventional acoustic features when classifying depression via speech. By applying different thresholds to manual or automatic ratings, ratings-based features can achieve modest performance gains. Results herein suggest that the constraints on affective regions using rating thresholds help narrow affective identities (e.g. low, medium, high), and resulted in effectively, isolating specific types of emotion (i.e. in the affective example of ‘arousal’ - bored, neutral, excited) along with their associated acoustic speech characteristics. The application of automatic prediction of ratings-based arousal features was useful even though the correlation with the manual ratings was not high. The results show that for arousal, automated acoustic features derived from GeMAPS provide a boost in performance when emotion-based data selection is applied.

168

Chapter 11 ACOUSTIC, LINGUISTIC, AND AFFECTIVE CONSIDERATION FOR SPOKEN READ SENTENCES 11.1 Overview In the near future, automatic speech-based analysis of mental health will become widely available to augment conventional healthcare evaluation methods. For speech-based patient evaluations of this kind, elicitation protocol design is a consideration. Furthermore, as noted in Chapter 1.2, an examination of articulatory, linguistic, and affective measures and their related features is key to better understanding how individuals with depression can be more readily identified via their speech behaviors. As discussed earlier in Chapter 5.2.4, read speech provides key advantages over other verbal modes (e.g. automatic, spontaneous) as it provides a stable and repeatable protocol. Further, textdependent speech analysis holds the prospect of reducing phonetic variability and delivering linguistic/affective ground-truth that is known a priori, thus allowing more specific or targeted examination of depressed and non-depressed speakers’ speech. For psychogenic voice and neurological disorders (e.g. anxiety, dementia, depression, Parkinson’s disease, traumatic brain injury) observation of spoken language is a standard clinician practice. Psychogenic voice disorders are associated with disturbances of an individual's autonomic system and his/her personality (Perepa, 2017). For psychogenic voice disorders, the most common speech articulatory manifestation errors are aphonia, dysphonia, spasmodic dysphonia, and stuttering-like behaviour (Duffy, 2008). At the acoustic level, and of particular concern related to aphonia/dysphonia, are aspects of speech voicing for abnormal motor behaviour identification. Speech voicing refers to the activation or nonactivation of the vocal folds. Generally, for speech processing applications, there are four types of 169

speech voicing, with the first and second being the most crucial for acoustic speech analysis: (1) voiced (i.e. speech sound with vibration of the vocal folds; (2) unvoiced (i.e. speech sound without vibration of vocal folds); (3) mixed voice-unvoiced (i.e. transition between speech voicing); and (4) silence (i.e. no speech sound). Literature such as Alghowinem et al. (2013b) and Orozco-Arroyave et al. (2016) has hinted at voiced versus unvoiced articulatory differences between healthy speakers and speakers with disorders involving abnormal motor function. While speech voicing (voiced/unvoiced) was examined for depression classification in Alghowinem et al. (2013b), it was primarily used as a data selection criterion for other acoustic-based features. Spontaneous speech results in Algohowinem et al. (2013b) indicated that separation of voiced and unvoiced frame-based feature analysis increased the accuracy of depression classification. However, speech voicing in terms of frame count distributions or voiced-unvoiced ratio features was not investigated. In Hashim et al., (2012), speech was evaluated based on an entire read paragraph from a small set (19) of depressed/non-depressed male speakers. Therein, a probability density function along with a Markov

modeling

approach

was

applied

to

compute

a

histogram

of

consecutive

voiced/unvoiced/silence frame labels per speaker. While Hashim et al. (2012) reported on depression classification results for voiced and silence histogram-based features, the unvoiced feature results were omitted on the basis of poor performance. It is known that individuals with depression often demonstrate a decline in their verbal language skills. Recorded exemplars of speech-language disruptions in depressed speakers include disfluent speech patterns, abandonment of phrases, and unusually long response latencies (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman, et al., 1985). Depressed people also exhibit a greater number of speech hesitations (i.e. pauses, repeats, false-starts) than non-depressed populations during communication due to psychomotor agitation/retardation and cognitive processing delays (Alpert et al., 2001; Cannizzaro et al., 2004a; Darby et al., 1984; Ellgring & Scherer, 1996; Fossati et al., 2003; Hartlage et al., 1993; Nilssone et al., 1987, 1988; Szabadi et al., 1976). Much of the literature on speech-based depression disfluencies has focused on spontaneous and/or entire paragraph read speech samples. However, speech-based depression and other psychogenic illness studies have yet to closely monitor the speech hesitation (e.g. word-repeat, phrase-repeat, pause) and error (e.g. morphological, substitution, deletion) types during text-dependent read aloud speech. One of the earliest studies regarding speakers with depression and hesitation analysis was

170

by Szabadi et al. (1976). Their research indicated that for automatic speech (e.g. counting) speakers have longer pause durations when depressed than after-going a two-month treatment. However, the Szabadi et al. (1976) study should be cautiously considered because it only included four participants, automatic single-word counting tasks, and all measurements were completed by hand using a tape measure! Since Szabadi’s (1976) investigation, many other speech-based depression studies (Alghowinem et al., 2012; Alpert et al., 2001; Esposito et al., 2016; Mundt, 2012; Nisonne et al., 1988; Stassen et al., 1998) have also evaluated speech pause durations and frequency ratios with varied success. For example, in Alghowinem et al. (2012), average speech pause durations were found to be longer in depressed speakers than non-depressed speakers. However, the recordings analyzed in Alghowinem et al. (2012) were only for spontaneous speech and not read speech. Therein, because each speaker’s spontaneous recording contained variable time lengths and natural pause moments (i.e. breaths, interviewer speech), the total sum of speakers’ pause counts and pause durations could not be comparatively evaluated. As a result of the natural speaker pauses commonly found in spontaneous speech, speech pause and speech rate ratios (e.g. total number of pauses divided by total number of words, total pause time divided by total recording time) are usually used to help normalize speakers (Liu et al., 2017; Wolters et al., 2015). More recently, in Esposito et al. (2016), pauses were examined using both read and spontaneous speech from a set of two-dozen mildly to severely depressed speakers. For spontaneous speech, this study found significant pause duration lengthening exhibited by depressed speakers relative to nondepressed. However, raw pause counts for the text-dependent read speech were again left unexamined. Instead, in experiments by Esposito et al. (2016), several different ratio-based pause rate features (e.g. filled-pause rate, empty pause rate) were used. The ratios found in Esposito et al. (2016) and other aforementioned similar studies do not allow for discrimination between pause duration and counts. Further, therein the pauses were not evaluated on a sentence-to-sentence basis allowing insight as to which kinds of sentences had the most pauses, and whether a speaker’s pauses occurred intermittently or in clusters. It can be contended that further depression discriminative information can be obtained by generating separate features based on the number of pauses and pause durations measured using single sentence utterances. Automatic speech-based depression classification in diagnostic application is often considerably reliant on the uniformity of patient elicitation procedure, as each elicitation method can include wide ranges of linguistic structure, affect information, and task requirements (Howe et al., 2014;

171

Stasak et al., 2017, 2018a). To date, there is no clinically approved speech-depression automatic diagnosis protocol for widespread use, and there is not even consensus on elicitation methods among researchers. By utilising constrained elicitation designs, such as text-dependent read aloud speech, the aforementioned cognitive-motor and affective abnormalities exhibited by speakers can be monitored and compared under tightly constrained conditions that favor detailed analysis. The aim of experiments herein is to investigate text-dependent constrained read aloud speech, and examine the performance impact that affective target words, speech voicing, speech disfluencies, affective target word syntactic placement, and linguistic point-of-view have on automatic depression classification. An affective investigation of speech disfluencies and their potential discriminative features for automatic depression classification has received remarkably little consideration (Rubino et al., 2011). It is hypothesized that text-dependent read sentences with affective reactivity (i.e. affective target words) will provide more accurate ground-truth for disfluency analysis as a result of its phonetic and affective constraints; further affording directly comparable stimuli speech data between different speakers. Based on the aforementioned research into hesitations (e.g. pauses), it is expected that depressed speakers will have higher rates of hesitations and require more time verbally to read each sentence than non-depressed speakers. While natural speech disfluencies are common in spontaneous speech (Johnson et al., 2004; Gósy, 2003), due to the abnormal cognitivemotor effects of depression on cognitive skills, it is believed that during simple sentence reading tasks depressed speakers will demonstrate greater numbers of speech errors (e.g. omissions, substitutions) than non-depressed speakers. Based similarly on structural affect theory (Brewer & Lichtenstein, 1982) and linguistic syntax measures, it is predicted that analysis of sentences wherein the main affective stimuli are declared clearly at the beginning of a phrase will prove more discriminative for depression detection than those with stimuli at the end of a sentence. It is also hypothesized that having the affective target word location at the beginning of sentence will allow the speaker earlier cognitive cues for mood processing and acoustic paralinguistic phrasing. Additionally, based on empathic point-of-view (e.g. understanding of another person’s situation), it is thought that features extracted from firstperson read narrative sentences will generate better classification results than third-person narrative sentences. Read speech experiments found herein are further inspired by spontaneous speech experiments by Rubino et al. (2011). They discovered that depressed speakers exhibited significantly greater

172

numbers of referential failures (i.e. including word replacement errors) than non-depressed speakers. It is hypothesized that during experimental read speech tasks, similar to Rubino et al. (2011) spontaneous speech finding, for affectively charged read tasks upward trends in recorded depressive referential failures, especially in the form of malapropisms, will be discovered when compared with non-depressed speakers. A malapropism is an incorrect substitution word for an intended word. By definition, a malapropism is unrelated in meaning and has a similar pronunciation, grammatical category, word stress, and syllable length (Fay & Cutler, 1977). Due to passive avoidance strategies exhibited by people with depression (Holahan & Moos, 1987; Holahan et al., 2005), it is also anticipated that depressed speakers will attempt fewer self-corrections after speech errors or malapropisms when compared with non-depressed speakers. Lastly, due to the constrained nature of text-dependent stimuli, an investigation of utilising automatic speech recognition to help demarcate speech disfluency attributes, such as the number of speech disfluencies and pause durations is examined. It is anticipated that text-dependent constraints will improve the precision of the automatic acoustic speech recognition output since the phonetic variability in the elicited speech is smaller than that of, for example, spontaneous speech.

11.2 Methods

11.2.1 Acoustic Feature Extraction For experiments herein, the openSMILE speech toolkit was used to extract 88 eGeMAPS acoustic speech features from all 20 sentences in the BDAS database. The eGeMAPS features were calculated by extracting features from 20 millisecond frames with 50% frame-overlap, wherein an aggregated mean functional was computed per eGeMAPS feature. The eGeMAPS feature set was chosen because it has been used previously as a baseline for speech-based depression research (Valstar et al., 2016).

11.2.2 Speech Voicing Extraction For speech voicing extraction, the COVAREP toolkit (Degottex et al., 2014) voice activity detection (VAD) analysis was used, as described previously in Chapter 4.1.1. This particular VAD was chosen because it uses frame-level probabilistic decisions based on more than one VAD

173

algorithm (e.g. MFCC-based, summation of the residual harmonics, multi-voicing measures). Also, the COVAREP VAD has been applied in previous speech-based depression studies (Scherer et al., 2014; Valstar et al., 2016). For more details on the COVAREP VAD and its algorithmic components the reader is referred to: Degottex et al. (2014), Drugman et al. (2011, 2016), Eyben et al. (2013), and Sadjadi et al. (2013). Based on the VAD, for every speaker’s individual sentence, counts of unvoiced and voiced speech frames were determined. Silence frames and/or any fundamental frequency (F0) values less than 65 Hz and greater than 400 Hz were omitted from unvoiced/voiced counts. For both adult females and males, this F0 range is within normal speech range, excluding extreme vocal modes (e.g. singing, shouting). Per sentence and F0 threshold, ‘voiced’ and ‘unvoiced’ frames were summed into their corresponding groups based on their VAD labeling in addition to a combined ‘speech’ group that included a summation of both ‘voiced’ and ‘unvoiced’ frames.

11.2.3 Manual Speech Disfluency Extraction All speakers’ read sentences were manually analysed, identifying any considerable audible hesitations (e.g. abrupt pauses, word repeat, phrase repeat) and speech errors (e.g. malapropisms, omissions, spoonerisms, substitutions). For each read sentence, three different methods for calculating speech disfluencies were explored. Firstly, a broad binary single value per sentence was recorded simply if any disfluency occurred during the read aloud sentence, even if a self-correction was attempted. Secondly, as shown by Fig. 11.1, a binary decision tree method was used to tally specific disfluency types, to allow for more insight into hesitation/error type and series-based decision consequences, such as self-correction. Thirdly, a count of the total number of disfluencies per sentence (across all of the speech disfluency decision criteria nodes in Fig. 11.1) was made. Differences in disfluency methodology and labeling terminology exist across previous studies, making it more difficult to subjectively or automatically label each with undisputed precision (Garman, 1990; Shriberg, 1994). Hence, as a novel investigation, both a broad and a fine-grained disfluency analysis are proposed.

174

SENTENCE INPUT Speech Error?

Hes$ta$on? (0) No

(1) Yes

(1) Yes Uncorrected?

Word Repeat? (0) No

(0) No

(1) Yes

(0) No

(1) Yes

Phrase Repeat? (0) No

(1) Yes

Figure 11.1: Speech disfluency feature binary tree showing how each disfluency type relates to each other. The text in italics indicates the five verbal disfluency decision nodes (i.e. disfluency features).

11.2.4 Automatic Speech Recognition Disfluency Extraction Each recorded sentence was processed using forced alignment via automatic speech recognition (ASR) with known text ground-truth. The Bavarian Archive for Speech Signals (BAS) EMU Magic ASR system (Kisler et al., 2017) was used to obtain the estimated transcript timings for words and pauses. The BAS EMU Magic is a free, cloud-based ASR system that has many language models available, including Australian-English. According to Kisler et al. (2017), BAS generates a word alignment accuracy of ~97% across six example languages using open large vocabulary ASR models (for more details, refer to Kisler et al. (2017)). The automatic monitoring of hesitations was investigated using the ASR transcripts. All sentence files had any start and end labeled pauses removed (e.g. silence) and the total number of ASR entries per speaker was calculated on a per-sentence basis. This calculation included hesitations, false-starts, word repeats, word additions, and/or pauses. A higher number of segmented transcript token word entries per sentence were considered indicative of a speaker’s speech disfluency errors (as shown in Fig. 11.2 on the next page).

175



DE

w@s



@

pVdl



@nD@

S{O@



Figure 11.2: An example of a spectrogram generated from the BDAS database of a severely depressed (i.e. QIDS-SR score of 23) female reading aloud the ‘neutral’ sentence “There was a puddle in the shower”. The text-transcript pseudo-phoneme convention is shown along the top of the spectrogram. The vertical lines indicate segmented word tokens (i.e. corresponding to a phoneme-level ASR transcript). The speech hesitations (e.g. pauses) are denoted by . The loudness (dB-SPL) is indicated by the color bar on the right.

11.2.5 Linguistic Measures For automatic speech processing and depression classification there are no studies that have evaluated the impact of affective keyword locations and narrative point-of-view on read speech. Therefore, linguistic text-based measures centered on the location of the affective keyword and read narrative point-of-views were used to group each of the 20 sentences discussed earlier in Chapter 5.4.3 for the Black Dog Institute Affective Sentences (BLAS). The affective keyword positions included the following sentence groups: beginning {1,5,8,12}, middle {4,10,11,13,20}, and end {2,3,6,7,9,14,15,17}. For the linguistic read narrative point-of-view measures, each sentence was placed into three different groups: first-person (i.e. I, my, me) {5,7,12,17}, third-person (i.e. he, she) {1,8,9,16,18,19}, and ambiguous {2,3,4,6,10,11,13,14,15,20}.

11.2.6 Affective Measures Affective text-based measures were estimated per sentence using the Sentiment Analysis and Cognition Engine (SEANCE), which is a free affective computing text processing toolkit (Crossley et al., 2017). As discussed previously in Chapter 4.1.3, the SEANCE text-processing toolkit contains several standard manually rated affective target word score references, such as the Affective Norms for English Words (ANEW). ANEW was used to expand the Brierley et al. (2007)

176

and Lawson et al. (1999) based ‘negative’ and ‘neutral’ affective sentence groups. Therefore, similarly to the use of a negative-neutral affective threshold by Brierley et al. (2007), a valence threshold value of greater than 6.0 was used to produce a ‘positive’ sentence group. The expanded ‘positive’ valence group consisted of the sentence set {4,12,16,19}. Although arousal, dominance, and fear-disgust were also evaluated per sentence, experiments herein focused on valence measures. It is worth mentioning that analysis shown in Table 2 indicates that any fear-disgust score greater than ‘0’ assured that a ‘negative’ mood emotion was present. Based solely on the fear-disgust affective sentence scores, it is proposed that this affective measure is useful in automatically labeling sentences ‘negative’ and ‘neutral’. Sentences {2} and {20} did not have any affective target words found in the ANEW affect rating reference; thus, they generated ‘0’ values. The relatively moderate affective target word rating reference size (i.e. approximately 12k English words) is a limitation concerning the use of automatic affective text-processing applications. Table 11.1: Read phrases with ‘positive’, ‘neutral’, and ‘negative’ affective target words emboldened in light green, dark green, and red, respectively. Additionally, sentence affect values for arousal, valence, dominance, and fear-disgust are shown based on the SEANCE text-processing analysis toolkit (Crossley et al., 2017). This table shows that even sentences within the same ‘neutral’ or ‘negative’ groups have varied levels of affect. The most extreme valence sentences {5,16} are indicated in bold. # {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {17} {18} {19} {20}

Sentence He would abuse the children at every party There was a crowd gathering around the entrance The teacher made the class afraid There had been a lot of improvement to the city The devil flew into my bedroom The chef’s hands were covered in filth My next door neighbor is a tailor The pain came as he answered the door She gave her daughter a slap There was a spider in the shower There was a fire sweeping through the forest The swift flew into my bedroom There had been a lot of destruction to the city The teacher made the class listen There was a crowd gathering around the accident He would amuse the children at every party My uncle is a madman The post came as he answered the door She gave her daughter a doll There was a puddle in the shower

Arousal 6.35 0.00 5.36 5.24 6.07 4.76 3.80 5.27 6.46 5.71 7.17 5.39 5.53 4.05 6.26 6.12 5.56 4.60 4.24 0.00

Valence 5.58 0.00 3.84 6.03 2.21 4.21 5.13 4.63 2.95 3.33 3.22 6.46 4.59 5.68 2.05 7.47 3.91 5.88 6.09 0.00

Dominance 4.87 0.00 4.55 5.74 5.35 4.58 4.69 4.75 4.21 4.75 4.49 6.29 4.83 5.11 3.76 5.47 4.79 5.27 4.61 0.00

Fear-Disgust 0.72 0.00 0.21 0.00 1.26 0.69 0.00 0.59 0.57 0.57 0.08 0.00 0.37 0.00 0.59 0.00 0.74 0.00 0.00 0.00

For the speech voicing and disfluency features, these were initially averaged over all 20 sentences from each speaker. However, it was later proposed that by concatenating individual features feature

177

values from each of the 20 sentences into a feature vector (20 dimensions), its dimensions might retain more individual sentence-specific feature information. Thus, the concatenated sentence feature approach was used for all experiments herein, except for individual sentence experiments. For all ‘negative’ and ‘neutral’ sentences shown previously in Table 11.1, and the additional extended valence sentence group ‘positive’ group, the summation of manual speech disfluency, speech voicing, ASR token word count, and ASR pause duration was calculated using valencebased feature groupings: 𝑆!"" =

!" !!! 𝑆!

(11.1)

𝑆!"# =

!"{!,!,!,!,!,!,!",!!,!",!",!"} 𝑆!

(11.2)

𝑆!"#$ = 𝑆!"# =

!" !,!,!,!",!",!",!",!",!" !"{!,!",!",!"} 𝑆!

𝑆!

(11.3) (11.4)

Sall contains summed feature values for all 20 sentences, whereas Sneg, Sneut, and Spos are specific to the respective valence affect groups. Note that the Sneg and Sneut contain the ‘negative’ and ‘neutral’ sentence labels similar to Brierley et al. (2007) and Lawson et al. (1999). The Spos is an affective extension of Sneut based on the sentence valence values above a score of 6.0, described previously and shown in Table 11.1. These four valence groups were chosen because they allowed for a broad system performance across different ranges. Below, Fig. 11.3 shows the block diagram for the manual and automatic methods used. Note that ASR and manual analysis was required only for the speech disfluency features. Speech Recording

Voice Activity Detection

Affective / Linguistic Text Processing

Sentence Selection {S1,S2 ,…,Sn }

Automatic Feature Extraction

Classification

Depressed/ Non-Depressed Decision

Automatic Speech Recognition (ASR) Manual Feature Extraction

Manual Labels

Figure 11.3: System configuration, with dashed lines indicating experimental configurations employing data selection based on valence affective text-processing parameters (e.g. ‘negative’, ‘neutral’, ‘positive’). The thin dotted lines indicate manually-analysed disfluency.

178

11.3 Experimental Settings All experiments utilised 10-fold cross validation in a 90/10 training/test split to help maximize data available for training and reduce overfitting. For the acoustic, speech voicing, and fusion experiments presented in Chapters 11.5.1, 11.5.2 and 11.5.4, LDA was applied because these features generally had higher dimensions than the majority of disfluency features. Depression classification for the speech disfluency features, both manual and automatic, was conducted using decision trees (similarly to Mitra et al. (2014)) due to their close resemblance to the tree speech disfluency structure shown previously in Chapter 11.2.3 in Fig. 11.1. Further, the decision tree allowed for a simple interconnected interpretation of the low-dimensional disfluency feature sets. All speech disfluency experiments found in Chapter 11.5.3 used the simple decision tree classifier from the MATLAB toolkit using a coarse distinction setting with a maximum parameter of 4 splits. For all classification experiments, performance was determined using overall accuracy and individual class F1 scores (similarly to Valstar et al. (2016)). For all experiments herein, the Black Dog Institute Affective Sentences (BDAS) corpus, as described previously in Chapter 5.4.3, was used on account of its clinically validated depression diagnosis and its constrained speech elicitation mode, which comprised read sentences with deliberately placed affective target words (see previous Table 11.1). While speakers from the BDAS database were not assessed for possible dyslexia conditions or visual impairments, the prevalence of occurrence should be relatively the same for both depressed and non-depressed speakers. As mentioned previously in Chapter 5.4.3, a WTAR assessment was given, which helped to reduce the potential likelihood of including any speakers with these types of reading disabilities.

11.4 Results and Discussion

11.4.1 Acoustic Analysis The all-sentence baseline eGeMAPS using the LDA classifier depression classification accuracy was found to be 65%, with F1 scores of 0.68 (0.63) for depressed and non-depressed classes, respectively. The accuracies and F1 scores for individual sentences are shown in Table 11.2. Generally, single-sentence depression classification performance for eGeMAPS features was relatively low when compared with the all-sentence eGeMAPS baseline.

179

In comparing the classification performance produced by specific sentences, the most-positive valence affective score sentence {16} achieved the best classification accuracy and F1 scores. Furthermore, based on the sentence valence affect scores previously shown in Table 11.1, the two of the three the best individual sentence results {5,6,16} had the most extreme valence scores. As hypothesized earlier in Chapter 11.1, sentence {1} (i.e. containing highly positive and negative valence affective words) did not perform as well on depression classification in comparison with its oppositely affect paired sentence {16} (containing only positive valence affective words), or other sentences without opposing valence words. Table 11.2: Depressed (D) and Non-Depressed (ND) classification accuracy and F1 score performance using eGeMAPS features. Based on Brierley et al. (2007) and Lawson et al. (1999), affective sentences are listed with ‘negative’ (red) and ‘neutral’ (dark green) marked words; whereas, the ‘positive’ (light green) marked words based on valence scores above 6.0 derived previously from Table 3. {} indicates the sentence number. # {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15} {16} {17} {18} {19} {20}

Sentence He would abuse the children at every party There was a crowd gathering around the entrance The teacher made the class afraid There had been a lot of improvement to the city The devil flew into my bedroom The chef’s hands were covered in filth My next door neighbor is a tailor The pain came as he answered the door She gave her daughter a slap There was a spider in the shower There was a fire sweeping through the forest The swift flew into my bedroom There had been a lot of destruction to the city The teacher made the class listen There was a crowd gathering around the accident He would amuse the children at every party My uncle is a madman The post came as he answered the door She gave her daughter a doll There was a puddle in the shower

Accuracy 60% 59% 57% 59% 64% 66% 53% 47% 51% 57% 56% 56% 56% 56% 53% 70% 57% 56% 61% 59%

F1 D 0.66 0.63 0.64 0.60 0.67 0.69 0.56 0.52 0.53 0.61 0.58 0.61 0.56 0.58 0.55 0.72 0.60 0.55 0.63 0.58

F1 ND 0.52 0.52 0.46 0.57 0.62 0.61 0.49 0.41 0.50 0.53 0.54 0.49 0.55 0.54 0.51 0.68 0.55 0.56 0.60 0.59

The training and testing of affective valence target word sentence opposites using their concatenated eGeMAPS features was also investigated to examine how valence extremes affect depression classification performance, as shown in Table 11.3. Again, two of the three best depression classification performances were sentence pairs {1,16}, {5,12}, {8,18} with the most extreme valence affect scores, as indicated previously in Table 11.1.

180

Table 11.3: Depressed (D) and Non-Depressed (ND) classification accuracy and F1 scores using concatenated eGeMAPS features. Specific sentences that are grouped together are shown in {}. † indicates that these sentences do not have valence affective target word opposites and similar syntactic structure to each other. Sentence Opposites {1,16} {2,15} {3,14} {4,13} {5,12} {6,11}† {7,17} {8,18} {9,19} {10,20}

Accuracy 64% 54% 49% 60% 67% 56% 54% 66% 57% 56%

F1 D 0.66 0.57 0.53 0.62 0.70 0.62 0.56 0.66 0.62 0.56

F1 ND 0.63 0.52 0.44 0.59 0.64 0.48 0.53 0.66 0.54 0.55

As shown in Table 11.4, training and testing of affective sentence groups (e.g. ‘negative’, ‘neutral’, ‘positive’) using their concatenated eGeMAPS features demonstrated that the ‘neutral’ eGeMAPS valence sentence group did not surpass the depression classification performance of the all-sentence eGeMAPS baseline. However, ‘neutral’ group results were still quite similar (2% absolute difference) despite consisting of less data. Notably, the designated ‘negative’ valence eGeMAPS group performed the worst when compared with the all-sentence eGeMAPS baseline (9% absolute difference). Upon extending the labeled affect sentences into an additional ‘positive’ affective group, based on extracted valence scores described previously in Chapter 11.1 in Table 11.1, the higher valence score sentence group resulted in slight accuracy improvement (2% absolute gain) over the all-sentence baseline. Most importantly, the ‘negative’ valence affective sentence group performed the poorest – indicating that for broad acoustic-based feature sets, ‘negative’ valence sentences are less effective than ‘neutral’ or ‘positive’ valence sentences for automatic depression classification. Table 11.4: Depressed (D) and Non-Depressed (ND) classification accuracy and F1 score performance using eGeMAPS features. Specific sentences that are grouped together are shown in {}. Groups Based on Affective Measures Positive {4,12,16,19} Neutral {2,4,7,12,14,16,18,19,20} Negative {1,3,5,6,8,9,10,11,13,15,17}

Accuracy 67% 63% 56%

F1 D 0.69 0.66 0.59

F1 ND 0.66 0.59 0.52

With respect to training and testing read sentences based on linguistic affective keyword positions and their concatenated eGeMAPS features as shown in Table 11.5, sentences with the affective target words towards the beginning of the sentence performed better than those with affective target words in the middle or end. Herein the sentences with affective keywords at the beginning of the sentence achieved a 5% absolute improvement in depression accuracy over the all-sentence eGeMAPS baseline. Results in Table 11.5 support the previously mentioned hypothesis found in

181

Chapter 11.1 that read sentences for affective target words located at the beginning are more effective for depression classification for acoustic-based features. These results provide evidence, that for acoustic features, an early affective target word position induces more distinctive affective paralinguistic cues throughout the remainder of the sentence, whereas for other affective target word positions this is not the case. Table 11.5: Depressed (D) and Non-Depressed (ND) classification accuracy and F1 score performance using eGeMAPS features. Specific sentences that are grouped together are shown in {}. Groups Based on Affect Keyword Position Beginning {1,5,8,12,16,18} Middle {4,10,11,13,20} End {2,3,6,7,9,14,15,17,19}

Accuracy 70% 60% 61%

F1 D 0.70 0.62 0.65

F1 ND 0.70 0.58 0.57

Although not shown herein, for the linguistic point-of-view concatenated eGeMAPS features, results showed that the ambiguous point-of-view sentences {2,3,4,6,10,11,13,14,15,20} achieved 63% depression classification accuracy with F1 scores of 0.66 (0.59). The first-person {5,7,12,17} and third-person {1,8,9,16,18,19} point-of-view groups attained depression classification accuracies and F1 scores of 57% 0.61 (0.52) and 59% 0.59 (0.58), respectively. This may be because of the larger feature dimensionality of the ambiguous group (i.e. it contains more sentences).

11.4.2 Speech Voicing Analysis An initial analysis comparing depressed/non-depressed speakers demonstrated remarkable differences in the overall average of number of frames of speech per ‘voiced’ and ‘unvoiced’ groups, as shown in Fig. 11.4. In Fig. 11.4(a), for all sentences, the mean number of ‘speech’ (i.e. ‘voiced’ plus ‘unvoiced’) frames for non-depressed speakers was statistically significantly lower on a sentence-by-sentence basis. Interestingly, Fig. 11.4(b) indicates that the mean number of ‘voiced’ frames for both nondepressed and depressed speakers is relatively similar; only one sentence {12} showed a significant increase in the number of ‘voiced’ frames for depressed speakers. The increased speech duration for depressed speakers shown in Fig. 11.4(a) is thus a consequence of depressed speakers having a greater number of ‘unvoiced’ frames than non-depressed speakers. Fig. 11.4(c) shows several statistically significant differences in the ‘unvoiced’ frame counts between non-depressed and depressed speakers.

182

600

⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈*⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉

⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉

⌈**⌉

500 450 400 350 300 250 200 150

S16

S17

S17

S18

S18

S17

S18

S18

S19

S20

S16

S17

S20

S15

S16

S19

S15

S16

S19

S14

S15

S14

S13

S15

Number Frames NumberofofVoiced Voiced Frames

400

S13

S12

S12

S11

S10

S9

S9

S8

S8

S7

S7

S6

S6

S5

S5

S4

S4

S3

S3

S2

S2

S1

S1

S11

(a)

100

S10

Number of Speech Frames Number of Speech Frames

550

⌈*⌉

350 300 250 200 150 100 50

350

⌈*⌉ ⌈**⌉

⌈**⌉

⌈**⌉

⌈**⌉

S20

S20

S19

S14

S14

S13

S13

S12

S12

S11

S11

S10

S9

S9

S8

S8

S7

S7

S6

S6

S5

S5

S4

S4

S3

S3

S2

S2

S1

S1

Number of Unvoiced Frames Number of Unvoiced Frames

400

S10

(b)

0

⌈**⌉ ⌈**⌉ ⌈*⌉ ⌈**⌉ ⌈**⌉ ⌈**⌉

⌈**⌉ ⌈**⌉

300 250 200 150 100 50

(c) S20

S20

S19

S19

S18

S18

S17

S17

S16

S16

S15

S15

S14

S14

S13

S13

S12

S12

S11

S11

S10

S10

S9

S9

S8

S8

S7

S7

S6

S6

S5

S5

S4

S4

S3

S3

S2

S2

S1

S1

0

Figure 11.4: (a) Number of ‘Speech’ (voiced + unvoiced); (b) ‘Voiced’; and (c) ‘Unvoiced’ frames per sentence for non-depressed speakers (blue) and depressed speakers (red). The order of the sentences is indicated by the x-axis value (e.g. 1-20). The mean is indicated using a circle with the 25th to 75th percentile range shown as a thick bar. The narrower line indicates the outer percentile ranges, while outliers are indicated by small individual dots. Starred and double-starred brackets indicate pairs of results that were statistical different based on a paired t-test with p = 0.05 and p = 0.01 settings, respectively.

For non-depressed and depressed speakers, it is understood that sentences {1} and {2} have a greater number of ‘speech’ frames than other sentences because speakers were adjusting to the reading task expectations. According to Barrett & Paus (2002), for depressed and non-depressed speakers, read sentences with negative (e.g. sad) affect were significantly longer in duration than

183

positive (e.g. happy) affect sentences. The affect group results summarized in Table 11.6 agree with Barrett & Paus’ findings, as the overall average speech duration is longer for ‘negative’ valence sentences than for ‘neutral’. However, for the depressed speakers, the ‘neutral’ valence sentence durations were found to be slightly longer in duration than their ‘negative’ sentences (see Table 11.6). This analysis of ‘voiced’ and ‘unvoiced’ frames helps to substantiate previous observations in studies (Flint et al., 1993; France et al., 2000; Hashim et al., 2012; Kiss & Vicsi, 2015; Sahu & Espy-Wilson, 2016) that reported depressed speakers as having lower intensity and breathy speech characteristics. Also, unlike previous speech voicing analysis (Alghowinem et al., 2013b) that utilised spontaneous speech and speech voicing ratios, the text-dependent read speech material in the BDAS database allowed for a much more controlled identical sentence comparison of speakers’ raw voiced and unvoiced frame counts. Table 11.6: Comparison of average non-depressed (ND) and depressed (D) ‘voiced’, ‘unvoiced’, and ‘speech’ frame counts per sentence group. Those pairs with t-test statistical difference are shown by p parameters * (p = 0.05) and ** (p = 0.01). Specific sentences are shown by {}. Speech Voicing & Valence Groups Positive {4,12,16,19} Neutral {2,4,7,12,14,16,18,19,20} Negative {1,3,5,6,8,9,10,11,13,15,17}

ND Voiced 133** 131** 130**

D Voiced 148** 144** 142**

ND Unvoiced 116** 115** 123**

D Unvoiced 162** 158** 155**

ND Speech 249** 246** 252**

D Speech 310** 303** 296**

Results using the speech voicing features are shown in Tables 11.7, 11.8, and 11.9 below. In comparison with the eGeMAPS features including baseline, the ‘speech’ voicing (e.g. ‘voiced’ and ‘unvoiced’), and ‘unvoiced’ features performed considerably higher (up to 12% absolute gain). The ‘voiced’ feature results are not shown in Tables 11.8-11.10 because they performed no better than chance level. In Table 11.7, similarly to the eGeMAPS features, the ‘speech’ voicing features performed the best for sentences with the affective target word occurring at the beginning position. Further, as shown in Table 11.8, for the ‘speech’ voicing features, sentences with the affective target word towards the end of the sentence performed the poorest. These results again support the idea that for acousticbased features especially, elicitation of the affective target word at the beginning of a sentence provides better depression classification performance. For the linguistic narrative point-of-view groups, results in Table 11.9 show that for ‘speech’ voicing and ‘unvoiced’, the first-person point-of-view group produced slightly better depression classification results than the other groupings. Note that the first-person point-of-view had sizably less data (e.g. fewer sentences) than the third-person and ambiguous point-of-view groups.

184

Table 11.7: Depressed/Non-Depressed classification performance using ‘speech’ and ‘unvoiced’ frame count features. The affective sentence groups were based on the ‘neutral’ and ‘negative’ valence affective labels from Brierley et al. (2007) and Lawson et al. (1999), along with the extended ‘positive’ sentence grouping based on valence scores. Specific sentences are shown by {}. ‘Speech’ Feature All {1-20} Positive {4,12,16,19} Neutral {2,4,7,12,14,16,18,19,20} Negative {1,3,5,6,8,9,10,11,13,15,17} ‘Unvoiced’ Feature All {1-20} Positive {4,12,16,19} Neutral {2,4,7,12,14,16,18,19,20} Negative {1,3,5,6,8,9,10,11,13,15,17}

Classification Accuracy 70% 77% 76% 67% Classification Accuracy 70% 73% 71% 66%

F1 Depressed 0.67 0.75 0.74 0.64 F1 Depressed 0.67 0.71 0.69 0.63

F1 Non-Depressed 0.73 0.79 0.77 0.70 F1 Non-Depressed 0.73 0.75 0.74 0.68

Table 11.8: Comparative depressed/non-depressed classification performance using ‘speech’ and ‘unvoiced’ frame count features with linguistic affective target word location groups. ‘Speech’ Feature Beginning {1,5,8,12,16,18} Middle {4,10,11,13,20} End {2,3,6,7,9,14,15,17,19} ‘Unvoiced’ Feature Beginning {1,5,8,12,16,18} Middle {4,10,11,13,20} End {2,3,6,7,9,14,15,17,19}

Classification Accuracy 77% 71% 63% Classification Accuracy 70% 69% 69%

F1 Depressed 0.77 0.70 0.58 F1 Depressed 0.69 0.67 0.72

F1 Non-Depressed 0.78 0.73 0.67 F1 Non-Depressed 0.71 0.70 0.65

Table 11.9: Depressed/Non-Depressed classification performance using ‘speech’ and ‘unvoiced’ frame count features with linguistic point-of-view measure groups. ‘Speech’ Feature Speech First-Person {5,7,12,17} Speech Third-Person {1,8,9,16,18,19} Speech Ambiguous {2,3,4,6,10,11,13,14,15,20} ‘Unvoiced’ Feature Unvoiced First-Person {5,7,12,17} Unvoiced Third-Person {1,8,9,16,18,19} Unvoiced Ambiguous {2,3,4,6,10,11,13,14,15,20}

Classification Accuracy 74% 70% 67% Classification Accuracy 71% 69% 67%

F1 Depressed 0.72 0.66 0.65 F1 Depressed 0.68 0.66 0.65

F1 Non-Depressed 0.76 0.73 0.69 F1 Non-Depressed 0.74 0.71 0.69

11.4.3 Verbal Disfluency Analysis A comparison between non-depressed and depressed speakers of the BDAS database indicated a statistically significant (based on t-test at a significance level of p = 0.01) higher prevalence of hesitations and spoken word errors for depressed speakers. The average sentence hesitation prevalence across all read sentences was ~9% for depressed speakers, whereas only ~4% for nondepressed speakers (see Fig. 11.5). The non-depressed speaker average is close to the average

185

percentage of hesitations/repetition occurrences found in Gósy et al. (2003), wherein a disfluency rate of approximately 5% was recorded for interview-driven spontaneous conversational speech. In general, for conversational speech, the Goldman-Eisler (1968) and Gósy et al. (2003) experiments suggest that hesitations occur naturally roughly every 7-9 words. However, these studies (Goldman-Eisler, 1968; Gósy et al., 2003) evaluated speech generated from spontaneous conversational interview speech, which is expected to be more cognitively demanding than read speech. It is also known that unlike read speech, conversational speech contains a considerable amount of natural interruptions (Garman, 1990; Shriberg, 2005). In a comparative study on read versus spontaneous speech, Howell & Kadi-Hanifi (1991) found that read speech had fewer hesitations. 70

Total Number

60 50 40 30 20 10 0

Hesitation

Hestitation Word Repeat Hestitation Phrase Repeat

Speech Error

Speech Error Uncorrected

Figure 11.5: Total number of manually annotated disfluencies recorded for all non-depressed (blue) and depressed (red) speakers for all sentences shown previously in Chapter 11.2 in Table 11.1. Note that the hesitation category includes speech pauses and word/phrase repeats.

Additionally, shown in Fig. 11.5, the depressed speaker group had nearly four times as many speech errors (~8%) as the non-depressed speaker group (~2%). As further evidence of the rarity of read speech errors, in a three-month study by Cowie (1985), entire word-level ‘slips of the eye’ (i.e. speech errors) were exceedingly rare among adult participants. Furthermore, of particular interest was the unusually high number of recorded malapropisms generated by depressed speakers when compared to non-depressed speakers. Examples of malapropisms produced (many more than once) by depressed speakers in the BDAS database were: abuse/amuse, accident/incident, chef/chief, destruction/disruption, she/he, tailor/traitor, and madman/madam. It is known that depression disorders limit the degree of cognitive planning and strategies during multitasking (Hartlage et al., 1993). Non-depressed speakers are more focused than depressed speakers, and thus more likely to utilise a silent pre-reading strategy (i.e. read the sentence internally, then verbalize it). Based read speech error analysis, and previous studies on depression disorder speaker cognitive declines (Levens & Gotlib, 2015; Silberman et al., 1983; Roy-Byrne et

186

al., 1986; Rubino et al., 2011; Weingartner et al., 1981), the high rate of word errors produced by depressed speakers further substantiates that they are more prone to mental-lexicon word retrieval failure, referential failures, or concentration restrictions than non-depressed speakers. Unexpectedly, analysis of speech error corrections was contrary to an initial hypothesis that depressive speakers would omit verbal self-corrections due to their reliance on avoidance coping strategies (Holahan & Moos, 1987; Holahan et al., 2005). During the sentence reading tasks, as shown previously in Fig. 11.5, the depressed and non-depressed speakers made approximately the same amount of effort to identify and verbally correct their speech errors. Non-depressed speakers made attempts to correct ~93% of their speech errors, whereas depressed speakers’ attempts at corrections were slightly lower at ~86%. This attempt correction percentage calculation was derived from the total number of speech errors uncorrected divided by the total number of speech errors. Although the overall occurrence of hesitations differed between non-depressed and depressed speaker groups, sentences {6,12,15} had the highest number of hesitations for both groups. The similarity in hesitations and speech errors in specific sentences could indicate that these sentences are generally more difficult to read than others due to linguistic syntactical or lexical content. For both non-depressed and depressed speakers, sentences {2,15} exhibited the most recorded speech errors. It is believed that this was mainly due to the word “around”; its pronunciation by speakers was often phonetically reduced to “round”. This kind of articulatory deletion indicates that careful consideration should be taken in designing read sentence tasks for speech elicitation and speech error analysis; e.g. avoiding sentence structure with potential colloquialisms, contractions, and acronyms to help maintain consistency of the text-dependent material. Moreover, based on evidence from experimental analysis of disfluencies herein, it is suggested that more attention should be placed on elicitation design for speech-based depression analysis, especially if the purpose is to ascertain a greater collection of specific types of speech errors from speakers. In regard to eliciting particular kinds of speech errors, Gable et al. (1992) found that speech hesitation error frequency increases as task demand increases, whereas they found that for speech errors, low task demand generated greater frequency. Shriberg (1994) also noted that speech disfluencies occur at a higher rate during longer sentences than shorter ones and a variety of errors are often clustered together rather than spread out evenly over time. Considerations of the aforementioned studies (Gable et al., 1992; Shriberg, 1994) and new findings herein on speech disfluencies provide guidance for future speech data collection protocols for analysis of individuals with depression disorders.

187

For all combined sentences, as shown by the analysis in Fig. 11.6, the depressed speaker group showed a larger number of ASR token word entries and longer pause durations than most of the non-depressed speaker group. These findings indicate that the greater number of token word entries for depressed speakers is due to their increase in overall speech disfluencies.

Figure 11.6: (left) the total duration of ASR ‘pauses’ based on transcripts; (right) the total number of ASR word token counts (i.e. includes words, pauses, disfluencies). These boxplots are based on combined totals for all sentences previously defined in Table 11.1.

Prior studies (Alghowinem et al., 2012; Esposito et al., 2016; Szabadi et al., 1976) that evaluated speech hesitations, especially pauses, have generally focused a great deal on rate-of-speech type features. Ultimately, the rate of speech is calculated using the number of phonemes produced over a designated time period. A potential drawback with generalized speech rates is that they do not provide information regarding what moments a speaker was most fluent or disfluent, unless the data is segmented at sentence/phrase level. In addition, two speakers could achieve the same rate although their number, or frequency distribution periods, of disfluencies could be very different. For instance, one speaker could have five recorded hesitations that in total equal the same duration as another speaker’s single hesitation. It is known that the rate of articulation in spontaneous speech is highly idiosyncratic (GoldmanEisler, 1961, 1968). Studies have shown that spontaneous speech is faster and has significantly greater interval variability than read speech (Cichocki, 2015; Trouvain et al., 2001). Additionally, spontaneous speech uses different hierarchical acoustic-prosodic cues when compared with read speech, such as intonation, phoneme duration, and spectral features (Haynes et al., 2015; Laan, 1992). Therefore, speech rate ratio type features may be less effective for read speech unless examined at incremental sentence/phrase levels. During experimentation herein, speech phoneme rate ratios were experimented with briefly. However, the preliminary results on the BDAS read

188

speech performed poorly for depression classification, which included both individual sentence evaluation and concatenated sentence speech rate ratio features. By using a manually annotated broad binary disfluency single feature per sentence, a depression classification accuracy of 67% was obtained with F1 scores of 0.61 (0.68). Furthermore, evaluating the five binary speech disfluency features using a decision tree classifier, a depression classification accuracy of 67% and F1 scores of 0.58 (0.73) was achieved, indicating that the monitoring and generic identification of any kind of speech disfluency type (e.g. hesitation, repeat, speech error) is useful for identifying depressive characteristics. By summing the raw disfluency values per disfluency feature for all sentences with a decision tree approach, a depression classification average of 69% and F1 of 0.69 (0.69) was attained. Further, Table 11.10 shows the speech disfluency feature results for all individual disfluency raw counts and valence affect groups. Combinations of many of the speech disfluency feature types were explored; however, the ‘neutral’ hesitations contributed the most to higher classification performance. The 86% classification accuracy produced by manual hesitation features was unsurpassed even using many variations of other speech disfluency and valence affective sentence groupings. Interestingly, for speech errors, the ‘negative’ valence sentences generated the best depression classification results (70%), whereas the ‘positive’ performed the worst (37%). For speech disfluency features, the linguistic narrative point-of-view was examined, as well. However, it appeared to have little impact on disfluency features’ performance. Table 11.10: Depressed (D) and Non-Depressed (ND) classification performance using manually annotated hesitation and speech error features; summed raw counts for all sentences; simple tree classifier. Speech Disfluency Feature Types Hesitation (H) Hesitation Word Repeat (HWR) Hesitation Phrase Repeat (HPR) Speech Error (SE) Speech Error Uncorrected (SEU) Speech Disfluency Types & Valence Groups H Positive {4,12,16,19} H Neutral {2,4,7,12,14,16,18,19,20} H Negative {1,3,5,6,8,9,10,11,13,15,17} SE Positive {4,12,16,19} SE Neutral {2,4,7,12,14,16,18,19,20} SE Negative {1,3,5,6,8,9,10,11,13,15,17} Neutral {H} + Negative {SE} + Positive {SE} Neutral {SE} + Negative {H} + Positive {H} Neutral {H,SE} + Negative {H,SE} + Positive {H,SE}

189

Classification Accuracy 63% 55% 60% 71% 70% Classification Accuracy 63% 86% 61% 37% 64% 70% 83% 63% 83%

F1 D 0.57 0.24 0.42 0.71 0.69 F1 D 0.54 0.88 0.60 0.37 0.59 0.64 0.85 0.57 0.84

F1 ND 0.68 0.69 0.70 0.72 0.71 F1 ND 0.69 0.83 0.63 0.37 0.68 0.74 0.81 0.68 0.81

As shown in Table 11.11, the ASR approaches to identifying speech disfluencies requires more investigation and refinement, as the general depression classification performance of these features was poorer than that for manual methods. The best automated speech disfluency feature result was observed for the ‘positive’ valence affect grouping, which attained 69% depression classification accuracy with F1 scores of 0.62 (0.73). The ASR token word count results run counter to the previously shown manual results in Table 11.10, wherein ‘negative’ valence affect sentences were more discriminative of depression. It is believed that a parameter threshold on the number of token word entries allowed per sentence may be a factor (i.e. token insertions may have been too aggressive). Table 11.11: Depressed/Non-Depressed classification performance using ASR token word count and ASR pause duration (PD) features; summed raw counts for sentences; simple tree classifier. Automatic Disfluency Feature Token Word Count All (TWC) Pause Duration All (PD) Automatic Disfluency Feature & Valence Group TWC Positive {4,12,16,19} TWC Neutral {2,4,7,12,14,16,18,19,20} TWC Negative {1,3,5,6,8,9,10,11,13,15,17} PD Positive {4,12,16,19} PD Neutral {2,4,7,12,14,16,18,19,20} PD Negative {1,3,5,6,8,9,10,11,13,15,17}

Classification Accuracy 53% 60% Classification Accuracy 69% 50% 64% 60% 60% 57%

F1 D 0.42 0.61 F1 D 0.62 0.34 0.53 0.63 0.64 0.58

F1 ND 0.60 0.59 F1 ND 0.73 0.60 0.71 0.56 0.55 0.56

11.4.4 Affect-Based Feature Fusion The approaches using the fusion of n-best features include summation of manual speech disfluencies, voicing, and ASR-based derived features. The final fusion experiments focused on these particular features because for experimental results previously reported in Chapters 11.4.2 and 11.4.3, they provided the highest accuracies. In the proposed affect-based fusion approach, the peraffect features were concatenated into a single feature vector, which is unique from other features tested previously in that it contains information about each separate affect group. Results shown in Table 11.12 indicate that used individually, some feature types perform better than others, or better than using all sentences. For instance, for manually annotated speech errors, the negative affect sentences performed best (70%), whereas for automatic ‘speech’ voicing features the neutral affect sentence performed best (79%). The fusion of separate sentence-valence affect groups generated the best result for a single feature type based on manual hesitation features (91%). With the exception of the ‘speech’ voicing

190

features, the affect-fusion approach strongly demonstrates the benefit of fusing features from different affective sentence groups. Moreover, the automatically derived speech features (‘speech’ voicing, ASR word token entries, ASR Pause Durations) using the multi-feature fusion approach were able to slightly surpass (~4% absolute gain) the eGeMAPS baseline performance (65%). However, by fusing ‘speech’ voicing and ASR token automatic features using only the ‘positive’ affect sentence group, a depression classification accuracy of 84% and F1 score of (0.83) 0.85 was achieved. Further, by only fusing the manual hesitations with the ASR pause durations, an accuracy of 100% was attained for 2-class depression classification. The fusion results herein indicate that there is more discriminative information within each type of valence affective sentence grouping, and when employed collectively, this information is more powerful for depression classification than any individual valence and/or non-specific valence grouping. Table 11.12: Depressed/Non-Depressed 2-class classification performance based on raw summed hesitation, speech error, speech voicing, ASR hesitation, and ASR pause length features. Individual feature results are shown for all, ‘negative’, ‘neutral’, and ‘positive’ affect sentence groups. In addition, multi-feature type fusion results that include these four types of features (e.g. all, ‘negative’, ‘neutral’, ‘positive’) are shown. Manual (blue) and automatic (orange) feature methods have their results in bold to indicate the best results per individual feature type. Feature Types Hesitations Speech Errors Hesitations + Speech Errors Speech Voicing ASR Word Token Entries ASR Pause Durations Speech Voicing + ASR Word Token Entries + ASR Pause Durations

ALL 70% 0.64 (0.74) 66% 0.61 (0.69) 70% 0.67 (0.73)

NEGATIVE 61% 0.60 (0.63) 70% 0.64 (0.74) 70% 0.64 (0.74)

NEUTURAL 70% 0.63 (0.75) 64% 0.59 (0.68) 69% 0.62 (0.73)

POSITIVE 61% 0.60 (0.63) 40% 0.36 (0.43) 60% 0.56 (0.63)

FUSED 91% 0.92 (0.91) 70% 0.63 (0.75) 87% 0.88 (0.87)

70% 0.67 (0.73) 59% 0.52 (0.63) 60% 0.53 (0.65) 73% 0.70 (0.75)

70% 0.68 (0.72) 57% 0.50 (0.63) 59% 0.51 (0.64) 66% 0.61 (0.69)

79% 0.78 (0.80) 60% 0.53 (0.65) 64% 0.59 (0.68) 70% 0.67 (0.73)

77% 0.75 (0.79) 67% 0.64 (0.70) 64% 0.59 (0.68) 77% 0.76 (0.78)

76% 0.75 (0.77) 67% 0.63(0.70) 70% 0.68 (0.72) 69% 0.68 (0.70)

Due to the competitive 2-class fusion classification results achieved, a 3-class classification experiment was conducted to see whether this type of affect feature fusion could also offer good discrimination between melancholic/non-melancholic subtypes of depression, discussed previously in Chapter 2.2. The 4-best feature fusion (i.e. manual hesitations, speech errors, speech voicing, ASR pause duration) features were applied for 3-class (i.e. non-depressed, non-melancholic, melancholic) depression subtype classification. An average classification accuracy of 80% was

191

achieved for the 3-class task, using an LDA 10-fold decision classification as per earlier experiments in this chapter. The result of 80% for the 3-class classification task is strong, as when compared with previous speech-based depression studies (Alghowinem et al., 2013a, 2013b; Cummins et al., 2011) using a similar database, these studies achieved between 60% to 80% accuracy for 2-class classification. Furthermore, as shown in Table 11.13, a 2-best feature fusion (e.g. manual hesitations, ASR pause durations) features provided a slight classification accuracy boost to 83% for the 3-class classification accuracy. These improved 3-class classification results produced no false alarms in the non-depressed group. The 3-class fusion results in Table 11.13 indicate performance advantages for classification of depression subtypes using different feature types along with affect considerations. Table 11.13: Confusion matrix using the 2-best feature fusion comination of manual hesitations and ASR pause durations in Table 11.12, but with a 3-class experiment distinguishing non-depressed, non-melancholic, and melancholic depression subtypes. Note the low number of false alarms for non-depressed speakers. As indicated in bold, the majority of speakers were identified with a correct class label. Non-Depressed Non-Melancholic Melancholic

Non-Depressed 35 0 0

Non-Melancholic 0 14 7

Melancholic 0 5 7

11.4.5 Limitations Like other studies of this kind, a relatively small quantity of speech data was used. Despite this small amount of speech data, it has been shown that other health disorder can be correctly identified and classified from very small amounts of speech. For example, in Rusz et al. (2017), it was demonstrated that by using only a single held-vowel sound, Parkinson’s disease classification performance accuracy of up to 90% was achieved. Other speech-based neurological classification studies (e.g. amyotrophic lateral sclerosis, dementia, autism, traumatic brain injury) using very small amounts speech data with high-level automatic classification performance include the following: Bonneh et al. (2011); Orozco-Arroyave et al. (2016); Wang et al. (2003); and Yunusova et al. (2016).

192

11.5 Summary A major advantage of text-dependent read speech over spontaneous speech is that feature differences can be measured between speakers’ read utterances with constrained phonetic variability and knowledge of the ground-truth text. Based on experimental results herein, new insights regarding the application of multi-dimensional affect (e.g. negative, neutral, positive) features were shown to improve depression classification relative to an affect-agnostic approach. Furthermore, of particular interest regarding elicitation design, results show that linguistic information (e.g. affective keyword choice, position, point-of-view) also can help to boost classification performance, especially for acoustic-based features. This study is the first examination that explores the manner in which depressed and non-depressed individuals make disfluencies during spoken affective sentence reading tasks. The newly proposed speech error features have been shown to provide very strong depression classification accuracy. Although the best of these results depend on manual annotation, note that they still have important practical implications, as observing speech errors would be a very quick and simple test to administer in a clinical setting or via a web-based service, including by a non-clinican/non-expert. Currently, this ‘all or nothing’ speech disfluency evaluation technique is common practice during many common speech-language pathology assessments (e.g. articulation, verbal recall, stammering). Moreover, it is shown that automatic transcription can be used to a reasonable extent to extract speech disfluency features in the form of extraneous speech token word and pause entries, and improve classification accuracy. Results shown using automatically derived ‘speech’ voicing, token word, and/or pause duration features show major advantages to analyzing these features per affective group. At best, it is shown that selective automatic features with specific affect sentence grouping analysis can achieve up to 19% absolute depression classification accuracy gains over the affect agnostic acoustic baseline. By investigating sentences with affective target words and linguistic characteristics, these new insights were discovered into how the careful construction of assessment protocol sentences can influence automatic depression classification performance. The proposed monitoring of word-level speech errors for read speech is a new area of exploration for automatic speech depression classification. Results presented in this study suggest a new direction for speech elicitation protocol design for depression severity classification.

193

Experimental results presented herein demonstrate, in a collective sense, that all types of affect (e.g. negative, neutral, positive) are important in the speech-based analysis of depression. Previously, in Chapter 11, gains in depression classification accuracy were achieved by constraining feature data analysis within specific affective regions. Yet again, affective read sentence depression classification results presented in this chapter demonstrated similar accuracy improvement trends with most features based on specific affective sentence groups (e.g. negative, neutral, positive). However, in experiments herein, the proposed implementation of a multi-affect feature fusion technique demonstrated even better depression classification accuracy (i.e. up to 100%) than a single affective sentence group. Remarkably, the majority of the features presented in these depression classification experiments have a relatively low feature dimension (less than 20), yet were highly effective. In some instances, sentence-based features were compactly combined into a single 1-dimensional feature per speaker that yielded a high degree of classification accuracy (79%). Lastly, while so far in the literature, it has been exceedingly rare that speech-based depression classification studies report on depression subtypes, fusion results presented herein on melancholic/non-melancholic/non-depressed multi-classification show much promise. The future of automatic speech-based depression severity analysis and its clinical diagnostic utility will be enhanced significantly if subtypes of depression can be adequately identified and separated from other similar disorders/disease with potentially overlapping symptoms.

194

Chapter 12 INVESTIGATION OF PRACTICAL CONSIDERATIONS FOR SMARTPHONE APPLICATIONS 12.1 Overview It is anticipated that engineers who create speech-based depression classification systems for smartphones will have to consider optimizing their design for a very large number of differing handsets. While there have been many speech studies investigating the effects of cross-channel degradation on speaker, language, and emotion identification (Richardson et al., 2016), the impact of smartphone audio channel conditions on paralinguistic classification problems like depression recognition has surprisingly received minimal consideration from researchers. As mentioned early in Chapter 5.3, the quality of speech information can be impacted by nuisance factors, such as device limitations and channel conditions found in modern smartphones. Therefore, questions still remain concerning how analogous smartphone manufacturers/versions are to each other in terms of their speech signal representations; and more importantly, for speech-based depression classification, what kinds of acoustic features are most impacted by smartphone device channel variability and are these features same different devices? The differing audio acquisition techniques employed by different mobile devices introduce unwanted channel and environmental noise variability into the speech-based depression classification problem. Even when using the same audio recording application, the acoustic recording properties of smartphones vary greatly depending on the manufacturer, and can even vary within same-manufacturer smartphone versions (Brown & Evans, 2011; Kardous & Shaw, 2014; Robinson & Tingay, 2014). While AppleTM and Google AndroidTM currently represent the majority of all smartphone software operating system platforms, the latter has a staggering 87% global market dominance (IDC, 2016). Despite popularity, research has indicated less acoustic signal

195

conformity among AndroidTM devices (Kardous & Shaw, 2014) attributed to the greater number of different manufacturers’ hardware and software designs. Based on the aforementioned cross-channel studies, it is hypothesized that the grouping of speech data from similar smartphone manufacturers and/or manufacturer versions for specific speech-based depression modeling will achieve superior classification performance when compared with using a device-agnostic approach. In terms of robustness, acoustic feature analysis and normalization methods are investigated using several smartphone groupings with the intention to uncover which acoustic feature types have the greatest channel-independence based on depression classification accuracy.

12.2 Methods Smartphones can easily provide metadata about their manufacturer and device version (e.g. iPhoneX, Samsung-SG, LG-3G). Smartphone device information can be used to automatically group speaker data for similarly grouped analysis. Despite this similarly grouped smartphone device speaker data, normalization techniques are often also applied to acoustic speech features in speech processing applications to help reduce channel nuisance effects commonly found in real world speech collection scenarios, wherein speaker and environmental noise conditions can unpredictably vary. A method using histogram equalization of features (see Fig. 12.1), otherwise known as cumulative distribution mapping, was proposed to address the inherent speech signal and acoustic feature variation found across different manufacturers and their device series. Histogram equalization of features has been previously used as an acoustic feature normalization process in speech recognition, speaker verification, language identification, and emotion recognition applications (Allen et al., 2006; Pelecanos et al., 2001; Segura et al., 2002; Sethu et al., 2007).

Figure 12.1: Cumulative distribution mapping example showing an initial raw feature distribution, in which its feature values are then sorted from low to high; and then feature value percentile distributions are calculated generating a Gaussian representation.

196

To apply the histogram equalization of features, the feature distribution is modified by an invertible transform, wherein mapping is performed on a dimension-by-dimension basis. This can be represented as: ! 𝑓 !!!!

𝑦 𝑑𝑦 =

! ℎ !!!!

𝑧 𝑑𝑧 ,

(12.1)

where f(y) is the probability density function of y, and h(z) is the desired probability density function, usually a Gaussian distribution, as in experiments herein. In practice, the mapping is achieved by ranking the features over the time interval to be normalized, e.g. frames with indices from 1 to N. The rank R of a particular feature value is then used to obtain a mapped value q from the desired probability density function. This can be shown by the following equation: !!!!! !

=

! ℎ !!!!

𝑧 𝑑𝑧, (0 < 𝛽 < 1)

(12.2)

Prior research on speech emotion classification (France et al., 2000; Sethu et al., 2007) has demonstrated effective mitigation of unwanted speaker variability by applying histogram equalization, which uses the distribution of the feature vectors per speaker to map training and test features to a Gaussian distribution. Herein, it was proposed to modify the technique to normalize on a per-smartphone manufacturer/version basis rather than on a per-speaker basis. The proposed technique allows features from each smartphone, irrespective of manufacturer or device version, to be mapped to a uniform distribution scale, thereby reducing this unwanted variability. As shown by Fig. 12.2, two normalization grouping approaches were investigated: (1) ‘per speaker’, wherein all smartphone types are included into a single generic model; (2) ‘per manufacturer’, where in three different models based on manufacturer are used and input speech data is matched with same manufacturer model. The ‘per speaker’ approach does not require knowledge about the smartphone manufacturer (i.e. if metadata is unavailable), as all speakers and their device series are collectively mixed. On the contrary, the ‘per manufacturer’ approach requires specific knowledge about the speakers’ device manufacturer. Further, the ‘per manufacturer’ approach presumes that many manufacturers use the same hardware/software across different series of smartphones. Thus, it is believed they will have similar speech signal response outputs. However, again, little is known regarding the acoustic signal similarity within particular smartphone manufacturers along with their many device versions; thus, more investigation is needed.

197

(a)

(b)

Figure 12.2: Illustrated examples of (a) ‘per speaker’ versus (b) ‘per manufacturer’ normalization using a histogram equalization of features based on a small amount of speaker training data.

To evaluate the acoustic spectral differences between manufacturers’ smartphones, and same manufacturer but different smartphone device versions, analysis was conducted using the long-term average speech spectrum (LTASS). The LTASS provides spectral distribution information derived from the speech biosignal by averaging short-time FFT spectra over time (Cornelisse et al., 1991; Jovicic et al., 2015; Lofquist & Mandersson, 1987).

12.3 Experimental Settings Two baseline experiments were completed for feature comparison: eGeMAPS and COVAREP, as previously described in Chapter 4.1.1. The eGeMAPS acoustic features (Eyben et al., 2016) were computed using a whole-file analysis; thus, 88 low-level descriptors were calculated per 20ms frame, and these were aggregated across the utterance using functionals including the mean, standard deviation, and skewness. In addition, set of 74 acoustic features was extracted using the COVAREP speech toolkit (Degottex et al., 2014), wherein its mean, standard deviation, and skewness functionals were calculated by aggregating 20ms frame features. Similarly to France et al. (2000), depression classification was performed using Quadratic Discriminant Analysis (QDA) covered briefly in Chapter 4.2.2. All experiments utilised 5-fold cross validation with two classes (non-depressed/depressed) with a 20/80 training/test split. The Sonde Health (SH1) corpus was used for this research due to its non-laboratory natural environments (i.e. recorded at home via privately owned device), sizeable female/male speaker (160), read/free tasks, PHQ-9 depression scores, and smart device metadata (e.g. manufacturer,

198

phone series). All audio files had Voice Activity Detection (VAD) (Kinnunen et al., 2013) applied to remove silence. After VAD was applied the free speech (SH1f) and read speech (SH1r) files were on average 20 and 120 seconds in length, respectively. There has been no previous database of depressed speakers recorded from smartphones in non-laboratory natural environments (i.e. at home via privately owned device), in conjunction with a smartphone health app designed to self-assess depression severity. For more information regarding the SH1 database, refer to Chapter 5.4.5.

12.4 Results and Discussion

12.4.1 Manufacturer Comparison Fig. 12.3 shows the Long-Term Average Speech Spectrum (LTASS) for the most common smartphone manufacturers in the SH1 database, calculated by using all speakers’ SH1r utterances. While smartphones from all three manufacturers (M1, M2, M3) demonstrated significant acoustic speech energy below 1 kHz, each had its own unique signal response characteristics. In particular, speech recorded from devices made by manufacture M2 had a greater magnitude above 1 kHz compared with the responses of the M1 and M3. -10

-15

-20

Magnitude (dB)

-25

-30

-35

-40

-45

-50

0

1000

2000

3000

4000

5000

6000

7000

8000

Frequency (Hz)

Figure 12.3: Long-Term Average Speech Amplitude Spectrum comparison of M1 (blue), M2 (red), and M3 (green) smartphones using a 20ms frame length Discrete Fourier Transform (DFT).

In most instances for both SH1f and SH1r, the depression classification accuracy within manufacturer smartphone groupings was equal to or better than the mixed category (see Fig. 12.4). It was hypothesized that similarly branded group models would perform better than a mixed model due to channel constraints; this was found to be true. Despite the reduction in training model size, similarly branded smartphone models achieved an average of 18% (SH1f) and 17% (SH1r) relative

199

improvement over the mixed (i.e. unspecified) model. In some instances, similarly branded smartphone models generated an increase in relative depression classification performance of more than 30%. In Fig. 12.4, the eGeMAPS features produced slightly better (~3%) depression classification gains than the COVAREP features.

Relative Classsification Accuracy Gain/Loss

35% 30% 25% 20% 15% 10% 5% 0% -5%

Mixed

M1

M2

M3

M1

eGeMAPS (88)

-10%

M2

M3

COVAREP (196)

Figure 12.4: Per-manufacturer and per-feature comparison of mixed (black), eGeMAPS (blue), and COVAREP (red) relative accuracy improvement for SH1f (dark shade) and SH1r (light shade). The number of features is indicated in parenthesis.

12.4.2 Manufacturer Version Comparison In the SH1 database, as shown in Fig.12.5, again calculated using all speakers’ SH1r utterances, the LTASS analysis reveals magnitude-frequency differences even within similar M3 version smartphones. The version 1 (V1) has a larger magnitude dB at higher frequencies than the version 2 (V2) and version 3 (V3). -10

-15

-20

Magnitude (dB)

-25

-30

-35

-40

-45

-50

0

1000

2000

3000

4000

5000

6000

7000

8000

Frequency (Hz)

Figure 12.5: Long-Term Average Speech Amplitude Spectrum comparison of M3 Version (V) smartphones using a 512-point DFT; V1 (blue); V2 (red); and V3 (green).

200

In Fig. 12.6, for both eGeMAPS and COVAREP features, the smaller training size of the M3 specific version smartphone models achieved an average of 16% (SH1f) and 14% (SH1r) relative improvement over the M3 version mixed models.

Relative Classification Accuracy Gain

35% 30% 25% 20% 15% 10% 5% 0% -5% -10%

Mixed

V1

V2

V3

eGeMAPS (88)

V1

V2

V3

COVAREP (196)

Figure 12.6: M3 versions and feature comparison of eGeMAPS (blue) and COVAREP (red) relative accuracy improvement for depression classification for SH1f (dark shade) and SH1r (light shade).

12.4.3 Manufacturer Feature Comparison The COVAREP features were evaluated using glottal, spectral, and prosodic sets. The relative depression classification performances for these feature sets are shown in Fig. 12.7 (seen next page). Among the feature sets, the spectral demonstrated greater variation in performance accuracy gains across the different smartphones (>10%), whereas the prosodic and glottal features showed less variation (