Comparative Study of Pairwise Classifications by ML and NN on

1 downloads 0 Views 1MB Size Report
when testing ML with means and SD's of MFCC in voiced speech. While NN provides higher 61% from classifying MFCC of unvoiced speech. In overall, 64% ...
International Conference on Systems and Electronic Engineering (ICSEE'2012) December 18-19, 2012 Phuket (Thailand)

Comparative Study of Pairwise Classifications by ML and NN on Unvoiced Segments in Speech Sample Suphakthinee Thongdee, Siwat Suksri and Thaweesak Yingthawornsuk Clinical procedures to identify person who is severely depressed are in great need for practical use in healthcare centre and clinics. The most common methods to assess, if patients were at severe state of depression or even at elevated risk of suicide, are self-scored patient survey, report by others, clinical interviews and rating scales[1], such as the Hamilton depression rating scale. Diagnosis and decision making on clinical categories patient belongs to are clinical procedure with time consuming in which practitioners must involve several steps such as information gathering, background profile checking, hospital admission and visiting records, diagnosing progress, crime related report, and hotline call-in for healthcare consultation with psychiatrist. The clinical diagnosis with simultaneous response in judging if patient were psychologically safe from suicidal risk or clinically identified for one of symptom categories, dramatically necessitates for physician to conclude the diagnosing result with the correct decision making on admission and treatment to that patient. As reported in the published studies[2], several analytical techniques have been proposed to achieve the way to measure the particular changes, as a result of affection from the underlying symptom of depression, in acoustics of speech of depressed patients. It has been concluded that the suicidal speech in severely depressed speaker is very similar to that of common depressive one, but the tonal quality of speech significantly changes when the symptom of near-term suicidal risk highly strikes at the moment. Acoustical parameters extracted by several speech processing techniques, such as statistical properties of fundamental frequency, speech jitters and shimmers, formants, pitch contour, Glottal spectral tilt, and frequency distribution of PSD energies have suggested as the effective indicators in monitoring the symptom of major depression in speakers[3] through their sound outcomes. Another effectively discriminative parameter group is the Mel-Scale Frequency Cepstral Coefficients (MFCC)[4,5] used as input to ML and NN classifiers in attempt to identify the suicidal patients among depressed patients and out of the control subjects. Analyzed results of classifying such speech sample cases will provide us more suggestion on the speech acoustics affecting the class separation corresponding to ways that subjects producing speech in addition to investigation of the optimal speech parameters.

Abstract - This paper shows the experimental results supporting the quantitative link between acoustical properties extracted from speech signal and severe level of depression. The data acquisition and experiment have been carried out from two clinically categorized groups of females with depression and remission. The mel-scale cepstral coefficients from the sixteenth-order triangular filter bank were extracted and used in pairwise classification based on ML and NN. Results of classification comparatively show the classifier’s performance in correctly separating two classes at 55% and 52% when testing ML with means and SD’s of MFCC in voiced speech. While NN provides higher 61% from classifying MFCC of unvoiced speech. In overall, 64% was obtained from ML classifying SD’s of MFCC from unvoiced speech. Keywords-- Classification, Speech, MFCC, PCA, ML, NN. I. INTRODUCTION

C

LINICAL depression is the psychiatric disorder which can lead to the risk of suicide in person who has experienced it recursively without taking the treatment. This type of emotional illness has been popularly studied and reported to be the prominent precursor of the suicidal risk in human and suicide is the public health problem with increasing rate every year, and has an obvious impact on social, healthcare, life living and even economic growth. Therefore, preventing suicide is the most important task and has to be proceeded in earlier time by screening persons for depression and admitting that depressed person who may or may not have severe symptom of illness to the treatment program. The major objective of this study is to attempt to investigate the relation between the acoustical properties of speech and the severity of mental states in speakers who were clinically diagnosed for depression or suicidal risk by psychiatrists. The formerly experimental studies have been proposed that the acoustical parameters estimated from speech signal can be used in observing the affection on recognizing pattern and assessing the degree of mental severity in depressive speakers.

Suphakthinee Thongdee is with Department of Media Technology, King Mongkut’s University of Technology Thonburi – BangKhunthian Campus, Thailand ([email protected]) Siwat Suksri is with Department of Media Technology, King Mongkut’s University of Technology Thonburi – BangKhunthian Campus, Thailand ([email protected]) Thaweesak Yingthawornsuk is with Department of Media Technology, King Mongkut’s University of Technology Thonburi – BangKhunthian Campus, Thailand ([email protected])

28

International Conference on Systems and Electronic Engineering (ICSEE'2012) December 18-19, 2012 Phuket (Thailand)

is 𝛿5 = 25 . Any segments of speech signal with its largest energy level estimated at scale 𝛿1 = 21 is favorably classified as unvoiced segment, otherwise found to be voiced

II. METHODOLOGY 2.1 Database Information The database consists of female speech sample recorded by pre-session physician. The categorized into two groups, 14 females with depression and

Fig. 1 Speech processing and classification procedure. 14 females with remission. All subjects admitted to the program are between 23-65 years old. Individual subject is advised to complete the Beck Depression Inventory-II (BDIII) before the participating in interviewing session with a psychiatrist. The BDI-II inventory used in this program is a mood weighting measure in a form of standard, brief and self score questionnaire having detail of clinical evaluation, regarding of mental and physical depression[6] as well. Total score of BDI-II can be ranging from 0 to 64 for patients response in which the larger number implies the possibility of that person suffers more severe depression than one with lower BDI-II score. In this work, only interview speech samples from recording database are studies for the MFCC and its statistical significance in symptom categorizing discrimination. The pre-processing is carried out by first digitizing all speech signals through a 16-bit analog to digital converter at a sampling rate 10 kHz via a 5 kHz anti-aliasing low-pass filter. Prior to detection of voiced, unvoiced, silent segment in speech files, the monitor and screening on any sound artifact possibly appearing during interviewing are offline implemented by using the Goldwave, including the silences longer than 0.5 seconds are manually removed as well. All speech signals of depressed and remitted females are carefully processed under the same condition of pre-processing and the similar acoustic environment control is made during the period of recording the speech sample of interviewing conversation. 2.2 Voiced/Unvoiced Detection Based on the exploiting fact that the unvoiced segments of speech signal are very high frequency component compared to the voiced speech which is low frequency and quasi-periodic. To classify which segment of speech signal for their energy and then weighted using the Dyadic Wavelet Transform (DTW) of speech samples was computed in each segment length of 256 samples/frame. Therefore, the unvoiced speech segments can be simply detected by comparing the energies of DWTs at the lowest scale 𝛿1 = 21 and the highest energy level

segments. The following equation is the energy threshold defining as unvoiced segment; 𝑛 = 1, … , 𝑁(1) 𝑢𝑢 = (𝑛|𝛿𝑖 = 21 ); At which 𝑢𝑢 is speech segment classified as unvoiced at which the 𝑛 segment with energy at scale 𝛿1 maximized.

Fig. 2 Original speech signal (upper), voiced segment of speech (middle) and unvoiced segments (lower).

2.3 Vocal Feature Extraction Preprocessed speech sample were classified voiced, unvoiced and silent group. Only voiced and unvoiced segments of speech signal are further processed for MFCC 29

International Conference on Systems and Electronic Engineering (ICSEE'2012) December 18-19, 2012 Phuket (Thailand)

extraction. The procedure to determine MFCC [7] is described as follows:  Windowing each concatenated voiced-segments into individually 25.6ms-length frames.  Computing the logarithm of the discrete Fourier Transform (DFT) for all windowed speech frames.  Applying the log-magnitude spectrum through the filter bank of 16 triangular band pass filters with center frequencies corresponding to mel frequency scale.  Normalizing the variation of vocal tract for all subjects  Computing the inverse discrete Fourier transform (IDFT), then get the 16-order cepstral coefficients.  Calculating mean and standard deviation (SD) for every 20 sample which are used to represent as input feature vector to classifier.  Analyzing all extracted MFCC dataset for two dimension principal component and then used as an input vector for training and testing with ML and NN classifier. All processes are implemented in Matlab scripts.

Due to the computational simplicity, equation (5) without 2 the superfluous factor ′ is employed in our algorithm for the 𝑁 computation of mel-cepstral filter bank coefficients. 2.4 Principal Component Analysis (PCA) We have applied the PCA technique[9] to MFCC features to extract the most significant components of feature. The main concept of PCA is to project the original feature vector onto principal component axis. These axes are orthogonal and correspond to the directions of greatest variance in the original corresponding to the directions of greatest variance in the original feature space. Projecting input vectors onto the principal subspace helps reducing the redundancy in original feature space and dimension as well. The analyzed MFCC features are projected onto a two dimensional space which is adequate for data training and testing in next classification state. 2.5 Feature Classification The classifiers, Maximum Likelihood (ML) are selected to train and test on two-dimensional MFCC dataset and then compared to each other for performances on correct classification. Firstly, samples are randomly selected for 20% of sample dataset, and then used to train a classifier, and another 35%, 50% of the rest of dataset used later for testing the same classifier. Several trials on random selection of samples from our dataset with 20%, 35% and 50% for training and the rest for testing are further proceeded to find more on the performance of classifier which may be affected by sample sizes. In case of ML classification the Bay’s Decision Rule has been approximated from our samples by following the denoted equation[5];

The mel-scale is used to map between linear frequency scale of speech signal to logarithmic scale for frequencies higher than 1 kHz. This makes the spectral frequency characteristics of signal closely corresponding to the human auditory perception [8]. The mel-scale frequency mapping is formulated as: 𝑓𝑚𝑒𝑙 = 2595 ∗ LOG10 �1 +

𝑓𝑙𝑖𝑛 � 700

(2)

in which 𝑓𝑚𝑒𝑙 is the perceived frequency and 𝑓𝑙𝑖𝑛 is the real linear frequency in speech signal. In filtering phase, a series of the 16 triangular band-pass filters, 𝑁𝑓 = 16 is used for a filter bank whose center frequencies and bandwidths are selected according to the mel-

𝑃(𝑚𝑚𝑚𝑚𝑖 |ω1 )P(ω1 ) > 𝑃(𝑚𝑚𝑚𝑚𝑖 |ω2 )P(ω2 ) (6) This means 𝑚𝑚𝑚𝑚𝑖 is in class ω1 , otherwise 𝑚𝑚𝑚𝑚𝑖 is identified as class ω2 . Where P(ω𝑖 ) is known as the prior probability that it would be in class i and 𝑃(𝑚𝑚𝑚𝑚𝑖 |ω𝑖 ) is known as the state-conditional pro-bability for class i. Furthermore, the inequality can be re-arranged to obtain another decision rule;

𝑓

scale. They span the entire signal bandwidth for �0 − 𝑠 �. The

center frequency of individual filter is defined;

2

𝑓𝑠 ; 𝑁′

𝑖 = 1, 2, 3, … , 𝑁𝑓

And its bandwidth is consequently computed by

(3)

𝐵𝑖 = 𝐹𝐶,𝑖+1 − 𝐹𝐶,𝑖−1 ;

𝑖 = 1, 2, 3, … , 𝑁𝑓

(4)

𝐹𝐶,𝑖 = k 𝑖

𝐿𝑅 (𝑚𝑚𝑚𝑚𝑖 ) =

Here, 𝑁′ is the fft bin equal to 256, 𝑘𝑖 is the DFT index of the center frequency of filter i, 𝐵𝑖 and 𝐹𝐶,𝑖+1 are the bandwidth and the center frequency of filter i, respectively. It is also 𝑓 important to see that 𝐹𝐶,0 = 0 and 𝐹𝐶,𝑁𝑓 < 𝑠. Once the center 2 frequencies and bandwidth of the filters are obtained, the logenergy output of each filter i is computed and encoded to the MFCC by performing a Discrete Cosine Transform (DCT) defined as follows: 𝐶𝑛 =

𝑃(𝜔2 ) 𝑃(𝑚𝑚𝑚𝑚𝑖 |𝜔1 ) > 𝑃(𝜔2 ) 𝑃(𝑚𝑚𝑚𝑚𝑖 |𝜔2 ) 𝑃(𝜔1 ) 𝑃(𝜔2 ) 𝜏𝑐 = 𝑃(𝜔2 ) 𝑃(𝜔1 )

(7)

Then 𝑥 is in class 𝜔1 . The ratio on the left of Equation 7 is called the likelihood ratio and quantity on the right is the threshold. If L R > 𝜏 c then we decide that the case belongs to class ω 1 . If L R < 𝜏 c , then the threshold is one (𝜏 c = 1 ). Thus, when L R > 1, we assign the observation or pattern to ω 1 , and if L R