FCSR - Fuzzy Continuous Speech Recognition ...

9 downloads 0 Views 1MB Size Report
Keywords: Speech recognition а Laryngeal pathology identification а Fuzzy logic ... occurred by an incomplete glottal cleft closure. In brief, this will evidently ...
FCSR - Fuzzy Continuous Speech Recognition Approach for Identifying Laryngeal Pathologies Using New Weighted Spectrum Features Rania M. Ghoniem1(&) and Khaled Shaalan2 1

School of Computer Science, Mansoura University, Mansoura, Egypt [email protected] 2 Faculty of Engineering and IT, The British University in Dubai, Dubai, United Arab Emirates [email protected]

Abstract. Speech processing technologies have provided distinct contributions for identifying laryngeal pathology, in which samples of normal and pathologic voice are evaluated. In this paper, a novel Fuzzy Continuous Speech Recognition approach termed FCSR is proposed for laryngeal pathology identification. First of all, new speech weighted spectrum features based on Jacobi–Fourier Moments (JFMs) are presented for characterization of larynx pathologies. This is primarily motivated by the assumption that the energy represented by spectrogram would entirely change with some larynx pathologies like physiological pathologies, neuromuscular pathologies, while it would extremely change with normal speech. This phenomenon would extensively influence the allocation of spectrogram local energy in time axis together with frequency axis. Consequently, the JFMs computed from spectrogram local regions are utilized to characterize distribution of spectrogram local energy. Besides, a proposed multi-class fuzzy support vector machine (FSVM) model is constructed to classify larynx pathologies, where partition index maximization (PIM) clustering along with particle swarm optimization (PSO) are employed for calculating fuzzy memberships and optimizing the arguments of the kernel function of the FSVM, respectively. Eventually, the experiments legitimize the proposed approach in reference to the accuracy of the laryngeal pathology recognition. Keywords: Speech recognition logic



Laryngeal pathology identification



Fuzzy

1 Introduction Larynx Pathological changes are distinguished by an increase of mass of vibrating vocal fold, a deficiency of closure, in addition to a considerable change in the folds elasticity. As a consequence, the harmonic structure may be altered, and similarly the energy would be maximized at components of higher energy because of the turbulences occurred by an incomplete glottal cleft closure. In brief, this will evidently reflected in © Springer International Publishing AG 2018 A.E. Hassanien et al. (eds.), Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017, Advances in Intelligent Systems and Computing 639, DOI 10.1007/978-3-319-64861-3_36

FCSR - Fuzzy Continuous Speech Recognition Approach

385

the speech throughout voiced sounds since in such sound segments the folds would be in vibration [1]. On the other hand, automated acoustic processing of speech is exceedingly utilized for the screening of laryngeal perturbations. Various acoustic parameters, have been proposed to objectively evaluate the perceptual qualities of pathologic voice utilizing sustained vowels as acoustic materials. However, the pathologies that dominate quality of spoken voice through the conversation would have insignificant influence on the relative quality of the vowel sustained by a speaker at comfortable pitch. Unlikely, parameters and methods reviewed in Sect. 2 cannot be simply applied to continuous speech because of articulations, onset or offset influences, likewise suprasegmental changes. A principal objective of the current paper is to propose a fuzzy continuous speech recognition approach for identifying laryngeal pathology. The paper advances the state-of-the-art of laryngeal pathologies identification and classification using continuous speech by the following contributions: 1. The major contribution in this study is the proposed speech weighted spectrum features based upon local JFMs. Formally speaking, under different speech perturbations, the articulation, the pronounced speed, the pronounced strength, and the degree of variation of pitch frequency would distinctly be changed. These variations would change the grade related to the energy localized to some spectrogram frequencies. For instance, energy is widely localized under particular frequency of the spectrogram when the spoken utterances have better articulation as well as superior pronounced strength. Thus, in case that the JFMs are calculated in spectrogram local regions, they would assess the grade related to the energy localized to frequency regions in a spectrogram. This demonstrates the good capability of JFMs in estimating the differences between the pathologies. Moreover, JFMs can diminish the variations presented by: the speakers; the sentences; and the speaking style, as JFMs are invariant to scaling, rotation, as well as translation. 2. Another contribution in this study is the use of fuzzy approach, which is effectively used in solving uncertain problems. In particular, an enhanced multi-class FSVM model is presented for laryngeal pathology identification. Unlike the traditional laryngeal pathologies classification methods (see Table 4), the proposed approach is robust for noises or outliers in the training set. The classifier is established by one-against-other (OAO) approach. Besides, PIM clustering as well as PSO are applied for calculating fuzzy memberships and optimizing the arguments of FSVM kernel function, in the same order already mentioned.

2 Related Work For the sake of perturbation measurement that comes into sight in existence of larynx pathology, present acoustic analysis panorama permits us to compute plenty acoustic measures that are long-term averaged. These measures include: Signal-to-Noise (SN) ratio, Normalized Noise Energy, Harmonics to Noise (HN) ratio, Voice Turbulence Index, jitter, shimmer, Glottal to Noise Excitation (GNE) ratio, Frequency Amplitude Tremor, and others [2–4]. They were evolved for measuring the relative

386

R.M. Ghoniem and K. Shaalan

excellence and “normality grade” of voice records from sustained phonations of vowels. The underlying shortcoming of such measures is that they depend upon a rigorous estimation of the substantial frequency, which represents a rather complicated task in existence of indisputable pathologies. Additionally, other works addressed this issue from acoustic signals by inverse filtering [5]. On the contrary, inverse filtering is based upon the supposition of a linear model, thus, these methods can not behave properly in case of pathology is exist, because of non-linearity presented by such pathology itself. By contrast, few works address the speech perturbation identification from continuous speech. In [6], Two systems were constructed: one based upon support vector machine (SVM) as well as another based upon Gaussian mixture model (GMM), both using Mel-frequency Cepstral Coefficients (MFCC) calculated from: (1) sustained vowel /a/, and (2) continuous speech. Regarding continuous speech, the GMM-based system realized an accuracy ratio of 74%, as well as the SVM-based system achieved an accuracy ratio of 72%. Concerning sustained vowel /a/, 66% and 69% accuracies were achieved for the GMM and SVM systems, respectively, that are lower outcomes than when running speech is employed.

3 Materials and Methods 3.1

Database

This study uses the Massachusetts Eye and Ear Infirmary (MEEI) database [7], created by the Voice and Speech Laboratory of MEEI and used by several studies (see Table 4). Three classes were obtained: physiological larynx pathologies, of 59 samples (nodules edemas or vocal fold); neuromuscular larynx pathologies, of 59 samples of speech (unilateral vocal fold paralysis); besides normal signals, of 36 samples. 3.2

Proposed FCSR Approach

The steps of FCSR are demonstrated on Fig. 1. These steps are briefed as follows. Extraction of the new weighted spectrum features. This step presents the proposed weighted spectrum features based on JFMs, termed JFWSF. The algorithm of JFWSF is outlined in Table 1. The JFWSF includes seven main steps as follows: 1. Framing: The speech signal q is segmented into frames of duration equivalent to 23.22 ms with a 11.61 ms overlap. 2. Windowing: Each frame qk(z) of the time signal q is windowed throughout multiplying of q by a hamming sliding window D(z); in this manner, spectral distortion would be reduced, where k denotes frame index, o is time location, and Z is the samples number of each frame, as given below: uðzÞ ¼ qðzÞ Dðz  oÞ;

where

0  z  ðZ  1Þ;

ð1Þ

FCSR - Fuzzy Continuous Speech Recognition Approach

387

Fig. 1. Block diagram of the FCSR approach.

DðzÞ ¼ 0:54  0:46 cos ½2 p z = ðZ  1Þ ; 0  z  ðZ  1Þ;

ð2Þ

3. Short Time Fourier Transform (STFT): STFT Rk is calculated, as   j2pzk uðzÞ exp  Rk ¼ Z k¼0 ZX 1

ð3Þ

388

R.M. Ghoniem and K. Shaalan Table 1. The algorithm of JFWSF.

FCSR - Fuzzy Continuous Speech Recognition Approach

389

4. Mel filter and log energy: Mel frequency Mk for a given frequency f along with the log energy Gk ðyÞ are calculated by Eqs. 4 and 5 respectively, 

1 þ Rk Mk ¼ 2595  log 700 Gk ðyÞ ¼ ln

fh X

 ð4Þ

! 2

jMk ðf Þj Uy ðf Þ ;

0  y\Y

ð5Þ

f ¼ fl

where fl and fh, assigned by fl ¼ 0 HZ and fh ¼ 4000 HZ, represent the upper and lower limits of calculated speech frequency, Uy ðf Þ denotes the triangular filter described in Eq. 6, and Y indicates Mel filters number.

Uy ðf Þ ¼

8 0; > > >
> > ðJðy þ 1Þ  Jðy  1ÞÞðJðy þ 1Þ  JðyÞÞ ;

: where

P y¼0

0;

f \ Jðy  1Þ Jðy  1Þ  f \JðyÞ Jðy Þ  f \Jðy þ 1Þ f [ Jðy þ 1Þ

ð6Þ

Uy ðnÞ ¼ 1, JðyÞ; y ¼ 1; 2; : : : Y is the triangular center frequency, and y

indicates the increase of the interval between two JðyÞ. 5. Jacobi–Fourier moments: The JFMs JB of J are calculated where the log energy spectrum is firstly decomposed into ðK  C þ 1Þ  ðY=CÞ blocks Qab by Eq. 7, where C is the block size. The blocks are square with the aim that the STFT frequency resolution b can counterbalance the relation among the time and the frequency domains. Then, the Jacobi orthogonal polynomials of zero JP0 and first JP1 normalized orders of Qab are computed by Eqs. 8 and 9, respectively. Finally, all JP0 ða; b; vÞ and JP1 ða; b; vÞ are utilized to construct JB 2 Rðk  C þ 1ÞðY=CÞ , where a  b [  1, b [ 0, hða; b; vÞ symbolizes general weight function, and b0 ða; bÞ denotes normalization constant 2 Qab

6 ¼ 6 4

Ga ðC  bÞ : : Ga þ C  1ðC  bÞ

: : : :

3 : Ga ðC  b þ C  1Þ 7 : : 7; a ¼ 1; . . .; K  C þ 1; b ¼ 1; . . .; Y=C 5 : : : Ga þ C  1ðC  b þ C  1Þ

ð7Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hða; b; vÞ ; JB0 ða; b; vÞ ¼ b0 ða; bÞ

ð8Þ

390

R.M. Ghoniem and K. Shaalan

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   ða þ 2Þb a þ 1 v1 JB1 ða; b; vÞ ¼ JB0 ða; b; vÞ abþ1 b

ð9Þ

6. Discrete time mapping cepstrum: The discrete cosine transform DCk 2 R1  ðY=CÞ is calculated by Eq. 10 from local JFMs LOJk . The coefficients from the 2-nd till 13-th of LOJk are picked out, as the noise distorts the high-order cepstral coefficients. DCk ðyÞ ¼ Dy

Y=C X n¼1

LOJk ðnÞ cos

ð2n  1Þyp 2ðY=CÞ

ð10Þ

where Dy is defined as 8 qffiffiffiffiffiffiffi 1 < Y=C y ¼ 1 ffiffiffiffiffiffi ffi q Dy ¼ 2 : Y=C y ¼ 2; . . .; Y=C

ð11Þ

7. Speech weighted spectrum features: JFWSF are produced by integrating LOJ with DC, which dimension would be ðK  C þ 1Þ  ðY=C þ 12Þ.

Fig. 2. Flowchart of the proposed multi-class FSVM model.

FCSR - Fuzzy Continuous Speech Recognition Approach

391

Table 2. PIM objective function.

The JFWSF algorithm is clarified as Algorithm 1 (see Table 1). Feature statistics (FT). FT estimates global statistics of the computed features so as to construct a speech signal feature vector. The subsequent feature statistics are chosen: mean and standard deviation (SD). Fuzzy classification of laryngeal pathologies using proposed multi-class FSVM. The FSVM algorithm assigns fuzzy membership values to distinct samples for focusing on their importance for their own classes. Regarding the FSVM kernel functions, the radial basis kernel function (RBKF) is adopted for speech signal classification and is 2 ksh  sj k  2 2d defined as: KFðsh ; sj Þ ¼ e , where sh is the feature vector, d denotes the parameter of width. The computational steps of the proposed multi-class FSVM model illustrated in Fig. 2 are clarified as follows: Construction of the OAO-FSVM classifier. For a dataset of K laryngeal pathology classes, it is: ðxj ; yj ; zj Þ, j ¼ 1; : : :; v,xj 2 Ru ; yj 2 f1; : : :; Kg, the steps within the OAO scheme are: Firstly, A top-precedence class is considered to be a positive class, whereas the remainder K−1 classes are assigned as negative class, then a binary FSVM classifier FSVM1 is established. Secondly, leaving aside the top-precedence class, repeat the previous step till last binary FSVM (FSVMK-1) is created.

392

R.M. Ghoniem and K. Shaalan

Computing fuzzy memberships. For each binary classifier FSVMK, two classes (S+ indicates the positive class while S- indicates the negative class) are clustered with PIM algorithm to get a structure of the class fuzzy distribution (see Table 2). Firstly, For a given speech signal xb, find the nearest pair of clusters, where two centers, one from positive class while the other from negative class are chosen for deciding the fuzzy membership for the signal. Secondly, if the distance among the signal and the ith cluster center Ea is smaller than bi (bi is the radius of the cluster core), then this sample will belong to the ith cluster with a membership value of 1. According to this restriction, each cluster will have a boundary inside which all signals will have membership values of 1, while other signals outside this boundary will have memberships of lab 2 ð0; 1Þ. Parameters enhancement using PSO. PSO algorithm is utilized for optimization of the following arguments of each binary FSVM: f which retains an appropriate stability among calculation complexity as well as separating error, and d that reflects the training samples characteristics. Then the FSVM optimized parameters are chosen based upon the particle that owns the maximal fitness value using Eq. 15. Each particle velocity vgt ðrÞ and position xgt ðrÞ are updated by Eqs. 16 and 17. Inertia weight W is utilized to balance global and local exploration capabilities and described as Eq. 18.

F ¼

 K t 1X lR  100 K t ¼ 1 lt

ð15Þ

vdt ðr þ 1Þ ¼ Wvgt ðrÞ þ c1 r and1  ðpgbest; t ðrÞ  xgt ðrÞÞ þ c2 r and2  ðpgcbest ðrÞ  xgt ðrÞÞ ð16Þ xgt ðr þ 1Þ ¼ xgt ðrÞ þ vgt ðrÞ W ¼ Wmax 

Wmax  Wmin r R

ð17Þ ð18Þ

where lt symbolizes count of samples of the set designated for validation, ltR is count of precisely identified samples of the validation set, pgbest; t ðrÞ defines the best solution which particle t has taken until iteration r, pgcbest ðrÞ indicates the best solution taken between all particles of swarm, rand1 and rand2 are random variables in range of ½0; 1, velocity vgt ðrÞ is restricted to the range of ½vmax ; vmax , vmax denotes the boundary value, c1 and c2 represent pair of constants of positive acceleration for relative velocity adjustment respecting to the most excellent local and global positions, Wmin and Wmax represent minimal as well as maximal inertia weight, in the order already mentioned, r denotes the running iteration, and R refers to iterations maximum number.

FCSR - Fuzzy Continuous Speech Recognition Approach

393

4 Experimental Results Eventually, this section addresses the results after applying the proposed approach on the MEEI database. From experiments, the values of parameters, producing the best accuracy are outlined on Table 3. Table 3. Enhanced values of parameters influencing classification. Method JFWSF

Parameters Frame length Frame overlap Window type Number of Mel filters Multi-class FSVM model Optimal cluster number f d Number of particles Number of iterations Acceleration c1 Acceleration c2 Maximal inertia weight Wmax Minimal inertia weight Wmin

4.1

Value 1024 512 Hamming 160 (2, 2) 256 2 20 100 1.4 1.6 0.8 0.1

Performance Evaluation

For evaluating the proposed approach, 10-fold has been employed for testing database. Final accuracy of FSVM is expressed as the rate among all hits taken by the classifier and total number of signals. 4.2

Comparing the Results to Other Related Work

With the objective of comparison, FCSR is contrasted with the others introduced in the state-of-the-art. Table 4 presents the most recent work on identifying laryngeal pathology. As seen in Table 4, FCSR outperforms the state-of-the-art on identifying laryngeal pathologies. Therefore, FCSR is well-adapted to identify and classify laryngeal pathology in continuous speech due to the previously mentioned reasons.

394

R.M. Ghoniem and K. Shaalan Table 4. Berief of previous works on identifying laryngeal pathologies

Reference

[8]

[9] [10]

FCSR

Database (pathologic +normal) MEEI (638 + 53)

MEEI (173 + 53) MEEI (118 + 36)

MEEI (118 + 36)

Feature set

Feature statistics

Classifier

Noise, Perturbation



MFCC



Linear discriminant analysis, Knearest neighbors (KNN) GMM

Line spectral frequencies, MFCC, Mel line spectral frequencies, Proposed JFWSF



Mean, SD

GMM, support vector machines, discriminant analysis Proposed multi-class FSVM

Best accuracy (%) 96.1

94 84.4%

97.6

5 Conclusions In current research, a fuzzy continuous speech recognition approach termed FCSR is proposed for laryngeal pathology identification and classification. First of all, a novel speech spectrum feature type termed JFWSF, that employs local JFMs, is presented. JFWSF have the following merits. Firstly, JFWSF assess the grades how the energy localized to the spectrogram frequencies, which change considerably with the laryngeal pathology kind. Secondly, JFWSF can diminish the variances caused by: the differences between speakers; the sentences; in addition to the speaking styles. Additionally, a multi-class FSVM model is proposed for laryngeal pathology classification. PIM clustering is employed to decide the values of fuzzy memberships of the FSVM training set. Besides, arguments of each binary FSVM of multi-class FSVM are optimized using PSO algorithm. The experiments legitimize the integration of the new JFWSF as well as the proposed multi-class FSVM.

References 1. Niedzielska, G.: Acoustic analysis in the diagnosis of voice disorders in children. Int. J. Pediatr. Otorhinolaryngol. 57, 189–193 (2001) 2. Michaelis, D., Gramss, T., Strube, H.W.: Glottal-to-noise excitation ratio—a new measure for describing pathological voices. Acustica/Acta Acustica 83, 700–706 (1997) 3. de Krom, G.: A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. J. Speech Hear. Res. 36, 254–266 (1993)

FCSR - Fuzzy Continuous Speech Recognition Approach

395

4. Feijoo, S., Hernandez-Espinosa, C.: Short-term stability measures for the evaluation of vocal quality. J. Speech Hear. Res. 33, 324–334 (1990) 5. Ritchings, R.T., McGillion, M.A., Moore, C.J.: Pathological voice quality assessment using artificial neural networks. Med. Eng. Phys. 24, 561–564 (2002) 6. Cordeiro, H., Meneses, C., Fonseca, J.: Continuous speech classification systems for voice pathologies identification. In: Camarinha-Matos, L.M., Baldissera, T.A., Di Orio, G., Marques, F. (eds.) DoCEIS 2015. IFIPAICT, vol. 450, pp. 217–224. Springer, Cham (2015). doi:10.1007/978-3-319-16766-4_23 7. Eye, M., Infirmary, E.: Voice Disorders Database, (Version 1.03 Cd-Rom). In: Lincoln Park, N. (ed.) Kay Elemetrics Corp., Lincoln Park, NJ (1994) 8. Hadjitodorov, S., Mitev, P.: A computer system for acoustic analysis of pathological voices and laryngeal disease screening. Med. Eng. Phys. 24, 419–429 (2002) 9. Godino-Llorente, J.I., Gomez-Vilda, P., Blanco-Velasco, M.: Dimensionality reduction of a pathological voice quality assessment system based on gaussian mixture models and short-term cepstral parameters. IEEE Trans. Biomed. Eng. 53, 1943–1953 (2006) 10. Cordeiro, H., Fonseca, J., Guimarães, I., Meneses, C.: Hierarchical classification and system combination for automatically identifying physiological and neuromuscular laryngeal pathologies. J. Voice 31(3), 384.e9–384.e14 (2016)