Automated Speech Recognition For Children With ...

2017 International Conference on Advances in ICT for Emerging Regions (ICTer) :

Automated Speech Recognition For Children With Cleft Lip And Palate Issues Daminda Herath#1, Lakmal Vithanage*2, Laxman Jayarathnet#3 #

University of Colombo School in Computing (UCSC), No 35, Reid Avenue, Colombo 7, Sri Lanka 1

[email protected]

3

[email protected]

*

Lanka Orix Leasing Company, Rajagiriya Sri Lanka 2

[email protected]

07th - 08th September 2017

International Conference on Advances in ICT for Emerging Regions ICTer2017

2017 International Conference on Advances in ICT for Emerging Regions (ICTer) :

Abstract— It is common for children who are born with a cleft lip and palate issues. Therefore they have speech problems at some time in their lives. Over half of them will require speech therapy at some point during childhood. The speech language pathologist has to access child's speech production and language development and make appropriate therapy recommendation. Once the palate has been repaired, child may be able to learn more consonant sounds and say more words but speech may still be delayed and difficulties in making certain sound. Therefore those children are treated by speech therapist clinic. The children with parents has to come to clinic and get the speech therapy. It is a continues process. During the therapy period, children's' speech may be understandable, but some sound may be distorted. To address the issue a biometric solution can be implemented. Speech recognition is used to identify the speak by the children. The speech recognition aims at understanding and comprehending WHAT is spoken. In this research, automated speech recognition system is proposed to identifying certain sound from the child then extract this sound and match with correct sound. Final stage, the comparison and decision making are taken place. Keywords: Speech Recognition, Biometric, Hidden Markov Model, Mel-Frequency Cepstral Coefficient, Word Error Rate

I. INTRODUCTION In a prospective study that has surveyed 51,542 live births and 5263 still births, cleft clip with (CL+P) or without palate has found in Sri Lanka and isolated cleft palate(CP) problem occurs 0.19 per thousand births. Distribution by sex, type of cleft, site and family history also were factors for cleft palate issues. The survey has further revealed that three major ethnic groups inhabit Sri Lanka, the incidence was significantly greater in the moors than the Sinhalese and Tamils. But social status was found to have no association with occurrence of Cleft [1]. Based on the statistics of Lady Ridgeway hospital in Sri Lanka, generally around 20-25 children with Cleft lip with or with out palate or palate only are registered [2]. Before the palate is repaired, there is no separation between the nasal cavity and the mouth. That means the child can not build up air pressure in the mouth because air escapes out of the nose and there is less tissue on the roof of mouth for the tongue to touch. Both of there problems can make it difficult for the child to learn how to make some sounds [3]. Therefor a half of children is treated by speech therapist clinic per weekly basis in a certain period. During the therapy period, child’s speech may be understandable but some sounds may distort. Some children may simply develop speech very slowly than others. Speech-language pathologist is carefully hearing sounds and examining the child’s sound pattern. The they can identify the problem of the child soft palate, cleft, throat, nasal cavity, mouth so on. But it is not easy task, both pathologist and parent have to work hard to get sound from the child. Because a set of specific words or letters is used to identify the problems. Refer in Appendix Fig1. In order to minimize the above stated hindrances, We proposed a biometric solution where presence of the child at the clinic or home. The system can record the child’s sounds any convenient time and then those recorded sounds can be compared with training sounds through the speech recognition system then the system can identify whether the child is still having pronunciation problem and if so the place where the problem occurs.

07th - 08th September 2017

II. LITERATURE REVIEW Shivakumar P.G, Potamianos A, Lee S, Narayanan S have done a research on speech recognition for children “Improving Speech Recognition for Children using Acoustic Adaptation and Pronunciation Modeling”. In this research, they discussed about feature extraction technique such as Mel-Frequency Cepstral Coefficient (MFCC) to extract features from speech, normalization techniques such as Cepstral Mean and Variance Normalization (CMVN) to reduce the raw cepstral features to zero mean and unit variance and Vocal Tract Length Normalization (VTLN) to reduce inter speaker variability, Hidden Markov Model (HMM) to match patterns in a statistical patten recognition approach[4]. Potamianos A, Narayanan S.S, have presented a papers titled “Automatic speech recognition for children”. In this paper, they used a statistical model framework for pattern matching. That is HMM and speaker normalization algorithm combining frequency warping and spectral shaping introduced to reduce acoustic variability and significantly improve recognition performance for children speakers[5]. Gray S.S , Willett D, Lu J, Pinto J, Maergner P, Bodenstab N, have done a research named “Child Automatic Speech Recognition for US English: Child Interaction with Living-Room-Electronic-Devices”. In this research, They has used MFCC and delta coefficients as speech feature input, and the acoustic model makes use of context-dependent treebased clustered Gaussian mixture HMMs. Final they used Word error rate (WER) metric to measure the performance of a speech recognition System[6]. III. SPEECH RECOGNITION Speech recognition uses the process and related technology for converting speech signals into a sequence of words or other linguistic units by means of an algorithm implemented as a program. There are two variants of speech recognition, speaker dependent and speaker independent[7]. In the speaker independent mode of the speech recognition the computer ignore the speaker specific characteristics of the speech signal and extract the useful words, phrases or letters . On the other hand in case of speaker recognition machine should extract speaker characteristics in the acoustic signal. Then comparison of speech signal from an unknown speaker to a database of known speaker has been done[7]. Speech recognition can also be divided into two methods text-dependent and text-independent. In text dependent method the speaker speaks key words or sentences having the same text for both training and testing trials whereas text independent does not rely on a specific texts being spoken [8]. In this research, child’s speech is spontaneous. Pathologists or parent can not force child to spell a specific word, phrase or letter. Therefore the system with spontaneous speech ability should be able to handle a variety of natural speech features. In that case, we used speaker independent and spontaneous speech recognition with text-dependent method. IV. DESIGN OF THE SYSTEM Children's speech has to record by any possible time ie at the clinic or home. It is better to use at the home. Because

International Conference on Advances in ICT for Emerging Regions ICTer2017

Daminda Herath#1, Lakmal Vithanage*2, Laxman Jayarathnet#3

3

children are not speaking every time. Another factor is the children are babies most of the time. It is not easy to record the speech in a specific time. If they are trying to speak then adults have to record it. Now a days recording devices are very frequent. Even using a smart mobile phone can be recorded the speech. Once it recorded, the sample speech file can uploaded into system through simple and user friendly interface. Then the processing part can be started. System has to remove environment noise , to filter, to do blocking into frames and to do windowing. Sample speech file is then passed through the feature extraction. Feature extraction transforms the incoming sample speech into an internal representation such that it is possible to reconstruct the original form it. Then speech sample stores in the database. After that, sample speech file and the training pattern can be compared. The system uses pattern-matching technique to identify the correct pronunciation and incorrect pronunciation. Based on that, decisions are made. Fig 2 shows the high level design of the proposed system. In this research, we used several sequence steps to carry out of children's speech recognition process to make decisions. Each and every steps, importance methods methods and techniques were identified. Below steps are listed in order to carry out.

In next step, the signal is converted from time domain to frequency domain by subjecting it to Fourier Transform. The Discrete Fourier Transform (DFT) of a signal is defined by the following:

A. Speech Capturing The input speech can be captured with the help of microphone and they are converted into the analog signal into digital signal form by sampling at a frequency of 16 kHz. Sampling means recording the speech signals at a regular interval.

In the next step the log Mel scale spectrum is converted to time domain using Discrete Cosine Transform (DCT). DCT is defined by the following, where is a constant dependent on

B. Preprocessing Once speech-capturing process is complete, the speech is available in continuous samples. Now our next step is preprocess these samples to make available for feature extraction and recognition. It involves following steps.

2 In the this step the frequency domain signal is converted to Mel frequency scale, which is more appropriate for human hearing and perceptions. This is done by a set of triangular filters that are used to compute a weighted sum of spectral components so that the output of the process approximates a Mel scale. Each filter’s magnitude frequency response is triangular in shape and equal to unity at the centre frequency and decrease linearly to zero at centre frequency of two adjacent filters. The following equation is used to calculate the Mel for a given frequency:

3

4 The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficients is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vectors [12]. Here, MFCC techniques has been used as feature extraction for speech recognition purpose.

1) Background Noise and Silence Removing 2) Preemphasis filter 3) Blocking into Frames 4) Windowing C. Feature Extraction The collected speech samples are then passed through the feature extraction, feature training & feature testing stages. Feature extraction transforms the incoming speech into an internal representation (feature vector) such that it is possible to reconstruct the original signal from it, which can be used for the recognition purpose. There are various techniques to extract features, but mostly used is MFCC [9][10][12]. How the MFCC uses in speech recognition: The signal is divided into overlapping frames Let each frame consist of N samples and let adjacent frames be separated by M samples where M