Feature Extraction and Classification Techniques for Speech ...

8 downloads 9026 Views 429KB Size Report
2Asst.Professor, Department Of Electronics and Communication Engineering, C.G.P.I.T, Bardoli, Gujarat. 3Asst. ... communication and speech processing has been one of the ..... MFCC and Neural Networks”, International Journal of Modern .
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013)

Feature Extraction and Classification Techniques for Speech Recognition: A Review Nidhi Desai1, Prof.Kinnal Dhameliya2, Prof.Vijayendra Desai3 1

M.Tech. [Electronics and Communication] Student, Department Of Electronics and Communication Engineering, C.G.P.I.T, Bardoli, Gujarat 2 Asst.Professor, Department Of Electronics and Communication Engineering, C.G.P.I.T, Bardoli, Gujarat 3 Asst.Professor, Department Of Electronics and Communication Engineering, CKPCET, Surat, Gujrat The main intension of speech recognition area is to evolve techniques and system for speech input to machine. Speech is the primary means of communication between humans, and it is the dominancy of this medium that motivates research efforts to allow speech to become a viable human computer interaction [5].So, Automatic speech recognition (ASR) is viewed as inherent part of human-computer interfaces, that are visualized to use speech, among other means, to achieve natural, prevalent and ubiquitous computing. There are usually two categories for isolated and continuous speech recognition: speakerdependent and speaker-independent. Speaker dependent method involves training a system that recognize each of the lexicon words uttered single or multiple times by specific set of speakers [7], while for speaker independent training methods are generally unworkable and words are recognized by analyzing their inherent acoustical properties [7]. With the rapid evolution of computer hardware and software, speech recognition technology is moderately key technology in computer information processing technology. It is widely used in voiced-activated telephone exchange, medical services, banking services, industrial control every side of society and people’s lives. Fig.1 shows basic representation of speech recognition system in simple equation which contains feature extraction, database, network training and testing or decoding. The recognition process is shown below (Fig.1).

Abstract— Speech is the most natural form of human communication and speech processing has been one of the most inspiring expanses of signal processing. Speech recognition is the process of automatically recognizing the spoken words of person based on information in speech signal. Automatic Speech Recognition (ASR) system takes a human speech utterance as an input and requites a string of words as output. This paper introduce a brief survey on Automatic Speech Recognition and discuss the major subjects and improvements made in the past 60 years of research, that provides technological outlook and a respect of the fundamental achievement that has been accomplished in this important area of speech communication. Definition of various types of speech classes, feature extraction techniques, speech classifiers and performance evaluation are issues that requires attention in designing of speech recognition system. The objective of this review paper is to summarize some of the well known methods used in several stage of speech recognition system. Keywords—Acoustic Phonetic Approach, Artificial Intelligence, Feature extraction, LPC, MFCC, Pattern Recognition Approach, Speech Recognition, Word Error Rate

I. INTRODUCTION Speech Recognition is the ability of machine or program to identify words and phrase from spoken language and convert them in to machine readable format. It is also known as Automatic Speech Recognition or computer speech recognition and speech to text conversion.

Input Speech

Feature

Database

Extraction

Testing or

Training

Decoding

FIGURE1.BASIC MODEL OF SPEECH RECOGNITION SYSTEM

367

Output Speech

Network

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) A. Linear prediction coding(LPC) LPC is one of the most powerful signal analysis methods for linear prediction. It is predominant technique for determining the basic parameters of speech and provides precise estimation of speech parameters and computational model of speech. Speech sample can be approximated as a linear combination of past speech samples is the basic idea behind LPC. Following figure.2 shows the steps involved in LPC feature extraction.

II. TYPES OF SPEECH UTTERED Separation of speech recognition system in different classes can be made based on what type of utterance they have ability to recognize. A. Isolated speech Isolated word recognizer usually set necessary condition that each utterance having little or no noise on both sides of sample window. It requires single utterance at a time. Often, these types of speech have “Listen/Not-Listen states”, where they require the speaker to have pause between utterances. Isolated word might be better name for this type. B. Connected word Connected word require minimum pause between utterances to make speech flow smoothly. They are almost similar to isolated words. C. Continuous speech Continuous speech is basically computer’s dictation. It is normal human speech, without silent pauses between words. This kind of speech makes machine understanding much more difficult.

Figure2.Steps involve in LPC Feature extraction [8]

B. Mel frequency Cepstral Coefficient(MFCC) MFCC is the most evident and popular feature extraction technique for speech recognition. It approximates the human system response more closely than any other system because frequency bands are placed logarithmically here. They are obtained from a Mel-frequency cepstrum where frequency bands are equally spaced on the Mel scale. Computation technique of MFCC is based on the shortterm analysis and thus from each frame MFCC vector is computed. MFCC can be computed by using the formula:

D. Spontaneous speech Spontaneous speech can be thought of as speech that is natural sounding and no tried out before. An ASR system with spontaneous speech ability should be able to handle a diversity of natural speech features such as words being run at the same time.

Mel (f) =2595*log10 (1+f/700)

III. FEATURE EXTRACTION TECHNIQUES-AN OVERVIEW

Following figure.3 shows the steps involved in MFCC feature extraction.

Feature extraction is the most important part of speech recognition as it distinguishes one speech from other. The utterance can be extracted from a vast range of feature extraction techniques suggested and successfully utilized for speech recognition task, but extracted feature should meet some criteria while negotiating with the speech signal such as [5]:  Easy to measure extracted speech feature  It should not be receptive to mimicry  It should show less variation from one speaking environment to another  It should be balanced over time  It should occur normally and naturally in speech Different techniques for feature extraction are LPC, MFCC, AMFCC, PLP, PCA, cepstral analysis, RAS etc.

Figure.3 Steps involve in MFCC Feature extraction[8]

368

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) C. Linear prediction cepstral coefficient(LPCC) The goal of feature extraction is to demonstrate speech signal by finite number of measures of the signal. Linear Predictive Coding is used to obtain the LPCC coefficients from the speech tokens. The LPCC coefficients are then translated to cepstral coefficients.The cepstral coefficients are regularized in between 1 and -1[6]. LPCC was implemented using autocorrelation method. LPCC are highly sensitive to quantization noise is the main drawback of it.

Steps included in acoustic phonetic approach are as follows: The first step is the spectral analysis of speech which describes the broad acoustic properties of different phonetic units. The next step is segmentation and labeling the speech, that results in a phoneme lattice characterization of the speech. The last step is determination of string of words or a valid word from phonetic label sequences brought out by the segmentation to labeling. This approach has not been most extensively used in most commercial application [3]. B. Pattern recognition approach Two essential steps involves in pattern recognition approach are, pattern training and pattern comparison. Using a well formulated mathematical framework and initiates consistent speech pattern representation for reliable pattern comparison, from a set of labeled training samples through formal training algorithm is essential feature of this approach. In this, there exist two methods: Template base approach and stochastic approach. Stochastic model are more suitable approach to speech recognition as it uses probabilistic models to deal with undetermined or incomplete information [1].there exits many methods in this approach like HMM, SVM, DTW, VQ etc, among these hidden markov model is most popular stochastic approach today. Hidden markov model (HMM): A hidden markov model is signalizing by a finite state markov model and a set of output distributions. The alteration parameter in the Markov chain models are temporal variability’s, while the output distribution model parameters are spectral variability. These two types of variability are essential for speech recognition. Hidden Markov modeling is more general and has a secure mathematical foundation compared to template based approach. Compared to knowledge base approach, HMM enables easy incorporation of knowledge sources into organized architecture.HMM do not provide much insight on the recognition process, is negative side effect of HMM. To improve performance of HMM system, analyses of errors of system is made, but it is quite difficult. However, judicious incorporation of knowledge has significantly improved HMM based system. Dynamic time warping (DTW): Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed[4].DTW has been applied to video,audio,graphic, infect any data which can be develop into a linear representation can be analyzed with DTW.

IV. SPEECH RECOGNITION TECHNIQUE CLASSIFICATION Basically there are three approaches to recognition. A. Acoustic Phonetic Approach B. Pattern Recognition Approach C. Artificial Intelligence Approach

speech

Figure4.Speech recognition Technique Classification [2]

A. Acoustic phonetic approach Acoustic phonetic approach for speech recognition is based on finding speech sound and providing appropriate labels these sounds. The basis of acoustic phonetic approach based on the fact that, there exist finite and exclusive phonemes in spoken language and these are broadly characterized by a set of acoustic properties that are demonstrated in the speech signal over time. With speaker and co articulation effect, the acoustic properties of phonetic units are highly variable, it is assumed in this approach that, the criteria governing the instability are straightforward and can be readily learned by machine. 369

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) In general, DTW allows a computer to search an optimal match between two time series if one time series may be “warped” non-linearly by pulling or shriveling it along its time axis. This warping between two time series can then be used to find equivalently regions among the two time series or to diagnose the similarity between two times series [4].Continuity is not much important in DTW than in other pattern matching algorithms.

The WER is acquired from the Levenshtein distance, working at the word level alternatively to the phoneme level [9].Word error rate can be computed as

WER 

S DI N

Where, S is the number of substitutions. D is the number of the deletions. I is the number of the insertions. N is the number of words in the reference. When reporting the performance of a speech recognition system, sometimes word recognition rate (WRR) is used instead of WER,

A. Artificial intelligence approach (knowledge based approach) Combination of acoustic phonetic approach and pattern recognition approach makes The Artificial Intelligence approach. Acoustic phonetic knowledge is used to developed classification rules for speech sound where template based methods provide less insight about human speech processing, but these methods have been very productive in the design of a diversity of speech recognition system. This approach is not much successful as complexness in quantifying skilful knowledge. Integration of levels of human knowledge i.e. phonetics, lexical access, syntax and semantics, is the another problem of this approach. Artificial Neural Network method is more reliable method for this approach. Artificial neural networks (ANN): An artificial neural network contains potentially large number of simple processing element that is called units or neurons, which impact each other’s performance via a network of excitatory or repressive weights [5]. It is a feed-forward artificial neural network which has more than one layer of hidden units between its inputs and its outputs. Neural Network provides three types of learning methods namely supervised, unsupervised and reinforced.

WRR  1  WER 

N S DI H I  N N

Where is N-(S+D), the number of correctly recognized word. VI. CONCLUSION Different feature extraction techniques and recognition techniques are discussed in this paper and it can be concluded that performance of MFCC technique is superior to LPCC performance[6].This paper attempts to provide a comprehensive survey on speech recognition and to deliver some year wise progress to this date and it is challenging and interesting problem in and of itself. It has been found that HMM is the best technique in developing language model. Speech recognition has attracted scientist as an important regulation and has created a technological influence on society. It is hoping that this paper bring out understand and inspiration amongst the research group of ASR.

V. PERFORMANCE MEASURING PARAMETERS The performance of speech recognition system is often described in terms of accuracy and speed.Accuracy may be meausured in terms of word error rate(WER),where speed is measured with the real time factor.Single Word Error Rate(SWER) and Command Success Rate(CSR) are another measures of accuracy[3].

REFERENCES [1]

[2]

A. Word error rate (WER) Word error rate is familiar measurement of the performance of a speech recognition or machine translation system. The recognized word sequence can have different length from the reference word sequence is more general difficulty in performance measurement.

[3]

[4]

370

M.A.Anusuya, “Speech Recognition by Machine,” International Journal of Computer Science and Information security, Vol.6, No.3, 2009 S.J.Arora and R.Singh, “Automatic Speech Recognition: A Review, “International Journal of Computer Applications, vol60-No.9, December 2012 Santosh K.Gaikward and Bharti W.Gawali, “A Review on Speech Recognition Technique,” International Journal of Computer Applications, vol 10, No.3, November 2010 Lindasalwa Muda, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques “, Journal Of Computing, Volume 2, Issue 3, March 2010

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 12, December 2013) [5]

[6]

[7]

Nidhi Srivastava and Dr.Harsh Dev“Speech Recognition using MFCC and Neural Networks”, International Journal of Modern Engineering Research (IJMER), march 2007 Dr.R.L.K.Venkateswarlu, Dr.R.Vasantha Kumari and A.K.V.Nagavya, “Efficient Speech Recognition by Using Modular Neural Network”, Int. J. Comp. Tech. Appl., Vol 2 (3) Bishnu Prasad Das and Ranjan Parekh, “Recognition of Isolated Words using Features based on LPC, MFCC, ZCR and STE, with Neural Network Classifiers”, International Journal of Modern Engineering Research (IJMER) , Vol.2, Issue.3, May-June 2012

[8]

[9]

371

Om Prakash Prabhakar and Navneet Kumar Sahu, “A Survey On: Voice Command Recognition Technique,” International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 5, May 2013 Milind U. Nemade and Prof. Satish K. Shah, “Survey of Soft Computing based Speech Recognition Techniques for Speech Enhancement in Multimedia Applications”, International Journal of Advanced Research in Computer and Communication Engineering Vol. 2, Issue 5, May 2013

Suggest Documents