Chapter 2 REVIEW OF PREVIOUS WORK

Chapter 2 REVIEW OF PREVIOUS WORK

2.1

Introduction

Automatic Speech Recognition (ASR) is a critical core technology in the field of intelligence communication between human and machine. Despite the long history of research on the acoustic characteristics of Vowel / Consonant-Vowel (V/CV) unit waveforms [Sakai and Doshita(1963)], [Schaffer and Rabiner(1970)], [D. Dutta Majumder and Pal(1976)], [Broad(1972)], current state-of-the-art ASR systems are still incapable of performing accurate recognition for these class of sounds . Beginning 1910, Campell and Crandall from AT & T and Western Electric Engineering initiated a series of experiments to explore the nature of human speech perception. After this in 1918, these experiments were continued by Fletcher and his colleagues at The Bell Telephone Laboratories (Western Electric Engineering until 1925). These studies lead to a speech recognition measure called articulation index, which accurately characterizes speech intelligibly under condition of filtering and noise. All these experiments began with normal conversational speech over a modified telephone channel [Fletcher(1922)] [Fletcher and Munson(1937)] [Fletcher and Galt(1950)]. In the 1930’s Homer

14

Dudley, influenced greatly by Fletcher’s research, developed a speech synthesizer called the VODER (Voice Operating Demonstrator), which was an electrical equivalent (with mechanical control) of Wheatstone’s mechanical speaking machine [H. Dudley and Watkins(1939)]. These two speech poineers thoroughly established the importance of the signal spectrum for reliable identification of the phonetic nature of a speech sound. In 1940’s Dudley had developed mathematical models for speech based on linguistic research that viewed spoken language with the impulses from larynx and the vocal folds as the input to the system, the shape of the vocal tract representing the filter parameter and the speech waveform as the system output [Dudley(1940)].

In 1952,Davis et al., of Bell Laboratories built a system for isolated digit recognition for a single speaker , using the formant frequencies measured (or estimated) from vowel regions of each digit. These trajectories served as the reference pattern for determining the identity of an unknown digit utterance as the best matching digit [K. H. Davis and Balashek(1952)]. In the 1960’s, several Japanese laboratories established their capability of building special purpose hardware to achieve a speech recognition task. Most important were the vowel recognizer of Suzuki and Nakata at the Radio Research Lab in Tokyo [Suzuki and Nakata(1961)], the phoneme recognizer of Sakai and Doshita at Kyoto [Sakai and Doshita(1962)], and the digit recognizer of NEC Laboratories [K. Nagata and Chiba(1963)]. The work of Sakai and Doshita involved the first use of a speech segmenter for analysis and recognition of speech in different portions of the input utterance.

15

In another early recognition system Fry and Denes, at University College in England, built a phoneme recognizer to recognize 4 vowels and 9 consonants [Fry and Denes(1959)]. Including statistical information about allowable phoneme sequences in English, they increased the overall phoneme recognition accuracy for words consisting of two or more phonemes. This work marked the first use of statistical syntax (at the phoneme level) in automatic speech recognition. An alternative to the use of a speech segmenter was the concept of adopting a nonuniform time scale for aligning speech patterns. This concept started to gain acceptance in the 1960’s through the work of Tom Martin at RCA Laboratories [T. B. Martin and Zadell(1964)] and Vintsyuk in the Soviet Union [Vintsyuk(1968)]. Martin recognized the need to deal with the temporal non-uniformity in repeated speech events and suggested a range of solutions, including detection of utterance endpoints, which greatly enhanced the reliability of the recognizer performance. Vintsyuk proposed the use of dynamic programming for time alignment between two utterances in order to derive a meaningful assessment of their similarity [Vintsyuk(1968)]. His work, though largely unknown in the West, appears to have preceded that of Sakoe and Chiba as well as others who proposed more formal methods [Sakoe and Chiba(1978)], generally known as dynamic time warping, in speech pattern matching. Since the late 1970’s,mainly due to the publication by Sakoe and Chiba, dynamic programming, in numerous variant forms (including the Viterbi algorithm which came from the communication theory community), has become an indispensable technique in automatic speech recognition [Viterbi(1967)].

16

In the late 1960’s, Atal and Itakura independently formulated the fundamental concepts of Linear Predictive Coding (LPC) , which greatly simplified the estimation of the vocal tract response from speech waveforms [Atal and Hanauer(1971)] [Itakura and Saito(1970)]. By the mid 1970’s, the basic ideas of applying fundamental pattern recognition technology to speech recognition, based on LPC methods, were proposed by Itakura [Itakura(1975)], Rabiner and Levinson and others [L. R. Rabiner and Wilpon(1979)]. In the early 1980’s at Bell Laboratories, the theory of HMM was extended to mixture densities which have since proven vitally important in ensuring satisfactory recognition accuracy, particularly for speaker independent, large vocabulary speech recognition tasks [Juang(1985)] [B. H. Juang and Sondhi(1986)] . Another technology that was (re)introduced in the late 1980’s was the idea of artificial neural networks (ANN). Neural networks were first introduced in the 1950’s, but failed to produce notable results initially [McCullough and Pitts(1943)]. The advent, in the 1980’s, of a parallel distributed processing (PDP) model, which was a dense interconnection of simple computational elements, and a corresponding training method, called error backpropagation, revived interest around the old idea of mimicking the human neural processing mechanism [Lippmann(1987)] [Kohonen(1988)] [Pal and Mitra(1988)].

In the 1990’s, a number of inventions took place in the field of pattern recognition. The problem of pattern recognition, which traditionally followed the framework of Bayes and required estimation of distributions for the data, was transformed into an optimization problem involving minimization of the empirical recognition error [Juang(1985)]. This fundamental change of paradigm was caused by the recognition of the fact that the distribution functions for the speech 17

signal could not be accurately chosen or defined, and that Baye’s decision theory would become inapplicable under these circumstances. After all, the objective of a recognizer design should be to achieve the least recognition error rather than the best fitting of a distribution function to the given (known) data set as advocated by the Bayes criterion. The concept of minimum classification or empirical error subsequently spawned a number of techniques, among which discriminative training and kernel-based methods such as the support vector machines (SVM) have become popular subjects of study [B.H. Juang and Chou(1997)] [Vapnik(1998)].

This chapter presents a review of previous works in the area of linear and nonlinear speech processing, multi resolution analysis and wavelet transform, neural networks and statistical learning algorithms and is organized as follows. Section 2.2, section 2.3 and section 2.4 provides a summary of research findings in the area of traditional speech processing, wavelet transform and nonlinear speech processing respectively. Section 2.5 gives a review of previous works in the applications of neural network for speech recognition and section 2.6 contains review of previous work in the application of k-Nearest Neighborhood (k-NN) and SVM. Finally section 2.7 concludes this review.

2.2

Review on Traditional Features for Speech Recognition

Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficient (MFCC) features are known to be traditional basic speech features in the sense that the speech model used in many of these applications is the source-filter model which represent the vocal characteristics of the speech signal. The linear prediction (LP)

18

model for speech analysis and synthesis was first introduced by Saito and Itakura and Atal and Schroeder [Itakura and Saito(1970)] [Atal and Schroeder(1967)]. Saito and Itakura at NTT, Japan, developed a statistical approach for the estimation speech spectral density using a maximum likelihood method [Itakura and Saito(1970)]. Their work was originally presented at conferences in Japan and therefore, was not known worldwide. The theoretical analysis behind their statistical approach were slightly different than that of linear prediction, but the overall results were identical. Based on their statistical approach, Itakura and Saito introduced new speech parameters such as the partial autocorrelation (PARCOR) coefficients for efficient encoding of linear prediction coefficients. Later, Itakura discovered the line spectrum pairs, which are now widely used in speech coding applications.

In 1975, John Makhoul presented a tutorial review on Linear Predictive Coding [Makhoul(1975)], which gives an exposition of linear prediction in the analysis of discrete signals. The major part of this paper is devoted to all-pole models. The model parameters are obtained by a least squares analysis in the time domain. Two methods resulted, depending on whether the signal is assumed to be stationary or non stationary. The same results are then derived in the frequency domain also. The resulting spectral matching formulation allows for the modeling of selected portions of a spectrum for arbitrary spectral shaping in the frequency domain, and for the modeling of continuous as well as discrete spectra.

The method of linear prediction has proved quite popular and successful for use in speech compression system [Markel and A.H.Gray(1974)] [Itakura(1972)] 19

[Atal and Hanauer(1971)]. An efficient method for transmitting the linear prediction parameters has been found by Sambur using the techniques of differential PCM [Sambur(1975)]. Using this technique, speech transmission is employing fewer than 1500 bits/second. Further reduction in the linear prediction storage requirements can be realized at a cost of higher system complexity by transmission of the most significant eigenvectors of the parameters. It has been found that this technique in combination with differential PCM can lower the bitrate to 1000 bits/sec. Sambur and Jayant discusses several manipulations of LPC parameters for providing speech encryption [Sambur and Jayant(1976)]. They considers temporal rearrangement or scrambling of the LPC code sequence, as well as the alternative of perturbing individual samples in the sequence by means of pseudorandom additive or multiplicative noise. The latter approach is believed to have greater encryption potential than the temporal scrambling technique, in terms of time needed to break the security code. The encryption technique are assessed on the basis of perceptual experiments, as well as by means of qualitative assessment of speech spectrum distortion, as given by an appropriate distance measure.

2.3

Review on Wavelet Transform for Speech recognition

Over the last decades, wavelet analysis have become very popular and new interests are emerging in this topic. It has turned to be a standard technique in the ‘area of geophysics, meteorology, audio signal processing and image compression [Hongyu Liao and Cockburn(2004)], [Soman and Ramachandran(2005)], [Mallat(2009)]. Wavelet Transform is a tool for Multi Resolution Analysis which can be used to efficiently represent the speech signal in the time-frequency plane. 20

Martin Vetterli [Vetterli(1992)] had compared the wavelet transform with the more classical short-time Fourier transform approach to signal analysis. In addition he also pointed out the strong similarities between the details of these techniques. Gianpaolo [Evangelista(1993)] explored a new wavelet representation using the transform based on a pitch-synchronous vector representation and its adaptation to the oscillatory or aperiodic characteristics of signals. Pseudo-periodic signals are represented in terms of an asymptotically periodic trend and aperiodic fluctuations at several scales. The transform reverts to the ordinary wavelet transform over totally aperiodic signal segments. The pitch-synchronous wavelet transform is particularly suitable to the analysis, rate-reduction in coding and synthesis of speech signals and it may serve as a preprocessing block in automatic speech recognition systems. Separation of voice from noise in voiced consonants is easily performed by means of partial wavelet expansions.

In [Xia and Zhang(1993)], the authors studied the properties of Cardinal Orthogonal Scaling Functions (COSF), which provide the standard sampling theorem in multiresolution spaces with scaling functions as interpolants. They presented a family of COSF with exponential decay, which are generalizations of the Haar functions. With these COSF, an application is the computation of Wavelet Series Transform (WST) coefficients of a signal by the Mallat algorithm. They also presented some numerical comparisons for different scaling functions to illustrate the advantage of COSF. For signals which are not in multiresolution spaces, they estimated the aliasing error in the sampling theorem by using uniform samples.

21

In [M Lang and R.O.Wells(1996)] Lang et al., had presented a new nonlinear noise reduction method using discrete wavelet transform.They employed thresholding in the wavelet transform domain following a suggestion by Coifman, using undecimated, shift-invariant, nonorthogonal wavelet transform instead of the usual orthogonal one. This approach can be interpreted as a repeated application of the original Donoho and Johnstone method for different shifts. The main feature of this algorithm is a significantly improved noise reduction compared to the original wavelet based approach. This holds for a large class of signals, and is shown theoretically as well as by experimental results.

A multi wavelet design criterion known as omnidirectional balancing using wavelet transform is introduced by James E. Fowler and Li Hua to extend to vector transforms the balancing philosophy previously proposed for multiwavelet based scalar-signal expansion [Fowler and Hua(2002)]. It is shown that the straightforward implementation of a vector wavelet transform, namely, the application of a scalar transform to each vector component independently, is a special case of an omnidirectionally balanced vector wavelet transform in which filter-coefficient matrices are constrained to be diagonal. Additionally, a family of symmetricantisymmetric multiwavelets is designed according to the omnidirectional balancing criterion. In empirical results for a vector-field compression system, it is observed that the performance of vector wavelet transforms derived from these omnidirectionally balanced symmetric-antisymmetric multiwavelets is far superior to that of transforms implemented via other multiwavelets and can exceed that of diagonal transforms derived from popular scalar wavelets.

22

O Farooq and S Dutta [Farooq and S.Datta(2004)]had proposed a subband feature extraction technique based on an admissible wavelet transform and the features are modified to make them robust to Additive White Gaussian Noise (AWGN). The performance of this system is compared with the conventional mel frequency cepstral coefficients (MFCC) under various signal to noise ratios. The recognition performance based on the eight sub-band features is found to be superior under the noisy conditions compared with MFCC features using this approach.

Elawakdy et al., proposed a speech recognition algorithm using wavelet transform. This paper discussed the combination of a feature extraction by wavelet transform, subtractive clustering and adaptive neuro-fuzzy inference system (ANFIS). The feature extraction is used as input of the subtractive clustering to put the data in a group of clusters. Also it is used as an input of the neural network in ANFIS. The initial fuzzy inference system is trained by the neural network to obtain the least possible error between the desired output (target) and the fuzzy inference system (FIS) output to get the final FIS. The performance of the proposed speech recognition algorithm (SRA) using a wavelet transform and ANFIS is evaluated by different samples of speech signals of isolated words with added background noise. The proposed speech recognition algorithm is tested using different isolated words obtaining a recognition ratio about 99%.

A multi resolution hidden markov model using class specific features is proposed by Baggenstoss [Baggenstoss(2010)]. He applied the PDF projection theorem to generalize the hidden Markov model (HMM) to accommodate multiple 23

simultaneous segmentations of the raw data and multiple feature extraction transformations. Different segment sizes and feature transformations are assigned to each state. The algorithm averages over all allowable segmentations by mapping the segmentations to a "proxy" HMM and using the forward procedure. A by-product of the algorithm is the set of a posteriori state probability estimates that serve as a description of the input data. These probabilities have simultaneously both the temporal resolution of the smallest processing windows and the processing gain and frequency resolution of the largest processing windows. The method is demonstrated on the problem of precisely modeling the consonant "T" in order to detect the presence of a distinct "burst" component. He compared the algorithm against standard speech analysis methods using data from the TIMIT Speech Database.

In short, from the literature survey it is shown that there is an emerging research trends in the study of application of Multi Resolution Analysis using Wavelet Transform for the human speech recognition for the fast few years. Thus Malayalam V/CV speech unit recognition using MRA based Wavelet Transform is of great importance to capture the non-stationary nature of the speech signal.

2.4

Review on Non-Linear Dynamical System Models for Speech Recognition

Nonlinear speech processing is a rapidly growing area of research. Naturally, it is difficult to define a precise date for the origin of the field, but it is clear that there was a rapid growth in this area, which started in the mid-nineteen eighties. 24

Since that time, numerous techniques were introduced for nonlinear time series analysis, which are ultimately aimed at engineering applications.

Among the nonlinear dynamics community, a budding interest has emerged in the application of theoretical results to experimental time series data analysis in 1980’s. One of the profound results established in chaos theory is the celebrated Takens’ embedding theorem. Takens’ theorem states that under certain assumptions, phase space of a dynamical system can be reconstructed through the use of time-delayed versions of the original scalar measurements. This new state space is commonly referred to in the literature as Reconstructed State Space (RSS), and has been proven to be topologically equivalent to the original state space of the dynamical system.

Packard et al., first proposed the concept of phase space reconstruction in 1980 [Packard.N.H and Shaw.R.S(1980)]. Soon after, Takens showed that a delay coordinate mapping from a generic state space to a space of higher dimension preserves topology [Takens(1980)]. Sauer and Yorke have modified Taken’s theorem to apply for experimental time series data analysis [Sauer.T and Casdagli.M(1991)].

Conventional linear digital signal processing techniques often utilize the frequency domain as the primary processing space, which is obtained through the Discrete Fourier Transform (DFT) of a time series. For a linear dynamical system, representation of the signal appears in the frequency domain that takes the form of sharp resonant peaks in the spectrum. However for a nonlinear or chaotic system, the signal representation does not appear in the frequency domain, because the 25

spectrum is usually broadband and resembles noise. In the RPS, a signal representation emerges in the form of complex, dense orbits that form patterns known as attractors. These attractors contain the information about the time evolution of the system, which means that features derived from a RPS can potentially contain more or different information.

The majority of literature that utilizes a RSS for signal processing applications revolves around its use for control, prediction, and noise reduction, reporting both positive and negative results. There is only scattered research using RPS features for classification and /or recognition experiments.

In contrast to the linear source-filter model for speech production process, a large number of research works are reported in the literature to show the nonlinear effects in the physical process. Koizumi.T, Taniguchi.S et al., in 1985 showed that the vocal tract and the vocal folds do not function independently of each other, but that, there is in fact some form of coupling between them when the glottis is open [Koizumi.T and Hiromitsu.S(1985)]. This can cause significant changes in formant characteristics between open and closed glottis cycles [Brookes.D.M and Naylor.P.A(1988)].

Teager and Teager [Teager.S.M(1989)] have claimed that voiced sounds are characterised by highly complex airflows in the vocal tract, rather than well behaved laminar flow. Turbulent flow of this nature is also accepted to occur during unvoiced speech, where the generation of sound is due to a constriction at some point in the vocal tract. In addition, the vocal folds will themselves be responsible 26

for further nonlinear behaviour, since the muscle and cartilage, which comprise the larynx, have nonlinear stretching qualities [Fletcher and Munson(1937)].

Such non-linearities are routinely included in attempts to model the physical process of vocal fold vibration, which have focused on two or more mass models [Fletcher and Galt(1950)], [H. Dudley and Watkins(1939)], [Dudley(1940)] in which the movement of the vocal folds is modeled by masses connected by springs, with nonlinear coupling. Observations of the glottal waveform reinforce this evidence, where it has been shown that this waveform can change shape at different amplitudes [Schoentgen(1990)]. Such a change would not be possible in a strictly linear system where the waveform shape is unaffected by amplitude changes.

Extraction of invariant parameters from speech signal has attracted researchers for designing speech and speaker recognition systems. In 1988, Narayanan.N.K. and Sridhar C.S. [Narayanan.N.K and Sridhar.C.S(1988)] used the dynamical system technique mentioned in the nonlinear dynamics to extract invariant parameters from speech signal. The dynamics of speech signal is experimentally investigated by extracting the second order dimension of the attractor D2 and the second order Kolmogorov entropy K2 of speech signal. The fractal dimension of D2 and non-zero value of K2 confirms the contribution of deterministic chaos to the behavior of speech signal. The attractor dimension D2 and Kolmogorov entropy K2 are then used as a powerful tool for voiced / unvoiced classification of speech signals.

27

The dimension of the trajectories, or the dimension of the attractor is an important characteristic of the dynamic systems. The estimation of the dimension gives a lower bound of the number of parameters needed in order to model the system. The goal is to find if the system under study occupies all the state space or if it is most of the time in a subset of the space, called attractor. The correlation dimension [Tishby.N(1990)] is a practical method to estimate the dimension of an empirical temporal series.

There are a large variety of techniques found in the literature of nonlinear methods and it is difficult to predict which techniques ultimately will be more successful in speech processing. However, commonly observed methods in the speech processing literature are various forms of oscillators and nonlinear predictors, the latter being part of the more general class of nonlinear autoregressive methods. The oscillator and autoregressive techniques themselves are also closely related since a nonlinear autoregressive model in its synthesis form, forms a nonlinear oscillator if no input is applied. For the practical design of a nonlinear autoregressive model, various approximations have been proposed [Farmer.J.D and Sidorowich.J.D(1988)] [Casdagli.M and Gibson.J(1991)] [Abarbanel.H.D.I and Tsimring.L.S(1993)] [Kubin.G(1995)]. These can be split into two main categories: parametric and non parametric methods.

Phase space reconstruction is usually the first step in the analysis of dynamical systems. An experimenter obtains a scalar time series from one observable of a multidimensional system. State-space reconstruction is then needed for the indirect measurement of the system’s invariant parameters like, dimension, Lyapunov 28

exponent etc. Takens’ theorem gives little guidance, about practical considerations for reconstructing a good state space. It is silent on the choice of time delay (τ) to use in constructing m-dimensional data vectors. Indeed, it allows any time delay as long as one has an infinite amount of infinitely accurate data. However, for reconstructing state spaces from real-world, finite, noisy data, it gives no direction [Casdagli.M and Gibson.J(1991)]. Two heuristics have been developed in the literature for establishing a time lag [Kantz and Schreiber.T(2003)]. First one is the first zero of the autocorrelation function and the second one is the first minimum of the auto mutual information curve [Fraser.A.M and Swinney.H.L(1986)]. Andrew M Fraser and Harry L Swinney reported in their work that the mutual information is examined for a model dynamical system and for chaotic data from an experiment on the Belousov-Zhabotinskii reaction. An N log N algorithm for calculating mutual information (I) is presented. A minimum in ’I’ is found to be a good criterion for the choice of time delay in Phase Space Reconstruction from time series data. This criterion is shown to be far superior than choosing a zero of the autocorrelation function.

There have been many discussions on how to determine the optimal embedding dimension from a scalar time series based on Taken’s theorem or its extensions [Sauer.T and Casdagli.M(1991)]. Among different geometrical criteria, the most popular seems to be the method of False Nearest Neighbors. This criterion concerns the fundamental condition of no self-intersections of the reconstructed attractor [Kennel.M.B and Abarbanel.H.D.I(1992)].

29

Work by Banbrook, McLaughlin et al., [Banbrook and McLaughlin(1994)] Kumar et al.,

[Kumar.A and Mullick.S.K(1996)]

and Narayanan et al.,

[Narayanan.S.S and Alwan.A.A(1995)] has attempted to use nonlinear dynamical methods to answer the question: "Is speech chaotic?" These papers focused on calculating theoretical quantities such as Lyapunov exponents and Correlation dimension. Their results are largely inconclusive and even contradictory. A synthesis technique for voiced sounds is developed by Banbrook et al., inspired by the technique for estimating the Lyapunov exponents.

In a work presented by Langi and Kinsner [Langi.A and Kinsner.W(1995)], speech consonants are characterised by using a fractal model for speech recognition systems . Characterization of consonants has been a difficult problem because consonant waveforms may be indistinguishable in time or frequency domain. The approach views consonant waveforms as coming from a turbulent constriction in a human speech production system, and thus exhibiting turbulent and noise like time domain appearance. However, it departs from the usual approach by modeling consonant excitation using chaotic dynamical systems capable of generating turbulent and noise-like excitations. The scheme employs correlation fractal dimension and Takens embedding theorem to measure fractal dimension from time series observation of the dynamical systems. It uses linear predictive coding (LPC) excitation of twenty-two consonant waveforms as the time series. Furthermore, the correlation fractal dimension is calculated using a fast Grassberger algorithm [Grassberger and Procaccia(1983)].

30

The criterion in the False Nearest Neighbor approach for determining optimal embedding dimension is subjective in some sense that, different values of parameters may lead to different results [Cao.L(1997)]. He proposed in his work a practical method to determine the minimum embedding dimension from a scalar time series. It does not contain any subjective parameters except for the time delay for the embedding. It does not strongly depend on how many data points are available and it is computationally efficient. Several time series are tested to show the above advantages of the method. For real time series data, different optimal embedding dimensions are obtained for different values of the threshold value. Also with noisy data this method gives spurious results [Kantz and Schreiber.T(2003)].

Narayanan presented an algorithm for voiced/unvoiced speech signal classification using second order attractor dimension and second order kolmogorov entropy of the speech signals. The non-linear dynamics of the speech signal is experimentally analyzed using this approach. The proposed techniques were further used as a powerful tool for the classification of voiced/unvoiced speech signals in many applications [Narayanan(1999)].

In [N K Narayanan and Sasindran(2000)] Narayanan et al., investigated on the applications of phase space map and phase space point distribution parameter for the recognition of Malayalam vowel units. The presented features were extracted by utilizing the non-linear/chaotic signal processing techniques. Andrew et al., presented phase space feature for the classification of TIMIT corpus and demonstrated that the proposed technique outperform compared with frequency domain based MFCC feature parameters [Andre C Lingren and Povinelli(2003)]. 31

Petry et al.,

[A Petry and Barone.C(2002)] and Pitsikalis et al.,

[Pitsikalis.V and Maragos.P(2003)] have used Lyapunov exponents and Correlation dimension in unison with traditional features (cepstral coefficients) and have shown minor improvements over baseline speech recognition systems. Central to both sets of these papers is the importance of Lyapunov exponents and Correlation dimension, because they are invariant metrics that are the same regardless of initial conditions in both the original and reconstructed phase space. Despite their significance, there are several issues that exist in the measuring of these quantities on real experimental data. The most important issue is that these measurements are very sensitive to noise. Secondarily, the automatic computation of these quantities through a numerical algorithm is not well established and this can lead to drastically differing results. The overall performance of these quantities as salient features remains an open research question.

In [P Prajith and Narayanan(2004)], P Prajith et al., proposed a feature parameter by utilizing nonlinear or chaotic signal processing techniques to extract time domain based phase space features.Two sets of experiments are presented. In the first, exploiting the theoretical results derived in nonlinear dynamics, a processing space called phase space is generated and a recognition parameter called Phase Space Point Distribution (PSPD) is extracted. In the second experiment Phase Space Map at a phase angle p/2 is reconstructed and PSPD is calculated. The output of a neural network with error back propagation algorithm demonstrate that phase space features contain substantial discriminatory power.

32

Kevin M Lindrebo et al., introduced a method for calculating speech features from third-order statistics of sub band filtered speech signals which are used for robust speech recognition [Kevin M. Indrebo(2005)]. These features have the potential to capture nonlinear information not represented by cepstral coefficients. Also, because the features presented in this method are based on the third-order moments, they may be more immune to Gaussian noise than cepstrals, as Gaussian distributions have zero third-order moments.

Richard J Povinelli et al., introduced a novel approach to the analysis and classification of time series signals using statistical models of reconstructed phase spaces [Povinelli.R.J(2006)]. With sufficient dimension, such reconstructed phase spaces are, with probability one, guaranteed to be topologically equivalent to the state dynamics of the generating system, and, therefore, may contain information that is absent in analysis and classification methods rooted in linear assumptions. Parametric and non parametric distributions are introduced as statistical representations over the multidimensional reconstructed phase space, with classification accomplished through methods such as Bayes maximum likelihood and artificial neural networks (ANNs). The technique is demonstrated on heart arrhythmia classification and speech recognition. This new approach is shown to be a viable and effective alternative to traditional signal classification approaches, particularly for signals with strong nonlinear characteristics.

In [Prajith and Narayanan(2006)] P Prajith and N K Narayanan had introduced a flexible algorithm for pitch calculation by utilizing the methodologies developed for analyzing chaotic time series. The experimental result showed that the pitch 33

estimated using reconstructed phase space feature agrees with that obtained using conventional pitch detection algorithm.

Marcos Faundez-Zanuy compared the identification rates of a speaker recognition system using several parameterizations, with special emphasis on the residual signal obtained from linear and nonlinear predictive analysis [Zanuy(2007)]. It is found that the residual signal is still useful even when using a high dimensional linear predictive analysis. If instead of using the residual signal of a linear analysis a nonlinear analysis is used, both combined signals are more uncorrelated and although the discriminating power of the nonlinear residual signal is lower, the combined scheme outperforms the linear one for several analysis orders.

P Prajith introduced in his thesis the applications of nonlinear dynamical theory. As an alternate to traditional model of speech production a nonlinear system has been proposed. The problem of whether speech (especially vowel sounds) is chaotic has been examined through discussion of previous studies and experiments. Nonlinear invariant parameters for Malayalam vowels are calculated. The major invariant features include attractor dimensions and Kolmogorov entropy. The non-integer attractor dimension and non-zero value of Kolmogorov entropy confirm the contribution of deterministic chaos to the behavior of speech signal [Prajith(2008)].

In a recent study Kar proposed a novel criterion for the global asymptotic stability of fixed-point state-space digital filters under various combinations of quantization and overflow non linearities [Kar(2011)]. Yucel Ozbek et al., pro34

posed a systematic framework for accurate estimation of articulatory trajectories from acoustic data based on multiple-model dynamic systems via state-space representation [Yucel Ozbek and Demirekler(2012)]. The acoustic measurements and articulatory positions are considered as observable (measurement) and hidden (state) quantities of the system, respectively. To improve the performance of state space-based articulatory inversion they have used jump-Markov linear system (JMLS). Comparison of the performance of their method with the reported ones given in the literature shows that the proposed method improves the performance of the state-space approaches.

It is seen that the majority of literature that utilizes the nonlinear techniques for signal processing applications revolves around its use for control, prediction and noise reduction, reporting both positive and negative results. There is only scattered research using these methods for classification or recognition experiments. It is also important to notice that very less works are reported yet in nonlinear speech processing for Malayalam and no such work has been reported in other Indian languages. The succeeding session of this chapter is focused on the review of the applications of artificial neural network for speech recognition.

2.5

Review on Applications of ANN for Speech Recognition

Artificial neural net (ANN) algorithms have been designed and implemented for speech pattern recognition by a number of researchers. ANNs are of interest because algorithms used in many speech recognizers can be implemented using highly parallel neural net architectures and also because new parallel algorithms

35

are being developed making use of the newly acquired knowledge of the working of biological nervous systems. Hutton.L.V compares neural network and statistical pattern comparison method for pattern recognition purpose [Hutton.L.V(1992)]. Neural network approaches to pattern classification problems complement and compete with statistical approaches. Each approach has unique strengths that can be exploited in the design and evaluation of classifier systems. Classical (statistical) techniques can be used to evaluate the performance of neural net classifiers, which often outperform them. Neural net classifiers may have advantages even when their ultimate performance on a training set can be shown to be no better than the classical. It is possible to be implemented in real time using special purpose hardware.

Personnaz L et al., presents an elementary introduction to networks of formal neurons [Personnaz.L and Dreyfus.G(1990)]. The state of the art regarding basic research and the applications are presented in this work. First, the most usual models of formal neurons are described, together with the most currently used network architectures: static (feedforward) nets and dynamic (feedback) nets. Secondly, the main potential applications of neural networks are reviewed: pattern recognition (vision, speech), signal processing and automatic control. Finally, the main achievements (simulation software, simulation machines, integrated circuits) are presented.

Willian Huang et al., presents some neural net approaches for the problem of static pattern classification and time alignment. For static pattern classification multi layer perceptron classifiers trained with back propagation can form arbitrary 36

decision regions, are robust, and are trained rapidly for convex decision regions. For time alignment, the Viterbi net is a neural net implementation of the Viterbi decoder used very effectively in recognition systems based on Hidden Markov Models (HMMs) [William Huang and Gold(1988)].

Waibel.A et al., [Weibel.A and Lang.K(1988)] proposed a time delay neural network (TDNN) approach to phoneme recognition, which is characterized by two important properties. Using a three level arrangement of simple computing units, it can represent arbitrary non-linear decision surface. The TDNN learns these decision surfaces automatically using error back propagation. The time delay arrangement enables the network to discover acoustic phonetic features and temporal relationships between them independent of position in time and hence not blurred by temporal shifts in the input. For comparison, several discrete Hidden Markov Models (HMM) were trained to perform the same task, i.e. the speaker dependent recognition of the phonemes "B", "D" and "G" extracted from varying phonetic contexts. The TDNN achieved a recognition rate of 98.5% correct compared to 93.7% for the best of HMMs. They showed that the TDNN has well known acoustic-phonetic features (e.g., F2-rise, F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal representations to link different acoustic realizations to the same concept.

Yoshua Bengio and Renato De Mori used The Boltzmann machine algorithm and the error back propagation algorithm to learn to recognize the place of articulation of vowels(front, center or back), represented by a static description of spectral lines [Bengio and Mori(1988)]. The error rate is shown to depend on the 37

coding. Results are comparable or better than those obtained by them on the same data using hidden Markov Models. They also show a fault tolerant property of the neural nets, i.e. that the error on the test set increases slowly and gradually when an increasing number of nodes fail.

Mah. R.S.H and Chakravarthy.V examined the key features of simple networks and their application to pattern recognition [Mah.R.S.H and Chakravarthy.V(1992)]. Beginning with a three-layer back propagation network, the authors examine the mechanisms of pattern classification. They relate the number of input, output and hidden nodes to the problem features and parameters. In particular, each hidden neuron corresponds to a discriminant in the input space. They point out that the interactions between number of discriminants, the size and distribution of the training set, and numerical magnitudes make it very difficult to provide precise guidelines. They found that the shape of the threshold function plays a major role in both pattern recognition, and quantitative prediction and interpolation. Tuning the sharpness parameter could have a significant effect on neural network performance. This feature is currently under-utilized in many applications. For some applications linear discriminant is a poor choice.

Janssen et al., developed a phonetic front-end for speaker-independent recognition of continuous letter strings [Janssen.R.D.T and Cole.R.A(1991)]. A feedforward neutral network is trained to classify 3 msec speech frames as one of the 30 phonemes in the English alphabet. Phonetic context is used in two ways: first, by providing spectral and waveform information before and after the frame to be classified, and second, by a second-pass network that uses both acoustic features 38

and the phonetic outputs of the first-pass network. This use of context reduced the error rate by 50%. The effectiveness of the DFT and the more compact PLP (perceptual linear predictive) analysis is compared, and several other features, such as zerocrossing rate, are investigated. A frame-based phonetic classification performance of 75.7% was achieved.

Ki-Seok-Kim and Hee-Yeung-Hwang present the result of the study on the speech recognition of Korean phonemes using recurrent neural network models conducted by them [Ki-Seok-Kim and Hee-Yeung-Hwang(1991)]. The results of applying the recurrent multi layer perceptron model for learning temporal characteristics of speech phoneme recognition is presented. The test data consist of 144 vowel+consonant+vowel (V+CV) speech chains made up of 4 Korean monothongs and 9 Korean plosive consonants. The input parameters of the artificial neural network model used are the FFT coefficients, residual error and zero crossing rates. The baseline model showed a recognition rate of 91% for vowels and 71% for plosive consonants of one male speaker. The authors obtained better recognition rates from various other experiments compared to the existing multilayer perceptron model, thus showing the recurrent model to be better suited to speech recognition. The possibility of using the recurrent models for speech recognition was experimented upon by changing the configuration of this baseline model.

Ahn.R and Holmes.W.H propose a voiced / unvoiced / silence classification algorithm of speech using 2-stage neural networks with delayed decision input [Ahn.R and Holmes.W.H(1996)]. This feed forward neural network classifier is 39

capable of determining voiced, unvoiced and silence in the first stage and refining unvoiced and silence decisions in the second stage. Delayed decision from the previous frame’s classification along with preliminary decision by the first stage network, zero crossing rate and energy ratio enable the second stage to correct the mistakes made by the first stage in classifying unvoiced and silence frames. Comparisons with a single stage classifier demonstrate the necessity of two-stage classification techniques. It also shows that the proposed classifier performs excellently.

Sunilkumar and Narayanan investigated the potential use of zerocrossing based information of the signal for Malayalam vowel recognition. A vowel recognition system using artificial neural network is developed. The highest recognition accuracy obtained for normal speech is 90.62% [Sunilkumar(2002)], [R K Sunilkumar and Narayanan(2004)].

Dhananjaya et al., proposed a method for detecting speaker changes in a multi speaker speech signal [Dhananjaya.N and Yagnanarayana.B(2004)]. The statistical approach to a point phenomenon (speaker change) fails when the given conversation involves short speaker turns (< 5 sec duration). They used auto associative neural network (AANN) models to capture the characteristics of the excitation source that present in the linear prediction (LP) residue of speech signal. The AANN models are then used to detect the speaker changes.

In [P Prajith and Narayanan(2004)] P Prajith et al., presented the implementation of a neural network with error back propagation algorithm for the speech 40

recognition application with Phase Space Point Distribution as the input parameter. A method is suggested for speech recognition by utilizing nonlinear or chaotic signal processing techniques to extract time domain based phase space features. Two sets of experiments are presented in this paper. In the first, exploiting the theoretical results derived in nonlinear dynamics, a processing space called phase space is generated and a recognition parameter called Phase Space Point Distribution (PSPD) is extracted. In the second experiment Phase Space Map at a phase angle p/2 is reconstructed and PSPD is calculated. The output of a neural network with error back propagation algorithm demonstrate that phase space features contain substantial discriminatory power.

In [R K Sunilkumar and Narayanan(2004)] the speech signal is modeled using the zerocrossing base features of the signal. This feature is used for recognizing the Malayalam vowels and Consonant Vowel Units using Kolmogorov- Smirnov statistical test and multilayer feed forward artificial neural network. The average vowel recognition accuracy for single speaker using ANN base method is 92.62 % and for three female speaker is 91.48%. The average Consonant vowel recognition accuracy for single speaker is 73.8%. The advantage of this method is that the network shows better performance than the other conventional techniques and it takes less computation than the other conventional techniques of parameterization of speech signal like FFT, and Cepstral methods.

Xavier Domont et al., proposed a feed forward neural network for syllable recognition [Xavier Domont and Goerick(2007)]. The core of the recognition system is based on a hierarchical architecture initially developed for visual object 41

recognition. In this work, they showed that, given the similarities between the primary auditory and visual cortexes, such a system can successfully be used for speech recognition. Syllables are used as basic units for the recognition. Their spectrograms, computed using a Gammatone filter bank, are interpreted as images and subsequently feed into the neural network after a preprocessing step that enhances the formant frequencies and normalizes the length of the syllables.

P Prajith investigated in his work the application of Multi Layer Feed Forward Neural Network (MLFFNN) with error back propagation algorithm for the classification of Malayalam vowel units. To evaluate the credibility of the classifier he used reconstructed phase approach in combination with Mel Frequency Cepstral Coefficient(MFCC). An overall recognition accuracy of 96.24% is obtained from the simulation experiment and reported that a significant boost in recognition accuracy is obtained using ANN with the hybrid features [Prajith(2008)].

Anupkumar et al., [Anupkumar Paul and Kamal(2009)] studied Linear Predictive Coding Coefficients (LPCC) and Artificial Neural Network (ANN) for the recognition of Bangala speech. They presented the different neural network architecture design for the pattern at hand. It is concluded that neural networks having more hidden layers are able to solve the problems very easily. By comparing error curves and recognition accuracy of digits it is concluded that Multi Layer Perceptron with 5 layers is a more generic approach rather than Multi Layer Perceptron with 3 hidden layers.

42

Hanchate et al., investigated the application of Neural Networks with one hidden layer with sigmoid functions and the output layer with linear functions. There are 10 output neurons for all the networks while the numbers of hidden neurons vary from 10 to 70. The inputs of the network are the features of 4 selected frames with 256 samples per frame. Each frame is represented by 12 MFCC coefficients of the signal in the frame. A comparatively good recognition accuracy is obtained using this approach [D B Hanchate and Mourya(2010)].

In [Yong and Ting(2011)], Yong and Ting investigated the speaker independent vowel recognition for Malay children using the Time Delay Neural Network (TDNN). Due to less research done on the children speech recognition, the temporal structure of the children speech was not fully understood. Two hidden layers TDNN was proposed to discriminate 6 Malay vowels: /a/, /e/, /i/, /o/ and /u/. The speech database consisted of vowel sounds from 360 children speakers. Cepstral coefficient was normalized for the input of TDNN. The frame rate of the TDNN was tested with 10ms, 20ms, and 30ms. It was found out that the 30ms frame rate produced the highest vowel recognition accuracy with 81.92%. The TDNN also showed higher speech recognition rate compared to the previous studies that used Multilayer Perceptron.

The zerocrossing interval distribution of vowel speech signal is studied by Sunilkumar and Lajish [Sunilkumar and Lajish(2012)] using 5 Malayalam short vowel units. The classification of these sounds are carried out using multilayer feed forward artificial neural network. After analyzing the distribution patterns and the vowel recognition results, they reported that the zerocrossing interval dis43

tribution parameters can be effectively used for the speech phone classification and recognition. The noise adaptness of this parameter is also studied by adding additive white Gaussian noise at different signal to noise ratio. The computational complexity of the proposed technique is also less compared to the conventional spectral techniques which includes FFT and Cepstral methods, used in the parameterization of speech signal.

In [Battacharjee(2012)] Bhattacharjee discussed a novel technique for the recognition of Assamese phonemes using Recurrent Neural Network (RNN) based phoneme recognizer. A Multi-Layer Perceptron (MLP) has been used as phoneme segmenter for the segmentation of phonemes from isolated Assamese words. Two different RNN based approaches have been considered for recognition of the phonemes and their performances have been evaluated. MFCC has been used as the feature vector for both segmentation and recognition. With RNN based phoneme recognizer, a recognition accuracy of 91% has been achieved. The RNN based phoneme recognizer has been tested for speaker mismatch and channel mismatch conditions. It has been observed that the recognizer is robust to any unseen speaker. However, its performance degrades in channel mismatch condition. Cepstral Mean Normalization (CMN) has been used to overcome the problem of performance degradation effectively.

In this thesis the application of linear and non-linear dynamical system models and multi resolution analysis using wavelet transform features of V/CV speech units for the recognition using brain like computing algorithm namely Artificial Neural Networks is explored in detail. 44

2.6

Review on Statistical Learning Algorithms for Speech Recognition

Support Vector Machines (SVM) are learning techniques that is considered as an effective method for general purpose pattern recognition because of its high generalization performance without the need of domain specific knowledge [Vapnik(1995)]. Intuitively, given a set of points belonging to two classes, a SVM finds a hyperplane that separates the largest possible fraction of points of the same class on the same side, while maximizing the distance from either class to the hyperplane. This is the optimal separating hyperplane which minimizes the risk of misclassifying not only the examples in the training set, but also the unseen example of the test set.

The main characteristics of SVM are that they minimize a formally proven upper bound on the generalization error. They work on high dimensional feature space by means of a dual formulation in terms of kernels. The prediction is based on hyperplanes in these feature spaces, which may correspond to quite involved classification criteria on the input data. The layer in the training data set can be handled by means of soft margins.

In a work done [Clarkson and Moreno(1997)] by Clarkson and Moreno, authors explores the issues involved in applying SVMs to phonetic classification as a first step to speech recognition. They presented results on several standard vowel and phonetic classification tasks and show better performance than Gaussian mixture classifiers. They also presented an analysis of the difficulties they

45

foresee in applying SVMs to continuous speech recognition problems. This paper represents a preliminary step in understanding the problems of applying SVMs to speech recognition.

As a preliminary analysis on speech signal analysis Anil K Jain et al., [Anil k Jain and Mao(2000)] presented a robust speech recognizer based on features obtained from the speech signal. The authors explored the issues involved in applying SVMs to phonetic classification. They presented results on several standard vowel and phonetic classification tasks and showed better performance than Gaussian mixture classifiers. They also present an analysis of the difficulties in applying SVMs to continuous speech recognition problems.

In a work presented by Aravindh Ganapathiraju et al., they addressed the use of a support vector machine as a classifier in a continuous speech recognition system. The technology has been successfully applied to two speech recognition tasks. A hybrid SVM/HMM system has been developed that uses SVMs to postprocess data generated by a conventional HMM system. The results obtained in the experiments clearly indicate the classification power of SVMs and affirm the use of SVMs for acoustic modeling. The oracle experiments reported in their work clearly show the potential of this hybrid system while highlighting the need for further research into the segmentation issue [Aravind Ganapathiraju and Picone(2004)].

In a work done by Tsang-Long et al., [Tsang-Long Pao and Li(2006)] SVM & NN classifiers and feature selection algorithm were used to classify five emotions from Mandarin emotional speech and compared their experimental results. The 46

overall experimental results reveal that the SVM classifier (84.2%) outperforms the NN classifier (80.8%) and detects anger perfectly, but confuses happiness with sadness, boredom and neutral. The NN classifier achieves better performance in recognizing sadness and neutral and differentiates happiness and boredom perfectly.

In [Jing Bai(2006)], to improve the learning and generalization ability of the machine-learning model, a new compound kernel that may pay attention to the similar degree between sample space and feature space is proposed. The author used the new compound kernel support vector machine to a speech recognition system for Chinese isolated words, non-specific person and middle glossary quantity, and compared the speech recognition results with the SVM using traditional kernels and RBF network. Experiments showed that the SVM performance with the new compound kernel is much better than traditional kernels and has higher recognition rates than ones of using RBF network in different SNRs, and is of shorter training time.

Sandhya Arora et al., [Sandhya Arora and Basu(2010)] discussed the characteristics of some classification methods that have been successfully applied to handwritten Devnagari character recognition and results of SVM and ANNs classification method, applied on Handwritten Devnagari characters. After preprocessing the character image, they extracted shadow features, chain code histogram features, view based features and longest run features. These features are then fed to Neural classifier and in Support Vector Machine for classification. In neural classifier, they explored three ways of combining decisions of four MLP’s, de47

signed for four different features.

In a work done by Zhuo-ming Chen et al., [Zhuo-ming Chen and tao Yao(2011)] they extracted a new feature(DWTMFC-CT) of the consonants by employing wavelet transformation, and explains that the difference of similar consonants can be described more accurately by this feature. The algorithm used for classification was multi-class fuzzy support vector machine(FSVM). In order to reduce the computation complexity caused by using the standard fuzzy support vector machine for multi-class classification, this paper propose an algorithm based on two stages. Experimental results shows that the proposed algorithm could get better classification results while reducing the training time greatly.

In [Ruben Solera-urena and de MariA(2012)], authors suggest the use of a weighted least squares (WLS) training procedure that facilitates the possibility of imposing a compact semiparametric model on the SVM, which results in a dramatic complexity reduction. Such a complexity reduction with respect to conventional SVM, which is between two and three orders of magnitude, allows the hybrid WLS-SVC/HMM system to perform real-time speech decoding on a connected-digit recognition task (Spanish database namely SpeechDat). The experimental evaluation of the proposed system shows encouraging performance levels in clean and noisy conditions.

In short, SVM have been widely used for speech recognition application for the last few years. It is because of its high generalization performance even without the domain specific knowledge. In this thesis application of non-linear and 48

wavelet based feature of V/CV speech units for recognition using SVM is studied in detail

2.7

Conclusion

In summary, the current stage in the evalution of speech recognition research result from a combination of several elements, such as versatility of the database used, credibility of the different strategies of the feature selection, environmental conditions, the performance of different classifiers and their combinations etc. It is clear that no much well known attempts are reported towards in the recognition of Malayalam V/CV speech units and hence more research is needed to improve the recognition rates of V/CV units in both clean and noisy conditions. The studies performed in the following chapter of this thesis reveal that, the multi resolution analysis and non-linear dynamical system approach have a very good role in providing flexible information processing capability by devising methodologies and algorithms on a massively parallel system capable of handling infinite intra-class variations for representation and recognition of V/CV speech units.

49