HMM models for arabic speech ... - IEEE Xplore

0 downloads 0 Views 781KB Size Report
Abstract-Thanks to Automatic Speech Recognition (ASR), a lot of machines can nowadays emulate human being ability to understand and speak.
A hybrid ANN/HMM models for arabic speech recognition using optimal codebook Mohamed ETTAOUIL

Mohamed LAZAAR

Zakariae EN-NAIMANI

Modelling and Scientific Computing Laboratory, Faculty of Science and Technology, University Sidi Mohammed ben Abdellah Fez, MOROCCO

[email protected],

[email protected],

[email protected]

Abstract-Thanks to Automatic Speech Recognition (ASR), a

translation. The Hidden Markov Model permits to model the

lot of machines can nowadays emulate human being ability to

temporal variation, with local spectral variability modeled

understand

using flexible distributions such as mixtures of Gaussian

and

speak

natural

language.

However,

ASR

problematic could be as interesting as it is difficult. Its difficulty

densities. New techniques have emerged to increase the system

is precisely due to the complexity of speech processing, which

performance speech recognition.

takes

into

consideration

many

aspects:

acoustic,

phonetic,

syntactic, etc. Thus, the most commonly used technology, in the context of speech recognition, is based on statistical models. Especially, the Hidden Markov Models that are capable of

These are based on the

hybridization of Artificial Neural network and Hidden Markov Model (ANNIHMM). The approaches ANN/HMM represent a competitive alternative to standard systems [5]. We describe in this paper an original and successful

simultaneously modeling frequency and temporal characteristics of the speech signal. There is also the alternative of using

utilization

of

Neuronal Networks. But another interesting framework applied

modeling

for

in ASR is indeed the hybrid Artificial Neural Network (ANN)

conceptually weaker than continuous supervised systems like

and Hidden Markov Model (HMM) speech recognizer that improves the accuracy of the two models. In the present work, we propose an Arabic digits recognition system based on hybrid Artificial

Neural

Network

and

Hidden

Markov

Model

(ANNIHMM). The main innovation in this work is to use an optimal neural network to determine the optimal class, unlike in classical

Kohonen

approach.

The

numerical

results

are

encouraging and satisfactory.

speech

systems,

the

ANNs

and

recognition approach

Hidden

systems.

presented

Markov Although

here

offers

performance and learning speedup advantages when compared to other (more classic) discrete systems. The unsupervised ANNs which are used here are the so-called Self-Organizing Map (SaM) of Kohonen. Since the training stage is very important in the Self-Organizing Maps (SaM) performance, the

selection

of

the

architecture

of

Kohonen

network,

associated with a given problem, is one of the most important

Keywords-Automatic Speech Models, Artificial Neural Network,

I.

hybrid

unsupervised

Recognition,

Hidden Markov

Vector Quantization.

research problems in the neural network research [2]. More precisely, the choice of neurons number and the initial weights has a great impact on the convergence of learning methods. The optimization of the artificial neural networks architectures,

INTRODUCTION

Speech recognition is a topic that is very useful in many applications and environments in our daily life. This field is one of the most challenging fields that have faced scientists for a long time. In fact because speech is the most essential mediwn of human being communication, he always tries to replicate this ability on machines. However, the processing of

particularly Kohonen networks, is a recent problem. In this work,

we discusses a new method of Automatic Speech

Recognition using a new hybrid ANN/HMM model, the later contains an optimal for architectures of the Kohonen networks, this model permits to determining the optimal codebook and reduce the coding time.

speech signal has to consider many aspects, which complicate

This paper is organized as follows: The section 2 presents

the task. Indeed, the signal information could be analyzed from

Feature extraction from speech signal. In section 3 the vector

different sides: acoustic, phonetic, phonologic, morphologic,

quantization by Kohonen networks is described. The Hidden

syntactic, semantic and pragmatic [3].

Markov Model is presented in section 4. In section 5, the

The Hidden Markov Model has become one of the most powerful statistical methods for modeling speech signals. Its principles have been successfully used in automatic speech recognition, formant and pitch tracking, speech enhancement, speech synthesis, statistical language modeling, part-of-speech tagging,

spoken

language

understanding,

978-1-4799-0299-6/13/$31.00 ©2013 IEEE

and

machine

proposed approach to determine the optimal codebook is presented. Experimental results are given in the section 6. The last section concludes our work. II.

FEATURE EXTRACTION FROM SPEECH (MFCC)

Voice Signal Identification consists of the process to convert a speech waveform into features that are useful for

further processing. There are many algorithms and techniques

artificial neural networks to learning unsupervised without

are use. It depends on features capability to capture time

human intervention,

frequency and energy into set of coefficients for cepstrum

respect the characteristics of input data. This model calculates

analysis [1].

simply a Euclidean distance between her input and weights.

the little information is necessary to

Despite many differences between individuals, and the existence of many languages, speech follows general patterns, and on average has well defined characteristics such as those of volume, frequency distribution, pitch rate and syllabic rate [3]. These characteristics have adapted with regard to environment, hearing and voice production limitations, speech characteristics fit the speech generating abilities of the body, but the rapid changes in society over the past century have exceeded our ability to adapt. The human voice is converted into digital signal form to produce digital data representing each level of signal at every discrete time step. A wide range of possibilities exist for parametrically

Fig. 1: Kohonen topological map.

representing the speech signal for the speech processing task, such as Linear Prediction Coding (LPC),

Mel Frequency

Cepstral Coefficient (MFCCs), and others. Generally

speaking,

a

conventional

speech

processing

system can be organized in two blocks: the feature extraction and the modeling stage. MFCCs are the most commonly used acoustic features in speech processing, and this feature has been used in this paper. MFCC's are based on the known variation of the human ear's critical bandwidths with frequency [5]. III.

To look at the training process more formally, let us consider the input data as consisting of n-dimensional vectors X =

{xl,X , . " ,xtlJ.

Meanwhile, each of N neurons has an associated reference i vectofl1li = (w , w , . " , w ) .

f

x

at a time is compared with all wi to During training, one k fmd the reference vector w that satisfies a minimum distance or

k.

criterion.

Though

a

number

of

The quantization method using the

the original data changes as the time passes, since it supports

processing.

minf=lllx - will

wi(t + 1)

well suitable to the application that the statistical distribution of

a huge parallel structure and the possibility for high-speed

arg

neighborhood are then activated and modified:

artificial neural network, in particular Self Organizing Maps, is

the adaptive learning to data [13]. Also, the neural network has

=

The best-matching unit (BMU) and neurons within its

also been increasingly applied to reduce problem complexity

=

wi{t) + P , iI (t) lIx - J\,T ili i

One of the main parameters influencing the training process is

the

neighborhood

function

(Prr.,[ ( t) ),

which

defines

a

distance-weighted model for adjusting neuron vectors. It is define by the following relations:

The Self Organizing Map (Kohonen neural network) is the

closest

of

all

artificial

neural

network

architectures and learning schemes to the biological neuron network. The aim of Kohonen learning is to map similar signals to similar neuron positions.

arranged in a two-dimensional plane, let name this one the output layer (see Fig.

I).

The additional input layer just

distributes the inputs to output layer. The number of neurons on input layer is equal to the dimension of input vector. A defmed topology means that each neuron has a defmed number of neurons as nearest neighbors, second-nearest neighbors, etc. The neighborhood of a neuron is usually arranged either in squares, which means that each neuron has either four nearest neighbors. Kohonen has proposed various alternatives for the classification,

and

P k ,[{t)

= exp

C:���))

One can see that the neighborhood function is dependent on both the distance between the BMU and the respective neuron

The Kohonen network has one single layer of neurons

automatic

similarity

common:

VECTOR QUANTIZATION

technique in the field of speech coding. Vector quantization has

probably

maximum

measures are possible, the Euclidean distance is by far the most

Vector quantization is considered as a data compression

like pattern recognition.



presented

the

Kohonen

topological map; these models belong to the category of

(d R,i )

and on the time step reached in the overall training

process (t). In the Gaussian model, that neighborhood's size appears as kernel width

(a)

and is not a fixed parameter. The

neighborhood radius is used to set the kernel width with which training will start. One typically starts with a neighborhood spanning most of the SOM, in order to achieve a rough global ordering, but kernel width then decreases during later training cycles. IV.

HIDDEN MARKOV MODELS (HMM)

This section is dedicated to the particular problem of modeling speech units. The Hidden Markov Model is a fmite set of states, each of which is associated with a probability

distribution. Transitions among the states are governed by a set

canonical order the less complexity using optimized neural

of probabilities called transition probabilities.

networks.

A HMM is a stochastic automaton with a stochastic output process attached to each.

Thus we have two concurrent

The basic unit for describing how speech convoys linguistic meaning is the phoneme. The standard Arabic language has

stochastic processes: an underlying (hidden) Markov process

basically 34 phonemes which are 28 consonants, 3 short

modeling the temporal structure of speech and a set of state

vowels

output processes modeling the stationary character of the

pronunciation of the phoneme in the syllable as initial, closing,

and

3

long

vowels.

Several

factors

effect

the

speech signal. For more detailed information about HMMs

intervocalic or suffix. There are 19 ways of pronounced of an

used in speech recognition, several texts can be recommended

Arabic letter. A consonant can be pronounced with 6 vowels

such as[6].

and with signs as tanween, tachdid or germination and sukune.

More precisely, an HMM is defined by:

A

=

The performance of the Vector Quantization techniques depends largely on the quality of the codebook used in coding

(Q,O,P, B,rr)

the data.

If the codebook contains vectors that represent

sequence of acoustic features, the compression quality will be

where:

good. For cons, the code words which are not similar to the

Q:

set of [mite states of the model of cardinal N;

0: set of finite observations of the model of cardinal M· p'

=

i.:>N (Pij ) 1:>g J:>N

blocks

to

of

state

(bAoa) is[ :>i'! : iSt:> I T I

nj

= =

not

produce

good

quality

of

signal

Feature Analysis

matrix of probability to emit

observation 0t in the state n

will

transition

probabilities. Transition probability is the probability

=

coded

Speech

matrix

to chose the transition Pij to access to the state q [ ; B

be

quantification features.

Sequence

.q j ;

(n:i' n ,.,,' 'TfN ) : initial P (qo = j), Vj E [1, N],

of acoustic

distribution where

go

of

states

features

represents the

Vector Quantization

initial state of the model. HMMs represent a learning paradigm, in the sense that examples of the event that is to be modeled can be collected

Sequence of

and used in conjunction with a training algorithm in order to

quantification

learn proper estimates of p, B and

11:.

The most popular of such

features

algorithms are the forward-backward(or Baum-Welch) [7] and the Viterbi [7] algorithms. Whenever continuous emission

Pattern Classification

probabilities are considered, both of them are based on the general maximum-likelihood (ML) criterion, i.e. they aim at maximizing the probability of the samples given the model at

Sequence of

hand. In particular, the Viterbi algorithm concentrates only on

modeIs

the most alike path throughout all the possible sequences of states in the model.

Language Processing

The aim of this model is finding the probability of

observing a sequence o. Therefore, the model has to provide a maximal probability. That's the reason we are using HMM with its three fundamental problems:

t

evaluation problem,

decoding problem and learning problem. V.

Text of the sentence

Fig. 2 : Automatic speech recognition

PROPOSED APPROACH FOR OPTIMAL CODEBOOK

Learning to real Arabic text aloud is a difficult task because

A vector quantization is performed at the output of the

the correct pronunciation of a letter depends on the context in

feature extraction using an optimal neural network, mapping

which the letter appears. The presence of the diactric marks in

the acoustic vector sequence into a reference label vector

the Arabic text is essential for the implementation of the

sequence.

automatic text-to-speech system. Unfortunately, most modern written Arabic, as in books and newspapers, is at the best partially vowelized. A verbal sentence is a straightforward in

Many algorithms exist to generate the codebook [2]. The Kohonen map is used to create the codebook. The training sequence contains the blocks of training

data,

and after

training, the weight vectors represent the final code words. In

general, this method use a learning database, this base is a set of sequence of acoustic features that are extracted from

Table II

show

the

available speech signal. The learning base therefore contains sequence of acoustic features that are representative speech

numerical

results

of classification

obtained by the proposed approach ANN/HMM. NUMERICAL RESULTS OF CLASSlFlCAnON

TABLE IT.

coding. The objective is to create an optimal number N of

Size of Neural network

Digits

vectors (code words) that the best represent the blocks of this

T=34

T=36

T=48

°

0,92

0,9

0,85

map presents in general two types of classes. The first class

I

0,93

0,9

0,89

that does represent any observation (empty class), the second

2

0,9

0,86

0,9

useless part in the Kohonen map; these neurons will be called

3

0,78

0,84

0,82

"useless neurons".

4

0,85

0,79

0,84

5

0,82

0,89

0,88

generate three dictionaries using three neural networks, the first

6

0,83

0,87

0,81

with 34 neurons, the second contains 36 and the third contains

7

0,72

0,76

0,8

8

0,82

0,84

0,87

9

0,87

0,8

0,91

mean

0,84

0,85

0,86

learning base. After the learning phase by Kohonen algorithm, we get the

class represents the information. The first type of neurons is of

The main idea proposed in this work is to build an adequate dictionary with the number of phonemes. In this context we

48 neurons, each neuron represents a phoneme. VI. A.

EXPERIMENTS RESULTS

Dataset Description The experiments were performed using the Arabic digit

corpus collected by the laboratory of automatic and signals, University of Badji-Mokhtar - Annaba, Algeria. A number of 88 individual (44 males and 44 females) Arabic native speakers were asked to utter all digits ten times [8]. Depending on this, the database consists of 8800 tokens (10 digits x 10 repetitions

We remark that the result of the classification depends on the size of the dictionary. With a neural network of size 34, the recognition rate is equal to 84%, with 36 neurons is equal to 85% and 48 rate is equal to 86%.

x 88 speakers). In this experiment, the data set is divided into two parts: a training set with 75% of the samples and test set

Rate of recognition (%)

with 25% of the samples. In this research, speaker-independent mode is considered.

Arabic

English



ZERO

..l.:oo.\.J

THREE

�)

FOUR

........:..

FIVE

'.::i.u,

�W �

TABLE

I

'1'

0,85

'2'

TWO

'3'

0,85

'4' '5' '7'

0,84

'8'

EIGHT

'9'

NINE

-

0,84

0,86

/ L

0,84

'6'

SIX SEVEN

/

0,86

Symbol '0'

ONE

ulii\ "..:iXi



0,86

ARABIC DIGITS

TABLE/,

shows the Arabic digits, the first column present

the digits in language Arabic, the second column present the digit in language English and the last column shows the symbol

T=36

T=34

Fig.

I.

T=48

Rate of recognition

From a nwnerical point of view,

Fig.

l.

shows the

of each digit.

importance of choosing the dictionary using neural network.

B.

This later plays an important role. This choice is based on the

Numerical results We generate a codebook of 34, 36 and 48 vectors for each

digit; this optimwn size obtained using the proposed approach. The

first

satisfactory.

experiment

results

of

speech

recognition

are

number of phonemes. VII.

CONCLUSION

In this paper, we have presented an approach to determine the optimal codebook by the Self Organizing Maps (SOM) for Speech recognition using a hybrid model ANN/HMM. The

main innovation is to use the optimal map of Kohonen network to generate the optimal codebook. The optimal codebook is used to classification of Arabic digits using Hidden Markov Model. The robustness of the proposed method to speech recognition is provided by

the

optimization

of Kohonen

[3]

N. Hammami, M. Bedda , "Improved Tree model for Arabic Speech Recognition", Proc.IEEE, ICCSITlO, 2010.

[4]

R. P. Lippmann, "Review of neural networks for speech recognition", Neural Computing, vol. I, 1989, pp. 1- 38.

[5]

L. Muda, M. Begam, and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques". Journal Of Computing, Volume 2, Issue 3, March 2010.

[6]

L. Rabiner, B.H. luang. "Fundamentals of Speech Recognition", edition Prentiee Hall PTR, 1993.

[7]

L.R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", Proc. IEEE 77 (2) (1989) 257}286.

[8]

http://archive.ics.uci.edu/ml/datasets/Spoken+Arabic+Digit

architecture which determines the optimal codebook. REFERENCES

[ I]

[2]

H. Bourland, N. Morgan, "Hybrid HMM/ANN Systems for Speech Recognition : Overview and New Research Directions", in Adaptive processing of sequences and Data Structures, C.L Giles and M. Gori (Eds.), Lecture Notes in Artificial Intelligence (1387), Springer Verlag, pp. 389-417, 1998. M. Ettaouil, M. Lazaar, "Improved Self-Organizing Maps and Speech Compression", International Journal of Computer Science Issues (UCS\), Volume 9, Issue 2, No 1, pp. 197-205, 2012.

Suggest Documents