A hybrid ANN/HMM models for arabic speech recognition using optimal codebook Mohamed ETTAOUIL
Mohamed LAZAAR
Zakariae EN-NAIMANI
Modelling and Scientific Computing Laboratory, Faculty of Science and Technology, University Sidi Mohammed ben Abdellah Fez, MOROCCO
[email protected],
[email protected],
[email protected]
Abstract-Thanks to Automatic Speech Recognition (ASR), a
translation. The Hidden Markov Model permits to model the
lot of machines can nowadays emulate human being ability to
temporal variation, with local spectral variability modeled
understand
using flexible distributions such as mixtures of Gaussian
and
speak
natural
language.
However,
ASR
problematic could be as interesting as it is difficult. Its difficulty
densities. New techniques have emerged to increase the system
is precisely due to the complexity of speech processing, which
performance speech recognition.
takes
into
consideration
many
aspects:
acoustic,
phonetic,
syntactic, etc. Thus, the most commonly used technology, in the context of speech recognition, is based on statistical models. Especially, the Hidden Markov Models that are capable of
These are based on the
hybridization of Artificial Neural network and Hidden Markov Model (ANNIHMM). The approaches ANN/HMM represent a competitive alternative to standard systems [5]. We describe in this paper an original and successful
simultaneously modeling frequency and temporal characteristics of the speech signal. There is also the alternative of using
utilization
of
Neuronal Networks. But another interesting framework applied
modeling
for
in ASR is indeed the hybrid Artificial Neural Network (ANN)
conceptually weaker than continuous supervised systems like
and Hidden Markov Model (HMM) speech recognizer that improves the accuracy of the two models. In the present work, we propose an Arabic digits recognition system based on hybrid Artificial
Neural
Network
and
Hidden
Markov
Model
(ANNIHMM). The main innovation in this work is to use an optimal neural network to determine the optimal class, unlike in classical
Kohonen
approach.
The
numerical
results
are
encouraging and satisfactory.
speech
systems,
the
ANNs
and
recognition approach
Hidden
systems.
presented
Markov Although
here
offers
performance and learning speedup advantages when compared to other (more classic) discrete systems. The unsupervised ANNs which are used here are the so-called Self-Organizing Map (SaM) of Kohonen. Since the training stage is very important in the Self-Organizing Maps (SaM) performance, the
selection
of
the
architecture
of
Kohonen
network,
associated with a given problem, is one of the most important
Keywords-Automatic Speech Models, Artificial Neural Network,
I.
hybrid
unsupervised
Recognition,
Hidden Markov
Vector Quantization.
research problems in the neural network research [2]. More precisely, the choice of neurons number and the initial weights has a great impact on the convergence of learning methods. The optimization of the artificial neural networks architectures,
INTRODUCTION
Speech recognition is a topic that is very useful in many applications and environments in our daily life. This field is one of the most challenging fields that have faced scientists for a long time. In fact because speech is the most essential mediwn of human being communication, he always tries to replicate this ability on machines. However, the processing of
particularly Kohonen networks, is a recent problem. In this work,
we discusses a new method of Automatic Speech
Recognition using a new hybrid ANN/HMM model, the later contains an optimal for architectures of the Kohonen networks, this model permits to determining the optimal codebook and reduce the coding time.
speech signal has to consider many aspects, which complicate
This paper is organized as follows: The section 2 presents
the task. Indeed, the signal information could be analyzed from
Feature extraction from speech signal. In section 3 the vector
different sides: acoustic, phonetic, phonologic, morphologic,
quantization by Kohonen networks is described. The Hidden
syntactic, semantic and pragmatic [3].
Markov Model is presented in section 4. In section 5, the
The Hidden Markov Model has become one of the most powerful statistical methods for modeling speech signals. Its principles have been successfully used in automatic speech recognition, formant and pitch tracking, speech enhancement, speech synthesis, statistical language modeling, part-of-speech tagging,
spoken
language
understanding,
978-1-4799-0299-6/13/$31.00 ©2013 IEEE
and
machine
proposed approach to determine the optimal codebook is presented. Experimental results are given in the section 6. The last section concludes our work. II.
FEATURE EXTRACTION FROM SPEECH (MFCC)
Voice Signal Identification consists of the process to convert a speech waveform into features that are useful for
further processing. There are many algorithms and techniques
artificial neural networks to learning unsupervised without
are use. It depends on features capability to capture time
human intervention,
frequency and energy into set of coefficients for cepstrum
respect the characteristics of input data. This model calculates
analysis [1].
simply a Euclidean distance between her input and weights.
the little information is necessary to
Despite many differences between individuals, and the existence of many languages, speech follows general patterns, and on average has well defined characteristics such as those of volume, frequency distribution, pitch rate and syllabic rate [3]. These characteristics have adapted with regard to environment, hearing and voice production limitations, speech characteristics fit the speech generating abilities of the body, but the rapid changes in society over the past century have exceeded our ability to adapt. The human voice is converted into digital signal form to produce digital data representing each level of signal at every discrete time step. A wide range of possibilities exist for parametrically
Fig. 1: Kohonen topological map.
representing the speech signal for the speech processing task, such as Linear Prediction Coding (LPC),
Mel Frequency
Cepstral Coefficient (MFCCs), and others. Generally
speaking,
a
conventional
speech
processing
system can be organized in two blocks: the feature extraction and the modeling stage. MFCCs are the most commonly used acoustic features in speech processing, and this feature has been used in this paper. MFCC's are based on the known variation of the human ear's critical bandwidths with frequency [5]. III.
To look at the training process more formally, let us consider the input data as consisting of n-dimensional vectors X =
{xl,X , . " ,xtlJ.
Meanwhile, each of N neurons has an associated reference i vectofl1li = (w , w , . " , w ) .
f
x
at a time is compared with all wi to During training, one k fmd the reference vector w that satisfies a minimum distance or
k.
criterion.
Though
a
number
of
The quantization method using the
the original data changes as the time passes, since it supports
processing.
minf=lllx - will
wi(t + 1)
well suitable to the application that the statistical distribution of
a huge parallel structure and the possibility for high-speed
arg
neighborhood are then activated and modified:
artificial neural network, in particular Self Organizing Maps, is
the adaptive learning to data [13]. Also, the neural network has
=
The best-matching unit (BMU) and neurons within its
also been increasingly applied to reduce problem complexity
=
wi{t) + P , iI (t) lIx - J\,T ili i
One of the main parameters influencing the training process is
the
neighborhood
function
(Prr.,[ ( t) ),
which
defines
a
distance-weighted model for adjusting neuron vectors. It is define by the following relations:
The Self Organizing Map (Kohonen neural network) is the
closest
of
all
artificial
neural
network
architectures and learning schemes to the biological neuron network. The aim of Kohonen learning is to map similar signals to similar neuron positions.
arranged in a two-dimensional plane, let name this one the output layer (see Fig.
I).
The additional input layer just
distributes the inputs to output layer. The number of neurons on input layer is equal to the dimension of input vector. A defmed topology means that each neuron has a defmed number of neurons as nearest neighbors, second-nearest neighbors, etc. The neighborhood of a neuron is usually arranged either in squares, which means that each neuron has either four nearest neighbors. Kohonen has proposed various alternatives for the classification,
and
P k ,[{t)
= exp
C:���))
One can see that the neighborhood function is dependent on both the distance between the BMU and the respective neuron
The Kohonen network has one single layer of neurons
automatic
similarity
common:
VECTOR QUANTIZATION
technique in the field of speech coding. Vector quantization has
probably
maximum
measures are possible, the Euclidean distance is by far the most
Vector quantization is considered as a data compression
like pattern recognition.
�
presented
the
Kohonen
topological map; these models belong to the category of
(d R,i )
and on the time step reached in the overall training
process (t). In the Gaussian model, that neighborhood's size appears as kernel width
(a)
and is not a fixed parameter. The
neighborhood radius is used to set the kernel width with which training will start. One typically starts with a neighborhood spanning most of the SOM, in order to achieve a rough global ordering, but kernel width then decreases during later training cycles. IV.
HIDDEN MARKOV MODELS (HMM)
This section is dedicated to the particular problem of modeling speech units. The Hidden Markov Model is a fmite set of states, each of which is associated with a probability
distribution. Transitions among the states are governed by a set
canonical order the less complexity using optimized neural
of probabilities called transition probabilities.
networks.
A HMM is a stochastic automaton with a stochastic output process attached to each.
Thus we have two concurrent
The basic unit for describing how speech convoys linguistic meaning is the phoneme. The standard Arabic language has
stochastic processes: an underlying (hidden) Markov process
basically 34 phonemes which are 28 consonants, 3 short
modeling the temporal structure of speech and a set of state
vowels
output processes modeling the stationary character of the
pronunciation of the phoneme in the syllable as initial, closing,
and
3
long
vowels.
Several
factors
effect
the
speech signal. For more detailed information about HMMs
intervocalic or suffix. There are 19 ways of pronounced of an
used in speech recognition, several texts can be recommended
Arabic letter. A consonant can be pronounced with 6 vowels
such as[6].
and with signs as tanween, tachdid or germination and sukune.
More precisely, an HMM is defined by:
A
=
The performance of the Vector Quantization techniques depends largely on the quality of the codebook used in coding
(Q,O,P, B,rr)
the data.
If the codebook contains vectors that represent
sequence of acoustic features, the compression quality will be
where:
good. For cons, the code words which are not similar to the
Q:
set of [mite states of the model of cardinal N;
0: set of finite observations of the model of cardinal M· p'
=
i.:>N (Pij ) 1:>g J:>N
blocks
to
of
state
(bAoa) is[ :>i'! : iSt:> I T I
nj
= =
not
produce
good
quality
of
signal
Feature Analysis
matrix of probability to emit
observation 0t in the state n
will
transition
probabilities. Transition probability is the probability
=
coded
Speech
matrix
to chose the transition Pij to access to the state q [ ; B
be
quantification features.
Sequence
.q j ;
(n:i' n ,.,,' 'TfN ) : initial P (qo = j), Vj E [1, N],
of acoustic
distribution where
go
of
states
features
represents the
Vector Quantization
initial state of the model. HMMs represent a learning paradigm, in the sense that examples of the event that is to be modeled can be collected
Sequence of
and used in conjunction with a training algorithm in order to
quantification
learn proper estimates of p, B and
11:.
The most popular of such
features
algorithms are the forward-backward(or Baum-Welch) [7] and the Viterbi [7] algorithms. Whenever continuous emission
Pattern Classification
probabilities are considered, both of them are based on the general maximum-likelihood (ML) criterion, i.e. they aim at maximizing the probability of the samples given the model at
Sequence of
hand. In particular, the Viterbi algorithm concentrates only on
modeIs
the most alike path throughout all the possible sequences of states in the model.
Language Processing
The aim of this model is finding the probability of
observing a sequence o. Therefore, the model has to provide a maximal probability. That's the reason we are using HMM with its three fundamental problems:
t
evaluation problem,
decoding problem and learning problem. V.
Text of the sentence
Fig. 2 : Automatic speech recognition
PROPOSED APPROACH FOR OPTIMAL CODEBOOK
Learning to real Arabic text aloud is a difficult task because
A vector quantization is performed at the output of the
the correct pronunciation of a letter depends on the context in
feature extraction using an optimal neural network, mapping
which the letter appears. The presence of the diactric marks in
the acoustic vector sequence into a reference label vector
the Arabic text is essential for the implementation of the
sequence.
automatic text-to-speech system. Unfortunately, most modern written Arabic, as in books and newspapers, is at the best partially vowelized. A verbal sentence is a straightforward in
Many algorithms exist to generate the codebook [2]. The Kohonen map is used to create the codebook. The training sequence contains the blocks of training
data,
and after
training, the weight vectors represent the final code words. In
general, this method use a learning database, this base is a set of sequence of acoustic features that are extracted from
Table II
show
the
available speech signal. The learning base therefore contains sequence of acoustic features that are representative speech
numerical
results
of classification
obtained by the proposed approach ANN/HMM. NUMERICAL RESULTS OF CLASSlFlCAnON
TABLE IT.
coding. The objective is to create an optimal number N of
Size of Neural network
Digits
vectors (code words) that the best represent the blocks of this
T=34
T=36
T=48
°
0,92
0,9
0,85
map presents in general two types of classes. The first class
I
0,93
0,9
0,89
that does represent any observation (empty class), the second
2
0,9
0,86
0,9
useless part in the Kohonen map; these neurons will be called
3
0,78
0,84
0,82
"useless neurons".
4
0,85
0,79
0,84
5
0,82
0,89
0,88
generate three dictionaries using three neural networks, the first
6
0,83
0,87
0,81
with 34 neurons, the second contains 36 and the third contains
7
0,72
0,76
0,8
8
0,82
0,84
0,87
9
0,87
0,8
0,91
mean
0,84
0,85
0,86
learning base. After the learning phase by Kohonen algorithm, we get the
class represents the information. The first type of neurons is of
The main idea proposed in this work is to build an adequate dictionary with the number of phonemes. In this context we
48 neurons, each neuron represents a phoneme. VI. A.
EXPERIMENTS RESULTS
Dataset Description The experiments were performed using the Arabic digit
corpus collected by the laboratory of automatic and signals, University of Badji-Mokhtar - Annaba, Algeria. A number of 88 individual (44 males and 44 females) Arabic native speakers were asked to utter all digits ten times [8]. Depending on this, the database consists of 8800 tokens (10 digits x 10 repetitions
We remark that the result of the classification depends on the size of the dictionary. With a neural network of size 34, the recognition rate is equal to 84%, with 36 neurons is equal to 85% and 48 rate is equal to 86%.
x 88 speakers). In this experiment, the data set is divided into two parts: a training set with 75% of the samples and test set
Rate of recognition (%)
with 25% of the samples. In this research, speaker-independent mode is considered.
Arabic
English
�
ZERO
..l.:oo.\.J
THREE
�)
FOUR
........:..
FIVE
'.::i.u,
�W �
TABLE
I
'1'
0,85
'2'
TWO
'3'
0,85
'4' '5' '7'
0,84
'8'
EIGHT
'9'
NINE
-
0,84
0,86
/ L
0,84
'6'
SIX SEVEN
/
0,86
Symbol '0'
ONE
ulii\ "..:iXi
�
0,86
ARABIC DIGITS
TABLE/,
shows the Arabic digits, the first column present
the digits in language Arabic, the second column present the digit in language English and the last column shows the symbol
T=36
T=34
Fig.
I.
T=48
Rate of recognition
From a nwnerical point of view,
Fig.
l.
shows the
of each digit.
importance of choosing the dictionary using neural network.
B.
This later plays an important role. This choice is based on the
Numerical results We generate a codebook of 34, 36 and 48 vectors for each
digit; this optimwn size obtained using the proposed approach. The
first
satisfactory.
experiment
results
of
speech
recognition
are
number of phonemes. VII.
CONCLUSION
In this paper, we have presented an approach to determine the optimal codebook by the Self Organizing Maps (SOM) for Speech recognition using a hybrid model ANN/HMM. The
main innovation is to use the optimal map of Kohonen network to generate the optimal codebook. The optimal codebook is used to classification of Arabic digits using Hidden Markov Model. The robustness of the proposed method to speech recognition is provided by
the
optimization
of Kohonen
[3]
N. Hammami, M. Bedda , "Improved Tree model for Arabic Speech Recognition", Proc.IEEE, ICCSITlO, 2010.
[4]
R. P. Lippmann, "Review of neural networks for speech recognition", Neural Computing, vol. I, 1989, pp. 1- 38.
[5]
L. Muda, M. Begam, and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques". Journal Of Computing, Volume 2, Issue 3, March 2010.
[6]
L. Rabiner, B.H. luang. "Fundamentals of Speech Recognition", edition Prentiee Hall PTR, 1993.
[7]
L.R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", Proc. IEEE 77 (2) (1989) 257}286.
[8]
http://archive.ics.uci.edu/ml/datasets/Spoken+Arabic+Digit
architecture which determines the optimal codebook. REFERENCES
[ I]
[2]
H. Bourland, N. Morgan, "Hybrid HMM/ANN Systems for Speech Recognition : Overview and New Research Directions", in Adaptive processing of sequences and Data Structures, C.L Giles and M. Gori (Eds.), Lecture Notes in Artificial Intelligence (1387), Springer Verlag, pp. 389-417, 1998. M. Ettaouil, M. Lazaar, "Improved Self-Organizing Maps and Speech Compression", International Journal of Computer Science Issues (UCS\), Volume 9, Issue 2, No 1, pp. 197-205, 2012.