Voice Command System Based On Pipelining ... - IEEE Xplore

3 downloads 0 Views 232KB Size Report
Voice Command System Based On Pipelining. Classifiers GMM-HMM. Mohamed FEZARI1, Mohamed Seghir Boumaza2. 1.Laboratory of Automatic and Signals, ...
2012 International Conference on Information Technology and e-Services

Voice Command System Based On Pipelining Classifiers GMM-HMM Mohamed FEZARI1, Mohamed Seghir Boumaza2 1.Laboratory of Automatic and Signals, Annaba, BP.12, Annaba, 23000, ALGERIA 2. Department of Electronics University of 8 Mai 1945 Guelma [email protected]

Abstract—Details of designing and developing a voice guiding system for a robot arm is presented. The features combination technique is investigated and then a hybrid method for classification is applied. Based on research and experimental results, more features will increase the rate of recognition in automatic speech recognition. Thus combining classical components used in ASR system such as Crossing Zero, energy, Mel frequency cepstral coefficients with wavelet transform ( to extract meaningful formants parameters) followed by a pipelining ordered classifiers GMM and HMM has contributed in reducing the error rate considerably. To implement the approach on a real-time application, a PC interface was designed to control the movements of a four degree of freedom robot arm by transmitting the orders via RF circuits. The voice command system for the robot is designed and tests showed an Improvement by combining techniques. Keywords— Speech recognition, wavelet transform, pipelining technics, GMM/HMM, Robot Arm, wireless command.

I. INTRODUCTION Robotics has achieved its greatest success to date in the world of industrial manufacturing. Robot arms, or manipulators, comprise a 2 billion dollar industry. Bolted at its shoulder to a specific position in the assembly line, the robot arm can move with great speed and accuracy to perform repetitive tasks such as spot welding and painting. In the electronics industry, manipulators place surface-mounted components with superhuman precision, making the portable telephone and laptop computer possible, tele-operated Robot arm are widely used in space robotics, bomb disposal, and remotely operated vehicles. [1] and [2]. Yet, for all of their successes, these commercial robots suffer from a fundamental disadvantage: lack of human voice control. Human-robot voice interface is an interesting application for elderly people and impaired users. A fixed manipulator has a limited range of commands provided by a manipulator. Mainly using a joystick, keyboard or a mouse. This paper proposes a new voice guiding system (VGS) for tele-operating a robot arm. Based on the recognition of isolated words, using a set of traditional pattern recognition approaches and a discrimination approach based pipelining classical methods [2][5] and [7] in order to increase the rate of

978-1-4673-1166-3/12/$31.00 ©2012 IEEE

Ali Aldahoud Faculty of IT Al-Zaytoonah University, Amman, Jordan [email protected]

recognition. The increase in complexity as compared to the use of only traditional approach is negligible, but the system achieves considerable improvement in the matching phase, thus facilitating the final decision and reducing the number of errors in decision taken by the VGS. In addition, speech recognition constitutes the focus of a large research effort in Artificial Intelligence (AI), which has led to a large number of new theories and new techniques. However, it is only recently that the fields of robot arm control and AGV (Automatic Guided Vehicle) navigation have started to import some of the existing techniques developed in AI for dealing with non-stationary information. Hybrid method is a simple, robust technique developed to allow the grouping of some basic techniques advantages. It therefore increases the rate of recognition. The selected features are: Zero Crossing and Extremes (CZEXM), continuous Wavelet transform CWT specially Gabor wavelets , Energy Segments (ES), and cepstral coefficients ( MFCC: Mel Frequency Cepstral coefficients).[6]. The application uses a set of commands in Arabic words in order to control the directions of four DOF robot arm. It has to be implemented on programmable integrated circuits [8] and has to be robust to any background noise confronted by the system. As application, a voice command for a four DOF robot arm is chosen. The robot arm is “TERGANE -TR45”. There have been many research projects dealing with robot control, among these projects, there are some projects that build intelligent systems [10-12]. Voice command system needs the recognition of isolated words from a limited vocabulary used in guiding a Robot Arm Control system (RACS)[14] . The paper is presented as follow: in section 2, we present the selected feature computation, in section 3 the speech recognition agent is detailed based on GMM and HMM, the hardware part is presented in section 4, finally in section 5 and 6 we present the results of tests simulation and a conclusion. II. GABOR WAVELETS AND SPEECH ANALYSIS In recent years, the wavelet transform has been successfully applied to image and signal processing.1 Unlike the Fourier transform; the wavelet transform can detect the characteristics of local signals. Therefore, the wavelet transform is particularly helpful for the analysis and processing

of speech signal where local formants information is important. Research on the wavelet transform for speech recognition and signal processing has been active for a decade, and different schemes have been proposed. Early work includes the applications of a one-dimensional 1-D wavelet transform to the phase analysis of the grid method. Wavelets have become a popular tool for speech processing, such as speech analysis, pitch detection and speech recognition [12]. Wavelets are successful front end processors for speech recognition, this by exploiting the time resolution of the wavelet transform. One difficulty of the use of the wavelet transform for speech recognition is the increase in the amount of data presented to the recognition system. The number of coefficients generated by the DWT is much larger compared to the Short Time Fourier Transform. Gabor wavelets, which are similar to the Mexican hat wavelets provide a more powerful tool for analyzing speech and music. We shall first go over their definition, and then illustrate their use with some examples. A Gabor wavelet, with width parameter w and frequency parameter v, is the following analyzing wavelet:

Ψ ( x) = ω −1 / 2 e −π ( x / w)

2

+ i*2*m* x / w

2.1 This wavelet is complex valued. Its real part R(x) and imaginary part I(x) are.

ΨR ( x) = w −1 / 2 e −π ( x / w ) cos(2πvx / w) 2

ΨI ( x) = w −1 / 2 e −π ( x / w ) sin(2πvx / w)

2.2

2

2.3

The width parameter w plays the same role as for the Mexican hat wavelet; it controls the width of the region over which most of the energy of (x) is concentrated. The value v/w is called the base frequency for a Gabor CWT. One advantage that Gabor wavelets have when analyzing sound signals is that they contain factors of cosines and sinus, as shown in (2.2) and (2.3). These cosine and sine factors allow the Gabor wavelets to create easily interpretable scalograms of those signals which are combinations of cosines and sinus—the most common instances of such signals are recorded music and speech. more details on this is in [6-8], but first we need to say a little more about the CWT defined by a Gabor analyzing wavelet. Because a Gabor wavelet is complex valued, it produces a complex-valued CWT. For many signals, it is often sufficient to just examine the magnitudes of the Gabor CWT values and take them as features. CWT were used to select some principles features within the uttered signal that can not be extracted by classical techniques these components will indeed increase the true positive counter however they will add on calculation period.

Fig .1 . a- Mexican hat CWT (scalogram) of a test signal with two main frequencies, b- Magnitudes of Gabor scalogram of test signal.

III. HYBRID TECHNIQUES Regarding the different ASR paradigms proposed during these years, Hidden Markov Models of Gaussian mixtures (HMM/GMM) [19] is doubtless the most widely accepted approach. Alternatively, Artificial Neural Networks (ANN) based framework has also been proposed [2], but despite their high discrimination ability in short-time classification tasks, they have proved inefficient when dealing with long-term speech segments. With the scope of solving the problem of long time modeling of the ANN framework, one of the most successful alternatives to HMM/GMM was later proposed, commonly known as hybrid ANN/HMM or connectionist paradigm [20]. In general, hybrid architectures seek to integrate ANN ability for estimation of Bayesian posterior probabilities into a classical HMM structure that permit modeling long-term speech evolution. A. GMM Technique The GMM can be viewed as a hybrid between parametric and non- parametric density models. Like a parametric model, it has structure and parameters that control the behavior of density in known ways. Like non-parametric model it has many degrees of freedom to allow arbitrary density modeling. The GMM density is defined as weighted sum of Gaussian densities:

PG , M ( x) = ¦ Wm g ( x, mm , C m )

2.4

the total number of Gaussian components. Wm are the component probabilities (_wm = 1), also called weights. We consider K-dimensional densities, so the argument is a vector x = (x1, ... , xK)T. The component probability density function (pdf), g(x, _m, Cm), is a K-dimensional Gaussian probability density function (pdf) given in Equation 7 as follows:

g ( x , μ m , C m ) = e −1 / 2 ( x − μ m ) Where

μm

T

2.5

is the mean vector, and Cm is the covariance

matrix. Organizing the data for input to the GMM is important since the components of GMM play a vital role in making the word models. For this purpose, we use K- means clustering technique to break the data into 256 cluster centroids. These centroids are then grouped into sets of 32 and then passed into each component of GMM. As a result, we obtain a set of 8 components of GMM. Once the component inputs are decided, the GMM modeling can be implemented as (Figure 2.a) more details are in [21]. B. HMM basics Over the past years, Hidden Markov Models have been widely applied in several models like pattern [6], or speech recognition [6][9] . The HMMs are suitable for the classification from one or two dimensional signals and can be used when the information is incomplete or uncertain. To use a HMM, we need a training phase and a test phase. For the training stage, we usually work with the Baum-Welch algorithm to estimate the parameters (,A,B) for the HMM [7, 9]. This method is based on the maximum likelihood criterion. To compute the most probable state sequence, the Viterbi algorithm is the most suitable. A HMM model is basically a stochastic finite state automaton, which generates an observation string, that is, the sequence of observation vectors, O=O1,..Ot ,… ,OT. Thus, a HMM model consists of a number of N states S={Si} and of the observation string produced as a result of emitting a vector Ot for each successive transitions from one state Si to a state Sj. Ot is d dimension and in the discrete case takes its values in a library of M symbols. The state transition probability distribution between state Si to Sj is A={aij}, and the observation probability distribution of emitting any vector Ot at state Sj is given by B={bj(Ot)}. The probability distribution of initial state is ={ i}. aij = P(qt +1 = S j qt = Si ) 3.1 B={bj(Ot)}.

3.2

π i = P(q0 = Si )

3.3 GMM Models

procedure P(O/) can be computed [10]. Consequently, the forward variable is defined as the probability of the partial observation sequence O1O2 ,....Ot (until time t) and the state S at time t, with the model  as (i). and the backward variable is defined as the probability of the partial observation sequence from t+1 to the end, given state S at time t and the model  as (i). the probability of the observation sequence is computed as follow: N

N

i =1

i =1

P(O / λ ) = ¦ α t (i ) * βt (i ) = ¦ αT (i )

3.4

and the probability of being in state I at time t, given the observation sequence O and the model  is computed as follow:

π i = P(q0 = Si )

3.5

The flowchart of a connected HMM is an HMM with all the states linked altogether (every state can be reached from any state). The Bakis HMM is left to right transition HMM with a matrix transition defined as: IV. PIPELINING RECOGNITION AGENTS A speech recognition agent is based on a traditional pattern recognition approach. The main elements are shown in the block diagram of Figure 2.b. The pre-processing block is used to adapt the characteristics of the input signal to the recognition system. It is essentially a set of filters, whose task is to enhance the characteristics of the speech signal and minimize the effects of the background noise produced by the external conditions and the motor. A Kalman filter is used to enhance the word command regarding the other words [14]. The Speech locator (SL) block detects the beginning and end of the word pronounced by the user, thus eliminating silence. Input Word Preprocessing

SL

Features Extraction ES, MFCC, CZEXM, CWT

HMM Models Pattern Matching GMM then HMM And Weighting vector pipelining Recognition System Block (PRS) Decision Block

Features extraction ( EN,MFCC,CZEM) + CWT parameters

GMM N-class separator

HMM N-class separator

Action : 8 bits to transmit by RF Decision

Reference Words pattern

Input signal Fig. 2.b: Block Diagram of the Proposed Design Fig2.a Pattern Matching GMM then HMM And Weighting vector

Given an observation O and a HMM model =(A,B,), the probability of the observed sequence by the forward-backward

It processes the samples of the filtered input waveform, comprising useful information (the word pronounced) and any

noise surrounding the PC. Its output is a vector of samples of the word (i.e.: those included between the endpoints detected). The SL implemented is based on analysis of crossing zero points and energy of the signal, the linear prediction mean square error computation helps in limiting the beginning and the end of a word; this makes it computationally quite simple. The parameter extraction block analyses the signal, extracting a set of parameters with which to perform the recognition process. First, the signal is analysed as a block, the signal is analysed over 20-mili seconds frames, at 256 samples per frame. Five types of parameters are extracted: Normalized Extremes Rate with Normalized Zero Crossing Rate (CZEM), Gabor wavelets coefficients , Energy Segments (ES) and MFCC’s [4]. These parameters were chosen for computational simplicity reasons (CZEM, ES), robustness to background noise (12 Cepstral parameters). Gabor wavelets for extracting useful parameters such as formants. The reference pattern block is created during the training phase of the application, where the user is asked to enter ten times each command word. For each word and based on the ten repetitions, ten vectors of parameters are extracted from each segment and stored. The matching block , by application of GMM classifier then HMM models in order , it compares the reference patterns and those extracted from the input signal in both methods. The matching decision integrate: a hybrid recognition block based on five methods, and a weighting vector. The value of this vector is then taken from the results of theses methods. Example : if GMM and and then HMM recognise the first word in table 1 then decision block will recognise the first word as uttered. Tests were made using each method GMM and HMM separately. From the results obtained, a weighting vector is extracted based on the rate of recognition for each method. Figure 1 shows the elements making up the main blocks for the hybrid recognition system (HRS). A voice command system and an interface to the robot arm are implemented. V. ROBOT ARM HARDWARE DESCRIPTION A. Robot Arm Description As shown in Figure 3.a and 3.b, the structures of the mechanical hardware and the computer board of the four DOF robot arm in this paper are similar to those in [5-6]. The system is composed of: - A robot arm reference T45 from TEREL, France. - A command module. - A power supply box. Four movements for the robot (base, shoulder, limb and hand) can be controlled. The command of each motor is realised by a moto-reductor block (1/500) powered by + and – 12 volts. The recopy voltage is provided by a linear rotated potentiometer fixed on the moto-reductor block. The grip, however is in open loop.

B. Simulation results of the Robot Arm Control B. The New Design for Tele-Guiding The TR-45 A parallel port interface was designed to show the realtime commands. It is based on two TTL 74HCT573 Latchs and 10 light emitting diodes (LED), 4 green LED to indicate each recognized word for robot arm parts (“Dirrar”, “Saad”, “Meassam”, “mekbadh”) and 5 yelow LED to indicate each recognised command word (“fawk”, “tahta”, “Yamine”, “Yassar”, “Kif”), respectively and a red LED to indicate wrong or no recognized word. Other LED’s can be added for future insertion of new command word in the vocabulary example the words: “Iftah”, “Ighlak” for the grip as shown in Figure 4. The order to be executed by the TR45 is transmitted by the wireless transmission circuit TXM-433-10, which provide a transmission rate of 10kb/s. Tx Antenna Set of Green Yelow and Red LEDs

Personal Computer ( HRS)

TXM433-10

Microphone Fig. 3.a : Block Diagram of Parallel Port Interface.

Reception Antenna RF

Receiver

M4

H

Pic16F8 76 With Quartz= 4Mhz

H

M1

H

M2

H

M3

Power Supply +12 volts DC.

Fig. 3.b. Robot Arm Control System block diagram

C. TR45 Interface Description However, since the robot arm in this paper needs to perform simpler tasks than those in [5-6] do, robot arm electronic control system can be designed simply. The computer board

consists of a PIC16F876 , with: 8 K-instruction Flash Memory , some analogue inputs, include some timers and Watch dog timer, serial communication interfaces USART and I2C [15], three H bridges drivers using BD134 and BD133 transistors for DC motors, a RF receiver module from RADIOMETRIX which is the SILRX-433-10 ( modulation frequency 433MHz and transmission rate is 10 Kbs) [16]. In order to protect the microcontroller from power feedback signals, a set of opto-transistors were added. Each motor within the robot arm T45 is designed by a name and performs the corresponding task to a received command as in Table 1. Commands and their corresponding tasks affected to robot arm motors may be changed in order to enhance or adapt the application. In the recognition phase, the application gets the word to be processed, treats the word, then takes a decision by setting the corresponding bit on the parallel port data register and hence the corresponding LED is on. The code is also transmitted in serial mode to the TXM-433-10.

5.

The rate of recognition with enhancement of parameters used in ASR (ES, MFCC, CWT) and combing GMM and HMM as classifier 6. And the effect of computation time due to combination of different techniques. For the four first tests, each command word is uttered 50 times. The recognition rate for each word is presented in Fig.5.a, 5b, and 5.c. Fig.5.b presents the improvement using combined features and classifiers. Figure 5.c, represents the computation time for each combination (hybridizing classifiers) , it is clear that the computation time is long by using more data parameters to process or more classifiers in pipeline to select Table 1 , shows the results of the five combinations, each column contains the results of the five selected words, the word Kif gives better results since it is quit different from the others in length and pronunciation. Table 1 RECOGNITION RATE OF EACH METHOD. Techniques Words

First

Second

Third

Fourth

Diraa (Upper limb)

60

70

72

75

fifth 87

Saad (Limb)

65

71

74

82

87

Iftah (Open)

70

75

75

76

86

Ighlak (Close)

68

72

76

76

89

Kif (Stop)

75

78

80

84

92

100 90 80

Fig.4. Overview of the Robot Arm T45

VI. EXPERIMENTS ON THE SYSTEM The corpus is based on 50 students from 1st and 2nd year Master Degree of our department, they have already study a course on robotics and have a lecture on Human- Machine Communication, 40 students were used for training phases and 10 students were used for tests. The developed system has been tested within the laboratory of L.A.S.A. The tests were’ done only on the five command words. Three different conditions were tested: 1. The rate of recognition using the classical methods with classical parameters used in ASR (ES, MFCC) and GMM as classifier.. 2. The rate of recognition using the classical methods with classical parameters used in ASR (ES, MFCC)and HMM as classifier. 3. The rate of recognition with enhancement of parameters used in ASR (ES, MFCC, CWT) and GMM as classifier 4. The rate of recognition with enhancement of parameters used in ASR (ES, MFCC, CWT) and HMM as classifier

70

Ammame

60

Wara

50

Yamine

40

Yassar

30

Kif

20 10 0 1

2

3

4

5

Fig 5.b. Recognition Rate of each method with a graph bar results of pipelining techniques and combining features

92 91 90 89 88 87 86 85 84 83 Ammame

Wara

Yamine

Yassar

Kif

EN+MFCC+WCT , GMM+HMM

Fig 5.c Results of Combining techniques

3 2,5

EN+MFCC, GMM EN+MFCC,HMM

2

EN+MFCC+CWT,GMM

1,5

EN+MFCC+CWT,HMM

1

EN+MFCC+CWT,GMM+HM M

0,5 0 1

Fig. 5.d The effect computation time

VII. CONCLUSION AND FUTURE DIRECTIONS Evaluations show the interest of .teleoperating arm manipulator for users' even handicapped people. Since the designed robot consists of a microcontroller and other low-cost components namely RF transmitters, the hardware design can easily be carried out. The results of the tests shows that a better recognition rate can be achieved inside the laboratory and especially if the phonemes of the selected word for voice command are quite different. However, a good position of the microphone and additional filtering may enhance the recognition rate. Several interesting applications of the proposed system different from previous ones are possible as mentioned in section 5. Beside the designed application, a hybrid approach to the implementation of an isolated word recognition agent HSR was used. This approach can be implemented easily within a DSP or a CMOS DSP microcontroller or even FPGA modules. The use of hybrid technique based on classical recognition methods makes it easier to separate the class represented by the various words, thus simplifying the task of the final decision block. Tests carried out have shown an improvement in performance, in terms of misclassification of the words pronounced by the user. Segmentation of the word in three principal frames for the Zero Crossing and Extremes method gives better results in recognition rate. The idea can be implemented easily within a hybrid design using a DSP with a microcontroller since it does not need too much memory capacity. Finally we notice that by simply changing the set of command words, we can use this system to control other objects by voice command such as an electric wheelchair movements or a set of home appliances ( TV ,curtain, lights on/off) . One line of research is under way is the implementation of feedback force sensors for the grip and adapted algorithms to get smooth objects. Another line of research will be investigated in the future relates to natural language recognition based on semantic and large data base. REFERENCES [1] Nguyen L., Belouchrani A., Abed-Meraim K. and Boa-shash B., “Separating more sources than sensors using time- frequency distributions.” in Proc. ISSPA, Malysia, 2001.

[2] M Mokhtari., Ghorbal M. and Kadouche R., “Mobilité et Services : application aux aides technologiques pour les personnes handicapées », in procedings 7th ISPSAlgiers, may, 2005, pp. 3948, 2005. [3] P. Arbeille, J Ruiz, M. Chevillet,“ Teleleoperated robotic Arm for Echographic Diagnosis in Obstetrics and Gynecology“, Untrasound In Obstetrics and Gynecololy Vol 24. No.3 pp. 242-243, August 2004. . [4] Gu L. and Rose K., "Perceptual Harmonic Cepstral Coefficients for Speech Recognition in Noisy Environment," Proc ICASSP 2001, Salt Lake City, Utah, May 2001. [5] C. pasca, P. Payeur, E.M. Petriu, A.M. Cretu, “Intelligent Haptic Sensor System for Robotic manipulator”, in Porc. IMTC/2004, IEEE Instrum. Meas. Technol. Conf. pp. 279-284, Italy 2004. [6] Tony F. Chan and Jackie (Jianhong) Shen, Image Processing and Analysis - Variational, PDE, Wavelet, and Stochastic Methods, Society of Applied Mathematics, ISBN 089871589X (2005) [7] Ingrid Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, 1992, ISBN 0-89871-274-2 [8] Hongyu L. Y. Zhao, Y. Dai and Z. Wang,. A secure Voice Communication System Based on DSP, IEEE 8th International Conf. on Cont. Atau. Robotc and Vision, Kunming, China, 2004, pp. 132-137. [9] Ferrer M.A. , I. Alonso, C. Travieso, Influence of initialization and Stop Criteria on HMM based recognizers , Electronics letters of IEE, 2000, Vol. 36, pp.1165-1166 [10] Vitrani M., Morel G., and Ortmaier T., “Automatic guidance of a surgical instrument with ultrasound based visual servoing,” in Proc. IEEE Int. Conf. on Robotics and Automation, Barcelona, Spain, April 2005. [11] Reynolds D. , “Automatic Speaker Recognition, Acoustics and Beyond”, Senior Member of Technical Staff, MIT Lincoln Laboratory, JHU CLSP, 10 July 2002. [12]. P.S. Addison. (2005). Wavelet transforms and the ECG: a review. Phys- iol. Meas., Vol. 26, R155–R199. [13] Kwee H., Stanger C.A., “The Manus robot arm”, Rehabilitation Robotics News letter, Vol. 5, No 2, 1993, H. G. Evers, E. Beugels and G. Peters, “MANUS toward new decade”, Proc. ICORR 2001, Vol. 9, pp. 155-161, 2001 [14] Evers H. G., Beugels E. and Peters G., “MANUS toward new decade”, Proc. ICORR 2001, Vol. 9, pp. 155-161, 2001 [15] Data sheet PIC16F876 from Microchip inc. User’s Manual, 2003, Available at: http://www.microchip.com. [16] Radiometrix components, TXm-433 and SILRX-433 Manual, HF Electronics Company. Available at: http://www.radiometrix.com. [17] Kim W. J. , et al., “Development of A voice remote control system.” Proceedings of the 1998 Korea Automatic Control Conference, Pusan, Korea, Oct. 1998, pp. 1401-1404, 1998. [18] Mahoney RM, Jackson RD, Dargie GD. “An Interactive Robot Quantitative Assessment Test.” Proceedings of RESNA ’92. 110-112. 1992. [19] Bourlard, H., Morgan, N.: Connectionist Speech Recognition: a Hybrid Approach. Boston: Kluwer Academic, Norwell, MA (USA) (1994) [20]Ch. Neukirchen, G. Rigoll: "Advanced Training Methods and New Network Topologies for Hybrid MMI-Connectionist/HMM Speech Recognition Systems", Proc. IEEE-ICASSP, Munich, 1997, pp. 32573260. [21]I. El-EMARY,. M. FEZARI and H. ATOUI," Hidden Markov odel/Gaussian mixture models (HMM/GMM) based voice command system", in Scientific Research and Essays Vol. 6(2), pp. 341-350, 18 January, 2011.