An Integrated Environment for Hidden Markov Models A ... - CiteSeerX

1 downloads 0 Views 137KB Size Report
indicates the iteration number. The Viterbi al- ... The program developped needs no speci c initialization .... printf("Wrong number of input arguments in function ...
An Integrated Environment for Hidden Markov Models A Scilab Toolbox Tarik AL-ANI and Yskandar HAMAM SC2I Laboratory Control Department - E.S.I.E.E. Cite Descartes, BP 99 93162 Noisy-le-Grand Cedex FRANCE

[email protected] and [email protected]

Abstract A Hidden Markov Model Toolbox is presented within the Scilab environement. In this toolbox popular methods for the resolution of HMM problems are incorporated. These methods cover the training and recognition phases. Models may be used with discrete and continuous observations. This toolbox includes conventional methods as well as extensions.

1. Introduction Hidden Markov models (HMM) have been widely applied in automatic speech recognition. In this eld, signals are encoded as temporal variation of short time power spectrum [12]. HMM applications are now being extended to many elds such as pattern recognition, signal processing, modeling and control of dynamic systems. They are well suited for the classi cation of one or two dimensional signals. A HMM is a double stochastic process with one underlying process that is not observable but may be estimated through a set of processes that produce a sequence of observations. They may be used for the treatment of problems where information is uncertain and incomplete. Their use necessitates two stages: a training stage where the stochastic process is estimated through extensive observation and an application stage where the model may be used in real-time to obtain sequences of maximum probability. HMM models owe their success to the existence of many algorithms which are ecient and reliable. For the training stage, the Baum-Welsh algorithm [2], [3] has become very popular due to its reliability and eciency. This algorithm has since been extended to include many di erent probability distribution functions and proofs of convergence are now available for these models. This method is based on the maximum likelihood criterion. Many works are now available on algorithms accounting for this as well as other criteria

[10], [4], [6]. Due to the non-convexity of the criterion, methods such as the simulated annealing are now used [5], [7]. The use of the trained HMM in real-time necessitates the use of an ecient algorithm which gives the state sequence of maximum probability. The Viterbi algorithm [13] ful lls this need. This algorithm is a polynomial time dynamic programming algorithm. It is very ecient and very reliable and robust. In order to use the HMM techniques, the authors use the Scilab environment. In Section (2) we introduce the conventional hidden Markov models. In section (3), the di erent Toolbox functions are introduced. The conclusions of this paper are presented in section (4).

2. Basic Hidden Markov Models (HMMs) A Hidden Markov Model is de ned by the triplet  ) f; A; Bg, where , A and B are the initial state dis-

tribution vector, matrix of state transition probabilities and matrix of measurement probability distribution, respectively.

 = [1; 2; :::; N ]; i = P (q0 = si); A = [aij ]; aij = P (qk+1 = sj jqk = si ); B = [bj (z(k))]; bj (z(k)) = P (z(k)jqk = sj ); i; j 2 f1; 2; :::; N g,k 2 f1; ; :::; K g. z (k) may be discrete O(k) = vk ; vk 2 V = fv1; v2; :::; vMg or continuous O(k) = vk 2 Rr .

In general, at each instant of time k, the model is in one of the states qi. It outputs vk (Discrete (DHMM) or continuous observation (CHMM)) with probability bj (O(k)) and then jumps to state qj with probability aij . The model  may be obtained o -line by training [12] and more recently by an on-line scheme [9]. The state transition matrix represents the structure of the HMM. A typical HMM used for speech recognition [12] and other applications like Fault Detection tasks [1] is

shown in Fig. 1. a a a 22 a 44 33 a 11 a 23 34 s 12 s3 4 s1 s2

a

Fig1. A typical state diagram of a 4-state left- to -right HMM.

In Speaker independent isolated-word recognition, we usually assume a vocabulary of W words to be recognized, a training set of L tokens of each word, and an independent testing set. We build a HMM for each word in the vocabulary. In Fault detection of continuous dynamic systems [1], we may represent the evolution of one system component by three hidden states: nominal, transient and faulty state respectively. For this process, the observation may be the estimated system parameters or the system input-output signals. A HMM includes two phases:

 Training phase : we build one or many HMMs

for the process (or processes w = 1; :::; W ) to be modeled. For exemple, In speaker independent isolated-word recognition, we use the observations from the set of L tokens to estimate the optimum parameters for each word, i.e. creating w for the w-th vocabulary word, 1  w  W .  Recognition phase. For each hidden sequence in the test set characterized by an observation sequence O = (O1 ; O2; :::; OK), the occurrence probability Pw = P (Ojw ) is calculated for each process model w . The unknown observation O is then classi ed as the process w = arg max P w : 1wW For the recognition phase, the forward probability function that evaluates probability of a partial observation sequence at node i may be used [12] k (i) = P (O1O2 :::Ok; qk = si jw ); 1  i  N; 1  k  K , i.e. the probability of being in state qi at time k given the model w . k (i) may be recursively computed [12]. The occurrence probability is obtained by P (Ojw ) =

N X i=1

K (i):

In some applications of HMM, the most probable state sequence QK = q1 q2 :::qK that generates a given observation (O = O1 O2:::OK ) is desired. To estimate

the optimum state sequence, the Dynamic Programming Viterbi algorithm [13] may be used. The Viterbi algorithm nds the state sequence Q such that P (O; QK jw ) = max(P (O; Q)jw ) Q

We de ne the probability function k (i) at the ith state, time k of the multi-stage by ln p(q1:::qk = i; O1O2 :::Okjw ; qk ); k (i) = qmax :::q k

1

k  1, i.e. the maximum partial observation probability associated with state sequence of HMM at time k. In order to allow for path tracking, k (i) is also introduced to keep track of the most probable state sequence visited by an observation O. We then start k (i) and k (i) by the following steps:

1. Initialisation, k = 1 1 (i) = ibi (O1); i = 1; 2; :::; N;

i = 0

2. Induction max [k?1(i)aij ]bj (Ok ); k (j ) = arg 1max [ (i)a ]; iN k?1 ij k (j ) =

1iN

1  j  N; 2  k  K . 3. Termination ln P  = 1max [ (i)]; iN K qK = arg max [K (i)]: 1iN Backtracking: qk = k+1 (qk+1); k = K ? 1; K ? 2; :::; 1

For the training phase, we introduce brie y two methods, Baum-Welch and Viterbi based training.

 Baum-Welch based training

The Baum-Welch reestimating algorithm is the standard method used to reestimate the HMM parameters [2], [3]. In this method, a new set of parameters ~w is chosen such that P (Oj~w ) is maximized for a given observation sequence O1; O2; :::; OK. In this method, a new set of parameters ~w is chosen such that P (Oj~w ) is maximized for a given observation sequence Ok+1; Ok+2; :::; OK. The reestimation formulas [12] may be derived directly by maximizing

(using standard constrained optimization techniques) Baum's auxiliary function X P (QjO; w)logP (O; Qj~w); Q(w ; ~ w ) = Q

over ~ w . It has been proven by Baum et al. [2], [3] that maximization of Q(w ; ~w ) leads to increased likelihood, i. e. max [Q(w ; ~ w )] ) P (Oj~w )  P (Ojw ) ~w 

Eventually the likelihood function converges to a critical point. Let us de ne the backward probability k (O) = P (Ok+1; Ok+2; :::; OKjqi at k, w ) that evaluates the probability of a partial observation sequence from time k+1 to the end. k (i) may be recursively computed [12]. The computation of forward and backward probabilities can be described by a trellis graph [12]. Using k (i) and k (i), the Baum-Welch reestimation equations [2], [3], [11] are as follows (i) (i) ~i = P1N 1 ; 1  i  N; i=1 K (i) ~aij =

PkK=1?1 k(i)aij bj (Ok+1) k+1(j) ; PkK=1?1 k(i) k (i)

X i=1

i

M X bj (Ok = vk ) k=1 Z1

?1

= 1; = 1; 1  j  N

bj (x)dx = 1; 1  j  N:

 Viterbi based training

Starting from an initial model ~w0 calculate ~wk+1 = arg max max P (O; Qj~w): ~w k

Q

iteratively until ~ wk+1 = ~wk . The subscripts here indicates the iteration number. The Viterbi algorithm and the new parameters are used here to trace the optimal state sequence Q for each training sequence. An observation vector is reassigned a state if its original state assignement is di erent from the tracing result, i.e. assign Ot 2 i if qt = i. If any observation vector is reassigned a new state, use the new state assignement and repeat until ~ wk+1 = ~ wk .

3. Scilab Toolbox Functions

The rst version of this toolbox includes 58 available functions, see Appendix. These functions cater for both discrete (DHMM) and continuous (CHMM) observations. Continuous, strict log-concave, probability PK k=1 k(j) k (j)Ok distribution functions are either simple multivariate 9P Ok =vk ~bj (Ok ) = ; gaussian [10] or a mixture multivariate of gaussians K (j ) (j ) k k=1 k [8]. A Continuous Autoregressive HMM (ACHMM) is also available. 1  j  N. A freely distributed version of this toolbox is now For continuous HMM with a single gaussian obavailable by anonymous ftp from : servation ftp.inria.fr/INRIA/Projects/Meta2/Scilab/Contrib/hmm 1 exp[? (Ok ? i)2 ]; This toolbox is distributed in source code format and P (Ok = Ok jqk = sj ) = 2i2 (2) 21 i as binary. The Scilab HMM Toolbox functions, see appendix, are essentially based on the following we obtain algorithms: PKk=1 k(j) k (j)xk j = PK ;  State estimation algorithms k=1 k(j ) k (j ) P K 2 1. State Estimation at time k: The fork=1 k (j ) k (j )(Ok ? j ) : j2 = PKk=1 k(j) k (j) ward probability may be used to estimate the state at time k. Liporace [10] has extended these formulas to a 2. Best state sequence estimation by the multivariate observations case. For all these Viterbi Algorithm : This algorithm gives reestimation equations, the following constraints the single best state sequence for a given obmust be respected servation sequence. 1  i; j  N . For discrete HMM, we have

N X a j =1

ij

= 1; 1  i  N;

 Training Algorithms : The following training schemes are included

1. Baum-Welch, and Liporace Algorithms : Based on the maximum likelihood principle, these algorithms are used to estimate the parameters of the HMM. 2. Training based on the Viterbi Algorithm : Functions are available based on the reestimation with an embedded Viterbi algorithm. The matrices are estimated to maximize the probability of the sequence obtained by the embedded Viterbi algorithm. 3. Simulated Annealing : The authors have developed a simulated annealing method for the training of HMM [7].This method is based on the choice of the optimal trajectory of the discrete state. It is applied to both discrete and continuous observations. The program developped needs no speci c initialization of the algorithm by the user, the cooling schedule being general and applicable to any speci c model. The method is applied to data generated randomly and compared to the initial model. The user may either x the structure of the stochastic matrices or keep them free. He does not need to make decisions such as initial or nal temperature and chain length. He may, however, select the initial rate of acceptance of the algorithm at high temperatures.

Various functions: Functions for manipulation the HMM are also available, see Appendix. These functions cover manipulating sequences of observations and states, calculating probabilities and maximum likelihoods and other elements needed for the writing of customized algorithms. In many cases, a need for the generation of test sequences arise. Functions for generating test sequences for both continuous and discrete observations are available. 4. Conclusions In this paper, a brief description of the elements of an HMM toolbox for Scilab is presented. This toolbox is now being tested on many di erent problems for robustness and reliability. The toolbox is structured so that the user may extend by adding new functions using basic elements. Many methods for HMM are available and the authors are working on the extension of the toolbox to cover for some of them.

References

[1] T. Al-ani and Y. Hamam, \Fault Detection in Continuous Dynamic Systems using Hidden Markov Models,' To appear.

[2] L. E. Baum, T. Petrie, G. Soules, and N. Weiss \A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,' Ann. Math. Stat , 41(1), pp. 164-171, 1970. [3] L. E. Baum,\An inequality and associated maximization technique in statistical estimation of probabilistic functions of Markov Processes,' Inequalities , 3, 1972. [4] L. R. Bahl,P. F. Brown, P. V. DE SOUZA, and R. L. Mercer \Maximum mutual information estimation of hidden Markov model parameters for speech recognition,' Proc. ICASSP'86 , Tokyo, pp. 49-52, April 1986. [5] B. P. Douglas,\Training of HMM recognizers by simulated annealing', Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process , pp. 13-16, 1985. [6] Y. Ephraim, A. Dembo, and L. R. Rabiner,\A minimum discrimination information approach for hidden Markov modeling,' Proc. IEEE Int. Conf. Acoust. Speech and Signal Process , pp. 25-28, Dallas, 1987. [7] Y. Hamam and T. Al-ani,\Simulated annealing approach for Hidden Markov Models,' 4th WG-7.6 Working Conference on Optimization-Based ComputerAided Modeling and Design , ESIEE, France, May 28-

30, 1996. [8] B. H. Juang,\Maximum likelihood estimation for mixture multivariate stochastic observations of Markov chains,' AT&T Tech. J., vol. 64, no. 6, pp. 1235-1249, July-Aug. 1985. [9] V. Krishnamurthy and J. B. Moore,\On-line estimation of Hidden Markov Model parameters based on the Kullback-Leibler information measure,' IEEE Trans. Signal Processing , vol. 41, no. 8, pp. 25572573, 1993. [10] L. R. Liporace,\Maximum likelihood estimation for multivariate observations of Markov sources,' IEEE Trans. Inform. Theory , IT-28, pp. 729-734, 1982. [11] S. E. Levinson, L. R. Rabiner and M. M. Sondhi,\An introduction to the application of the theory of probabilistic fonctions of Markov process to automatic speech recognition,' The Bell System Technical journal , 62(4), 1983. [12] L. R. Rabiner,\An tutorial on Hidden Markov Models and selected applications in speech recognition,' Proc. IEEE , vol. 77, no. 2, pp. 257-286, February 1989. [13] A. J. Viterbi,\Error bounds for convolutional codes and asymptotically optimum decoding algorithm,' IEEE Trans. Inform. Theory , IT-13, pp. 260269, April 1967.

Appendix

 armulikes - calculates the likelihood that a Con-

   



 

List of the main Scilab Toolbox functions:

               

tinuous Autoregressive HMM (CAHMM) has created the given output observation sequence. armurecurs - one step Baum-Welch learning for a CAHMMs. armureests - iterations on one step Baum-Welch learning for a CAHMMs. armulike1 - calculates the probability that an CAHMM has created the given output observation sequence. arlprob - calculates the likelihood needed in the function armulike1. arprob - calculates the likelihood that a given observation was generated by a CAHMM. bgauss - calculates the probability that an observation vector (Ot) is generated by a single gaussian pdf. bmgauss - calculates the probability for generating the observation vector (Ot) in any one state by a single or a mixture of M gaussian functions. bsgauss - calculates the probability for generating the observation vector (Ot) in any one state for a xed mixture m of the M gaussian functions. canneal - reestimates a CHMM model by simulated annealing. cannealhmm - reestimates a left-right CHMM by simulated annealing. cchkpar - checks the matrices Pinit, A, Me, Seg and C. ccreate - initialisation of a CHMM with the xwindows dialog. cgenom - generates observation matrices for leftright CHMM reestimation. chmmdist - calculates the non-symmetric distance between two CHMMs. chmmlist - puts the parameters of a CHMM in a list (used e.g. by chmmdist). cinhmm - an example for a CHMM initialisation. cobsgen - generates a state and an observation sequence for a CHMM.

              

crecur - reestimates a CHMM. creest - recursive learning of a CHMM. ctrajest - reestimates a CHMM. cviterb - estimates the hidden state sequence of

a CHMM. ctrmo - reestimates a left-right CHMM. danneal - reestimates a DHMM by simulated annealing. dannealhmm - reestimates a left-right DHMM by simulated annealing. dchkpar - checks the matrices Pinit, A and B. dcreate - initialisation of a DHMM with the xwindows dialog. dgenom - generates observation matrices for leftright DHMM reestimation. dhmmdist - calculates the non-symmetric distance between two DHMMs. dindanneal - an Example for the initialisation of the function danneal. dinhmm - an example of a DHMM initialisation. dobsgen - generates an observation sequence by a given DHMM. dprob - calculates the probability of a given observation sequence. drecur - reestimates a DHMM. dreest - recursive learning of a DHMM. dtrajest - reestimates a DHMM. dviterb - estimates the hidden state sequence of a DHMM. dtrajest - reestimates a DHMM by Viterbi. dtrmo - reestimates the left-right DHMM.

Scilab script example:

function [LP,X]=dviterb(Pinit,A,B,O) // This function (Viterbi algorithm) generates, // for a discrete observation sequence, the // best state sequence. [lhs,rhs]=argn(0); // check if number of input arguments is correct if rhs ~ = 4, printf("Wrong number of input arguments in function DVITERB. Function call is: n n",rhs); printf("[P,X]=dviterb(Pinit,A,B,O )",rhs); abort; end // check output arguments if lhs ~ = 2, printf("Wrong number of output arguments.n n",lhs); printf("continue type y else n : ",lhs); t=read(% i o(1),1,1,'(a)'); if (t ~ ='y') & (t ~ ='Y') abort; end end if ~ dchkpar(Pinit,A,B) abort end // T : oberservation length [T,rO]=size(O); // N : number of states [N,cA]=size(A); // M : number of observations per state [lB,M]=size(B); // Initialisation

t=1; Phi=zeros(N,T); // Psi : array to keep track of the best path Psi=zeros(N,T); for i=1:N Phi(i,t)=Pinit(i)*B(i,O(t)); end // Recursion for t = 2:T for j =1:N xx=Phi(:,t-1).*A(:,j); [x,i]=max(xx); Phi(j,t)=x*B(j,O(t)); Psi(j,t)=i; end end [P,xT]=max(Phi(:,T)); X(T)=xT; // Backtracking: for k=1:T-1 t=T-k; X(t)=Psi(X(t+1),t+1); end