Training Hidden Markov Models on Incomplete ...

Training Hidden Markov Models on Incomplete Sequences Alexander A. Popov1, Tatyana A. Gultyaeva1, Vadim E. Uvarov Novosibirsk State Technical University, Novosibirsk, Russia Abstract – This paper deals with the algorithms of training hidden Markov models on sequences with missing observations. The method of imputation using Viterbi algorithm and the method of marginalization of missing observations are studied. These two methods are compared to the standard methods of dealing with missing data: imputation of gaps using the mode (the most frequent value) of nearest observations and gluing of observable parts of the sequence. The studied methods appeared to be more effective than the standard ones. Although the marginalization method proved to be a bit more effective than the method of imputation using Viterbi algorithm. Index terms – Hidden Markov models, machine learning, sequences, Baum-Welch algorithm, missing observations, incomplete data.

I. INTRODUCTION

H

idden Markov models (HMMs) are quite often used in machine learning. HMMs have been successfully applied in many areas including speech recognition [1], image recognition [2] and other areas, where the object or process can be represented by an observable sequence of features produced with accordance to some hidden Markov process. Despite the popularity of this approach, no unified and effective method has been developed to cope with the missing data issue yet. In this paper we consider the problem of missing observations in training sequences when training HMMs. Let’s consider the techniques for dealing with missing observation that have been used up to now. The data not organized in sequences are processed for example using Expectation-Maximization algorithm [3]. Some standard techniques are also used, for example casewise deletion of missing data, imputation with the mean (or mode) of nearest observations, interpolation etc. Approaches for dealing with missing data using HMMs were also presented in a number of papers [4-5], however they were applied only to the task of classification of incomplete sequences with HMMs that were trained on clean dataset. This paper is a continuation of research that is carried out at the department of theoretical and applied informatics of Novosibirsk state technical university [6-8].

II. PROBLEM DEFINITION The aim of this work is to study a number of approaches to the problem of training hidden Markov models on the sequences that contain missing observations.

III. THEORY A. Hidden Markov Model Hidden Markov model imitates a random process that at each time t  1,..., T  appears to be in one of the N hidden states

s  s1 ,..., sN  and at each following time remains in previous or

transits to another hidden state according to some transition probabilities. Hidden states are characterized by features that appear in





observation sequence O  ot , t  1, T . We now define a number of parameters that completely describe the HMM with discrete observations. The hidden state of the HMM at time t is denoted as qt and the observation generated at time t is denoted as ot . The discrete HMM is characterized by





the initial distribution    i  p  q1  si  , i  1, N , the transition



probabilities





matrix



A  aij  p qt 1  s j | qt  si , i, j  1, N , the discrete set of symbols V  v1 , ..., vM  and observation probabilities matrix





B  bi  m   p  ot  vm | qt  si  , i  1, N , m  1, M [9]. B. Forward-Backward Algorithm Classification of sequences using HMM method is usually based on the value of probability of the observation sequence given by the model. This value is calculated for the observation sequence and for each competitive HMM. Eventually sequence is related to a class which corresponds to HMM that have the highest probability. To calculate the probability of the sequence O , given the model  the forward-backward algorithm is usually applied [9]. The first part of the forward-backward algorithm allows to calculate the values of forward variables

 t (i)  P(o1 , o2 ,..., ot , qt  si |  ) , t  1, T , i  1, N . The calculation of forward variables is described below: 1) initialization:

1 (i)   i bi (o1 ), i  1, N ;

(1)

2) induction:



N



 t 1 (i )  bi (ot 1 )    t ( j ) a ji  , i  1, N , t  1, T  1 ; (2)  j 1

3) termination:

1

This research has been supported by the Russian Ministry of Education and Science (project 2.541.2014K)



III. TRAINING HMMS ON INCOMPLETE SEQUENCES

N

p(O |  )   T (i) .

(3)

We consider that sequence O has missing observations if some

i 1

The second part of the forward-backward algorithm allows us to calculate the backward variables

t (i)  P(ot 1 , ot  2 ,..., oT | qt  si ,  ),

t  1, T , i  1, N

randomly chosen observations are replaced with a gap symbol  . A. Marginalization

The calculation of backward variables is described below: 1) initialization:

Consider the most straightforward approach to recognition of sequences with missing observations using HMM. One can notice

T (i)  1, i  1, N ;

that the calculation of bi (ot ) , i  1, N , t  1, T in formulas

2) induction: N

t (i )   t 1 ( j )b j (ot 1 )aij , i  1, N , t  1, T  1 (4)

(1)-(4) and (6) is quite hard to perform if ot   since we cannot choose a column in symbol probabilities matrix B that corresponds to the missing symbol. To use these formulas we should

C. Baum-Welch Algorithm The efficient Baum-Welch algorithm is usually applied for HMM training [10]. This is a modification of the expectation-maximization (EM) algorithm. Since this algorithm is iterative, one has to choose an initial ap-

define the value of bi (), i  1, N for the values that are calculated for missing observations. Essentially the missing observation can occur instead of any symbol v1 ,..., vM from alphabet V of original HMM. We expand

proximation of model parameters  0 . We define additional values  ,  , that are used for training.

the bi (), i  1, N component using its probabilistic definition:

j 1

 (i) t (i)  t (i)  P(qt  si | O,  )  t , P(O |  )

bi ( )  p  ot  v1  ot  v2  ...  ot  vM | qt  si   M

(5)

i  1, N , t  1, T  1 t (i, j )  P(qt  si , q t 1  s j | O,  )  

(6)  t (i )aij b j (ot 1 ) t 1 ( j ) , i, j  1, N , t  1, T  1. P(O |  )

 p  qt  vm | qt  si   1,

m 1

Hence if gap occurs at time t the bi (ot ) , i  1, N components in (1)-(4) and (6) are replaced by ones. In addition, the evaluation formula (7) for the components of emission matrix will be modified as follows:

dexes: 

,

k

,

k

,

k 

  bi  m   *

New estimation of model parameters  for a discrete HMM has the following coordinates [9]:

1 K ˆi*    1( k ) (i) , K k 1 K T k 1

*

   t( k ) (i) k 1 t 1

Tk

K

  * bi

 m 

k 1

t 1

K Tk

k 1 t 1

K

Tk

k 1

t 1

i  1, N , m  1, M .

,



(k ) t (i )

s.t. otk 

This approach was described in [4-5] only for the task of classification but it can be also applied for HMM training as shown in this paper.



observations O  ot , t  1, T

 t( k ) (i )

s .t . otk  vm

 

s.t. ot  vm

The following approach is mentioned in [4-5] as well, but only for recognition task. It consists of the following steps: 1) A hidden Markov model is trained on sequence with missing

,

K T k 1

t 1

 t( k ) (i)

B. Imputation Using Viterbi Algorithm

  t(k ) (i, j ) k 1 t 1

k 1

 

. *

aij 

Tk

K

We must also note that forward variables, backward variables and values of  ,  are calculated for each training sequence of indexes k 1,..., K . They are labelled with corresponding ink

.

i  1, N , t  1, T

2) ,

(7)

(k ) t (i )

i, j  1, N , m  1, M . The iterative process described above is performed until some standard stopping criterion is met.



The

most

Q  qt , t  1, T

 using marginalization approach.

probable

sequence

of

hidden

states

 is found using Viterbi algorithm. The b (o ) , i

t

i  1, N components are set to 1 for missing observations when calculating the steps of Viterbi algorithm 3) Gaps are imputed based on the values of the hidden states that were found on the previous step. The gap is replaced by the most probable symbol that can be generated in the corresponding hidden state. Thus, the gap at time t with corresponding hidden state

qt  si* is replaced with ot  arg max bi* (v) . vV

4) A hidden Markov model is trained on imputed sequence





O  ot , t  1,T using conventional methods (e.g. Baum-Welch algorithm that was used in our research).

g

g

 1 and  2 were trained on OgK sequences from the corresponding set using methods under study. The termination of iteration process was performed when relative increase of likelihood function value had become less than 10 5 or the number of 1000 iterations is reached. The norms of difference between the true

C. Gluing It is advisable to compare the previous two methods with some standard methods. One of the standard method to cope with the missing data is based on deletion of missing observations from the sequence and gluing the remaining parts together. Thus,

for

example a sequence O  , o 2 , , o4 , , o6 , o7 ,  is converted to a sequence

O*  o2 , o4 , o6 , o7  . This method is essentially a variation of

casewise deletion method.

model parameters and estimated ones A*  Ag

as well as logarithms of probability of full sequence OTK given

 

the model g : ln p OT , g

 .

Then the test sequences generated by the first and the second HMM were classified using maximum likelihood criterion. There were no missing observations in test sequences. We used two test sets (one set for each HMM) of K t  100 sequences of length

Tt  600 . The percent of correctly classified sequences was reg-

D. Imputation with the Mode of Neighbors We compared algorithms from sections A and B with a standard method of imputation based on a mode of k neighboring observations. After this imputation method is applied, some gaps may still remain (e.g. such gaps that have k neighbors missing as well). That is why we apply this method of imputation again but the number of neighbors k is increased to match the length of the whole sequence T . In this study we consider the 10 nearest neighbors of each gap (5 neighbors to the left and 5 to the left).

istered. The following results present the average value of 5 launches of experiment on randomly generated training and test sequences. The results obtained by changing the number of gaps from 0 to 600 are presented in Fig. 1.

IV. EXPERIMENTAL RESULTS To evaluate the methods of marginalization and imputation using Viterbi algorithm we studied their training performance on sequences with various number of gaps. The results were compared to the well-known approaches, namely to the standard imputation method based on the mode (the most frequent value) of nearest observations and to the gluing method. For the evaluation we used two discrete density HMMs with N  3 hidden states and M  3 observable symbols. The origiand 2

nal HMMs 1

*

*

  1,0,0 , *

a)

had initial state distribution vector

transition

probabilities

were calculated

matrix

0.2   0.1  dA 0.7  dA  A   0.2 0.2  dA 0.6  dA and observation proba0.1   0.8  dA 0.1  dA *

 0.1 0.1 0.8   bilities matrix B  0.1 0.8 0.1 .    0.8 0.1 0.1 *

The first HMM 1 had difference coefficient dA  0 and sec*

ond HMM 2 had difference coefficient dA  0.1 . Thus, the two HMMs differed only in transition probabilities matrixes. Two *

sets OT (one for each HMM) of K  50 training sequences of length T  600 were generated accordingly to the HMMs paramK

K K eters. The sequences O g were constructed from OT by ran-

domly removing g observations from the latter. The two HMMs

b)

c) Fig. 1. Training effectiveness of marginalization, gluing, imputation using Viterbi algorithm (Viterbi) and imputation based on mode of neighbors (Mode) methods versus the percent of missing observations in the training sequences. а) Likelihood function logarithm. b) Norm of difference between transition probabilities matrix. c) Percent of correctly classified sequences.

As it can be seen the best results were achieved by the marginalization method. The method of imputation using Viterbi algorithm shows the similar results up to 70% of gaps, but then it begins to fall behind the marginalization method. The methods of gluing and imputation based on the mode of neighbors show very poor results. Gluing method completely failed to classify correctly after 60% of missing observations and method of imputation by the mode failed after 25% of gaps.

V. DISCUSSION OF RESULTS The studied methods have demonstrated the advantage over the standard methods of coping with missing data: deletion of missing observations followed by gluing remaining parts of sequences together and imputation with the mode of neighbors. The ability to classify between the two HMMs used for the experiment remained very high up to 70% of gaps when the proposed methods of marginalization and Viterbi imputation were used.

VI. CONCLUSION The study has shown that methods of marginalization and imputation using Viterbi algorithm can be accepted for the further investigation of their usability. In future we plan to study the methods of training the continuous density hidden Markov models on sequences with missing observations. Another question of interest is the classification of incomplete sequences using HMMs that were trained on missing data.

REFERENCES [1] M. Gales, S. Young, “The Application of Hidden Markov Models in Speech Recognition,” Signal Processing, vol. 1, no. 3, pp. 195–304, 2007. [2] T.A. Gultyaeva, A.A. Popov, “Izvlechenie nabliudenii v skrytykh markovskikh modeliakh dlia zadachi raspoznavaniia lits” [“Observation extraction for the task of face recognition using hidden Markov models”], Materialy VIII mezhdunar. nauch.-tekhnich. konf. “Aktual'nye problemy elektronnogo priborostroeniia” - APEP 2006 [Procedeeing of the 8th international scientific conference “Actual Problems of Electronic Instrument Engineering” - APEIE 2006], Novosibirsk, vol.6, pp. 22-27, 2006. [3] Z. Ghahramani, M. Jordan, “Supervised learning from incomplete data via an EM approach,” Advances in neural information processing systems, pp. 120-127, 1993. [4] M. Cooke, P. Green, L. Josifovski, A. Vizinh, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, no. 3, pp. 267-285, 2001.

[5] D. Lee, D. Kulic, Y. Nakamura, “Missing motion data recovery using factorial hidden Markov models,” IEEE International Conference on Robotics and Automation, Pasadena, CA, pp. 1722-1728, 2008. [6] T. Gultyaeva, A. Popov, V. Kokoreva, V. Uvarov, “Classification of observation sequences described by hidden markov models,” Applied methods of statistical analysis. Nonparametric approach. Proceedings of the international workshop, pp. 136-144, 2015. [7] T.A. Gultyaeva, A.A. Popov, V.E. Uvarov, “Ispol'zovanie gibridnykh vychislenii dlia optimizatsii protsessa raspoznavaniia posledovatel'nostei, opisyvaemykh skrytymi markovskimi modeliami” [Using hybrid computation to optimize the process of sequence recognition based on hidden Markov models], Sbornik nauchnykh trudov NGTU [Transactions of Novosibirsk State Technical University], no. 4 (82), pp. 42-55, 2015. [8] A.A. Popov, T.A. Gultyaeva, V.E. Uvarov, “A comparison of some methods for training hidden Markov models on sequences with missing observations,” 11th International Forum on Strategic Technology (IFOST '16) IEEE, 2016, to appear. [9] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, pp. 257-285, 1989. [10] L.E. Baum, J.A. Egon, “An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology,” Bull. Amer. Meteorol. Soc., vol. 73, pp. 360-363, 1967. Alexander A. Popov. Novosibisrk State Technical University, department of theoretical and applied informatics, [email protected], D.Sc.(Eng.), professor. Main areas of scientific interest: statistical methods of data analysis and experimental design. He is author and co-author of more than 150 papers, including 3 monographs. Tatyana A. Gultyaeva. Novosibisrk State Technical University, department of theoretical and applied informatics, [email protected], PhD in engineering, associate professor. Main areas of scientific interest: strucural and statistical methods of recognition. She is author and co-author of more than 150 papers, including 3 monographs. Vadim E. Uvarov. Novosibisrk State Technical University, department of theoretical and applied informatics, [email protected]. PhD student in engineering. Main areas of scientific interest: strucural and statistical methods of recognition. He is author and co-author of more than 10 papers.

Training Hidden Markov Models on Incomplete ...

Training Hidden Markov Models on Incomplete ...

Suggest Documents

Notes on Hidden Markov Models

Hidden Markov Models

Hidden Markov models

L23: hidden Markov models

Hidden Markov Models Fundamentals

Hidden Markov Models

Logical Hidden Markov Models

Measures on Hidden Markov Models - Tidsskrift.dk

Hidden Markov Models - Semantic Scholar

Hidden Markov Models - Semantic Scholar

Multiple Regression Hidden Markov Models

Heterogeneous Hidden Markov Models - Members.home.nl

HIDDEN MARKOV MODELS - Stat Arb

Hidden Markov Models - Princeton University

Factorial Hidden Markov Models - CiteSeerX

HIDDEN MARKOV MODELS FOR DNA

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov models training using population ... - Semantic Scholar

Discriminative Training of Hidden Markov Models by ... - Google Sites

Improved Ensemble Training for Hidden Markov Models ... - UQ eSpace

The adjusted Viterbi training for hidden Markov models - arXiv

Hidden Markov Models Training by a Particle Swarm ... - Springer Link

Faster Gradient Descent Training of Hidden Markov Models, Using ...

Reduced space hidden Markov model training.