Training Hidden Markov Models on Incomplete Sequences Alexander A. Popov1, Tatyana A. Gultyaeva1, Vadim E. Uvarov Novosibirsk State Technical University, Novosibirsk, Russia Abstract – This paper deals with the algorithms of training hidden Markov models on sequences with missing observations. The method of imputation using Viterbi algorithm and the method of marginalization of missing observations are studied. These two methods are compared to the standard methods of dealing with missing data: imputation of gaps using the mode (the most frequent value) of nearest observations and gluing of observable parts of the sequence. The studied methods appeared to be more effective than the standard ones. Although the marginalization method proved to be a bit more effective than the method of imputation using Viterbi algorithm. Index terms – Hidden Markov models, machine learning, sequences, Baum-Welch algorithm, missing observations, incomplete data.
I. INTRODUCTION
H
idden Markov models (HMMs) are quite often used in machine learning. HMMs have been successfully applied in many areas including speech recognition [1], image recognition [2] and other areas, where the object or process can be represented by an observable sequence of features produced with accordance to some hidden Markov process. Despite the popularity of this approach, no unified and effective method has been developed to cope with the missing data issue yet. In this paper we consider the problem of missing observations in training sequences when training HMMs. Let’s consider the techniques for dealing with missing observation that have been used up to now. The data not organized in sequences are processed for example using Expectation-Maximization algorithm [3]. Some standard techniques are also used, for example casewise deletion of missing data, imputation with the mean (or mode) of nearest observations, interpolation etc. Approaches for dealing with missing data using HMMs were also presented in a number of papers [4-5], however they were applied only to the task of classification of incomplete sequences with HMMs that were trained on clean dataset. This paper is a continuation of research that is carried out at the department of theoretical and applied informatics of Novosibirsk state technical university [6-8].
II. PROBLEM DEFINITION The aim of this work is to study a number of approaches to the problem of training hidden Markov models on the sequences that contain missing observations.
III. THEORY A. Hidden Markov Model Hidden Markov model imitates a random process that at each time t 1,..., T appears to be in one of the N hidden states
s s1 ,..., sN and at each following time remains in previous or
transits to another hidden state according to some transition probabilities. Hidden states are characterized by features that appear in
observation sequence O ot , t 1, T . We now define a number of parameters that completely describe the HMM with discrete observations. The hidden state of the HMM at time t is denoted as qt and the observation generated at time t is denoted as ot . The discrete HMM is characterized by
the initial distribution i p q1 si , i 1, N , the transition
probabilities
matrix
A aij p qt 1 s j | qt si , i, j 1, N , the discrete set of symbols V v1 , ..., vM and observation probabilities matrix
B bi m p ot vm | qt si , i 1, N , m 1, M [9]. B. Forward-Backward Algorithm Classification of sequences using HMM method is usually based on the value of probability of the observation sequence given by the model. This value is calculated for the observation sequence and for each competitive HMM. Eventually sequence is related to a class which corresponds to HMM that have the highest probability. To calculate the probability of the sequence O , given the model the forward-backward algorithm is usually applied [9]. The first part of the forward-backward algorithm allows to calculate the values of forward variables
t (i) P(o1 , o2 ,..., ot , qt si | ) , t 1, T , i 1, N . The calculation of forward variables is described below: 1) initialization:
1 (i) i bi (o1 ), i 1, N ;
(1)
2) induction:
N
t 1 (i ) bi (ot 1 ) t ( j ) a ji , i 1, N , t 1, T 1 ; (2) j 1
3) termination:
1
This research has been supported by the Russian Ministry of Education and Science (project 2.541.2014K)
III. TRAINING HMMS ON INCOMPLETE SEQUENCES
N
p(O | ) T (i) .
(3)
We consider that sequence O has missing observations if some
i 1
The second part of the forward-backward algorithm allows us to calculate the backward variables
t (i) P(ot 1 , ot 2 ,..., oT | qt si , ),
t 1, T , i 1, N
randomly chosen observations are replaced with a gap symbol . A. Marginalization
The calculation of backward variables is described below: 1) initialization:
Consider the most straightforward approach to recognition of sequences with missing observations using HMM. One can notice
T (i) 1, i 1, N ;
that the calculation of bi (ot ) , i 1, N , t 1, T in formulas
2) induction: N
t (i ) t 1 ( j )b j (ot 1 )aij , i 1, N , t 1, T 1 (4)
(1)-(4) and (6) is quite hard to perform if ot since we cannot choose a column in symbol probabilities matrix B that corresponds to the missing symbol. To use these formulas we should
C. Baum-Welch Algorithm The efficient Baum-Welch algorithm is usually applied for HMM training [10]. This is a modification of the expectation-maximization (EM) algorithm. Since this algorithm is iterative, one has to choose an initial ap-
define the value of bi (), i 1, N for the values that are calculated for missing observations. Essentially the missing observation can occur instead of any symbol v1 ,..., vM from alphabet V of original HMM. We expand
proximation of model parameters 0 . We define additional values , , that are used for training.
the bi (), i 1, N component using its probabilistic definition:
j 1
(i) t (i) t (i) P(qt si | O, ) t , P(O | )
bi ( ) p ot v1 ot v2 ... ot vM | qt si M
(5)
i 1, N , t 1, T 1 t (i, j ) P(qt si , q t 1 s j | O, )
(6) t (i )aij b j (ot 1 ) t 1 ( j ) , i, j 1, N , t 1, T 1. P(O | )
p qt vm | qt si 1,
m 1
Hence if gap occurs at time t the bi (ot ) , i 1, N components in (1)-(4) and (6) are replaced by ones. In addition, the evaluation formula (7) for the components of emission matrix will be modified as follows:
dexes:
,
k
,
k
,
k
bi m *
New estimation of model parameters for a discrete HMM has the following coordinates [9]:
1 K ˆi* 1( k ) (i) , K k 1 K T k 1
*
t( k ) (i) k 1 t 1
Tk
K
* bi
m
k 1
t 1
K Tk
k 1 t 1
K
Tk
k 1
t 1
i 1, N , m 1, M .
,
(k ) t (i )
s.t. otk
This approach was described in [4-5] only for the task of classification but it can be also applied for HMM training as shown in this paper.
observations O ot , t 1, T
t( k ) (i )
s .t . otk vm
s.t. ot vm
The following approach is mentioned in [4-5] as well, but only for recognition task. It consists of the following steps: 1) A hidden Markov model is trained on sequence with missing
,
K T k 1
t 1
t( k ) (i)
B. Imputation Using Viterbi Algorithm
t(k ) (i, j ) k 1 t 1
k 1
. *
aij
Tk
K
We must also note that forward variables, backward variables and values of , are calculated for each training sequence of indexes k 1,..., K . They are labelled with corresponding ink
.
i 1, N , t 1, T
2) ,
(7)
(k ) t (i )
i, j 1, N , m 1, M . The iterative process described above is performed until some standard stopping criterion is met.
The
most
Q qt , t 1, T
using marginalization approach.
probable
sequence
of
hidden
states
is found using Viterbi algorithm. The b (o ) , i
t
i 1, N components are set to 1 for missing observations when calculating the steps of Viterbi algorithm 3) Gaps are imputed based on the values of the hidden states that were found on the previous step. The gap is replaced by the most probable symbol that can be generated in the corresponding hidden state. Thus, the gap at time t with corresponding hidden state
qt si* is replaced with ot arg max bi* (v) . vV
4) A hidden Markov model is trained on imputed sequence
O ot , t 1,T using conventional methods (e.g. Baum-Welch algorithm that was used in our research).
g
g
1 and 2 were trained on OgK sequences from the corresponding set using methods under study. The termination of iteration process was performed when relative increase of likelihood function value had become less than 10 5 or the number of 1000 iterations is reached. The norms of difference between the true
C. Gluing It is advisable to compare the previous two methods with some standard methods. One of the standard method to cope with the missing data is based on deletion of missing observations from the sequence and gluing the remaining parts together. Thus,
for
example a sequence O , o 2 , , o4 , , o6 , o7 , is converted to a sequence
O* o2 , o4 , o6 , o7 . This method is essentially a variation of
casewise deletion method.
model parameters and estimated ones A* Ag
as well as logarithms of probability of full sequence OTK given
the model g : ln p OT , g
.
Then the test sequences generated by the first and the second HMM were classified using maximum likelihood criterion. There were no missing observations in test sequences. We used two test sets (one set for each HMM) of K t 100 sequences of length
Tt 600 . The percent of correctly classified sequences was reg-
D. Imputation with the Mode of Neighbors We compared algorithms from sections A and B with a standard method of imputation based on a mode of k neighboring observations. After this imputation method is applied, some gaps may still remain (e.g. such gaps that have k neighbors missing as well). That is why we apply this method of imputation again but the number of neighbors k is increased to match the length of the whole sequence T . In this study we consider the 10 nearest neighbors of each gap (5 neighbors to the left and 5 to the left).
istered. The following results present the average value of 5 launches of experiment on randomly generated training and test sequences. The results obtained by changing the number of gaps from 0 to 600 are presented in Fig. 1.
IV. EXPERIMENTAL RESULTS To evaluate the methods of marginalization and imputation using Viterbi algorithm we studied their training performance on sequences with various number of gaps. The results were compared to the well-known approaches, namely to the standard imputation method based on the mode (the most frequent value) of nearest observations and to the gluing method. For the evaluation we used two discrete density HMMs with N 3 hidden states and M 3 observable symbols. The origiand 2
nal HMMs 1
*
*
1,0,0 , *
a)
had initial state distribution vector
transition
probabilities
were calculated
matrix
0.2 0.1 dA 0.7 dA A 0.2 0.2 dA 0.6 dA and observation proba0.1 0.8 dA 0.1 dA *
0.1 0.1 0.8 bilities matrix B 0.1 0.8 0.1 . 0.8 0.1 0.1 *
The first HMM 1 had difference coefficient dA 0 and sec*
ond HMM 2 had difference coefficient dA 0.1 . Thus, the two HMMs differed only in transition probabilities matrixes. Two *
sets OT (one for each HMM) of K 50 training sequences of length T 600 were generated accordingly to the HMMs paramK
K K eters. The sequences O g were constructed from OT by ran-
domly removing g observations from the latter. The two HMMs
b)
c) Fig. 1. Training effectiveness of marginalization, gluing, imputation using Viterbi algorithm (Viterbi) and imputation based on mode of neighbors (Mode) methods versus the percent of missing observations in the training sequences. а) Likelihood function logarithm. b) Norm of difference between transition probabilities matrix. c) Percent of correctly classified sequences.
As it can be seen the best results were achieved by the marginalization method. The method of imputation using Viterbi algorithm shows the similar results up to 70% of gaps, but then it begins to fall behind the marginalization method. The methods of gluing and imputation based on the mode of neighbors show very poor results. Gluing method completely failed to classify correctly after 60% of missing observations and method of imputation by the mode failed after 25% of gaps.
V. DISCUSSION OF RESULTS The studied methods have demonstrated the advantage over the standard methods of coping with missing data: deletion of missing observations followed by gluing remaining parts of sequences together and imputation with the mode of neighbors. The ability to classify between the two HMMs used for the experiment remained very high up to 70% of gaps when the proposed methods of marginalization and Viterbi imputation were used.
VI. CONCLUSION The study has shown that methods of marginalization and imputation using Viterbi algorithm can be accepted for the further investigation of their usability. In future we plan to study the methods of training the continuous density hidden Markov models on sequences with missing observations. Another question of interest is the classification of incomplete sequences using HMMs that were trained on missing data.
REFERENCES [1] M. Gales, S. Young, “The Application of Hidden Markov Models in Speech Recognition,” Signal Processing, vol. 1, no. 3, pp. 195–304, 2007. [2] T.A. Gultyaeva, A.A. Popov, “Izvlechenie nabliudenii v skrytykh markovskikh modeliakh dlia zadachi raspoznavaniia lits” [“Observation extraction for the task of face recognition using hidden Markov models”], Materialy VIII mezhdunar. nauch.-tekhnich. konf. “Aktual'nye problemy elektronnogo priborostroeniia” - APEP 2006 [Procedeeing of the 8th international scientific conference “Actual Problems of Electronic Instrument Engineering” - APEIE 2006], Novosibirsk, vol.6, pp. 22-27, 2006. [3] Z. Ghahramani, M. Jordan, “Supervised learning from incomplete data via an EM approach,” Advances in neural information processing systems, pp. 120-127, 1993. [4] M. Cooke, P. Green, L. Josifovski, A. Vizinh, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, no. 3, pp. 267-285, 2001.
[5] D. Lee, D. Kulic, Y. Nakamura, “Missing motion data recovery using factorial hidden Markov models,” IEEE International Conference on Robotics and Automation, Pasadena, CA, pp. 1722-1728, 2008. [6] T. Gultyaeva, A. Popov, V. Kokoreva, V. Uvarov, “Classification of observation sequences described by hidden markov models,” Applied methods of statistical analysis. Nonparametric approach. Proceedings of the international workshop, pp. 136-144, 2015. [7] T.A. Gultyaeva, A.A. Popov, V.E. Uvarov, “Ispol'zovanie gibridnykh vychislenii dlia optimizatsii protsessa raspoznavaniia posledovatel'nostei, opisyvaemykh skrytymi markovskimi modeliami” [Using hybrid computation to optimize the process of sequence recognition based on hidden Markov models], Sbornik nauchnykh trudov NGTU [Transactions of Novosibirsk State Technical University], no. 4 (82), pp. 42-55, 2015. [8] A.A. Popov, T.A. Gultyaeva, V.E. Uvarov, “A comparison of some methods for training hidden Markov models on sequences with missing observations,” 11th International Forum on Strategic Technology (IFOST '16) IEEE, 2016, to appear. [9] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, pp. 257-285, 1989. [10] L.E. Baum, J.A. Egon, “An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology,” Bull. Amer. Meteorol. Soc., vol. 73, pp. 360-363, 1967. Alexander A. Popov. Novosibisrk State Technical University, department of theoretical and applied informatics,
[email protected], D.Sc.(Eng.), professor. Main areas of scientific interest: statistical methods of data analysis and experimental design. He is author and co-author of more than 150 papers, including 3 monographs. Tatyana A. Gultyaeva. Novosibisrk State Technical University, department of theoretical and applied informatics,
[email protected], PhD in engineering, associate professor. Main areas of scientific interest: strucural and statistical methods of recognition. She is author and co-author of more than 150 papers, including 3 monographs. Vadim E. Uvarov. Novosibisrk State Technical University, department of theoretical and applied informatics,
[email protected]. PhD student in engineering. Main areas of scientific interest: strucural and statistical methods of recognition. He is author and co-author of more than 10 papers.