Applying the Hidden Markov Model Methodology for ... - CiteSeerX

31 downloads 0 Views 167KB Size Report
email : biswas@vuse.vanderbilt.edu. Abstract. The Hidden Markov Modeling (HMM) methodology has been successfully employed in the modeling and analysis ...
Applying the Hidden Markov Model Methodology for Unsupervised Learning of Temporal Data Cen Li

Gautam Biswas

Department of Computer Science Department of Electrical Engineering and Computer Science Middle Tennessee State University Vanderbilt University Box 48, Murfreesboro, TN 37132 Box 1824 Station B, Nashville, TN 37235 email : [email protected] email : [email protected]

Abstract The Hidden Markov Modeling (HMM) methodology has been successfully employed in the modeling and analysis of sequence data, for example, continuous speech recognition in the temporal domain and analysis of protein structures in the sequencing domain. However, these domains assume that the model structure has been provided by human experts. In a number of other domains, such as physiology, economics, and social sciences, model structures of dynamic phenomena are not not readily available. In this context, this paper develops a clustering methodology to automatically partition temporal data into homogeneous groups, and to construct HMM models for each group. when the model structure is unknown. Our methodology makes two new contributions to data clustering using HMM models. First, an explicit HMM model size selection procedure is incorporated into the clustering process, i.e., the sizes of the individual HMMs are dynamically determined for each cluster. This improves the interpretability of cluster models, and the quality of the final clustering partition results. Second, a partition selection method is developed to ensure an objective, data-driven selection of the number of clusters in the partition. The result is a heuristic sequential search control algorithm that is computationally feasible. Experiments with artificially generated data show that the HMM model size selection algorithm is effective in re-discovering the structure of the generating HMMs, and the HMM clustering with model size selection significantly outperforms HMM clustering using uniform HMM model sizes for re-discovering clustering partition structures. Experiments with real world ecological data show interesting results.

Introduction Unsupervised classification, or clustering, derives structure from data by using objective criteria to partition the data into homogeneous groups so that the within group object similarity and the between group object dissimilarity are optimized simultaneously (Jain

& Dubes 1988). Categorization and interpretation of structure are achieved by analyzing the models constructed in terms of the feature value distributions within each group. In the past, cluster analysis techniques have focused on data described by static features, i.e., value of the feature does not change, or the change is negligible, over time. In many real world applications, that cover diverse domains, such as engineering systems, human physiology, and economic and social systems, the dynamic characteristics, i.e., how a system interacts with the environment and evolves over time, are of interest. Dynamic characteristic of these systems is best described by temporal features whose values change significantly during the observation period. Clustering temporal data is inherently more complex than clustering static data because: (i) the dimensionality of data is significantly larger, and (ii) the model structures that describe individual cluster structures and more complex and harder to analyze and interpret. We assume that the temporal data sequences that define the dynamic characteristics of the phenomenon under study satisfy the Markov property, and the data generation may be viewed as a probabilistic walk through a fixed set of states. When state definitions directly correspond to feature values, a Markov chain model representation of the data may be appropriate (Sebastiani et al. 1999). When the state definitions are not directly observable, or it is not feasible to define states by exhaustive enumeration of feature values (e.g., when the data is described with multiple continuous valued temporal features), a Hidden Markov model (HMM) representation is more appropriate. In this paper, we apply the HMM methodology for unsupervised learning of temporal data. Consider the example of continuous speech recognition. Techniques have been developed (Rabiner & Jung 1993) for analyzing the speech spectrum and extracting temporal features from a continuous speech signal.

HMMs (usually 2-10 states per HMM) define feature sequences that correspond to subword units. Lexical and semantic rules are then employed to recognize full sentences. A lot of background knowledge in the form of phonemes and syllable for isolated word recognition, phone-like units (PLUs) for sub-word recognition and syllable and acoustic units for sentence-level recognition have been developed through years of rigorous analysis. Therefore, the HMM structures used for recognition are well-defined, and the speech recognition task is reduced to learning model parameters for efficient and robust recognition. On the other hand, when one considers problems, such as therapeutic effects of a new medical procedure or analysis of stock market data, not much is known about models that can be employed to structure, analyze, and interpret available data. To address these problems, we need to develop methodologies that enable us to derive dynamic models from temporal data. Our goal is to develop through the HMM models derived from available data, an accurate and explainable representation of system dynamics in a given domain. It is important for our clustering system to determine the best partition, i.e., number of clusters in the data, and the best model structure, i.e., the number of states in a model. The first step allows us to break up the data into homogeneous groups, and the second step provides an accurate state-based representation of the dynamic phenomena corresponding to each group. We approach these tasks by (i) developing an explicit HMM model size selection procedure that dynamically modifies the size of the HMMs during the clustering process, and (ii) casting the HMM model size selection and partition selection problems in the Bayesian model selection framework. We illustrate the clustering and model interpretation process on an ecological data set in the last section of this paper.

Modeling with HMMs A HMM is a non-deterministic stochastic Finite State Automata model. The basic structure of a HMM consists of a connected set of states, S = (S1 , S2 , ..., Sn ). We use first order HMMs, where the state of a system at a particular time t is only dependent on the state of the system at the immediate previous time point, i.e., P (St |St−1 , St−2 , ..., S1 ) = P (St |St−1 ). In addition, we assume all the temporal feature values of the dynamic phenomena under study are continuous, therefore, our HMM models use the continuous density HMM (CDHMM) representation. A CDHMM with n states for data having m temporal features can be characterized1 in terms of three sets of probabilities (Rabiner 1989): (i) the initial state probabilities, (ii) the transition probabilities, and (iii) the emission probabilities. The initial state probabilities, a vector 1 We assume the continuous features are sampled at a pre-defined rate, and the temporal feature values are defined as a sequence of values.

π of size n, defines the probability of any of the given states being the initial state of the given sequence. The transition probability matrix, A, of size n × n, defines the probability of transition from state i at time t, to state j at the next time step. The emission probability matrix, B, of size n × m, defines the probability of generating feature values at any given state. For CDHMM, the emission probability density function of each state is defined by a multivariate Gaussian distribution. There are a number of advantages to using the HMM representation for data driven profiling of system dynamics. First, the hidden states of a HMM can be used to effectively model the set of potentially valid states in a dynamic process. While the complete set of states and the exact sequence of states a system goes through may not be observable, it can be estimated by studying the observed behaviors of the system. Second, HMMs represent a well-defined probabilistic model. The parameters of a HMM can be determined in a well-defined manner. Furthermore, HMMs provide a graphical state based representation of the underlying dynamic processes that govern system behavior. Graphical models may aid the cluster analysis and model interpretation tasks.

The HMM Clustering Problem Earlier work on HMM clustering (Rabiner et al. 1989), (Dermatas & Kokkinakis 1996), (Kosaka, Masunaga, & Kuraoka 1995), (Smyth 1997) do not use an objective criterion measure to automatically select the cluster partition structure from the data. Pre-determined threshold values on data likelihood, or post-clustering Monte-Carlo simulations, have been used instead, In addition, a pre-specified and uniform HMM model size has been used for all intermediate and the final cluster definitions. Our clustering methodology does not require the model size for individual clusters to be pre-defined. A key objective of our system is to directly derive this information from the available data during the clustering process. There is a separate line of work that derives the HMM model structure from homogeneous data (Stolcke & Omohundro 1994) (Ostendorf & Singer 1997). The task here is to derive a model structure, and not a partition structure, i.e., no clustering is involved. These methods cannot be employed in our HMM clustering task because they do not apply to continuous valued temporal data, and they do not have an objective way of comparing the quality of models of different sizes. To overcome these limitations, we have adopted a Bayesian decision making scheme that enables the algorithm to derive the best partition structure for the data during the clustering process. Our HMM clustering system, which we call M atryoshka can be described in terms of four nested search loops that: • loop 1: derive the number of clusters in a partition;

• loop 2: determine the assignment of objects to clusters for a given partition; • loop 3: assign HMM model sizes for individual clusters in the partition; and • loop 4: derive the HMM parameter configuration for the individual clusters. For any chosen size (i.e., the number of states) in the HMM, the innermost search loop, loop 4, is invoked to select model parameters that optimize a chosen criterion. We employ a well known Maximum Likelihood (ML) parameter estimation method, the Baum-Welch procedure (Rabiner 1989) for this task. The BaumWelch procedure is is a variation of the more general EM algorithm (Dempster, Laird, & Rubin 1977). It iterates between an expectation step (E-step) and a maximization step (M-step). The E-step assumes the current parameter configuration of the model and computes the expected values of necessary statistics. The M-step updates the parameter values, by maximizing their expected likelihood. The iterative process continues until the parameter configuration converges.2

The Bayesian Clustering Methodology Bayesian Clustering In model-based clustering, it is assumed that data is generated by a mixture of underlying probability distributions. The mixture model, M , is represented by K component models and a hidden, independent discrete variable C, where each value i of C represents a component cluster, modeled by λi . Given observations X = (x1 , · · · , xN ), let fk (xi |θk , λk ) be the density of an observation xi from the kth component model, λk , where θk are the corresponding parameters of the model. The likelihood of the mixture model given data is expressed as: P (X|θ1 , · · · , θK , P1 , · · · , PK ) =

N K  

Pk · fk (xi |θk , λk ),

i=1 k=1

where Pk is the probability that an observation belongs  to the kth component(Pk ≥ 0, K k=1 Pk = 1). Bayesian clustering casts the model-based clustering problem into a Bayesian model selection problem. Given partitions with different component clusters, the goal is to select the best overall model, M , that has the highest posterior probability, P (M |X).

Model Selection Criterion From Bayes theorem, the posterior probability of the model, P (M |X), is given by: P (M |X) =

P (M )P (X|M ) , P (X)

where P (X) and P (M ) are prior probabilities of the data and the model, respectively, and P (X|M ) is 2

A HMM converges when the log data likelihood given two consecutive model configurations differ less than 10−6 .

the marginal likelihood of the data. For the purpose of comparing alternate models, we have P (M |X) ∝ P (M )P (X|M ). Assuming none of the models considered is favored a priori, P (M |X) ∝ P (X|M ). That is, the posterior probability of a model is directly proportional to the marginal likelihood. Therefore, the goal is to select the mixture model that gives the highest marginal likelihood. Given the parameter configuration, θ, of a model M , the marginal likelihood of the data is computed as  P (X|M ) =

P (X|θ, M )P (θ|M )dθ. θ

When parameters involved are continuous valued, as in the case of CDHMM, the integration computation often becomes too complex to obtain a closed form solution. Common approaches include Monte-Carlo methods, i.e., Gibbs sampling methods (Chib 1995), and various approximation methods, such as the Laplace approximation. It has been well documented that although the Monte-Carlo methods and the Laplace approximation are quite accurate, they are computationally intense. Next, we look at an efficient approximation method: the Bayesian information criterion (BIC) (Heckerman, Geiger, & Chickering 1995). Bayesian Information Criterion BIC is derived from the Laplace approximation(Heckerman, Geiger, & Chickering 1995): ˆ − d logN, logP (M |X) ≈ logP (X|M, θ) 2 where d is the number of parameters in the model, N is the number of data objects, and θˆ is the ML parameˆ the data ter configuration of model M . logP (X|M, θ), likelihood, tends to promote larger and more detailed models of data, whereas the second term, − d2 logN , is the penalty term which favors smaller models with less parameters. The larger and more complex the model, the more parameters in the model, the larger the d value, thus the larger the model complexity penalty. BIC selects the best model for the data by balancing these two terms.

Bayesian HMM Clustering Recently, graphical models have been incorporated into the Bayesian clustering framework. Components of a Bayesian mixture model, which typically are modeled as multivariate normal distributions (Cheeseman & Stutz 1996) are replaced with graphical models, such as Bayesian networks (Thiesson et al. 1998) and Markov chain models (Sebastiani et al. 1999). We have adapted Bayesian clustering to CDHMM clustering, so that: 1. components of a Bayesian mixture model are represented by CDHMMs, and 2. fk (Xi |θk , λk ), the data likelihood, computes the likelihood of a multi-feature temporal sequence given a CDHMM.

0

Penalty

-200 -400 -600

Likelihood -800

BIC

-1000 2

3

4

5

6

7 8 9 Model Size

10

Figure 1: Cluster partition selection We first describe how the general Bayesian model selection criterion is adapted for the HMM model size selection and the cluster partition selection problems. Then we describe how the characteristics of these criterion functions are used to design our heuristic clustering search control structure.

Criterion Functions Criterion for HMM Size Selection The HMM model size selection algorithm picks the model in a way that the number of states of the HMM best describe the data for that cluster. We use Bayesian model selection criterion to select the best HMM model size given the data. From our discussion earlier, for a model λ trained on data X, the best model size is the one, when coupled with its ML configuration, gives the highest posterior probability. Since P (λ|X) ∝ P (X|λ), the marginal likelihood can be efficiently approximated using the BIC measure. Applying the BIC approximation, marginal likelihood of the HMM, λk , for cluster k is computed as: Nk dk ˆ logP (Xk |λk ) ≈

j=1

logP (Xkj |λk , θk ) −

2

logNk ,

where Nk is the number of objects in cluster k, dk is the number of parameters3 in λk , and θˆk are the ML parameters in λk . Figure 1 illustrates how BIC works for HMM model size selection. Data generated on a 5-state HMM is modeled using HMMs of sizes ranging from 2 to 10. Results for the BIC measure are plotted. The dashed line shows the likelihood of data for the different size HMMs. The dotted line shows the penalty value for each model. And the solid line shows BIC as a combination of the above two terms. We observe that as the size of the model increases, the model likelihood also increases and the model penalty decreases monotonically. BIC has its highest value corresponding to the size of the generating HMM for the data. 3 Significant parameters include all the parameters for emission probability definitions and only the initial probabilities and transition probabilities that are greater than a threshold value t. t is set to 10−6 for all experiments reported in this paper.

Criterion for Cluster Partition Selection In the Bayesian framework, the best clustering mixture model, M , has the highest partition posterior probability (PPP), P (M |X). We approximate PPP with the marginal likelihood of the mixture model, P (X|M ). For a partition with K clusters, modeled as λ1 , · · · , λK , the PPP computed using the BIC approximation is: N K logP (X|M ) ≈ log[ k=1 Pk · P (Xi |θˆk , λk )] i=1 K K+ dk k=1 logN, − 2 where θˆk and dk are the ML model parameter configuration for λk , and the number of significant model parameters of cluster k, respectively. Pk is the likelihood of data given the model for cluster k. When computing the data likelihood, we assume that the data is complete, i.e., each object is assigned to one known cluster in the partition. Therefore, Pk = 1 if object Xi is in cluster k, and Pk = 0, otherwise. The best model is the one that balances the overall data likelihood and the complexity of the entire cluster partition. Figure 2 illustrates how the BIC measure works for cluster partition selection: given data consisting of an equal number of objects from four generating HMMs, the BIC scores are measured when data is partitioned from 1 to 10 clusters. At first when the number of clusters is small, because the improvements of data likelihood dominates the change of the BIC values, values of BIC monotonically increase as the size of the partition increases. It reaches the peak value when the size of the partition corresponds to the true partition size,i.e., four. Subsequently, the improvements of data likelihood becomes less and less significant, the penalty on model complexity and the model prior terms dominate the change of the BIC measure, and it decreases monotonically as the size of the partition continues to increase.

The Clustering Search Control Structure The exponential complexity associated with the four nested search steps of our HMM clustering scheme prompts us to introduce heuristics into the search process. Figures 1 & 2 illustrated the characteristics of the BIC criterion in HMM model size selection (step 3) and in partition selection (step 1). They both monotonically increase as the size of the HMM model or the cluster partition increase, until they reach a peak, and then begin to decrease because the penalty term dominates. The peak corresponds to the optimal model size. Given this characteristic, we employ the same sequential search strategy for both search steps 1 and 3. We start with the simplest model, i.e., a one cluster partition for step 1 and a one state HMM for step 3. Then, we gradually increase the size of the model, i.e., by adding one cluster to the partition or adding one state to the HMM, and re-estimate the model. After each expansion, we evaluate the model using the BIC measure. If the score of the current model decreases

Penalty

Table 1: The Matryoshka system

0 -200 -400 -600 -800 -1000 -1200 -1400 -1600 -1280001 -130000 Likelihood -132000 -134000 -136000 -138000 -140000 -142000 -144000 -1460001 2 3 4

BIC

5

6

7

8

9

10

Partition size

Initial partition construction: Select first seed Apply HMM model construction based on the seed Form the initial partition with the HMM do Partition expansion: Select one additional seed Apply HMM model construction based on the seed Add HMM to the current partition Object redistribution: do Distribute objects to clusters with the highest likelihood Estimate the best model size and partition value for each cluster while there are objects change cluster memberships Compute PPP of the current partition while PPP of the current partition > PPP of the previous partition Accept the previous partition as the final cluster partition Apply HMM model construction to final clusters

Figure 2: Cluster partition selection

from that of the previous model, we may conclude that we have just passed the peak point, and accept the previous model as our final model. Otherwise, we continue with the model expansion process. To expand a partition, we first select a seed based on the cluster models in the current partition. A seed includes a set of k objects, (k = 3 for all experiments shown here). The purpose of including more than one object in each seed is to ensure that there is sufficient data to build a reliable initial HMM. The first object in a seed is selected by choosing an object that has the least likelihood given all cluster models in the current partition. Then the remaining objects in the seed are the ones have the highest likelihood given the HMM model built based on the first object. A seed selected in this way is more likely to correspond to the centroid of a new cluster in the given partition. We apply HMM model size selection procedure for each chosen seed to estimate the optimal model size. Once a partition is expanded with a seed, search step 2 distributes objects to individual clusters such that the overall data likelihood given the partition is maximized. We assign object, xi , to cluster, (θˆk , λk ), based on its sequence-to-HMM likelihood measure (Rabiner 1989), P (xi |θˆk , λk ). Individual objects are assigned to clusters whose model provides the highest data likelihood. If after one round of object distribution, any object changes its cluster membership, models for all clusters are updated to reflect the current data in the clusters. Then all objects are redistributed using the set of new models. Otherwise, the current distribution is accepted. After the objects are distributed into clusters, for a HMM model size, step 3 and 4 are called upon to estimate the best HMM model size and the

ML model parameters for the chosen model size. Table 1 describes the Matryoshka algorithm for HMM clustering. For comparative purposes, we also build a simplified Matryoshka system where the HMM model size selection is only applied during the initial seed selection and to the set of clusters in the final partition. This is different from the complete Matryoshka system, where the HMM model size selection procedure is applied to the seed selection, as well as to the intermediate clusters after each object redistribution procedure.

Experimental Results We begin our discussion of the experiments by describing the method for data generation from synthetic models. Then the experimental results are analyzed using a set of performance indices.

Synthetic Data Synthetic data are generated in a manner that the underlying HMM models satisfy predetermined similarity values. We define the similarity between pairwise HMMs using the Normalized Symmetrized Similarity (NSS) measure. The similarity between two HMMs, λ1 and λ2 , d(λ1 , λ2 ) is computed as: d(λ1 , λ2 ) =

(logP (S 2 |λ1 ) + logP (S 1 |λ2 ) −logP (S 1 |λ1 ) − logP (S 2 |λ2 ))/2, (1)

and the NSS is computed as: ¯ 1 , λ2 ) = d(λ

logP (S 2 |λ1 )+logP (S 1 |λ2 )−P (logS 1 |λ1 )−logP (S 2 |λ2 ) , 2·|S1 |·|S2 |

(2)

where |S1 | and |S2 | are the number of data objects generated using λ1 and λ2 , respectively. The higher the

similarity value, the more similar are the two models involved. Note that the NSS is a log measure, and since the likelihood values are

Suggest Documents