Hidden Markov Models (HMMs) [16] are probabilistic graphical models that ... model which uses a qualitative abstraction of probability theory to formulate a.
Qualitative Hidden Markov Models for Classifying Gene Expression Data Zina M. Ibrahim, Ahmed Y. Tawfik, Alioune Ngom {ibrahim,atawfik,angom}@uwindsor.ca University of Windsor 401 Sunset Avenue Windsor, Ontario N9B 3P4
Abstract. Hidden Markov Models (HMMs) have been successfully used in tasks involving prediction and recognition of patterns in sequence data, with applications in areas such as speech recognition and bioinformatics. While variations of traditional HMMs proved to be practical in applications where it is feasible to obtain the numerical probabilities required for the specification of the parameters of the model and the probabilities available are descriptive of the underlying uncertainty, the capabilities of HMMs remain unexplored in applications where this convenience is not available. Motivated by such applications, we present a HMM that uses qualitative probabilities instead of quantitative ones. More specifically, the HMM presented here captures the order of magnitude of the probabilities involved instead of numerical probability values. We analyze the resulting model by using it to perform classification tasks on gene expression data.
1
Introduction and Motivation
Hidden Markov Models (HMMs) [16] are probabilistic graphical models that capture the dependencies between random variables in time-series data. They have been successfully applied to several areas of artificial intelligence such as speech recognition (e.g. [18]), robotics (e.g. [3]), pattern recognition [12] and several areas of bioinformatics, such as transmembrane protein classification (e.g. [10]) to perform predictive and recognitive tasks. The power of HMMs stems from the provision of efficient and intuitive algorithms that grant HMMs their predictive and recognitive capabilities by computing quantities of interest described by the model [16]. For example, given the specifications of the model, there exist efficient algorithms for computing the probability of observed events [20]. HMMs however, remain unexplored in application domains where they can be useful, by virtue of the unavailability of the statistical data necessary for the specification of the parameters of the model. Although overcoming the lack of real data by means of approximation [17] or synthesis [6] is possible for some applications, it is not an option for many types of applications. For example, epidemiological data describing factors influencing the occurrence of illnesses
cannot be approximated or synthesized when not sufficient. Another example is the problem of predicting the topological structure of proteins, where the topology of very few proteins are currently known, and available data is in general incomplete and uncertain, and HMMs have only been successfully used in the prediction of a special class of proteins called transmembrane proteins [10]. In response to this problem, formalisms of qualitative probability [22, 5, 15] have been proposed as alternatives where numerical probabilities are difficult to obtain. These formalisms aim at capturing the likelihood of events in a way which mimics that of probability theory without resorting to numerical values. Indeed, there exist evidence in the literature for the use of qualitative probabilities in complex problems, such as the protein topology prediction problem [14] and image interpretation [9]. Moreover, qualitative methods for dealing with uncertainty are not only an alternative for when data is not available, but are also useful where quantitative approaches have been proposed. For example, in bioinformatics, a heap of highthroughput data is available. The large amount of data has made formulating mechanisms to provide some biological insight work in progress [4]. We believe that qualitative equivalents of the quantitative methods available can serve as a guide for a better analysis for the mechanisms available. In other words, they can be used to perform an initial analysis to filter the data available, which aids in reducing the complexity of the full analysis performed by the quantitative methods . In this paper, we present a Qualitative HMM, that is a HMM that trades traditional probabilities with the qualitative framework found in [5], which captures the order of magnitude of probabilities instead of their numerical values. We use the resulting model to conduct a qualitative analysis of gene expression data. Traditional HMMs have been used to cluster time-series of gene expression data in the aim of finding the correlations among different genes (e.g. [19], [23]). The qualitative HMM we propose here are applied to the same problem, and serve to create pre-clusters that the existing quantitative HMMs can use as a guide for a better analysis. This is of special interest to the pharmaceutical industry, for which any new insight about the dynamics of genes can have a great impact on designing drugs for currently hard-to analyze diseases [13]. On a side note, it is essential that the reader keeps in mind that the provision of better qualitative knowledge about massive data is not only of use for healthcare applications, but to applications in various domains (e.g. economics [7]). We have previously formulated a model of a qualitative equivalent of HMMs [8] that was specifically tailored to use qualitative probability values in spatiotemporal applications. Other works in this regard is that of [17] which uses estimates of the parameters of HMM. However, there does not exist a general model which uses a qualitative abstraction of probability theory to formulate a qualitative equivalent to HMMs. In the remainder of the paper, we present the qualitative model along with its application to the analysis of gene expression data. In the first section, we present
an overview of the main constituents of standard HMMs and follow by outlining the qualitative theory of probabilities representing their order of magnitude in the second section. We then present the building blocks of the qualitative HMM and build the qualitative algorithm used to solve one of the canonical problems associated with HMMs. Then, we shift our attention to using the devised model to cluster time-series gene expression data and provide an analysis of the results.
2
Hidden Markov Models
Hidden Markov Models (HMMs) [16] are probabilistic graphical models used to represent the behavior of a system which is known to possess a number of states. The states of the model are hidden, in the sense that their operations can only be studied through discrete time series of the observed output produced by the states. Formally, a HMM={S,V,π,A,B} is defined by the following parameters: 1. A finite set of n unobservable (hidden) states S={s1 ,...,sn } 2. A finite set of m observable outputs, or the alphabet of the model: V = {v1 ,...,vm } that may be produced by the states given in S at any time t. 3. The vector π of the the initial state probability distribution, i.e. the probability of the system being at state si at time 0: P (q0 =si ), ∀ si ∈ S (1 ≤ i ≤ n). 4. The matrix A = [aij ]1≤i≤n which describes the transition probability distribution among associated states. For each entry aij in A, aij = P (qt = si |qt−1 = sj ), ∀ 1≤ i, j ≤ n, which describes the probability of the system being in state si at time t given that it was in state sj at time t − 1. This formulation reflects the Markov property which dictates that the next state is only dependent on the current state, and is independent of previous states. This property also implies that the transition probabilities must satisfy: n X n X
P (qt = si |qt−1 = sj ) = 1
i=0 j=0
5. The matrix B = {bj (ot ), 1 ≤ j ≤ n} of the emission probabilities of the observable output at a given state P (ot = vi |qt = sj ), which describes the probability of the system producing output vi at time t given that it is in state sj (1 ≤ i ≤ m). This information reflects the assumption that the output at a given time is only dependent on the state that produced it and is independent of previous output. In other words: m X
P (ot = vi |qt = sj ) = 1
i=0
Hence, a HMM can be described by a doubly stochastic structure. The first stochastic process provides a high-level view of the system and is operated by a Markov chain (described by the transition matrix A) governing the transitions among the hidden states. The second stochastic process, on the other hand, is
the one governing the production of observable output independently by each state (described by the emission matrix B). This structure provides HMMs with a high degree of flexibility, which makes them attractive for sequential data analysis. In this paper, we redefine the semantics of HMMs to accept qualitative abstractions of probability values for the emissions and transitions. We do this by using the qualitative probability model described in the next section.
3
Order of Magnitude of Probabilities: the Kappa Calculus
The kappa calculus [1, 5] is a system that abstracts probability theory by using order of magnitude of probabilities as an approximation of probability values. It does so by capturing the degree of disbelief in a proposition ω, or the degree of incremental surprise or abnormality associated with finding ω to be true [5], labeled κ(ω). The value of κ(ω) is assigned so that probabilities having the same order of magnitude belong to the same κ class, and that κ(ω) grows inversely to the order of magnitude of the probability value P (ω). The abstraction is achieved via a procedure which begins by representing the probability of a proposition ω, P (ω), by a polynomial function of one unknown, ², an infinitesimally small positive number (0 < ² < 1). The rank κ of a proposition ω is represented by the power of the most significant ²-term in the polynomial representing P (ω) (the lowest power of ² in the polynomial). Accordingly, the relation between probability and κ values is that P (ω) is of the same order as ²n , where n = κ(ω) [21], that is: ²
corresponding to a time-series microarray data set, construct a HMM to model the stochastic behavior of the matrix M as follows: – Construct the set of states S = {s1 , ..., sn }, where ∀si ∈ S : si represents the hidden behavior of gene i (1 ≤ i ≤ n), i.e. the behavior governing the time-series for gene i. – Construct the set of observation variables O = {o1 , ..., om }, where ∀ot ∈ O : ot represents the expression level of some gene at time t (1 ≤ t ≤ m). Hence, the matrix B = {bj (ot ), 1 ≤ j ≤ n} represents the observed expression level of gene j at time t. 6.4
Data Set
For the purpose of the initial examination of the performance of HMMκ embodied in this paper, we use two data sets. The first is a set of simulated data describing the expression levels of 550 genes for a 5-step time series, and for which the correct clustering of the gene time-series is known. The second is the E.Coli data set, for which we evaluate our algorithm by comparing our results with the literature. 6.5
Obtaining HMMκ
Ideally, we would like the HMM to be trained with kappa values instead of numerical probabilities. This, however, requires a qualitative version of the learning algorithms, which is currently under development. Therefore, the HMM was trained with the well-known Baum-Welch algorithm [16], which iteratively searches for the HMM parameters by maximizing the likelihood of the observations given the model, P (O|λ). We use the Baum-Welch to obtain a HMM = (S,V,π,A,B) that uses regular probabilities. The κ values of the corresponding HMMκ are then obtained from the probability values of the π vector, the A and B matrices by mapping the probability values using the notion introduced in section 3 as follows: ²