Bayesian Inference of Genetic Regulatory Networks from Time ... - UTSA

15 downloads 0 Views 547KB Size Report
Smc3. Cak1. Pho80. Met30. Esr1. Fig. 3. The inferred gene network for Qmax = 10 and q = 6/58. In this section, we provide the test results of RJMCMC.
46

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

Bayesian Inference of Genetic Regulatory Networks from Time Series Microarray Data Using Dynamic Bayesian Networks Yufei Huang , Jianyin Wang, Jianqiu Zhang, Maribel Sanchez, and Yufeng Wang Abstract— Reverse engineering of genetic regulatory networks from time series microarray data are investigated. We propose a dynamic Bayesian networks (DBNs) modeling and a full Bayesian learning scheme. The proposed DBN directly models the continuous expression levels and also is associated with parameters that indicate the degree as well as the type of regulations. To learn the network from data, we proposed a reversible jump Markov chain Monte Carlo (RJMCMC) algorithm. The RJMCMC algorithm can provide not only more accurate inference results than the deterministic alternative algorithms but also an estimate of the a posteriori probabilities (APPs) of the network topology. The estimated APPs provide useful information on the confidence of the inferred results and can also be used for efficient Bayesian data integration. The proposed approach is tested on yeast cell cycle microarray data and the results are compared with the KEGG pathway map.

I. INTRODUCTION In the cell of a living organism, there are thousands of genes interacting with each other at any given time to accomplish complicated biological tasks. Genetic regulatory networks (GRNs) are collections of gene-gene regulatory relations in a genome and are models that display causal relationships between gene activities. The system level view of gene functions provided by GRNs is of tremendous importance in understanding the underlying biological process of living organisms, providing new ideas for treating complicated diseases, and designing new drugs. Inevitably, uncovering GRNs has become a trend in recent biomedical researches [1], [2]. In this paper, we study signal processing solutions to the inference of GRNs based on microarray data. Microarray, a technology allowing measurements of mRNA expression levels of thousands of genes, provides first-hand information on genome wide molecular interactions and thus, it is logical to deduce that these data can be used to infer GRNs. Inference of GRNs based on microarray data is referred to as ‘reverse engineering’ [3], as the microarray expression Corresponding Author: Yufei Huang Y. Huang and J. Wang are with the Department of Electrical and Computer Engineering, University of Texas at San Antonio (UTSA), San Antonio, TX 78249-0669. E-mail: [email protected]. Phone: (210)4586270. Fax: (210)4585947. J. Zhang is with the Department of ECE, University of New Hampshire, Durham, NH 03824. E-mail: [email protected] M. Sanchez and Y. Wang are with the Department of Biology, UTSA. Email: [email protected] This work was supported by in part by an NSF Grant CCF-0546345 to Y. Huang and NIH 1R21AI067543-01A1, San Antonio Area Foundation Biomedical Research funds, UTSA Faculty Research Award to Y. Wang. Y. Wang is also supported by NIH RCMI grant 2G12 RR013646-06A1.

© 2007 ACADEMY PUBLISHER

levels are the outcome of gene regulation. Mathematically, reverse engineering is a traditional inverse problem. The solution to the problem is, however, not trivial, as it is complicated by the enormously large scale of the unknowns in a rather small sample size. In addition, the inherent experimental defects, noisy readings, and many other factors play a role. These complexities call for heavy involvement of statistical signal processing, which, we foresee, will play an increasingly important role in this research. The microarray data can be classified as from static or from time series experiments. In static experiments, snapshots of the expression of genes under different conditions are measured. In time series experiments, temporal molecular processes are measured. In particular, these time series data reflect the dynamics of gene activities in cell cycles. They are very important for understanding cellular aging (senescence) and programmed cell death (apoptosis), processes involved in the development of cancers, and other diseases associated with the aging process [4]. While building GRNs based on static microarray data is still of great interest and solutions based on probabilistic Boolean networks [5], [6], Bayesian networks [7], [8], [9], and many others [10] have been proposed, the study of using time series data has drawn increasing attention [11], [12]. Unlike the case of static experiments, extra attention is needed in modeling the time series experiments to account for temporal correlation. Such time series models can in turn complicate the inference, thus making the task of reverse engineering even more challenging than it already is. In this paper, we apply dynamic Bayesian networks (DBNs) to model the time series microarray experiment and develop a full Bayesian solution for learning the networks. The use of DBNs is not foreign to the reverse engineering of GRNs. The framework of such usage was first proposed in [13]. Details of modeling and learning with DBNs were investigated first in [14] and then in [15] and the proposed frameworks were tested on yeast cell cycle data. However, the proposed DBNs only took discretized expression levels and quantization on the expression level has to be performed, which resulted in loss of information. Also, only the connectivity of genes were modeled and no estimate was provided on the degree as well as the types of regulation. In [16] and [17], state-space model based DBNs were proposed, where hidden variables were allowed to account for factors that were not captured by the microarray experiments. Despite the elegance of such modeling and the proposed expectation-maximization and

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

variational Bayes solutions, the learning requires unrealistically large amount of data, thus greatly limiting their application. The DBN used in this paper is close to that in [18], which models the continuous expression level and the degree of regulation. However, unlike in [18], we target cases where only microarray data are available for network inference. Consequently, instead of assuming a nonlinear model based on B-spline as in [18], a more conservative linear regulatory model is adopted here since, with very limited data, more complex models will greatly reduce the credibility of the inferred results. On the other hand, we are particularly interested in full Bayesian solutions for learning the networks, which can provide estimates on the a posteriori probabilities (APPs) of the inferred network topology. This type of solution is termed ‘probabilistic’ or ‘soft’ in signal processing and digital communications. This requirement separates the proposed solutions from most of the existing approaches such as step-wise search and simulated annealing based algorithms, all of which produce only point estimates of the networks and are considered as “hard” solutions. The advantage of soft solutions has been demonstrated in digital communications [19]. In the context of GRNs, the APPs from the soft solutions provide valuable measurements of confidence on inference, which is difficult with hard solutions. Moreover, they are necessary for Bayesian data integration. Here, we propose a soft solution based on reversible jump Markov chain Monte Carlo (RJMCMC) sampling. To combat the distortion due to small sample size, we impose an upper limit on the number of parents and carefully design the topology priors. The rest of the paper is organized as follows: In section II, the issues on modeling the time series data with DBNs are discussed. The detailed model for gene regulation is also provided. In section III, tasks related to learning the networks are discussed and the Bayesian solution is derived. In section IV, the test results of the proposed approach on the simulated networks and yeast cell cycle data are provided. The paper concludes in V with remarks on future work. II. Modeling with Dynamic Bayesian Networks Like all graphical models, a DBN is a marriage of graphical and probabilistic theories. In particular, DBNs are a class of directed acyclic graphs (DAGs) that model probabilistic distributions of stochastic dynamic processes. DBNs enable easy factorization on joint distributions of dynamic processes into products of simpler conditional distributions according to the inherent Markov properties and thus greatly facilitate the task of inference. DBNs are shown to be a generalization of a wide range of popular models, which include hidden Markov models (HMMs) and Kalman filtering models or state-space models. They have been successfully applied in computer vision, speech processing, target tracking, and wireless communications. Refer to [20] for a comprehensive discussion on DBNs. A DBN consists of nodes and directed edges. Each node represents a variable in the problem while a directed edge

© 2007 ACADEMY PUBLISHER

47

indicates the direct association between the two connected nodes. In a DBN, the direction of an edge can carry the temporal information. To model the gene regulation from cell cycle using DBNs, we assume a microarray that measures the expression levels of G genes at N + 1 evenly sampled consecutive time instances. We then define a random variable matrix Y ∈ RG×(N +1) with the (i, n)th element yi (n − 1), denoting the expression level of gene i measured at time n − 1 (See Figure 1). We further assume that the gene regulation follows a first-order timehomogeneous Markov process. As a result, we need only to consider regulatory relationships between two consecutive time instances; this relationship remains unchanged over the course of the microarray experiment. This assumption may be insufficient but will facilitate the modeling and inference. Also, we call the regulating genes the “parent genes” or “parents” for short. Based on these definitions and assumptions, the structure of the proposed DBNs for modeling the cell cycle regulation is illustrated in Figure 1. In this DBN, each node denotes a random variable in Y and all the nodes are arranged the same way as the corresponding variables in the matrix Y. An edge between two nodes denotes the regulatory relationship between the two associated genes and the arrow indicates the direction of regulation. For example, we see from Figure 1 that genes 1, 3, and G regulate gene i. Even though, like all Bayesian networks, DBNs do not allow circles in the graph, they, however, are capable of modeling circular regulatory relationship, an important property that is not possessed by regular Bayesian networks. As an example, a circular regulation can be seen in Figure 1 between gene 1 and 2 even though no circular loops are used in the graph. To complete modeling with DBNs, we need to define the conditional distributions of each child node over the graph. Then the desired joint distribution can be represented as a product of these conditional distributions. To define the conditional distributions, we let pai (n) denote a column vector of the expression levels of all the parent genes that regulate gene i measured at time n. As an example in Figure 1, pai (n) = [y1 (n), y3 (n), yG (n)]. Then, the conditional distributions of each child nodes over the DBNs can be expressed as p(yi (n)|pai (n − 1)) ∀i. To determine the expression of the distributions, we assume linear regulatory relationship, i.e., the expression level of gene i is the result of a linear combination of the expression levels of the regulating genes at a previous sample time. Mathematically, we have the following expression yi (n) = wi pai (n − 1) + ei (n),

n = 1, 2, · · · , N

(1)

where wi ∈ R is the weight vector independent of time n and ei (n) is assumed to be white Gaussian noise with variance σ 2 . The assumption on white Gaussian noise may not be realistic for the system error of microarray experiments [21]. However, it simplifies the learning of networks. The weight vector is indicative of the degree and the types of the regulation [16]. A gene is up-regulated if the weight is positive and is down-regulated otherwise. The magni-

48

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

Dynamic Bayesian Network

Microarray

1st order Markov process Time

Gene

Time 0

Time1

Time 2

Time N

y1(0)

y1(1)

y1(2)

...

y1(N)

Gene 1

y1 (0)

y1 (1)

y1 (2)

y1 (N)

y2(0)

y2(1)

y2(2)

...

y2(N)

Gene 2

y2 (0)

y2 (1)

y2 (2)

y2 (N)

y3(0)

y3(1)

y3(2)

...

y3(N)

Gene 3

y3 (0)

y3 (1)

y3 (2)

y3 (N)

: :

: :

: :

: :

: :

yi(0)

yi(1)

yi(2)

...

yi(N)

Gene G

yG(0)

yG(1)

yG(2)

yG(N)

yG(0)

yG(1)

yG(2)

...

yG(N)

Fig. 1. A dynamic Bayesian network modeling of time series expression data.

tude (absolute value) of the weight indicates the degree of regulation. The noise variable is introduced to account for modeling and experimental errors. From (1), we obtain that the conditional distribution is Gaussian, i.e., p(yi (n)|pai (n − 1)) = N (wi pai (n − 1), σi2 ). In (1), the weight vector wi and the noise variance the unknown parameters to be determined.

Under the Bayesian paradigm, we select the most prob¯ i according to the maximum a posteriori able topology M criterion [22], i.e., ¯i M

=

arg

=

arg

=

arg

(2) σi2

are

Given a set of microarray measurements on the expression levels in cell cycles, the task of learning the above DBN consists of two parts: structure learning and parameter learning. The objective of structure learning is to determine the topology of the network or the parents of each genes. This is essentially a problem of model or variable selection. Under a given structure, parameter learning involves the estimation of the unknown model coefficients of each gene: the weight vector wi and the noise variance σi2 for all i. Since gene expression levels at any given time are independent and the network is fully observed, we can learn the parents and the associated model parameters of each gene separately. Thus we only discuss in the following the learning process of gene i. A. Bayesian criterion for structural learning (1)

(2)

(K)

Let Mi = {Mi , Mi , · · · , Mi } denote a set of all possible network topologies for gene i, where each element represents a topology derived from a possible combination of the parents of gene i. The problem of structure learning is to select the topology from Mi that is best supported by the microarray data. (k) (k) (k) (k) For a particular topology Mi , we use wi , pai , ei 2 and σik to denote the associated model variables. We can (k) then express (1) for Mi in a more compact matrix-vector form (k) (k) (k) (3) yi = Pai wi + ei (k)

(k)

(k)

where yi = [yi (1), · · · , yi (N )] , Pai = [pai (0), pai (1) (k) (k) (k) (k) (k) , · · · , pai (N − 1)] , ei = [ei (1), ei (2), · · · , ei (N )] , (k) (k) (k) (k) and wi = [wi (0), wi (1), · · · , wi (N − 1), ] .

© 2007 ACADEMY PUBLISHER

p(Mi |Y)

max

p(yi , Pai |Mi )p(Mi )

max

p(yi |Pai )p(Mi )

(k) Mi ∈Mi

(k)

Mi

III. Learning the DBN

(k)

max

(k) Mi ∈Mi

∈Mi

(k)

(k)

(k)

(k)

(k)

(4)

where the second equality is arrived from the Bayes theo(k) rem and the fact that under Mi , it is sufficient to have (k) Pai and yi instead of Y for modeling. Note that there is (k) a slight abuse of notation in (4). Y in p(Mi |Y) denotes a realization of expression levels measured from a microarray experiment. Apart from the MAP solution, we are also interested in obtaining estimates on the APPs of topol(k) ogy p(Mi |Y), whose advantages have been discussed in Section I. To this end, expressions of the marginal likeli(k) (k) hood p(yi |Pai ) and the model prior p(Mi ) need to be derived, and we discuss them next. (k)

A.1 The Marginal Likelihood p(yi |Pai ) The marginal likelihood is obtained by integrating the unknown parameters from the full likelihood   (k) (k) (k) 2 p(yi |wi , σik , Pai ) p(yi |Pai ) = (k)

(k)

(k)

2 2 p(wi , σik |Pai )dwi dσik (k)

(5)

(k)

2 where p(wi , σik |Pai ) is the parameter prior, and we choose the standard conjugate Gaussian-Inverse-Gamma prior [23] (k)

(k)

2 2 2 (ν0 , γ0 ) p(wi , σik |Pai ) = Nw(k) (0, σik R)IG σik i

(k) 

(6)

(k)

Pai and, to be noninformative, γ0 where R−1 = Pai and ν0 take small positive real values. Based on these conjugate priors, we show in the Appendix that the marginal likelihood has the form (k)

p(yi |Pai ) ∝

1

|P⊥ | 2 (γ0 + yi P⊥ yi )−

N +ν 2

(7)

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

(k)

(k) 

(k)

49

(k) 

where P⊥ = IN −Pai (Pai Pai +R−1 )−1 Pai IN is an N × N identity matrix.

and

(k)

B. The topology prior p(Mi ) There have been discussions in the literature on choosing the topology prior, most of which, however, are designed for large data samples. For cases of small data sample size as in most GRNs problems, the choice of the topology prior is a subtle issue and can sometimes affect the inference result to a large degree. One interesting choice of the prior is the one proposed in [24] that uses the description length principle and can be written as  −1 G (k) /G. (8) p(Mi ) = Pk where Pk denotes the total number of the parents under (k) Mi . Apparently, this prior favors topologies with small number or large number of parents. Especially, the ratio between the largest (Pk = G) and the smallest (Pk = G/2) prior probabilities are   G! G = G (9) rm = G/2 2! which can be very large for large G. For cases of small sample size, this prior can be too ‘informative’ so that it overwhelms the information carried by the likelihood, resulting a topology with either very large or very small number of the parents. Notice that this description length prior also implies a uniform distribution of the number of parents Q, i.e.,   −1 G G p(Q = Pk ) = /G = 1/G. (10) Pk Pk Instead, we assume that each gene has the same a priori probability, say q, to be a parent gene. This assumption implies a geometric distribution on the prior, which is expressed as p(M (k) ) = q Pk (1 − q)G−Pk . (11) As a result, the number of the parents Q follows a Binomial distribution   G Pk p(Q = Pk ) = q (1 − q)G−Pk . (12) Pk ¯ = Gq, the probability Since the mean number of parents Q q can be calculated from the mean as ¯ q = Q/G.

(13)

Therefore, the choice of q reflects our prior knowledge about the average number of the parents. As a special case, when q = 0.5, the prior becomes the popular uniform prior. Notice that this uniform prior implies a prior assumption of an average number of parents being G/2, an unrealistic scenario for large G. Thereby, the choice of the uniform prior is inappropriate as well. Having derived the marginal likelihood and specified the prior on topology, we look at how the optimization in (4)

© 2007 ACADEMY PUBLISHER

can be performed and at the same time, how calculation on APPs can be obtained. The difficulties of the task are two fold. First, the sample size N is normally much smaller than the total number of testing genes G. A direct result of it is that the problem becomes ill-conditioned. Thus, additional constraints must be imposed. Secondly, the optimization and calculation of APPs themselves are NP hard and exact solutions are infeasible for large G. For instance, when G = 58, the size K of M is about 2.88e17, and an exhaustive search over the space of this size is already prohibitive, not to mention that G can be in thousands in practice. As a result, we need to resort to numerical methods. C. The proposed solutions To the end of first difficulty, we impose an upper limit Qmax on the number of the parents and restrict Qmax < N . The restriction can be realistic in many genetic systems due to the restricted size of the regulatory region in genes. This constraint essentially forces us to search only among the topologies whose regulatory models are over-determined. It, in turn, also serves to reduce the size of the search space and helps alleviate the second difficulty. Nevertheless, the size of the search space can still be enormous even with an upper limit Qmax . We therefore propose to use reversible jump Markov chain Monte Carlo (RJMCMC) to approximate the MAP solution and the APPs. RJMCMC, proposed by Green in [25], is an MCMC algorithm for sampling from a joint topology-parameter space. In our case, since the parameters have been analytically marginalized out, the objective of the RJMCMC is to generate random samples from the APPs p(M (k) |Y). Then, the MAP solution can be approximated with the most-frequently-occurring sample. What is more, these samples can be also used to produce an approximation to the desired APPs, which is difficult with the deterministic schemes. The algorithm of the proposed RJMCMC is summarized in the following box.

Algorithm: RJMCMC Provide an initial topology and assign it to M (0). Iterate T times and at the tth iteration perform the following steps . (k) 1. Candidate selection: Suppose M (t − 1) = Mi . If Pk = 1, randomly select a gene from the nonparent genes; If Pk = Qmax , randomly select a gene from the parent genes; Otherwise, randomly select a gene from all G genes 2. If the gene is a parent in M (t − 1) - Death move: Remove the node associated with the selected gene from M (k) to obtain (j) (j) topology Mi . Set M (t) = Mi with probability λ = min{BF (j, k), α(j, k)}/α(j, k) (k) Otherwise M (t) = Mi . else - Birth move: Add the node associated with

50

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

(l)

the select gene to M (k) to obtain topology Mi . (l) Set M (t) = Mi with probability λ = min{BF (l, k), α(l, k)}/α(l, k). (k) Otherwise M (t) = Mi .

In this algorithm, BF (Mi , Mk ) is the Bayes factor between Mi and Mk and is defined as (j)

BF (j, k) =

p(y|Pai ) (k)

(14)

p(y|Pai )

In addition, α(j, k) is calculated as the product of the topology prior ratio rt and the probability ratio of moves rm , i.e., α(j, k) = rt (j, k)rm (j, k) (15) where rt (j, k)

=

p(Mj ) p(Mk )  1−q q q 1−q

= ⎧ ⎨

and rm (j, k) =



Qmax G G−1 G

1

for death move for birth move if Pk = Qmax if Pk = 1 otherwise

(16)

where δ(·) is the Kronecker Delta function and M (t) denotes the tth sample in the final collection. (17)

α can be considered as a threshold on Bayes factor BF . However, unlike that used in various deterministic Bayesian search algorithms, α produce random moves. When BF > α, the proposed move is accepted with the probability of 1 and otherwise it is accepted with the probability BF/α. This stochastic move can avoid being trapped on local high density regions, and thus, possibly produce a global solution. Also, notice that unlike in most of deterministic search schemes where the threshold is defined by experience or heretics, α is calculated from the topology priors and the probability of move, both of which have clear meanings. This proposed RJMCMC algorithm is very similar to a random-sweep Gibbs sampler [26], [27] in the topology space. The similarity lies in the fact that, in each iteration of the algorithm, a candidate gene is randomly picked for sample update while samples of the other genes are kept unchanged. In fact, when Pk , the number of parents, is between 1 and Qmax , this RJMCMC algorithm is exactly a random-sweep Gibbs sampler. However, due to the imposed upper limit Qmax and the assumption that there must be at least one parent, the use of the Gibbs sampler becomes nontrivial. The difficulty arises when Pk = 1 or Qmax . For example, when Pk = Qmax , the candidate gene can only be chosen from the existing Qmax parents and otherwise there is a possibility for Pk > Qmax . In this case, the dimension of variable space changes from 58 to Qmax and a standard random-sweep Gibbs sampler cannot handle the problem. Of course, the fundamental theories of MCMC for designing proper transition distributions of the underlying Markov chains and proposing an extension

© 2007 ACADEMY PUBLISHER

to the standard random-sweep Gibbs sampler can be relied upon. (This can be done by carefully defining the transition distributions of the underlying Markov chain.) Such effort would eventually lead to an equivalent form of the proposed RJMCMC. RJMCMC, on the other hand, is specifically designed for problems with dimensional changes. There is a standard procedure to follow when deriving the algorithm for a particular case. Therefore, the process is much more routine, and mistakes associated with designing the transition distributions in an extension to the random-sweep Gibbs sampler can be avoided. Additionally, the proposed RJMCMC algorithm is readily extended to handle nonlinear and/or nonGaussian regulatory models. Thus, this RJMCMC framework is more general. When the algorithm finishes, there will be T samples of (k) Mi and, as a common practice, we discard the first couple of samples (which is called burn-in) to account for convergence of Markov chain. Afterwards, if supposing that there are T  samples left, then the APPs can be approximated by T 1  (k) (k) δ(Mi − M (t)) (18) p(Mi ) =  T t=1

D. Parameter learning Once we determine the topology of the network, the model parameters wi and σi2 can be estimated according to the minimum mean squared error (MMSE) criterion. Given the linear Gaussian model (1), these estimates can be obtained analytically and shown as (k)

wi,M M SE = µi and 2 σi,M M SE

=

(19)

yi P⊥ yi +γ0 2 N +ν0 −1 2

.

(20) (k)

(k)

where we assume the selected topology is Mi and µi is defined by equation (24) in Appendix I. The covariance matrix and variance of these estimates are calculated by Cw = B−1 and vσ2 =

yi P⊥ yi +γ0 2 ) 2 N +ν0 N +ν0 2 ( 2 − 1) ( 2

(21)

(

− 2)

(22)

where B is defined through equation (25). These variances are indications on how well the MMSE estimates are. IV. Test Results A. Description of data set and algorithm settings We tested the proposed DBN and the RJMCMC learning algorithm on the cDNA microarray data of 58 genes in the yeast cell cycles, reported in [28] and [29]. The data set from [28] contains 18 samples evenly measured over a

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

51

period of 119 minutes where a synchronization treatment based on α mating factor was used. On the other hand, the data set from [28] contains 17 samples evenly measured over 160 minutes, and a temperature-sensitive cdc15 mutant was used for synchronization. For each gene, the data is represented as the log2 {(expression at time t)/(expression in mixture of control cells)}. Missing values exist in both data sets, which indicate that there was not a sufficiently strong signal in the spot. In this case, simple spline interpolation was used to fill in the missing data. As to the RJMCMC algorithm, in all of the experiments we used γ0 = 0.36 and ν0 = 1.2; we found that, as long as they are kept small, the results are insensitive to their specific values. Also, when implementing the RJMCMC algorithm, we set T = 10, 000 and ran the algorithm 10 times independently. In each independent run, we discard the first 1000 samples. This resulted in a total of 90,000 samples. By having independent runs, we reduce the chance of the Markov chains being trapped in local high density regions, thus lowering the bias of the samples. B. Test on a simulated network -1

10

variance σ 2 of the RJMCMC and K2 algorithms. For both algorithms, the POE at a given σ 2 was calculated based on 100 Monte Carlo trials. For the RJMCMC, we chose Qmax = 5 and q = 2/58. For the K2 algorithm, since no ordering was available, we performed an exhaustive search to determine the first possible parent of the selected gene. Also, the geometric prior on topology was included in the K2 algorithm. Figure 2 clearly demonstrates better performance of the RJMCMC algorithm, especially for small σ 2 . Notice that the POE of the RJMCMC decreases drastically with σ 2 , whereas the POE of the K2 algorithm almost remains flat for different σ 2 . This suggests that the K2 was trapped in some local solutions. The figure also suggests that when σ 2 increases to a point that noise becomes much stronger than the information from data, neither algorithm could perform well. However, this case is of little interest and more data should be included instead. The estimated variance from the real data set is 0.52. Given the correctness of the model, we would then expect better performance of the RJMCMC than the K2 when both were applied to the real data set. In summary, through this test on the simulated network, we are assured that the RJMCMC indeed works and has the potential to provide much better results.

Probability of error

C. Tests on the real data sets

Unconfirmed by KEGG pathway

-2

10

Bub1

Bub3

Dbf4

Pho85

Up regulate Down regulate

MCMC K2

Swi6

-3

Mad1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ddc1

Scc3

Tup1

Cln3

Cks1

0.8

Swi4

Noise variance (V2)

Tem1

Dbf20

Cdc5

Pho4

Cdc14

Cdc6 Pex2

Cdc7

Mih1 Pho5

Mad3

Swi5

We first tested the RJMCMC algorithm on a simulated network and compared the performance with the well known K2 algorithm [30]. Since the algorithm was applied to each gene separately, we thus only tested the performance of the algorithm on a randomly selected gene. To realistically simulate the network of the selected gene, we first ran the RJMCMC algorithm on the real data set from the α factor synchronization to estimate the parents, the associated weights, and the noise variance. Then the results on the parents and the weights were used as the true model parameters when simulating the expression level of the selected gene for time samples 2 to 18, whereas the expression level of the parent genes were still taken from the real data set. The resulted data set was then almost the same as the real data set, except the data of the selected gene were replaced by the simulated data. In Figure 2, we plotted the probability of errors (POE) vs. the noise

Cdc45

Clb6

Fig. 2. Plot of the probability of error vs. the noise variance for the RJMCMC and the K2 algorithms.

© 2007 ACADEMY PUBLISHER

Weight from 0.8—1.5

Mec3

Cdc28

0

Weight from 0.4—0.8

Bub2

Rad24

10

Weight from 0—0.4

Pds1

Cdc15

Mbp1

Sic1

Cdh1

Grf10

Rad53 Esp1

Rad17

Fus3

Swe1

Esc5

Hsl1

Pcl1

Cyc8

Hsl7

Clb1 Dbf2

Smc3

Far1

Cdc20 Cln1

Pho80

Cak1

Met30

Esr1

Rad9 Lte1

Fig. 3. The inferred gene network for Qmax = 10 and q = 6/58.

In this section, we provide the test results of RJMCMC on the two real data sets from yeast cell cycles. In the first experiment, we set the upper limit on the number of parents as Qmax = 10 and assume that, on average, there were 6 parents for each gene, which implies q = 6/58. The inferred gene network is depicted in Figure 3. In this network, the nodes are labeled with gene names and, like in DBNs, if gene i is a parent of gene j, an arrow from i to j is placed. The thickness of the arrow is determined by the magnitude of the corresponding weight, which denotes the

52

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

SIC1

MBP1 LTE1

SWI5

MAD1

DBF20

FUS3

MEC3

CDC15

DDC1

CDC45

BUB2

PCL1

DBF2

TUP1

PHO85

RAD17

PHO4 RAD9

TEM1

CDC14

CYC8

RAD24

CDC5

MET30

PHO5

ESC5

CDC7

ESP1

CDH1

PEX2

SWI5

MAD3

FAR1

CLB6

HSL7

BUB3

CAK1

PHO80

CDC6

SWI6

SWI4

SMC3

GRF10

CKS1

HSL1

SCC3

CLB1

CDC28

CDC20 RAD53

MIH1

CLN1

MBP1

SWE1

CDC45

ESP1

CLN3

DBF4 BUB1

DDC1

ESR1

PDS1

Unconfirmed by KEGG pathway

Fig. 4. The estimated posterior distribution of the topology for gene CDC28 in experiment 1. The x-axis is the decimal representation (k) of Mi .

Up regulate Down regulate

Weight from 0—0.4 Weight from 0.4—0.8

Weight from 0.8—1.2

Fig. 6. The gene network corresponding to the second largest APP of topology for Qmax = 10 and q = 6/58.

Fig. 5. The estimated posterior distribution of the topology for gene CDC14 in experiment 1. The x-axis is the decimal representation (k) of Mi .

degree of regulation. In addition, if the weight is positive, up regulation would be implied and a solid edge is used for the arrow. Otherwise, a dashed line is used, which represents down regulation. We compared the network with the KEGG pathway map (http://www.genome.jp/kegg/) and marked the unconfirmed regulations by blue edges. A confirmed regulation is likely to suggest a true positive in our inference results. The brown-shaded nodes are the genes that were not included in the KEGG map. We observed, on one hand, some general interaction networks supported by previous experimental and computational studies. For instance, CDC5, a serine/threonine-protein kinase is a central mediator of a series of inductive or repressive reactions. On the other hand, many interactions appeared inconsistent with the current biological views presented in the KEGG pathway map. These could be very well due to the insufficient amount of data - a set with only 18 time points were used. As a unique feature of the proposed RJMCMC algorithm, we calculated the posterior distribution of the topology for each gene. At least two aspects of the pos-

© 2007 ACADEMY PUBLISHER

terior distribution can be indicative of confidence of the MAP results. First, the larger the value of the maximum a posterior probability is, the more confidence we would have about the overall results. Secondly, the larger the difference between the maximum and the second largest a posterior probabilities is, the more confidence we could also have. As an example, we plot the APPs of the topology of gene CDC28 in Figure 4. The largest and the second largest probabilities are 0.0012 and 0.0006. Even though small, the largest probability is rather pronounced. We thus have confidence in this MAP solution. In another example, shown in Figure 5 is the posterior distribution of the topology of gene CDC14. This time, the largest two probabilities are very close, and thus, we do not have high confidence in the resulting network since the topology corresponding to the second largest probability can be equally good. Next, an average over the 58 probabilities or 0.0011 is provided. The probability is again rather small. The average of the second largest a posteriori probability is calculated equaling 0.0008. We see that the difference between the largest and the second largest probability is small, which implies, on average, a low confidence on the inferred networks. The gene networks corresponding to the second largest APP is shown in Figure 6. There are fewer links confirmed by KEGG. In the second experiment, we set Qmax = 5 and q = 2/58. This setting implies a smaller search space and would lead to results with higher confidence. The inferred network is shown in Figure 7. A similar annotation system as in Figure 3 is used. In Figures 8 and 9 the estimated posterior distributions on the topologies for gene CDC28 and CDC14 are plotted. In both cases, the MAP solutions are the same as those in Experiment 1. However, the probabilities are overall larger than those in Experiment 1. For CDC28, the relationship between the largest and the second largest probabilities are about the same as in

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

Bub1

Pds1

Hsl7

Dbf4

Mbp1

Mad3

Unconfirmed by KEGG pathway

Cak1

Cdh1

Rad9

Up regulate Down regulate

Hsl1

Weight from 0—0.4 Weight from 0.4—0.8

Cyc8

Cln1

Pcl1

53

Weight from 0.8—1.5

Dbf20

Clb6

Pho5

Clb1

Far1 Cdc20

Swi4

Cdc14

Pho80

Grf10

Rad53

Cks1

Lte1

Cdc5 Swi5

Dbf2 Pho85

Ddc1

Cdc45

Bub3

Cdc6

Swi6 Cdc7

Rad17

Esr1

Met30

Smc3

Bub2

Mih1

Sic1

Pho4

Mec3

Esp1

Fus3

Cdc15

Cdc28

Cln3

Pex2

Mad1

Tup1

Esc5

Scc3

Rad24

Swe1

Tem1

Fig. 9. The estimated posterior distribution of the topology for gene CDC14 in experiment 2. The x-axis is the decimal representation (k) of Mi .

Fig. 7. The inferred gene network for Qmax = 5 and q = 2/58.

CLB1

CDC14

BUB3

CDC5

CDC20

TEM1

CDC15

RAD9

DBF20

SWI5

MBP1

DDC1

CDH1

CYC8

CDC6

MAD3

MAD1

LTE1

PHO85

PEX2 HSL7

DBF4

CDC45

DBF2

MET30

SWI4

HSL1

SMC3

FAR1

SCC3

TUP1

SIC1

FUS3

BUB2

CDC28

GRF10

CLN3

PHO5

CKS1

PDS1

RAD53

CLB6

BUB1

SWE1

CLN1

ESR1

SWI6

MIH1

PHO4

CDC7

RAD17 ESP1

RAD24

PHO80

MEC3

CAK1

Unconfirmed by KEGG pathway

Up regulate Down regulate

PCL1

ESC5

Weight from 0—0.4 Weight from 0.4—0.8

Weight from 0.8—1.2

Fig. 8. The estimated posterior distribution of the topology for gene CDC28 in experiment 2. The x-axis is the decimal representation (k) of Mi .

Experiment 1, whereas for CDC14, the difference between the two is increased, which suggests increased confidence about the results. Again, calculating the respective averages over the largest and the second largest a posteriori probabilities of all the genes shows that they are 0.0257 and 0.0203, respectively. There is an approximately 20 times increase in the largest posterior probability over that in the first experiment. This indicates increased confidence on the inferred networks, which is consistent with our original expectation. However, the difference between the two probabilities is still slim. This suggests that, in addition to the inferred network, there were competing topologies that are almost equally likely to be a solution. The gene networks corresponding to the second largest APP is shown in Figure 10. Again, we see that there are fewer links confirmed by KEGG. In the third experiment, we tested the algorithm on the second data set from the CDC28 mutant. As in experiment

© 2007 ACADEMY PUBLISHER

Fig. 10. The gene network corresponding to the second largest APP of topology for Qmax = 5 and q = 2/58.

2, we set Qmax = 5 and q = 2/58. The inferred network is shown in Figure 11. There are a similar number of links confirmed by the KEGG map as that in Figure 7 from Experiment 2. Again, we provided the plots of APPs on topology for gene CDC28 and CDC14 in Figure 12 and Figure 13, respectively. First of all, the values of the largest APPs for both genes are similar to those in experiment 2. Therefore, we surmise that the two data set provide a similar degree of information concerning the network. Secondly, it was observed that the largest APP is more pronounced for CDC 28, whereas there are many peaks of similar height as the largest APP for CDC14. In particular, the ratio between the largest two APPs are 1.04 and 2.09 for CDC14 and CDC28, respectively. As a result, there is more confidence in the inference of CDC14 than that of CDC28 . Another interesting observation is that the two plots looked very similar to the two obtained in Experiment 2. This confirms from a probabilistic viewpoint that

54

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

Unconfirmed by KEGG pathway

Up regulate CDC5

Down regulate

Weight from 0—0.4 BUB1

PCL1

Weight from 0.4—0.8 Weight from 0.8—1.2

TEM1

CC15

ESP1

PDS1

MEC3

TUP1

CDC14

MBP1

MIH1 CLB1

SWI5

PHO85

BUB2

PHO80

CKS1

CDC20

CDC7

CC45

CDH1

CDC6

ESR1

RAD24

LTE1

FUS3

CYC8

PHO4

SIC1

MET30

GRF10

MAD3

SWI6

RAD9

RAD53

DBF20

ESC5

SWE1

SMC3

CLB6

CAK1

DBF2

MAD1

CLN1

HSL1

CLN3

SCC3

FAR1

SWI4

DDC1

PEX2

CDC28

HSL7

DBF4

BUB3

RAD17

PHO5

Fig. 13. The estimated posterior distribution of the topology for data set 2 for gene CDC14 in experiment 3. The x-axis is the (k) decimal representation of Mi .

Fig. 11. The inferred gene network for Qmax = 5 and q = 2/58.

Fig. 12. The estimated posterior distribution of the topology from data set 2 for gene CDC28 in experiment 3. The x-axis is the (k) decimal representation of Mi .

the two data sets provide information on the same network. (Otherwise, chances are that the APPs would not look the same if they were produced from different networks.) It is thus reasonable to integrate the data sets for the improved inference. V. Conclusions and future work We proposed a dynamic Bayesian network modeling of time series microarray data, where a linear regulatory model is adopted. To learn the DBN from the data, we developed a full Bayesian solution and a RJMCMC algorithm for determining the network topology. The developed full Bayesian solution can provide information on the APPs of topology. The APPs can be used as an indication of the confidence of the inferred results. We tested the proposed method on yeast microarray data in cell cycles. The estimated APPs indicated generally low confidence in the results, even though the confidence increases with strin-

© 2007 ACADEMY PUBLISHER

gent constraint and assumptions. This is mainly due to the small data size and possibly inaccuracy in the assumed linear regulatory models. The focus of the subsequent study will be on improving the confidence of inference results. This calls for approaches for incorporating additional data of similar types from different experiments and data of disparate types such as protein-protein interaction. The “soft” information or the APPs provided by the RJMCMC are advantageous for developing efficient Bayesian data integration than other existing “hard” solutions. In addition, gene regulation is naturally a nonlinear process and the system error of microarray experiment is more likely to be nonGaussian. Using more accurate nonlinear and nonGaussian regulatory models in GRNs will be worth investigating further in the future. Appendices I. Derivation of the marginal likelihood (k) p(yi |Pai ) Given on the conjugate Gaussian-Inverse-Gamma prior on the parameters, the marginal likelihood can be obtained as   (k) (k) (k) 2 p(yi |wi , σik , Pai ) p(yi |Pai ) = (k)

(k)

2 p(wi , σik |Pai )dwi dσi2   (k) (k) − 12 |yi −Pai wi |2 2 −N/2 ∝ (σik ) e 2σik −

e

1 2σ 2 ik

(k) 

wi

(k)

R−1 wi

2 −Pk /2 (σik ) |R|−1/2 2

(k)

2 −(ν0 /2+1) −γ0 /2/σik 2 (σik ) e dwi dσik   (k) (k) 2 −N/2 ∝ Nw(k) (µi , B−1 )dwi (σik ) i



|R|−1/2 |B|−1/2 e

1 σ2 ik

yi P⊥ yi 2

(k)

2 −(ν0 /2+1) −γ0 /2/σik 2 (σik ) e dwi dσik

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

= |R|−1/2 |B|−1/2 −

1 2



55

2 −((N +ν0 )/2+1) (σik )

(yi P⊥ yi +γ0 )

2 e 2σik dσik ∝ |P⊥ |−1/2 (yi P⊥ yi + γ0 )−(N +ν0 )/2  N + ν0 yi P⊥ yi + γ0 2 , )dσik IG( 2 2

= |P⊥ |−1/2 (yi P⊥ yi + γ0 )−(N +ν0 )/2 (23) where (k)

µi

(k) 

= B−1 Pai

B =

yi

(k)  (k) Pai Pai

and (k)

(24) −1

+R

(k) 

P⊥ = IN − Pai B−1 Pai

(25)

.

References [1] [2] [3] [4]

[5] [6]

[7]

[8]

[9] [10]

[11] [12] [13] [14] [15]

P. Brazhnik, A. de la Fuente, and P. Mendes, “Gene networks: how to put the function in genomics,” Trends in Biotechnology, vol. 20, no. 11, pp. 467–472, Nov. 2002. N. Friedman, “Inferring cellular networks using probabilistic graphical models,” Science, vol. 303, pp. 799–805, Feb 2004. ´ P. Dhaeseleer, P. Liang, S. Fuhrman, and R. Somogyi, “Genetic network inference: from co-expression clustering to reverse engineering,” Bioinformatics, vol. 16, no. 8, pp. 707–726, 2000. P. Ross-Macdonald, T. Roemer, P. Coelho S. Agarwal, A. Kumar, S. A. des Etages, K.-H. Cheung, A. Sheehan, D. Symoniatis, R. Jansen, L. Umansky, K. Nelson, H. Iwasaki, D. Kanada, R. Logo, K. Hager, M. Gerstein, P. Miller, G. S. Roeder, and M. Snyder, “Large-scale analysis of the yeast genome by transposon tagging and gene disruption,” Nature, vol. 402, pp. 413– 418, 1999. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks,” Bioinformatics, vol. 18, no. 1, 2002. X. Zhou, X. Wang, and E. R. Dougherty, “Construction of genomic networks using mutual-information clustering and reversible-jump Markov-chain-Monte-Carlo predictor design,” Signal Processing, vol. 83, pp. 745–761, 2003. A.J. Hartemink, D.K. Giord, T.S. Jaakkola, and R.A. Young, “Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks,” PaciLc Symposium on Biocomputing, vol. 6, pp. 23–32, 2001. E.J. Moler, D.C. Radisky, and I.S. Mian, “Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae,” Physiol. Genomics, vol. 4, pp. 127–135, 2000. Eran Segal, Rich probabilistic models for genomic data, Ph.D. thesis, Stanford University, 2004. N. Simonis, S. J. Wodak, G. N. Cohen, and J. van Helden, “Combining pattern discovery and discriminant analysis to predict gene co-regulation,” Bioinformatics, vol. 20, no. 15, pp. 2370– 2379, 2004. Z. Bar-Joseph, “Analyzing time series gene expression data,” Bioinformatics, vol. 20, no. 16, pp. 2493–2503, 2004. H. de Jong, “Modeling and simulation of genetic regulatory systems: A literature review,” Journal of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. K. Murphy and S. Mian, “Modelling gene expression data using dynamic Bayesian networks,” Tech. Rep., Computer Science Division, University of California, Berkeley., 1999. N. Friedman, M. Linial, I. Nachman, and D. Pe’er, “Using bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000. R. J. P. van Berlo, E. P. van Someren, and M. J. T. Reinders, “Studying the conditions for learning dynamic Bayesian networks to discover genetic regulatory networks,” Simulation, vol. 79, no. 12, 2003.

© 2007 ACADEMY PUBLISHER

[16] M. J. Beal, F. Falciani, Z. Ghahramani, C. Rangel, and D. L. Wild, “A Beyesian approach to reconstructing genetic regulatory networks with hidden factors,” Bioinformatics, vol. 20, pp. 1361–1372, Sept. 2004. [17] B. Perrin, L. Ralaivola, A. E. Mazurie, S. Bottani, J. Mallet, and F. d’Alch´ e Buc, “Gene networks inference using dynamic Bayesian networks,” Bioinformatics, vol. 19 Suppl. 2, pp. ii138– ii148, 2003. [18] S. Y. Kim, S. Imoto, and S. Miyano, “Inferring gene networks from time series microarray data using dynamic Bayesian networks,” Briefings in Bioinformatics, vol. 4, no. 3, pp. 228–235, 2003. [19] X. Wang and H. V. Poor, Wireless Communication Systems: Advanced Techniques for Signal Reception, Prentice Hall PTR, 2004. [20] Kevin Patrick Murphy, Dynamic Bayesian Networks: Representation, Inference and Learning, Ph.D. thesis, University of California, Berkeley, 2004. [21] P. Sebastiani, E. Gussoni, I. S. Kohane, and M. Ramoni, “Statistical challenges in functional genomics (with discussion),” Statistical Science, vol. 18, no. 1, pp. 33–60., 2003. [22] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall, 1997. [23] J. M. Bernardo and A. F. Smith, Eds., Bayesian Theory, John Wiley and Son Ltd, 2000. [24] N. Friedman and M. Goldszmidt, “Learning BNs with local structure, in Learning in Graphical Models,” chapter VI, pp. 421–459. Kluwer Academic, MIT Press, first edition, 1998. [25] P. Green, “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” vol. 82, pp. 711–732, 1995. [26] J. S. Liu, Monte Carlo Starategies in Scientific Computing, Springer-Verlag, New York, 2001. [27] C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, 2nd edition, 2004. [28] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D.Botstein, and B. Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, pp. 3273–3297, 1998. [29] R. Cho, M. Campbell, E. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. Wolfsberg, A. Gabrielian, D. Landsman, D. Lockhart, and R. Davis, “A genome-wide transcriptional analysis of the mitotic cell cycle,” Mol. Cell., , no. 2, pp. 65–73, 1998. [30] G. F. Cooper and E. Herskovits, “A Bayesian method for the induction of probabilistic networks from data,” Machine Learning, , no. 9, pp. 309–347, 1992.

Yufei Huang received his Ph.D. degree in electrical engineering from the State University of New York at Stony Brook in 2001. He is now Associate Professor in the Department of Electrical and Computer Engineering at the University of Texas at San Antonio. Dr. Huang’s expertise is in the area of genomic signal processing, statistical modeling and Bayesian methods. His current research focuses on developing signal processing solutions for gene networks modeling and discovery, data integration, and proteomics. He was a recipient of National Science Foundation (NSF) CAREER award in 2005. He has been an organizer of the IEEE Workshop on Genomic Signal Processing and Statistics, 2006 and 2007. He is an associate editor of EUROSIP Journal on Bioinformatics and Computational Biology. Jianqiu Zhang received her Ph.D degree in electrical engineering from the State University of New York at Stony Brook in 2002. She is now Assistant Professor in the Department of Electrical and Computer Engineering at the University of New Hampshire. Dr Zhang’s expertise is in information theory, statistical signal processing, and computational genomics. She is a member of IEEE.

56

Maribel Sanchez received dual Bachelors of Science degrees in Biology and Computer Science at the University of Texas at San Antonio (UTSA) in 2004. From 2000 to 2004 she was a research scientist associate at UTSA. She was a recipient of the National Institute of Health Minority Biomedical Research Support - Research Initiative in Science Enhancement (MBRS-RISE) and Minority Access to Research Careers - Undergraduate Student Training for Academic Research (MARC-U*STAR) fellowships. Currently, she is a Systems Analyst II at UTSA’s Department of Biology. Her current research focuses on comparative genomics with an emphasis in infectious diseases and cell cycle regulation. Yufeng Wang received her B.S. degree in Genetics from Fudan University, Shanghai, China in 1993, her M.S. degrees in Statistics and Genetics in 1998, and her Ph.D. degree in Bioinformatics and Computational Biology in 2001 from Iowa State University, Ames, IA. From 2001 to 2003, she was a research scientist at American Type Culture Collection (ATCC) and an affiliate research assistant professor at George Mason University, Manassas, VA. Since 2003, she has been with University of Texas at San Antonio, where she is an assistant professor with the Department of Biology. She is also an assistant professor at the South Texas Center for Emerging Infectious Diseases at San Antonio, Texas. Her current research interests include comparative genomics, molecular evolution, and population genetics, with a special emphasis on the evolutionary mechanisms and systems biology of infectious diseases.

© 2007 ACADEMY PUBLISHER

JOURNAL OF MULTIMEDIA, VOL. 2, NO. 3, JUNE 2007

Suggest Documents