Optimal Dynamic Bayesian Network Structure Learning in Polynomial ..... the current best score, i.e., sMIT (P) ⥠gMIT
T ECHNICAL R EPORT
A Polynomial Time Algorithm for Learning Globally Optimal Dynamic Bayesian Network and
its Applications in Genetic Network Reconstruction FIT-GSIT TR.1101 V INH N GUYEN 1 , M ADHU C HETTY 1 , R OSS C OPPEL 2 AND P RAMOD P. WANGIKAR 3 1
Faculty of Information Technology, Monash University Faculty of Medicine, Nursing and Health Sciences, Monash University Chemical Engineering Department, Indian Institute of Technology, Bombay 2
3
{Vinh.Nguyen,Madhu.Chetty,Ross.Coppel}@Monash.edu
[email protected]
Contents Contents
ii
1 GlobalMIT—A Polynomial Time Algorithm for Learning Globally Optimal Dynamic Bayesian Network 1.1 1.2 1.3
1.4
1.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MIT Score for Dynamic Bayesian Network Structure Learning . . . . . . Optimal Dynamic Bayesian Network Structure Learning in Polynomial Time with MIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Complexity bound 7, 1.3.2 Efficient Implementation for globalMIT 9 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Results on Probabilistic Network Synthetic Data 10, 1.4.2 Results on Linear Dynamical System Synthetic Data 10, 1.4.3 Results on Non-Linear Dynamical System Synthetic Data 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Network Modeling of Cyanobacterial Biological Processes via Clustering and Dynamic Bayesian Network 2.1 2.2 2.3 2.4 2.5 2.6
Introduction . . . . . . . . . . . . . . . . . . . . . . Filtering and clustering of genes . . . . . . . . . . Assessment of clustering results . . . . . . . . . . . Building a network of interacting clusters . . . . . Experiments on Cyanothece sp. strain ATCC 51142 Discussion and Conclusion . . . . . . . . . . . . .
Bibliography
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
1 1 3 4 9
12 15 15 16 17 18 18 21 23
ii
One GlobalMIT—A Polynomial Time Algorithm for Learning Globally Optimal Dynamic Bayesian Network The work in this chapter is concerned with the problem of learning the globally optimal structure of a dynamic Bayesian network (DBN) from data. We propose using a recently introduced information theoretic criterion named MIT (Mutual Information Test) for evaluating the goodness-of-fit of a DBN structure. MIT has been previously shown to be effective for learning static Bayesian network, yielding results competitive to other popular scoring metrics, such as BIC/MDL, K2 and BD, and the well-known constraint-based approach PC algorithm. This work adapts MIT to the case of DBN. Using a modified variant of MIT, we show that learning the globally optimal DBN structure can be efficiently achieved in polynomial time. 1.1
I NTRODUCTION
Bayesian network (BN) is a central topic in machine learning, and has found applications in various fields. A (static) BN is defined by a graphical structure and a family of probabilistic distribution, which together allow efficient and accurate representation of the joint probability distribution of a set of random variables (RV) of interest [KF09]. The graphical part of a static BN is a directed acyclic graph (DAG), with nodes representing RVs and edges representing their conditional (in)dependence relations. Two important disadvantages when applying static BN to certain domain problems, such as gene regulatory network reconstruction in bioinformatics, are that: (i) BN does not have a mechanism for exploiting the temporal aspect of time-series data (such as that gathered from time-series microarray experiments) abundant in this field; and (ii) BN does not allow the modeling of cyclic phenomena, such as feed back loops, which are prevalent in biological systems [YSW+ 04]. These drawbacks have motivated the development of the so-called dynamic Bayesian network (DBN). The simplest model of this type is the first-order Markov stationary DBN, in which both the structure of the network and the parameters characterizing it are assumed to remain unchanged over time, such as the one exemplified in Figure 1.1a. In this model, the value of a RV at time t + 1 is assumed to depend only on the value of its parents at time t. DBN accounts for the temporal aspect of time-series data, in that an edge must always direct forward in time (i.e., cause must precede consequence), and allows feedback loops (Fig. 1.1b). Since its inception, DBN has received particular interest, especially from the bioinformatics community [MM99, PRM+ 03, KIM03, Hus03, YSW+ 04, ZC05, WD09]. Recent works in the machine learning 1
2
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
community have progressed to allow more flexible DBN models, such as one with, either parameters [GH09], or both structure and parameters [RH09, RH10, DLH10] changing over time. It is worth noting that more flexible models generally require more data to be learned accurately. In situations where training data are scarce, such as in microarray experiments where the data size can be as small as a couple of dozen samples, a simpler model such as the first-order Markov stationary DBN might be a more suitable choice.
A
A
A
B
B
A
B
B
C
C
t
t+1
B
A
C
C
t
t+1
(a)
C
(b)
Figure 1.1: Dynamic Bayesian Network: (a) a 1st order Markov stationary DBN; (b) its equivalent folded network In this work, we focus on the problem of learning the globally optimal structure for the first-order Markov stationary DBN. Henceforth, DBN shall refer to this particular class of stationary DBN, and learning shall refer to structure learning. The most popular approaches for learning DBN have been the ones adapted from the static BN literature, namely the search+score paradigm [YSW+ 04, WD09], and Markov Chain Monte Carlo (MCMC) simulation [Hus03, DLH10, RH10]. Herein we are interested in the search+score approach, in which one has to specify a scoring function to assess the goodness-of-fit of a DBN given the data, and a search procedure to find the optimal network based on this scoring metric. Several popular scores for static BN, such as the Bayesian scores (K2, Bayesian-Dirichlet (BD), BDe and BDeu), and the information theoretic scores (Bayesian Information Criterion (BIC)/minimal description length (MDL) and Akaike Information Criterion—AIC), can be adapted straightforwardly for DBN. Another recently introduced scoring metric that catches our interest is the so-called MIT (Mutual Information Test) score [dC06], which, as the name suggests, belongs to the family of scores based on information theory. Through extensive experimental validation, the author suggested that MIT can compete favorably with Bayesian scores, outperforms BIC/MDL and should be the score of reference within those based on information theory. As opposed to the other popular scoring metrics, MIT has not been considered for DBN learning to our knowledge. As for the search part, due to several non-encouraging complexity results in learning static BN, most authors have resorted to heuristic search algorithms when it comes to learning DBN. For example, Chickering [Chi96] showed that under the BDe metric, the problem of identifying a Bayesian network, with each node having at most K parents, that has a relative posterior probability greater than a given constant is NP-complete, even with K = 2. For this reason, greedy hill-climbing and meta-optimization frameworks, such as Tabu search, simulated annealing and genetic algorithms are often employed. In a recent publication, Dojer [Doj06] has shown otherwise that learning DBN structure, as opposed to static BN, does not necessarily have to be NP-hard. In particular, this author showed that, under some mild assumptions, there are algorithms for finding the globally optimal network with a polynomial worst-case time complexity, when the MDL and BDe scores
1.2. MIT SCORE FOR DYNAMIC BAYESIAN NETWORK STRUCTURE LEARNING
3
are used. In the same line of these findings, in this work, we shall show that there exists a polynomial worst-case time complexity algorithm for learning the globally optimal DBN under the newly introduced MIT scoring metric, which we shall name the globalMIT algorithm. Our experimental results show that, in terms of the recovered DBN quality, MIT performs competitively with BIC/MDL and BDe. In terms of theoretical complexity analysis, globalMIT admits a comparable worst-case complexity to the BIC/MDL-based global algorithm, and is much faster than the BDe-based algorithm. The chapter is organized as follows: in Section 1.2 we review the MIT score for DBN learning. Section 1.3 presents our algorithm for finding the globally optimal network, followed by experimental results in section 1.4, and finally some discussion and conclusion. 1.2
MIT S CORE FOR D YNAMIC B AYESIAN N ETWORK S TRUCTURE L EARNING
In this section, we review the MIT score for learning BN, then adapts it to the DBN case. Briefly speaking, under MIT the goodness-of-fit of a network is measured by the total mutual information shared between each node and its parents, penalized by a term which quantifies the degree of statistical significance of this shared information. Let X = {X1 , . . . , Xn } denote the set of n variables with corresponding {r1 , . . . , rn } discrete states, D denote our data set of N observations, G be a DAG, and Pai = {Xi1 , . . . , Xisi } be the set of parents of Xi in G with corresponding {ri1 , . . . , risi } discrete states, si = |Pai |, then the MIT score is defined as: SS MIT (G : D) =
n X i=1 Pai 6=∅
{2N.I(Xi , Pai ) −
si X
χα,li σi (j) }
j=1
where I(Xi , Pai ) is the mutual information between Xi and its parents as estimated from D. χα,lij is the value such that p(χ2 (lij ) ≤ χα,lij ) = α (the Chi-square distribution at significance level 1 − α), and the term li σi (j) is defined as: ( Qj−1 (ri − 1)(riσi (j) − 1) k=1 riσi (k) , j = 2 . . . , si liσi (j) = (ri − 1)(riσi (j) − 1), j=1 where σi = {σi (1), . . . , σi (si )} is any permutation of the index set {1 . . . si } of Pai , with the first variable having the greatest number of states, the second variable having the second largest number of states, and so on. To make P sense of this criterion, let us first point out that maximizing the first term n in the score, i=1 2N.I(Xi , Pai ), can be shown to be equivalent to maximizing the logPai 6=∅
likelihood criterion. Learning BN by using the maximum likelihood principle suffers from overfitting however, as the fully-connected network will always have the maximum likelihood. Likewise, for the MIT criterion, since the mutual information can always be increased by including additional variables to the parent set, i.e., I(Xi , Pai ∪Xj ) ≥ I(Xi , Pai ), the complete network will have the maximum total mutual information. Thus, there is a need to penalize the complexity of the learned network. Penalizing the gives us the BIC/MDL criteria, while −C(G) log-likelihood criterion with − 21 C(G) log(N )P Qsi n gives us the AIC criterion (where C(G) = (r − 1) r measures the network i ij i=1 j=1 complexity). As for the MIT criterion, while the mutual information always increases
4
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
when including additional variables to the parent set, the degree of statistical significance of this increment might become negligible as more and more variables are added. This significance degree can be quantified based on a classical result in information theory by Kullback [Kul68], which, in this context, can be stated as follows: under the hypothesis that Xi and Xj are conditionally independent given Pai is true, the statistics 2N.I(Xi , Xj |Pai ) approximates to a χ2 (l) distribution, with l = (ri − 1)(rj − 1)qi degree ofQfreedom, and si rik . Now qi = 1 if Pai = ∅, otherwise qi is total the number of state of Pai , i.e., qi = k=1 we can see that the second term in the MIT score penalizes the addition of more variables to the parent set. Roughly speaking, only variables that have the conditional mutual information shared with Xi given all the other variables in Pai that is higher than 100α percent of the MI values under the null hypothesis of independence can increase the score. For detailed motivations and derivation of this scoring metric as well as an extensive comparison with BIC/MDL and BD, we refer readers to [dC06]. Adapting MIT for DBN learning is rather straightforward. One just has to pay attention to the fact that the mutual information is now calculated between a parent set and its child, which should be 1-unit shifted in time, as required by the first-order Markov assumption, → − denoted by Xi1 = {Xi2 , Xi3 , . . . , XiN }. As such, the number of “effective" observations, denoted by Ne , for DBN is now only N − 1. This is demonstrated in Figure 1.2. The MIT score for DBN should be calculated as: 0 SMIT (G : D) =
n X
→ −
{2Ne .I(Xi1 , Pai ) −
i=1 Pai 6=∅
si X
χα,li σi (j) }
j=1
Similarly, when the data is composed of Nt separate time-series, the number of effective observations is only Ne = N − Nt . 1 Xj Xi
N-1 N
2 ...
...
Figure 1.2: Data alignment for dynamic Bayesian network with an edge Xj → Xi . The “effective" number of observations is now only N − 1.
1.3
O PTIMAL D YNAMIC B AYESIAN N ETWORK S TRUCTURE L EARNING IN P OLYNOMIAL T IME WITH MIT
In this section, we show that learning the globally optimal DBN with MIT can be achieved in polynomial time. Our development is based on a recent result presented in [Doj06], which states that under several mild assumptions, there exists a polynomial worst-case time complexity algorithm for learning the optimal DBN with the MDL and BDe scoring metrics. Specifically, the 4 assumptions that Dojer considered are: Assumption 1. (acyclicity) There is no need to examine the acyclicity of the graph. Pn Assumption 2. (additivity) S(G : D) = i=1 s(Xi , Pai : D|Xi ∪Pai ) where D|Xi ∪Pai denotes the restriction of D to the values of the members of Xi ∪ Pai .
1.3. OPTIMAL DYNAMIC BAYESIAN NETWORK STRUCTURE LEARNING IN POLYNOMIAL TIME WITH MIT
5
To simplify notation, we write s(Pai ) for s(Xi , Pai : D|Xi ∪Pai ). Assumption 3. (splitting) s(Pai ) = g(Pai ) + d(Pai ) for some non-negative functions g,d satisfying Pai ⊆ Pa0i ⇒ g(Pai ) ≤ g(Pa0i ) Assumption 4. (uniformity) |Pai | = |Pa0i | ⇒ g(Pai ) = g(Pa0i ) Assumption 1 is valid for DBN in general. For the first-order Markov DBN that we are considering in this work, since the graph is bipartite, with edges directing only forward in time (Fig. 1.1a), acyclicity is automatically satisfied. Assumption 2 simply states that the scoring function decomposes over the variables, which is satisfied by most scoring metrics such as BIC/MDL, BD and also clearly by MIT. Together with assumption 1, this assumption allows us to compute the parents set of each variable independently. Assumption 3 requires the scoring function to decompose into two components: d evaluating the accuracy of representing the distribution underlying the data by the network, and g measuring its complexity. Furthermore, g is required to be a monotonically non-decreasing function in the cardinality of Pai (assumption 4), i.e., the network gets more complex as more variables are added to the parent sets. We note that unlike MIT in its original form that we have considered above, where better networks have higher scores, for the score considered by Dojer, lower scored networks are better. And thus the corresponding optimization must be cast as a score minimization problem. We now consider a variant of MIT as follows: SMIT (G : D) =
n X
→ −
0 2Ne .I(Xi1 , X) − SMIT (G : D)
(1.1)
i=1
which admits the following decomposition over each variable (with the convention of I(Xi , ∅) = 0): sMIT (Pai )
=
dMIT (Pai ) + gMIT (Pai )
dMIT (Pai )
=
gMIT (Pai )
=
2Ne .I(Xi1 , X) − 2Ne .I(Xi1 , Pai ) si X χα,li σi (j)
→ −
→ −
j=1
Roughly speaking, dMIT measures the “error” of representing the joint distribution underlying D by G, while gMIT measures the complexity of this representation. We state the following results: 0 Proposition 1. The problem of SMIT maximization is equivalent to the problem of SMIT minimization.
Proof. Obvious, since
Pn
i=1
→ −
2Ne .I(Xi1 , X) = const.
Proposition 2. dMIT , gMIT satisfy assumption 3 Proof. dMIT ≥ 0 since of all parent sets Pai , X has the maximum mutual information with → − Xi1 . And since the support of the Chi-square distribution is R+ , i.e., χα,· ≥ 0, therefore Pai ⊆ Pa0i ⇒ 0 ≤ gMIT (Pai ) ≤ gMIT (Pa0i ).
6
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
Unfortunately, gMIT does not satisfy assumption 4. However, for many applications, if all the variables have the same number of states then it can be shown that gMIT satisfies assumption 4. Assumption 5. (variable uniformity) All variables in X have the same number of discrete states k. Proposition 3. Under the assumption of variable uniformity, gMIT satisfies assumption 4. 0 0 Proof. Psi It can be seen that if |Pai | = |Pai | = si , then gMIT (Pai ) = gMIT (Pai ) = j=1 χα,(k−1)2 kj−1 .
Since gMIT (Pai ) is the same for all parent sets of the same cardinality, we can write gMIT (|Pai |) in place of gMIT (Pai ). With assumptions 1-5 satisfied, we can employ the following Algorithm 1, named globalMIT, to find the globally optimal DBN with MIT, i.e., the one with the minimal SMIT score. Algorithm 1 globalMIT : Optimal DBN with MIT Pai := ∅ for p = 1 to n do If gMIT (p) ≥ sMIT (Pai ) then return Pai ; Stop. P = arg min{Y⊆X:|Y|=p} sMIT (Y) If sMIT (P) < sMIT (Pai ) then Pai := P. end for
Theorem 1. Under assumptions 1-5, globalMIT applied to each variable in X finds a globally optimal DBN under the MIT scoring metric. Proof. The key insight here is that once a parent set grows to a certain extent, its complexity alone surpasses the total score of a previously found sub-optimal parent set. In fact, all the remaining potential parent sets P omitted by the algorithm have a total score higher than the current best score, i.e., sMIT (P) ≥ gMIT (|P|) ≥ sMIT (Pai ), where Pai is the last suboptimal parent set found. → −
We note that the terms 2Ne .I(Xi1 , X) in the SMIT score in (1.1) do not play any essential role, since they are all constant and would not affect the outcome of our optimization problem. Knowing their exact value is however, necessary for the stopping criterion in Algorithm 1, and also for constructing its complexity bound, as we shall do shortly. Un→ − fortunately, calculating I(Xi1 , X) is by itself a hard problem, requiring O(k n+1 ) space and time in general. However, for our purpose, since the only requirement for dMIT is that it → − must be non-negative, it is sufficient to use an upper bound of I(Xi1 , X). A fundamental property of the mutual information states that I(X, Y) ≤ min{H(X), H(Y)}, i.e., mutual information is bounded by the corresponding entropies. We therefore have: → −
→ −
2Ne .I(Xi1 , X) ≤ 2Ne .H(Xi1 ),
1.3. OPTIMAL DYNAMIC BAYESIAN NETWORK STRUCTURE LEARNING IN POLYNOMIAL TIME WITH MIT
7
→ −
where H(Xi1 ) can be estimated straightforwardly from the data. Or else, we can use an a → − priory fixed upper bound for all H(Xi1 ), that is log k, then: → −
2Ne .I(Xi1 , X) ≤ 2Ne . log k. Using these bounds, we obtain the following more practical versions of dMIT : → −
→ −
d0MIT (Pai )
=
2Ne .H(Xi1 ) − 2Ne .I(Xi1 , Pai )
d00MIT (Pai )
=
2Ne . log k − 2Ne .I(Xi1 , Pai )
→ −
It is straightforward to show that Algorithm 1 and Theorem 1 are still valid when d0MIT or d00MIT are used in place of dMIT . 1.3.1
Complexity bound
Theorem 2. globalMIT admits a polynomial worst-case time complexity in the number of variables. Proof. Our aim is to find a number p∗ satisfying gMIT (p∗ ) ≥ sMIT (∅). Clearly, there is no need to examine any parent set of cardinality p∗ and over. In the worse case, our algorithm will have to examine all the possible parent sets of cardinality from 1 to p∗ − 1. We have: gMIT (p∗ ) ≥ sMIT (∅) ∗
⇔
p X
→ −
≥ dMIT (∅) = 2Ne .I(Xi1 , X).
χα,li σi (j)
j=1
As discussed above, since calculating dMIT is not convenient, we use d0MIT and d00MIT instead. With d0MIT p∗ can be found as: p∗ = P p
arg min
j=1
while with d00MIT :
p∗ = P p j=1
∗
→ −
(1.2)
p,
χα,li σi (j) ≥2Ne .H(Xi1 )
arg min
(1.3)
p.
χα,li σi (j) ≥2Ne . log k ∗
It can be seen that p depends only on α, k and Ne . Since there are O(np ) subsets with at most p∗ parents, and each set of parents can be scored in polynomial time, globalMIT admits an overall polynomial worst-case time complexity in the number of variable n. We now give some examples to demonstrate the practicability of Theorem 2. Example 1: Consider a gene regulatory network reconstruction problem, where each gene has been discretized to k = 3 states, corresponding to up, down and regular gene expression. With the level of significance α set to 0.999 as recommended in [dC06], we have:
8
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
gMIT (1) =
χ0.999,4
= 18.47
gMIT (2) =
gMIT (1) + χ0.999,12
= 51.37
gMIT (3) =
gMIT (2) + χ0.999,36
= 119.35
gMIT (4) =
gMIT (3) + χ0.999,108
= 278.51
gMIT (5) =
gMIT (4) + χ0.999,324
= 686.92
gMIT (6) =
gMIT (5) + χ0.999,972
= 1800.9
gMIT (7) =
gMIT (6) + χ0.999,2916
= 4958.6
gMIT (8) =
gMIT (7) + χ0.999,8748
= 14121
gMIT (9) = gMIT (8) + χ0.999,26244
= 41079
gMIT (10) = gMIT (9) + χ0.999,78732
= 121042
... Consider a data set of N = 12 observations, which is the popular length of microarray time-series experiments (in fact N often ranges within 4 − 15), then d00MIT (∅) = 2(N − 1) log k = 24.16. Observing that gMIT (2) > d00MIT (∅), then p∗ = 2 and we do not have to consider any parent sets of 2 variables or more. Let us compare this bound with those of the algorithms for learning the globally optimal DBN under the BIC/MDL and BDe scoring metrics. For BIC/MDL, p∗M DL isPgiven by dlogk N e, while for BDe, p∗BDe = dN logγ −1 ke, where the distribution P (G) ∝ γ |Pai | , with a penalty parameter 0 < γ < 1, is used as a prior over the network structures [Doj06]. In this case, p∗M DL = 3. If we choose log γ −1 = 1 then p∗BDe = dN log ke = 14. In general, p∗BDe scales linearly with the number of data items N , making its value less of practical interest, even for small data sets. Example 2: Since the number of observations in a single microarray time-series experiment is often limited, it is a popular practice to concatenate several time-series to obtain a larger data set for analysis. Let us merge Nt = 10 data sets, each with 12 observations, then Ne = N − Nt = 120 − 10 = 110. For this combined data set, gMIT (4) > d00MIT (∅) = 2Ne log k = 241.69 ⇒ p∗ = 4, thus there is no need to consider any parent set of more than 3 variables. For comparison, we have p∗M DL = 5, and p∗BDe = 132 with log γ −1 = 1. Of course, this analysis only gives us the worst-case time complexity. In practice, the execution of Algorithm 1 can often be much shorter, since sMIT (Pai ) is often much greater than sMIT (∅). This observation also applies for the global algorithms based on the BIC/MDL and BDe scoring metrics. Even though Algorithm 1 admits a polynomial time complexity, exhaustive search for the optimal parent sets over all the subsets of X with at most p∗ − 1 elements may still be extremely time consuming, especially when the number of variable n is large. In such cases, MIT may be instead used in conjunction with local or meta-optimization algorithms such as hill climbing, simulated annealing or genetic algorithm. In these cases, our analysis gives a theoretical guidance and justification for setting the so-called “max-fan-in” parameter, which dictates the maximum number of parents allowed for each node, as found popular in many softwares for DBN learning. Setting an appropriate max-fan-in number greatly reduces the search space, while still ensuring that the region containing the global
1.4. EXPERIMENTAL EVALUATION
9
optimum is covered. There seems to be no systematic rules for setting this parameter in the BN literature in our observation. Example 3: Let’s consider some large scale data sets, with k = 3, α = 0.999 and a set of N = 10000 observations, then p∗ = 9. The max-fan-in parameter can then be set to 8. For comparison, we have p∗M DL = 9 and p∗BDE = 10987 with log γ −1 = 1. 1.3.2
Efficient Implementation for globalMIT
The search procedure involves examining all potential parent sets of increasing cardinality. The following decomposition property of the mutual information is handy when it comes to design an efficient implementation for globalMIT: I(Xi , Pai ∪ Xj ) = I(Xi , Pai ) + I(Xi , Xj |Pai ) This implies that the mutual information can be computed incrementally, and suggests that, for efficiency, the computed mutual information values should be cached to avoid redundant computations, subject to memory availability. 1.4
E XPERIMENTAL E VALUATION
In this section, we describe our experiments to evaluate our global approach for learning DBN with MIT, and compare it with the other most popular scores, namely BIC/MDL and BD. Our method, implemented in Matlab, was used, along with BNFinder [WD09], a Python-based software for inferring the globally optimal DBN with the MDL and BDe scores as proposed by Dojer [Doj06]. In addition, we also employed the Java-based Banjo software [Har], which can perform greedy search and simulated annealing over the DBN space using the BDeu metric. BNFinder and Banjo are freely available online. Data: The specific problem domain that we shall work with in this experiment is the problem of gene regulatory network reconstruction from microarray data, with the variables being genes, and edges being regulatory relationship between genes. The experiments are carried out on several synthetic data sets with known ground-truth network. In order to eliminate the bias toward a certain data generation method, we shall employ several different data generation schemes that have been used in some previous studies, namely, probabilistic method [Hus03], linear dynamical system based method [YSW+ 04], and non-linear dynamical system based method [SI04]. As a realistic number of samples for microarray data, we generated data sets of between 30 and 300 samples. Data preprocessing: As per requirement of assumption 5, and as per popular practice in the field, all the gene expression profiles were discretized to the same number of levels (normally 3). As mentioned in section 1.2, data that consist of multiple separate time series need to be preprocessed to have the proper data alignment and the number of effective samples. BNFinder has an in-built capability to process multiple time series, while Banjo doesn’t seem to support this functionality. However, when the number of samples is large, this preprocessing will probably only marginally affect the results. Self-regulated links handling: During the experiments, we have noted that, the selfregulated links, i.e., a link from a node to itself, very frequently appear in the optimal networks. This is to be expected, since biological time series data are often smooth and have a high degree of auto-correlation and auto-mutual information at short lag. While these links might or might not be present in the ground-truth network, they are generally
10
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
not very informative, as for most natural biological processes, the current state often dictates the state in the near future (unless the process evolves in a random-walk manner). Different softwares have different policies with these self-regulated links: BNFinder suppresses these links by default, i.e., the search is performed only on the parent sets without self-parents, while Banjo on the other hand, has a parameter to enforce this type of link, but then simply ignores them all in its output report. For globalMIT, we also implemented an option to either enable or disable these links. Self-links are, however, not considered when calculating the performance metrics below. Performance metrics: With the ground-truth network available, we count the number of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) edges, and report two network quality metrics, namely sensitivity= TP/(TP+FN), and imprecision=FP/(FP+TP). Parameters setting: globalMIT has one parameter, namely the significance level α, to control the trade-off between goodness-of-fit and network complexity. Adjusting α will generally affect the sensitivity and imprecision of the discovered network, very much like its affect on the Type-I and Type-II error of the mutual information test of independence. de Campos [dC06] suggested using very high levels of significance, namely 0.999 and 0.9999. We note that, the data sets used in [dC06] are of sizes from 1000 to 10000 samples. For microarray data sets of merely 30 − 300 samples, it is necessary to use a lower level of significance α to avoid overly penalizing network complexity. We have experimentally observed that using α ∈ [0.95, 0.999] on these small data sets yielded reasonable results, with balanced sensitivity and imprecision. BNFinder+MDL required no parameter tuning, while for BNFinder+BDe, the pseudocounts for the BDe score was set to the default value of 1, and the penalty parameter was set to the default value of log γ −1 = 1. For Banjo, we employed simulated annealing as the search engine, and left the equivalent sample size to the default value of 1 for the BDeu score, while the max-fan-in was set to 3. The runtime for Banjo was set to the average runtime of globalMIT, with a minimum value of 10 minutes, in case where globalMIT terminates earlier. Since some experiments were time consuming, all our experiments were performed in parallel on a 16-core Xeon X5550 workstation. 1.4.1
Results on Probabilistic Network Synthetic Data
We employed a subnetwork of the yeast cell cycle, consisting of 12 genes and 11 interactions, as depicted in Fig. 1.3(a). Two different conditional probabilities were associated with these interactions, namely noisy regulation according to a binomial distribution, and noisy XOR-style co-regulation (see [Hus03] for the parameter details, and this author website for Matlab code to generate this data). In addition, 8 unconnected nodes were added as confounders, for a total of 20 nodes. For each number of samples N = 30, 70 and 100, we generated 10 data sets. From the average statistics in Table 2.1, it can be seen that this is a relatively easy case for all methods. Except Banjo which committed a lower sensitivity and yet a higher imprecision, all other methods nearly recovered the correct network. Note that due to the excessive runtime of BNFinder+BDe, for N = 100, only 1 of ten data sets was analyzed. 1.4.2
Results on Linear Dynamical System Synthetic Data
We employed the two synthetic networks as described in [YSW+ 04], each consisting of 20 genes, with 10 and 11 genes having regulatory interaction, while the remainder moving
1.4. EXPERIMENTAL EVALUATION
11
(a) The yeast cell cycle subnetwork
(b) Yu’s net No. 1
(c) Yu’s net No. 2
Figure 1.3: Synthetic Dynamic Bayesian Networks
in a random walk, as depicted in Fig. 1.3(b,c). The data are generated by a simple linear dynamical process as follows: Xt+1 − Xt = A(Xt − T ) +
(1.4)
with X denotes the expression profiles, A describes the strength of gene-gene regulations, T is the constitutive expression values, and simulates a uniform biological noise. The detailed parameters for each network can be found in [YSW+ 04]. Using the GeneSim software provided by these authors, we generated, for each number of sample N = 100, 200 and 300, 10 data sets for each network. From the average statistics in Table 2.1, it can be seen that Banjo performed well on both data sets. BNFinder achieved a slightly lower sensitivity, but with very high imprecision rate. It is probable that the self-link suppression default option in BNFinder has led the method to include more incorrect edges to the network for a reasonable goodness-of-fit. GlobalMIT performed worse at N = 100, but is better at higher number of samples. Again, due to time limit, we were only able to run BNFinder+BDe on one out of ten data sets for each network at N = 200 and 300. 1.4.3
Results on Non-Linear Dynamical System Synthetic Data
We employed a five-gene network as in [SI04], of which dynamics is modeled by a system of coupled differential equations adhering to the power-law formalism, called the
12
CHAPTER 1. GLOBALMIT—A POLYNOMIAL TIME ALGORITHM FOR LEARNING GLOBALLY OPTIMAL DYNAMIC BAYESIAN NETWORK
S-system. The concrete form of an S-system is given as follows: n n Y Y dXi g h = αi Xj ij − βi Xj ij , dt j=1 j=1
i = 1 . . . n,
(1.5)
where the rates αi , βi and kinetic orders gij and hij are parameters dictating the influence of gene Xj on the rate of change in the expression level of gene Xi . Using the same system parameters as in [SI04], we integrated the system using the Runge-Kutta method with 10 different initial conditions to obtain 10 time series, each of length 50. We then randomly chose 3 time series of length 33, 3 of length 50 and 6 of length 50, to make data sets of length N = 99, 150 and 300 respectively, with 10 data sets for each N value. Although this data had been previously analyzed with good accuracy by using reconstruction methods based on differential equation models, it proved to be the most challenging case for DBN based methods. Even with a fairly large number of samples, compared to a small number of variables and interactions, all the methods performed poorly, with low sensitivity and high imprecision, rendering the results hardly useful. GlobalMIT nevertheless showed a slight advantage, with a reasonable sensitivity and imprecision at N = 300. 1.5
C ONCLUSION
This chapter has investigated the problem of learning the globally optimal DBN structure with the MIT scoring metric. We have showed that this task can be achieved using a polynomial time algorithm. Compared with the other well-known scoring metrics, namely BIC/MDL and BDe, both in terms of the worst-case complexity bound and practical evaluation, the BIC/MDL-based algorithm for learning the globally optimal DBN is fastest, followed by MIT, whereas the extensive runtime required by the BDe-based algorithm renders it a very expensive option. GlobalMIT, which is based on a sound information theoretic criterion, represents a very competitive alternative, both in terms of the network quality and runtime required. In the next chapter of this report, we present the application of GlobalMIT on reconstructing the network of biological processes of cyanobacteria.
GlobalMIT Sen Imp Time 95 ± 9 29 ± 13 13 ± 3 100 ± 0 1±3 67 ± 4 100 ± 0 0±0 499 ± 56
Probabilistic Network Synthetic Data Banjo BNFinder+MDL N Sen Imp Time Sen Imp Time 30 84 ± 6 70 ± 4 600 86 ± 10 10 ± 9