2004 8th International Conference on Control, Automation, Robotics and Vision Kunming, China, 6-9th December 2004
Reconstructing Boolean Networks from Noisy Gene Expression Data Zheng Yun and Kwoh Chee Keong School of Computer Engineering, Nanyang Technological University Nanyang Avenue, Singapore 639798 Email:
[email protected],
[email protected] Abstract
specified nodes Xi1 , . . ., Xik at time step t is assigned to the node Xi at time step t + 1, as shown in the following equation.
In recent years, a lot of interests have been given to simulate gene regulatory networks (GRNs), especially the architectures of them. Boolean networks (BLNs) are a good choice to obtain the architectures of GRNs when the accessible data sets are limited. Various algorithms have been introduced to reconstruct Boolean networks from gene expression profiles, which are always noisy. However, there are still few dedicated endeavors given to noise problems in learning BLNs. In this paper, we introduce a novel way of sifting noises from gene expression data. The noises cause indefinite states in the learned BLNs, but the correct BLNs could be obtained further with the incompletely specified Karnaugh maps. The experiments on both synthetic and yeast gene expression data show that the method can detect noises and reconstruct the original models in some cases.
Xi (t + 1) = fi (Xi1 (t), . . . , Xik (t)), 1 ≤ i ≤ n
The state of a BLN is expressed by the state vector of its nodes. We use v(t) = {x1 , . . . , xn } to represent the state of a BLN at time t, and 0 0 v(t + 1) = {x1 , . . . , xn } to represent the state of the 0 0 BLN at time t + 1. {x1 , . . . , xn } is calculated from {x1 , . . . , xn } with Equation 1. A state transition pair is v(t) → v(t + 1). In reference [10], we introduce a new algorithm called DFL (Discrete Function Learning) algorithm to learn qualitative models of GRNs from discretized microarray gene expression data. In our method, the expression data are assumed to be the products of these functions. Then, we use a reverse engineering method based on information theory to find these functions from gene expression data. Gene expression data are always noisy. We introduce a method called ² function to deal with the noise problems in data sets in this paper. We show that after the BLNs are learned from noisy data, some noises in the Boolean functions could be detected and removed with the incompletely specified Karnaugh maps. The rest of this paper is organized as follows. In the next section, we introduce the theory foundation of learning functional relations from data. We briefly introduce the DFL algorithm with an example in section 3. Then, we propose a new concept called ² function to deal with noises in the data sets in section 4. Finally, we summarize the works of this paper in the last section.
Keywords: gene regulatory networks, Boolean networks, reverse engineering, Karnaugh maps
1
Introduction
With the availability of genome-wide gene expression data [1, 2], a lot of interests have been given to modelling GRNs [3-9], which are assumed to be the underlying mechanisms that regulate different gene expression patterns. The real gene expression levels are always assumed to be sigmoidal functions. It has been proved to be computationally expensive to simulate such sigmoidal functions. One of the practical way is to idealize the sigmoid curve with a step function, since the step functions can model the non-linear part of the sigmoid curve. The resulting models is BLNs. When using BLNs to model GRNs, genes are represented with binary variables with two values ON (1) and OFF (0), which means the genes are turned on or turned off respectively. The regulatory relationships between the genes are expressed by Boolean functions related to each variables. Formally, a BLN G(V, F) consists of a set V = {X1 , . . . , Xn } of nodes representing genes and a set F = {f1 , . . . , fn } of Boolean functions, where a boolean function fi (Xi1 , . . . , Xik ) with inputs from 0-7803-8653-1/04/$20.00 © 2004 IEEE
(1)
2
Foundation of Information Theory
Our approach is based on the information theory. First of all, we introduce the following theorem, which is the theoretical foundation of our algorithm. Theorem 2.1 If the mutual information between X and Y is equal to the entropy of Y , i.e., I(X; Y ) = 1049
H(Y ), then Y is a function of X.
Definition 3.2 Let X be a subset of V = {X1 , . . . , Xn }, then ∆i (X) of X are the supersets of X so that X ⊂ ∆i (X) and |∆i | = |X| + i, where |X| denotes the cardinality of X.
Yeung [11] gave a proof for the following theorem. Theorem 2.2 H(Y |X) = 0 if and only if Y is a function of X. Since I(X; Y ) = H(Y )−H(Y |X), directly from Theorem 2.2, it is straightforward to obtain Theorem 2.1.
3
Methods 1 2 3 4 5 6∗
In this section, we begin with a formal definition of the problem of reconstructing qualitative GRN models from state transition pairs. Then, we briefly introduce the DFL algorithm to solve this problem. For detailed analysis of the DFL algorithm, refer to reference [10].
3.1
7
Problem definition
The problem of inferring the BLN model of the GRN from input-output transition pairs (time series of gene expression) is defined as follows.
∗
Table 1: The DFL algorithm. Algorithm: DFL(V, k, T ) Input: V with n genes, indegree k, T = {v(t) → v(t + 1)},t = 1, · · · , N . Output: F = {f1 , f2 , · · · , fn } Begin: L ← all single element subsets of V; ∆T ree.F irstN ode ← L; for every gene Y ∈ V { calculate H(Y 0 ); //from T D ← 1; //initial depth F.add(Sub(Y, ∆T ree, H(Y 0 ), D, k)); } return F; End
The Sub() is a sub routine listed in Table 3.
To clarify the heuristic underlying the DFL algorithm, let us consider a BLN consisting of four genes, as shown in Figure 1. In this example, the function of each gene is listed in Table 3. The set of all genes is V = {A, B, C, D}, and we use X to denote subsets of V.
Definition 3.1 Let V = {X1 , . . . , Xn }. Given a transition table T = {v(t) → v(t + 1)} where v(t) is the state vector of the GRN model at time t, find a set of discrete functions F = {f1 , f2 , · · · , fn }, so 0 that Xi (t + 1) (Xi hereafter) is calculated from fi as follows
A
B
C
D
Xi (t + 1) = fi (Xi1 (t), . . . , Xik (t)), where t goes from 1 to a limited constant.
A’ B’ C’ D’
If the F are Boolean functions, then the GRN model is a BLN, otherwise the GRN model is a GLF or PLDE model. From Theorem 2.1, to solve the problem in Definition 3.1 is actually to find a group of genes X(t) = {Xi1 (t), . . ., Xik (t)}, so that the mutual informa0 tion between X(t) and Xi is equal to the entropy of 0 Xi . Therefore, the problem is resorted to a searching over all combinations of V = {X1 , . . . , Xn }. Apparently, there are a total of 2n combinations for V, which makes the problem become NP-complete. Fortunately, for GRNs, each gene is estimated on the average to interact with four to eight other genes [12]. That is to say, it is sufficient to consider the combinations whose cardinalities are bounded by a small integer.
3.2
Figure 1: The wiring diagram of a BLN model, where n = 4, kmax = 4. X 0 denotes the state of X in next time step. One of the commonly used algorithms to infer BLNs from data is the REVEAL algorithm [13]. As shown in Figure 2, the REVEAL algorithm uses an exhaustive search method, it first searches the subsets with only one gene, then subsets with two genes, and so on. If the gene under consideration is Y , the REVEAL algorithm calculates the mutual information between X and Y 0 . If I(X, Y 0 ) = H(Y 0 ), then it extracts function rules from the original transition table T , and stops the calculation for Y . Consequently, it finds the boolean function for the next gene. When compared with the REVEAL algorithm, the DFL algorithm uses a better heuristic when finding the target combination. An example is given in Figure 2, where it shows the search procedures to find Boolean function of D0 of the example in Figure 1
Search method
The main steps of the DFL algorithm are listed in Table 1. In the DFL algorithm, we use the following definition, called ∆ supersets. 1050
{}
Table 2: The sub routine of the DFL algorithm. Algorithm: Sub(Y, ∆T ree, H, D, k) Input: Y , ∆T ree, entropy H(Y ) current depth D, indegree k Output: function Y = f (X) Begin: 1 L ← ∆T ree.DthN ode; 2 for every element X ∈ L { 3 calculate I(X, Y ); 4 if(I(X, Y ) == H) { 5∗ extract Y = f (X) from T ; 6 return Y = f (X) ; } } 7 sort L according to I; 8 for every element X ∈ L { 9 if(D < k){ 10 D ← D + 1; 11 ∆T ree.DthN ode ← ∆1 (X); 12 return Sub(Y, ∆T ree, H, D, k); } } 13 return “Fail(Y)”; End
{A}
{B}
{C}
{D}
{A,B} {A,C} {A,D} {B,C} {B,D} {C,D}
{A,B,C} {A,B,D} {A,C,D} {B,C,D}
{A,B,C,D}
Figure 2: Search procedures of the DFL algorithm and the REVEAL algorithm when finding the Boolean function of D0 in Figure 1. The solid line is for the DFL algorithm, the dashed line is for the REVEAL algorithm. The combinations with a black dot under them are the subsets which share the largest mutual information with D0 on their layers. The REVEAL algorithm firstly searches the first layer (the subsets with one gene), then the second layer, and so on. Finally, it finds the target subsets {A, B, C, D} at the fourth layer. The DFL algorithm uses a different heuristics. Firstly, it searches the first layer, then finds that {A}, with a black dot under it, shares the largest mutual information between D0 among subsets on the first layer. Then, it continues to search ∆1 (A) on the second layer. Similarly, these calculations continue until the target combination {A, B, C, D} is found on the fourth layer.
∗
By deleting unrelated variables and duplicate rows in T .
Table 3: Boolean functions of the example, where “+” is the logical OR operation, and “·” is the logical AND operation. Gene Rule A A0 = B B B0 = A + C C C 0 = (B · C) + (C · D) + (B · D) D D0 = (A · B) + (C · D)
4
Experiments and Results
Due the the noises in the gene expression data, the requirement of Theorem 2.1 may not be satisfied strictly. Therefore, some regulatory relations can not be identified successfully. In this section, we introduce the concept of ² function to overcome the problems incurred by noises in the data sets. We further show that some noises in the data sets of BLNs can be detected and removed by the incompletely specified Karnaugh maps.
and Table 3. Firstly, the DFL algorithm searches the first layer, then it sorts all subsets on the first layer. It finds that {A} shares the largest mutual information with D0 among subsets on the first layer. Then, the DFL algorithm searches through ∆1 (A), . . ., ∆k−1 (A), however it always decides the search order of ∆i+1 (A) bases on the calculation results of ∆i (A). If it still does not find the target subset, which satisfy the requirement of Theorem 2.1, in kth layer, the DFL algorithm will return to the first layer. Now, the first node on the first layer and all its ∆1 , . . . , ∆k−1 supersets have already been checked. It continues to calculate the second node on the first layer (and all its ∆1 , . . . , ∆k−1 supersets), the third one, and so on, until it reaches the end of the first layer.
4.1
The definition of ² function
In Theorem 2.1, the exact functional relation results from the strict equality between the entropy of Y H(Y ) and the mutual information of X and Y I(X; Y ). However, this equality is often ruined by the noisy data, like microarray gene expression data. In these cases, we can relax the requirement to obtain a compromised result. As shown in Figure 3, by defining a significant factor ², if the difference between I(X; Y ) and H(Y ) is less than ² × H(Y ), then we say Y is a ² function of X. Formally, we define the ² function as follows. Definition 4.1 If H(Y )−I(X; Y ) ≤ ²×H(Y ), then Y = f² (X) where ² is a significant factor. 1051
I(X;Y)
H(Y) H(X)
(a)
Figure 4: The incompletely specified Karnaugh map of the learned C 0 . The “-” in the figure is an unspecified state incurred by the noise.
(b)
Figure 3: The Venn diagram of H(X),H(Y ) and I(X, Y ), when Y = f (X). (a) The noiseless case, where the mutual information between X and Y is the entropy of Y . (b) The noisy case, where the entropy of Y is not equal to the mutual information between X and Y strictly. The shaded region is resulted from the noises. The ² function means that if the area of the shaded region is smaller than or equal to ² × H(Y ), then Y = f² (X).
original function in Table 3. The Boolean functions for A0 , B 0 and D0 can be correctly obtained in the same way.
4.3
In this section, we use the gene expression data of yeast Saccharomyces cerevisiae cell cycle from Cho et al. [14], which covers approximately two full cell cycles [14]. In [15], Lee et al. reported a GRN related to cell cycle of yeast. The GRN consists of 11 wellknown yeast cell cycle regulators, which are Mbp1, Swi4, Swi6, Mcm1, Fkh1, Fkh2, Ndd1, Swi5, Ace2, Skn7 and Stb1. We discretize the data set in [14] to two levels, then rearrange these expression values to state-transition pairs such that the expression values at current time step are the product of expression values at the prior time step. Finally, we apply the DFL algorithm on the obtained transition table. The learned models are shown in Figure 5.
The significant factor ² is adjustable for different noise levels in the data sets.
4.2
Results of synthetic data
Again, we use the example in Figure 1 and Table 3 to show that some noises in BLNs can be detected and removed by the incompletely specified Karnaugh maps. We consider the learning of C 0 here. In our experiment, we add one wrong transition pair “(1111) → (0000)” to the transition table. There are 16 lines in the original transition table of the example. Thus, the noise rate here is 1/17. The DFL algorithm successfully finds the correct network architecture in Figure 1 when ² = 0.27. The learned Boolean function table of C 0 is listed in Table 4. In Table 4, we see that there are two lines with the same input 111 of BCD. Since the C 0 is a deterministic function of BCD, one of these two rows are resulted from noise. Table 4: The learned function BCD C 0 BCD 100 000 0 101 001 0 110 010 0 111 011 1 111∗ 0 ∗ is the noise
Results of real data
table of C 0 . C0 0 1 1 1
(a)
(b)
Figure 5: The learned GRN model. (a) The number of discrete levels for gene expression data is 2, the indegree of the GRN is set to 5 and ² is 0. (b) Idem, where the ² is 0.2. The regulators are represented by ovals. The directed edge from Gene A to Gene B means that Gene A is a regulator of Gene B. The solid edges represent regulatory relations that have been verified by other approaches. The dashed edges represent regulatory relations that have not been verified.
Then, we draw a Karnaugh map for the function table as in Figure 4. In Figure 4, we see that the noise entry in Table 4 produces an unspecified state in the Karnaugh map. However, the noises is correctly detected and removed after the merging rules of the incompletely specified Karnaugh map are applied. As shown in Figure 4, the final Boolean function of C 0 is C 0 = BC + BD + CD, which is the
All regulatory relations represented by solid edges in Figure 5 are verified in references [15, 16]. For instance, Swi4 transcription is regulated in late G1 by MBF(a complex of Mbp1 and Swi6) [16]. In Figure 1052
References
5, this regulatory relation is identified in both (a) and (b). In Figure 5 (a), the DFL algorithm can not find the regulators of Fkh2 due to the noise in the data. However, two regulators of Fkh2, Mbp1 and Ndd1 are successfully found in Figure 5 (b). The learned Boolean function of Fkh2 is listed in Table 5, where N, M and F represent the expression level of Ndd1, Mbp1 and Fkh2 respectively.
[1] J. DeRisi, V. Iyer, and P. Brown, “Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale,” Science, vol. 278, no. 5338, pp. 680–686, 1997. [2] P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, pp. 3273–3297, 1998.
0
Table 5: The learned function table of F . NM F0 NM F0 10 1 00 0 11 1 01∗ 0 01∗ 1 ∗ One of these two lines is the noise
[3] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic networks inference: from co-expression clustering to reverse engineering,” Bioinformatics, vol. 16, no. 8, pp. 707–726, 2000. [4] P. Smolen, D. Baxter, and J. Byrne, “Modeling transcriptional control in gene network: Methods, recent results, and future directions,” Bulletin of Mathematical Biology, vol. 62, pp. 274– 292, 2000.
After applying the merging rules of the incompletely specified Karnaugh maps, we obtain F 0 = N . However, Fkh2 is controlled by many other regulators, like Mbp1, Skh7, Swi4, Ace2, Fhk1, Mcm1 and Fhk2 itself [15]. Thus, we include the Mbp1 in the Boolean function of Fkh2. Finally, we have F 0 = N + M . The conjectural reason why other regulators do not appear in the Boolean function of F 0 is the limited size of the data set, which contains results of 17 experiments only.
5
[5] D. Endy and R. Brent, “Modelling cellular behavior,” Nature, vol. 409, no. 6818, pp. 391–395, 2001. [6] J. Hasty, D. McMillen, F. Isaacs, and J. Collins, “Computational studies of gene regulatory networks: in numero molecular biology,” Nature Review Genetics, vol. 2, no. 4, pp. 268–279, 2001.
Discussions
[7] H. Bolouri and E. Davidson, “Modeling transcriptional regulatory networks,” BioEssays, vol. 24, pp. 1119–1129, 2002.
We introduce a new concept called ² function to deal with the noises when learning BLNs from gene expression data sets. Just like P value in statistical method, the ² function method can approximate the original function with a known and adjustable precision. We also show that some noises in ² functions of BLNs can be detected and removed with the incompletely specified Karnaugh maps. In reference [10], we show that the ² function method can also be applied to functions other than Boolean functions. However, it is also possible that some kinds of noises can not be removed by this method. For example, if the noises cause the BCD = (101) becomes an unspecified state, then the final Boolean function of C 0 would be C 0 = BC +CD, which is incorrect. In these cases, the ² functions offer the correct architectures of BLNs although the precise Boolean functions are incorrect. The network architectures of GRN models are important, because they can guide the future researches for biologist. As shown by Yuh et al. [17, 18], the prior models of Endo16 gene of sea urchin embryo were used to guide future research. In return, the original GRN model of Endo16 was refined by further experiments.
[8] J. Hasty, D. McMillen, and J. Collins, “Engineered gene circuits,” Nature, vol. 420, pp. 224– 230, 2002. [9] H. de Jong, “Modeling and simulation of genetic regulatory systems: A literature review,” Jounral of Computational Biology, vol. 9, no. 1, pp. 67–103, 2002. [10] Y. Zheng and C. K. Kwoh, “Dynamic algorithm for inferring qualitative models of gene regulatory networks,” in Proceedings of 3rd Computer Society Bioinformatics Conference, CSB 2004, 2004. [11] R. W. Yeung, A First Course in Information Theory. New York, NY: Kluwer Academic/Plenum Publishers, 2002. [12] M. Arnone and E. Davidson, “The hardwiring of development: organization and function of genomic regulatory systems,” Development, vol. 124, pp. 1851–1864, 1997. 1053
[13] S. Liang, S. Fuhrman, and R. Somogyi, “Reveal, a general reverse engineering algorithms for genetic network architectures,” in Proceedings of Pacific Symposium on Biocomputing ’98, vol. 3, pp. 18–29, 1998. [14] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, and R. W. Davis, “A genomewide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, vol. 2, pp. 65–73, 1998. [15] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J.-B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford, and R. A. Young, “Transcriptional Regulatory Networks in Saccharomyces cerevisiae,” Science, vol. 298, no. 5594, pp. 799–804, 2002. [16] I. Simon, J. Barnett, N. Hannett, C. Harbison, T. Rinaldi, N.J. amd Volkert, J. Wyrick, J. Zeitlinger, D. Gifford, T. Jaakkola, and R. Young, “Serial regulation of transcriptional regulators in the yeast cell cycle,” Cell, vol. 166, pp. 679–708, 2001. [17] C.-H. Yuh, H. Bolouri, and E. Davidson, “Genomic Cis-Regulatory Logic: Experimental and Computational Analysis of a Sea Urchin Gene,” Science, vol. 279, no. 5358, pp. 1896–1902, 1998. [18] C.-H. Yuh, H. Bolouri, J. Bower, and E. Davidson, Computational Modeling of Genetic and Biochemical Networks, ch. A logical model of cis-regulatory control in eukaryotic system, pp. 73–100. Cambridge, MA: MIT Press, 2001.
1054