Computational Regulomics: Information Theoretic Approaches Towards Regulatory Network Inference Vijender Chaitankar (M.S.), Preetam Ghosh (M.S., PhD) Department of Computer Science Virginia Commonwealth University Richmond, Virginia- 23284-3019 ABSTRACT Molecular biology has so far discovered the complex nature and the various components of a cell. With this knowledge, systems biology today aims at deciphering how these components come together to create living systems [1]. The regulatory network is one of the components of the cell and inference or reverse engineering of regulatory networks pertains to the process of discovering the interactions in the regulatory networks by reasoning backwards from observations of its behavioral patterns obtained from the microarray experiments [1]. Uncovering these underlying regulatory interactions is of great importance as it helps in understanding the disease dynamics and aid in drug development. To this end, a number of mathematical and computational models have been developed for reverse engineering of regulatory networks with a limiting trade-off between the accuracy and the size of inferred networks. Information theoretic approaches are widely used today for network inference as they have the ability to infer genome-scale regulatory networks. This chapter introduces the reverse engineering approaches towards regulatory network inference using expression data and discusses some novel information theory based approaches in detail. KEY WORDS: Reverse Engineering, Inference, Information Theory, Regulatory Networks, Time Lags. CONTENTS 1 Introduction 1.1 Microarray Experiments 1.2 Information Theoretic Metrics 1.2.1 Entropy (H) 1.2.2 Mutual Information (MI) 1.2.3 Conditional Mutual information (CMI) 2 2.1 2.1.1 2.1.2 2.1.3 2.1.4 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.2.7 2.2.7.1 2.3 2.3.1 2.3.2
Methods/Algorithms Relevance Networks Based Algorithms Relevance Network Algorithm Algorithm for Reconstruction of Genetic Networks (ARACNE) Context Likelihood of Relatedness (CLR) Direct Connectivity Algorithm Minimum Description Length Principle Based Algorithms Minimum Description Length Principle Genetic Network Formulation Description Length Computation Data Length Computation Network MDL Algorithm Predictive Minimum Description Length Principle (PMDL) PMDL Algorithm Time and space complexities of PMDL Algorithm Time Lagged Information Theoretic Approaches Why do information theory based approaches saturate? Time Lags
2.3.3 2.3.4 2.4 2.4.1 2.4.2 2.4.3 2.4.3.1 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5
Time lagged mutual information (TLMI) and time lagged conditional mutual information (TLCMI)
4
Applications
5
World Wide Web Resources
Time Lagged Information Theory Based Approaches REVEAL Based Algorithms sREVEAL-1 sREVEAL-2 rREVEAL Complexity Analysis of rREVEAL Inference Using Knock-out Data Knock-out Data Knock-out Network Generation Number of Parents in e-Coli Genome Network Generation Using Parent Restriction Knock-out Data incorporation in rREVEAL
REFERENCES
1 Introduction A genetic regulatory network at an abstract level can be conceptualized as a network of interconnected genes/transcription factors which respond to internal and external conditions by altering the relevant connections within the network; here a connection represents the regulation of a gene by another gene/protein. Babu et al. [2] organized the various approaches towards inferring the gene regulatory networks into three fundamental categories. These are 1. 2. 3.
Template based methods Inferring networks by predicting cis-regulatory elements Reverse engineering using gene expression data.
While the first two approaches are more biologically oriented, the third one relies primarily on mathematical and computational based approaches. Reverse engineering approaches using gene expression data scan for patterns in the time series microarray data sets generally known as the gene expression matrix [2]. The gene expression matrix is created after a series of steps performed over the microarray chip which includes image processing, transformation and normalization. In a gene expression matrix, generally a row represents a gene and a column represents an external condition or a specific time point [3]. The advantage of the "reverse engineering approach using gene expression data" is that it does not require any prior knowledge to infer the regulatory networks, while incorporating such prior knowledge would yield even better inference accuracy. For example as it is known that transcription factors regulate genes, giving a list of transcription factors as prior knowledge to these algorithms would greatly improve the results. Understanding the global properties of regulatory networks has also resulted in improved inference results. One example of such a global property based on empirical studies on the yeast regulatory network is that, a gene is generally regulated by a small number of transcription factors [4]. This property greatly reduces the computational complexity of popular methods such as dynamic bayesian networks (DBN) [5-7] and probabilistic boolean networks (PBN) [8-9] to name a few. Figure 1 outlines the inference/reverse engineering process. The main input that a reverse engineering algorithm requires is the gene expression matrix, while other inputs such as prior knowledge (ex. known interactions between genes) and other kinds of experimental data (e.g., gene knock out data) can also be utilized to obtain better results.
1.1 Microarray Experiments Microarrays help biologists to record the expressions levels of genes in an organism at a genome level. A microarray is a glass plate which has thousands of spots; each of these spots contains similar DNA molecules that uniquely identify a gene. In any microarray experiment, initially the RNA from the cells are extracted; these extracted RNA’s are then reverse transcribed in to cDNA. The cDNA’s are then labeled with colored dyes and allowed to hybridize on the microarray glass plate. At this stage the labeled cDNA’s will hybridize to the complimentary sequences on the spots. If the concentration of a particular gene is high, then the corresponding spot on the microarray plate will show the dye color. There are a number of ways in which a microarray experiment can record expression levels. The most popular one is to compare the expression levels of a cell exposed to a particular external condition to that of a reference cell in normal condition. In this approach, cells under the two different conditions are labeled with two different dye colors. If a particular gene is expressed in only one condition then the corresponding spot on the microarray plate shows the color for that particular condition. If it is not expressed in any conditions then the actual color of spot (usually black) shows up. It is also possible that the genes may be expressed in both conditions; in such cases a spot shows a variant color which is the combination of the two chosen colors. After hybridization the colors on the microarray plate are scanned and recorded by a machine. This microarray data is then analyzed which includes image processing, transformation and normalization. For more details on these steps, refer to [3]. Different microarray normalization and quantization methods might yield different networks, and hence need to be careful selected. The data after this analysis is used as input by the reverse engineering algorithms that we discuss in this chapter. We next present some mathematical foundations on information theory. 1.2 Information Theoretic Metrics This section introduces some of the basic metrics of information theory that are used in the algorithms for reverse engineering regulatory networks. 1.2.1 Entropy (H) Entropy is the measure of the average uncertainty in a random variable. Entropy of a random variable 𝑋 with probability mass function 𝑝(𝑥) is defined [10] by 𝐻(𝑋) = − � 𝑝(𝑥). 𝑙𝑜𝑔𝑝(𝑥)
1.2.2 Mutual Information (MI)
𝑥∈𝑋
(1)
Mutual information measures the amount of information that can be obtained about one random variable by observing another one. MI is defined [10] as 𝑝(𝑥, 𝑦) 𝐼(𝑋; 𝑌) = � �𝑝(𝑥, 𝑦)𝑙𝑜𝑔 � (2) 𝑝(𝑥). 𝑝(𝑦) 𝑥,𝑦
MI can also be defined in terms of entropies [10] as 𝐼(𝑋; 𝑌) = 𝐻(𝑋) + 𝐻(𝑌) − 𝐻(𝑋, 𝑌)
1.2.3 Conditional Mutual Information (CMI)
(3)
Conditional mutual information is the reduction in the uncertainty of 𝑋 due to knowledge of 𝑌 when 𝑍 is given [11]. The CMI of random variables 𝑋 and 𝑌 given 𝑍 is defined [10-11] as 𝑝(𝑥, 𝑦|𝑧) 𝐼(𝑋; 𝑌|𝑍) = � 𝑝(𝑥, 𝑦, 𝑧). 𝑙𝑜𝑔 (4) 𝑝(𝑥|𝑧). 𝑝(𝑦|𝑧) 𝑥,𝑦,𝑧
CMI can also be expressed in terms of entropies as [11]: 𝐼(𝑋; 𝑌|𝑍) = 𝐻(𝑋, 𝑍) + 𝐻(𝑌, 𝑍) − 𝐻(𝑍) − 𝐻(𝑋, 𝑌, 𝑍)
(5)
2 Methods/Algorithms This chapter focuses on information theory based inference of regulatory networks using expression data. While approaches such as Dynamic Bayesian Networks (DBN’s) [5-7] do infer quality networks, their capability is limited to inferring small regulatory networks (~300 nodes i.e. 300 genes and transcription factors). As the microarray experiments provide data for thousands of genes, these approaches cannot utilize the full potential of the available data. Usually a sub set of the complete data set such as differentially expressed genes is filtered from the complete data set and then is given as input to these approaches. However information theory approaches are capable of inferring networks at genome scale. In the following sub sections we will be discussing some of the popular information theory based algorithms. The algorithms discussed in this chapter are classified in the following four categories: 1. 2. 3. 4.
Relevance Networks based algorithms. Minimum Description Length Based Algorithms. Time-Lag based Algorithms. REVEAL based algorithms.
2. 1. Relevance Networks Based Algorithms This section discusses the relevance network algorithm [12-13] and algorithms that were derived on top of relevance networks. This family of algorithms is computationally very efficient and is capable of inferring very large networks. 2.1.1 Relevance Network Algorithm Relevance networks [12-13] is the simplest and computationally most effective information theory based approach. This approach computes pair-wise MI between every pair of gene from the given expression data. An edge between a pair of genes exists if the corresponding MI is greater than a user selected threshold. The quality of inferred networks heavily depends on the threshold selected. A small threshold infers a network with high number of true edges and false edges where as a high threshold infers networks with low number of true and false edges. If a regulatory interaction inferred is correct then the edge representing that interaction is a true edge, else if the regulatory interaction inferred is wrong then the edge representing the interaction is a false edge. Selection of proper threshold selection is an issue in this approach. 2.1.2 Algorithm for Reconstruction of Genetic Networks (ARACNE) The Relevance Networks concept assumed that if the MI is low (that the user defined threshold) then the genes are not connected while they are connected if MI is high. Based on the study of chemical kinetics, it has been found that the second assumption is not true [11]. Consider a network with three genes X, Y, and Z. If gene X regulates gene Y, and gene Y regulated gene Z, then the MI between genes X and Z could be high which infers a false edge. ARACNE [14] was the first inference algorithm to implement a method to identify and prune such false edges. ARACNE states that if the MI between two genes X and Y is less than or equal to MI between genes X and Z or between Y and Z, i.e. I(X, Y) ≤ [I(X, Z), I(Y, Z)], then there is no connectivity between X and Y. 2.1.3. Context Liklihood of Relatednedd (CLR) Various improvements over the relevance networks approach were achieved over the last decade with CLR [15] being the most popular amongst them today. CLR applies an adaptive background correction step to eliminate false connections and indirect influences. In this approach all pair-wise MI values are computed like in the relevance network approach and stored in a matrix. If the network inference is being performed over a data set which has expression data for N genes then all pair-wise MI values are stored in a matrix M which has N rows and N columns. For every entry in this M matrix Z-scores are computed over the row (z 1 ) and column (z 2 ). After Z-score computation is performed the score for each pair-wise
edge is computed as �𝑧1 2 + 𝑧2 2 . Thus score for every edge is computed and stored in a matrix (say Z) which also has N rows and N columns. The network is finally obtained by using a user defined threshold over the Z matrix. For more information on computing z-scores refer to [15]. 2.1.4. Direct Connectivity Algorithm Like ARACNE and CLR, the direct connectivity algorithm [11] is also built on top of relevance networks. The false edge pruning process here is performed using the conditional mutual information metric. The algorithm initially builds a relevance network; then for every edge inferred, the algorithm computes the conditional mutual information between the two nodes of the inferred edge and every other node in the network. If the conditional mutual information metric is below a user specified threshold, the edge is deleted. This algorithm, along with previously discussed algorithms, also suffers from the threshold selection problem. 2.2 Minimum Description Length Principle Based Algorithms: This section explains the minimum description length principle (MDL) [16-18] and algorithms based on minimum description length principle. 2.2.1 Minimum Description Length Principle The MDL principle states that if multiple theories exist then the theory with the smallest model length is the best. This principle along with entropy metric can be used to exploit the robustness property of a regulatory network. It also solves the MI threshold selection problem that exists in relevance networks based approaches. Description length for network inference was first formulated in [19]. In order to understand the description length, let us first define the network formulation. 2.2.2 Genetic Network Formulation The network formulation is similar to the one used in [19]. A graph 𝐺(𝑉, 𝐸) represents a network where 𝑉 denotes a set of genes and 𝐸 denotes a set of regulatory relationships between genes. If gene 𝑥 shares a regulatory relationship with gene 𝑦, then there exists an edge between 𝑥 and 𝑦 (𝑥 → 𝑦 ). Genes can have more than one regulator. The notation 𝑃(𝑥) is used to represent a set of genes that share regulatory relationships with gene 𝑥. For example, if gene 𝑥 shares a regulatory relationship with 𝑦 and 𝑧 then 𝑃(𝑥) = {𝑦, 𝑧}. Also every gene is associated with a function 𝑓𝑥 𝑃(𝑥) which denotes the expression value of gene 𝑥 determined by the values of genes in 𝑃(𝑥).
Gene expression is affected by many environmental factors. Since it is not possible to incorporate all such factors the regulatory functions are assumed to be probabilistic. Also, the gene expression values are assumed to be discrete and the probabilistic regulation functions are represented as look-up tables. If the expression levels are quantized to 𝑞 levels and a gene 𝑥 has 𝑛 predecessors then the look up table has 𝑞 𝑛 rows and 𝑞 columns and every entry in the table corresponds to a conditional probability. Say we have genes 𝑥, 𝑦 and 𝑧 and the data is quantized to 2 levels, an example look up table is given in Table 1. In this example the entry 0.6 can be inferred as follows: if genes 𝑦 and 𝑧 are lowly expressed then the probability that 𝑥 is also lowly expressed is 0.6. 2.2.3 Description Length Computation
The description length is the sum of model length and data length. The model length in [19] was defined as the amount of memory consumed by the algorithm. The complex part of the description length computation is the data length. The following section gives a detailed explanation of the data length computation. 2.2.4 Data Length Computation A gene can take any value when transformed from one time point to another due to the probabilistic nature of the network. The network is associated with a Markov chain which is used to model state transitions. These states are represented as n-gene expression vectors 𝑋𝑡 = (𝑥1,𝑡 , … , 𝑥𝑛,𝑡 )𝑇 and the transition probability 𝑝(𝑋𝑡+1 |𝑋𝑡 ) can be
derived as follows: 𝑝(𝑋𝑡+1 |𝑋𝑡 ) =∏𝑛𝑖=1 𝑝�𝑥𝑖,𝑡+1 � ℙ𝑡 (𝑥𝑖 ))
(6)
The probability 𝑝(𝑥𝑖,𝑡+1 |ℙ𝑡 (𝑥𝑖 ) can be obtained from the look-up table associated with the vertex 𝑥𝑖 and is assumed to be time invariant. It is estimated as follows: 1 𝑚−1 ∑𝑡=1 1{𝑗} (𝑥𝑖,𝑡+1 |ℙ𝑡 (𝑥𝑖 )) 𝑝(𝑥𝑖,𝑡+1 = 𝑗|ℙ𝑡 (𝑥𝑖 ) = (7) 𝑚−1
Each state transition brings some new information that is measured by the conditional entropy:
𝐻(𝑋𝑡+1 |𝑋𝑡 ) = −log (𝑝(𝑋𝑡+1 |𝑋𝑡 ))
The total entropy for the given m time-series sample points, (𝑋1 , … , 𝑋𝑚 ) is as follows
𝑚−1 𝐻�𝑋𝑗+1 |𝑋𝑗 � 𝐿𝐷 = 𝐻(𝑋1 ) + ∑𝑗=1
(8)
(9)
As 𝐻(𝑋1 ) is same for all models it is removed and thus the description length is given by
𝑚−1
𝐿𝐷 = � 𝐻�𝑋𝑗+1 |𝑋𝑗 � 𝑗=1
(10)
2.2.5 Network MDL Algorithm In the network MDL [19] approach for every pair-wise MI value that is greater than zero the algorithm generates an initial network. In order to select the best network the algorithm computes the MDL score for every network and chooses the network with minimum score as the best network. The model length computation here is not standardized, and as a fine tuning parameter is used in this approach to control the effects of the model length parameter, this method can be arbitrary. Another major disadvantage of this approach is that it is not scalable as it is computationally very complex. The complexity of this approach can be reduced by restricting the number of regulators inferred for each gene in initial network between three and six. Also the network is biased to a very high MI threshold value. Initial network generated using a very high MI threshold value will result in a sparse network which is likely to have a very low data length. Such a network is not desired and this algorithm might not produce very good results. But considering certain special cases such as restricted number of regulators and not using a very high or very low mutual information threshold, this algorithm can infer quality networks. 2.2.6 Predictive Minimum Description Length Principle (PMDL) The description length of the two-part MDL principle involves calculation of the model length and the data length. As the length can vary for various models, the method is in danger of being biased towards the length of the model [16]. The Normalized Maximum Likelihood Model has been implemented in [20] to overcome this issue. Another such model based on the universal code length is the PMDL principle [18]. The description length for a model in PMDL [17] is given as: 𝑚−1
𝐿𝐷 = − � log�𝑝(𝑋𝑡+1 |𝑋𝑡 )� 𝑡=0
(11)
where, 𝑝(𝑋𝑡+1 |𝑋𝑡 ) is the conditional probability or density function. In the PMDL algorithm the description length is equivalent to the data length given in [19]. The next section gives a detailed explanation of the PMDL Algorithm. 2.2.7 PMDL Algorithm
Given the time series data, the data is first pre-processed, which involves filling missing values and quantizing the data. Then the MI matrix 𝑀𝑛×𝑛 is evaluated to infer a connectivity matrix 𝐶𝑛×𝑛 which has two entries: 0 and 1. An entry of 0 indicates that no regulatory relationship exists between genes, but an entry of 1 at 𝐶𝑖×𝑗 indicates that gene 𝑖 regulates 𝑗. The algorithm is given in Figure 2 [21-23]. From lines 5 to 18 (figure 2) every value of the MI matrix is used as a threshold and a model is obtained. The conditional probabilities and the description lengths for each of these models are evaluated using equations 7 and 10 respectively. Then at line 19 the MI which was used to obtain the model with the shortest description length is then used as the MI threshold to obtain the initial connectivity matrix. From lines 20 to 31, for every valid regulatory connection in the connectivity matrix, the CMI of the genes with every other gene is evaluated and if the value is below the user specified threshold the connection is deleted. As the MDL principle selects the model with smallest description length as the best one, in this case the best model is the one which has lowest entropy i.e. the lowest uncertainty. Thus this approach exploits the robustness property of the biological networks. 2.2.7.1 Time and space complexities of PMDL Algorithm The performance of the PMDL algorithm depends on three factors. The number of genes, the number of time points in the expression matrix and most importantly the number of parents inferred for each gene by the algorithm. This section gives the time and space complexities of the algorithm. Step 4 of the algorithm iterates 𝑛2 𝑚 times, where n is number of genes and 𝑚 is the number of time points; from line 5 to line 18 the algorithm iterates 𝑛4 times; lines 15 and 16 of the algorithm iterates 𝑛3 𝑚 times. Finally, from lines 20 to 31 the algorithm iterates 𝑛3 times. Thus, the time complexity of the algorithm is Θ(𝑛4 + 𝑛3 𝑚). When it comes to space complexity the conditional probability tables play a major role. If a gene has 𝑛 parents then the conditional probability tables take 2𝑛 units of space. Thus, the amount of memory needed by the algorithm depends on the number of parents inferred by the network. As the space complexity grows exponentially based on the number of parents it is possible that the algorithm may run out of memory for a data set with as few as 50 genes but run for as little as 5 minutes for a data set with several hundred genes. The memory limitation can be solved by restricting the number of regulators each gene can have. 2.3 Time Lagged Information Theoretic Approaches It was observed that the accuracy of information theory based approaches saturate beyond specific data sizes [24, 29] (in terms of number of time points (or columns) in the expression matrix). Further analysis that showed why these approaches saturated was the motivation for conceiving time lagged information theory approaches which we discuss next. 2.3.1 Why do information theory based approaches saturate? Visual analysis about how the values of the basic metrics change as per the data size were performed in [24, 29]. For this analysis, biological synthetic data was created using the Genenetweaver tool [25-26]. A five gene network and expression data with 100 time points was generated. The synthetic data was quantized to two levels and then the information theoretic quantities were calculated. Entropy of all the genes in the network across 100 time points was computed, while the conditional entropy and MI were computed for each pair of genes. It was observed that with more data (and correspondingly with more time points) both the entropy and conditional entropies increased in the network (tending to unity) while the MI decreased (tending to zero). This concludes that the saturation in the inference accuracy of information theoretic approaches was due to the saturation in the mutual information quantity which goes close to zero even though the entropy increases in the network. This would conceptually mean that there is room to improve on the inference accuracy (due to high entropy), yet the mutual information metric will not be able to point us to the right direction. Other information theoretic algorithms, like REVEAL [27] use the ratio of MI and entropy to infer the network for this purpose which supposedly gives good performance. However, as the entropy and mutual information metrics saturates, it’s obvious that the ratio of mutual
information and entropy will also saturate, hence this ratio might also not be the right metric to achieve better accuracy by making use of more data. The Directed Mutual Information metric [28] might be a better metric to use than the conventional MI based algorithms. The saturation in MI due to increasing number of time points would suggest that the MI should not be computed for the entire range of time points of micro-array data available from the experiments. GRNs are inherently time varying, and hence the pair-wise MI between any 2 genes needs to be computed over the time range where the first gene will have substantial regulatory effect on the other one. This can be best approximated by estimating the regulatory time-lags between each gene pair, and subsequently computing the MI between them for this particular time range. The time lag computation concept was initially proposed in [7] to predict potential regulators in their DBN based scheme. 2.3.2 Time Lags The concept of time lags was first introduced by Zou et.al [7], where they proposed that the time difference between the initial expression change of a potential regulator (parent) and its target gene represents a biologically relevant time period. Here potential regulators are those set of genes whose initial expression change happened before the target gene and the initial expression change is up or down-regulation (ON or OFF) of genes. Implementing Zou et.al’s method of calculating time lags, for every pair of gene when the initial expression change is not at the same time point, one of the two time lags turns out to be negative. Figure 3 illustrates this problem. In the figure Ia and Ib indicate the initial change in expression of gene A and gene B at time points 2 and 3 respectively. As per Zou et.al [7], gene A is parent of gene B and they do not consider the time lags between B and A. Time lag between A and B is Ia-Ib = 3-2 = 1. Time lag between B and A is Ib-Ia = 2-3 = -1, this is a negative time lag which implies that a directed edge cannot exist between B to A. However it is important to consider such time lags both in the forward and backward directions (i.e., from A to B and vice versa) as this can model the loops between 2 genes (i.e., A->B and B->A connections). Zou et.al’s time lag computation cannot handle such cases. One can argue that a gene can regulate another gene only when it is up-regulated (ON). Based on the above argument a new time lag computation approach was proposed in [29], wherein time lags were defined as “the difference between initial up-regulation of first gene and initial expression change of the second gene after the up-regulation of first gene. Figure 3 also illustrates the time lag computation approach proposed in [29]. In the figure, Ua and Ub indicate the initial up-regulation of genes A and B at time points two and three respectively. Ca and Cb indicate that time points six and three are the time points where the expression values of genes A and B changed after the initial up-regulation of genes B and A. Time lag between A and B is calculated as τ1 = Cb-Ua and time lag between B and A is calculated as τ2 = Ca-Ub respectively. In this example time lag between X and Y is one and time lag between Y and X is three. 2.3.3 Time lagged mutual information (TLMI) and time lagged conditional mutual information (TLCMI) After implementing time lags, one no longer needs to computes the information theoretic metrics over the complete expression data. Say the expression matrix has E columns of expression data. If a time lag τ is computed between two genes A and B we remove the last τ columns from row in expression matrix representing gene A and first τ columns of row representing gene B . Computing the MI over the reduced expression data gives TLMI. Note that TLMI is not a symmetric quantity like MI, i.e. TLMI ( A, B ) ≠ TLMI ( B, A) . Considering a time lag, τ, between A and B, we compute TLCMI(A;B|C) by deleting the last τ columns of rows representing genes A and C and first τ columns of row representing gene B. Computing the CMI over this reduced expression data gives TLCMI. Examples of time lagged information theoretic quantities are given in [29]. 2.3.4 Time Lagged Information Theory Based Approaches Virtually any information theory based approach can be converted to its time lagged equivalent [29-30], by replacing information theoretic metrics with corresponding time lagged information theoretic quantities. A drawback
of time lagged approaches is that it can be implemented over time series data set’s only. 2.4 REVEAL Based Algorithms REVEAL [27] algorithm starts with pair-wise combinations. If the r-score (ratio of the mutual information between regulating gene and the entropy of the gene being regulated) equals one, REVEAL states that a regulatory interaction exists between these two genes and the algorithm proceeds to generate the corresponding wiring rules. If not all genes are being regulated with a gene, REVEAL next proceeds to see if a combination of two genes regulate the remaining genes. If there are still other genes where the MI to entropy ratio did not equal one, REVEAL will next consider a combination of three genes and so on. Thus, the algorithm iteratively considers more complex genetic interactions thereby having exponential time and space complexities. Even for a small network of 50 genes the algorithm runs out of memory on a modern 32 bit computer. The next sub sections discuss some algorithms which are improvements/extensions over REVEAL algorithm. 2.4.1 sREVEAL-1 By considering a set of known transcription factors, the number of iterations REVEAL has to execute for different input combinations is reduced by a huge margin, thus reducing both space and time complexities. Transcription factor prediction approach [31] can be used to identify transcription factors with a high accuracy. As empirical studies in the past have shown that a gene is regulated by a small set of transcription factors, one can restrict the number of regulators between three and six. This consideration further reduces the complexity of the algorithm. Unlike REVEAL, where it keeps solving for higher number of parents for a particular gene as it proceeds, this approach solves one gene at a time. If the number of parents is restricted to say three, mutual information between all combinations of one, two and three pairs of transcription factors are computed and the combination of transcription factor that produces the maximum MI to entropy ratio with the gene is chosen as the combination that regulates the gene. Figure 4 gives a pictorial presentation of REVEAL and s-REVEAL [32] algorithms. In reveal, first all possible pair wise combinations are checked to find the regulators of the genes; the genes wherein the ratio equals 1 have been properly characterized and taken out for future considerations. For the remaining genes, all combination of two genes that can act as their parents are checked similarly, and if the ratio equals 1, they are taken out of consideration; the algorithm next looks into combination of three genes to act as parents of the remaining genes and so on until all genes are characterized. The right hand side in figure 4 explains how sREVEAL-1 works. A maximum of two regulators for each gene is shown in this case. Hence for every gene all possible combinations of one and two genes are checked for and the combination that gives the maximum MI is chosen as potential parent set. 2.4.2 sREVEAL-2 Though the above mentioned approach reduces complexity and can infer the network for several hundred genes, it can still run out of memory when the transcription factor list itself is large (>50) as in the case of the complete e-coli transcriptional regulatory network. This issue can be solved by further reducing the potential set of transcription factors for each gene by using time lags. A transcription factor can regulate a gene if and only if the time lag between the transcription factor and the gene is greater than zero. In [32] the authors suggest that very large time lags should be set to zero and hence not considered as potential regulators and higher priority must be given to smaller time lags. Thus even after filtering the potential regulators using time lags, the potential regulators list can still be large; further filtering can be performed by considering only small time lags between transcription factors and genes. This approach can be used to infer fairly large regulatory networks. 2.4.3 rREVEAL rREVEAL removes the exit strategy (discarding cases for which r-score ratio evaluated to one) implemented by REVEAL and uses a novel ranking scheme on the regulators to assess their likelihood of serving as the parents for any particular gene. rREVEAL starts by computing r-scores between every regulator and the gene.
Then it computes the r-score between every two combinations of regulators and the gene, and then it moves to three combinations of regulators and so on. The scores of different combinations that produce maximum scores are stored in different bins, and regulators in each bin are given a normalized score. If the number of combinations that give the maximum score is C and the maximum number of regulators allowed in that bin is P, then the total number of regulators in the cell is C*P. If a regulator R occurs O times then the normalized score for regulator R in bin P is given by: NR =
O C×P
(12)
Also, normalizing the scores for each regulator separately in every bin makes sure that the inference is not biased on the number of parents any gene can have. For example, as shown in figure 5, for each gene, in bin one, we only have combinations of one parent, while in bin two we have combinations of two parents and so on. Also assume that two cases in each bin achieve the maximum r-score: A and B in bin 1; (C,D) and (C,A) in bin 2. Here, the probability of occurrence of A and B in bin 1 is 0.5 each, while that of C, D, A in bin 2 is 0.5, 0.25, 0.25 respectively. If instead of computing the normalized score in this fashion, we had simply counted the number of occurrences of the regulators in each bin, the results will be biased towards the higher bins as a regulator is more likely to occur more number of times in larger parent list combinations (regulator C in this example has a higher count). After this step, a final score is assigned to each regulator by summing the normalized scores across the individual bins. If P is the desired number of maximum regulators in the inferred network, a final score N S for each regulator is given by:
NS =
R, P
∑ Ni, j
(13)
i , j =1
Based on these scores the regulator list for each gene is ranked, and depending on the desired number of regulators, the final regulator list for each gene is short listed from the ranked list. Figure 5 gives a pictorial representation of rREVEAL. In the figure the network has four transcription factors. The maximum number of desired parents is three thus scores are computed for all possible combinations of one two and three for all nodes in network. While rREVEAL is more robust than REVEAL there is a possibility that ordering of data might affect the resulting networks. If maximum number of regulators allowed is n, and if nth and (n+1)th ranked regulators have same final scores then ordering of data yields different networks. Such a scenario is less likely to occur and simple steps can be taken to tackle such scenarios. One strategy is to include all regulators beyond (n+1)th rank that have the same score as the nth ranked regulator. Also normalization of scores is important to give each transcription factor a fair chance in the ranking process. 2.4.3.1 Complexity Analysis of rREVEAL Entropy estimation and considering different combinations of regulators are the major players in time complexities of rREVEAL algorithm. Entropy estimation for n variables and q level quantization has a time complexity of O(qn). As q is set to two and n is restricted to five, the entropy estimation complexity is a constant in the rREVEAL implementation. Now considering the different regulator combinations in a network with t transcription factors and k being the maximum number of parents each gene can have, the total number of combinations to be considered is tC 5 resulting in a worst case run-time complexity of O(tk); as k is again restricted between three and five, the worst case complexity of this step reduces to O(t5). For a network with s number of genes, the worst case complexity of rREVEAL is hence O(st5). 2.5 Inference Using Knock-out Data While the inference algorithms primarily work on the microarray data, other data sets can be incorporated to further improve the quality of inferred networks. One such data set that can be used is the knock-out data. In this chapter we will be looking at how the knowledge from the knock-out data can be used to further improve the quality of inferred networks.
2.5.1 Knock-out Data Knock-out data provides the expression levels of genes when a particular gene(s) is down regulated. This data is effective in determining direct regulatory interactions [33]. These experiments are useful when certain genes of interests are targeted. External conditions that down-regulate these genes are used to carry-out the experiments. As these experiments are expensive they cannot be implemented on a large scale. But the knowledge from these experiments can be used in network inference to improve the inference accuracy. 2.5.2 Knock-out Network Generation To generate the knock out data based network, a simple model to detect fold change in gene expression with respect to the wild type data (or expression data in normal condition) was used. If the fold change is greater than 1.2 or less than 0.7 then the gene has a direct regulatory relationship with the knocked out gene. 2.5.3 Number of Parents in e-Coli Genome Studies on yeast network [4] showed that a gene is usually regulated by a small number of transcription factor. This property can be used to greatly reduce the complexities of algorithms such as network MDL, PMDL, DBN etc. Similar studies were also performed on the genomic network of E. coli in [32]. The gene regulatory network for E. coli was obtained from RegulonDB release 6.2 [34]. The network has 1502 genes and 3587 regulatory interactions. It was observed that genes usually have a small number of parents as observed in yeast network. This observation indicates that a choice between three and six seems to be a good number of parents in GRN inference. 2.5.4 Network Generation Using Parent Restriction The parent restriction approach [35] for each gene takes the knock-out data network, the network obtained from PMDL, the individual MI values and time lags as inputs to compute the preference of parent node to be chosen in case a gene has more parents than the parent restriction constraint. Figure 6 illustrates the parent selection process. In Figure 6, three networks 1, 2 and 3 are shown. Network 1 is the network obtained using MI and CMI thresholds. The edges in this network are labeled by the MI values and the time lags are given above the nodes. Network 2 is the knock-out network and network 3 is the final network obtained after parent selection process is completed. Next the parent selection process is based on the following priorities: • • •
Knock-out edges MI values Time Lags
Irrespective of the MI or time lags the knock-out edges are added first, and then the parents from the mutual information network are chosen based on MI Values; if there is a conflict in MI values, the parent selection is based on the time lags. If the knock-out edges are more than the desired number of parents then the best of these are selected based on MI and time lags as discussed above. In Figure 6 the maximum number of parents that node A can have is three. So if genes E and F are chosen initially, and there is room for one more parent. There are two genes C and D both of which have 0.8 as MI value, but as gene C has a shorter time lag, we choose gene C as the third parent. 2.5.5 Knock-out Data incorporation in rREVEAL The knowledge from knock-out data sets can be used to reduce the potential regulators for each gene, thus reducing the run time of the algorithm. As knock-out data has been shown to infer strong regulatory interactions [33] pruning potential parents based on knock-out data not only improves the run time but also the inference accuracy. We used a simple model to obtain the potential regulators based on knock-out data. If the fold change of knocked out expression value between transcription factor A and gene B to steady state value of gene B is greater than or
equal to a specific user selected threshold then we consider transcription factor A to be a potential regulator of gene B. In this fashion for every gene the complete transcription factor list is tested. In the worst case every transcription factor s a potential parent of every gene but such a scenario is highly unlikely.
3 Applications Disease diagnostics and drug development research teams can make use of these reverse engineering algorithms to generate a start up regulatory network for their analysis. For example during drug development process, after the cell is exposed to the drug, the expressions levels of genes can be monitored using microarray data experiments. Using these observations across multiple time points one can use a reverse engineering algorithm to build a network and start their analysis based on this network. These algorithms can also be used for generating startup networks for analysis by toxicologist, for example the E.R.D.C of the U.S. Army is interested in knowing how chemicals being used in weapons affect the eco system. The E.R.D.C select certain species and expose their cells to a chemical, as in the previous example, again the expressions levels are recorded across multiple time points and a reverse engineering algorithm is used to build a start up network for analysis.
4 World Wide Web Resources Many of the above mentioned algorithms are available online. If one is interested in developing new algorithms, microarray data sets are also available online. Many in-silico synthetic data generation tools are also available using which a person interested in developing newer algorithms can generate data sets to test the performance of the algorithm. We present URL’s to some of the important resources available online. 1. Yeast Cell Cycle Data set available from Dr. Spellmans lab: http://genome-www.stanford.edu/cellcycle/data/rawdata/ 2. Yeast Cell Cycle Data set available from Dr. Chou’s Lab: http://genomics.stanford.edu/yeast_cell_cycle/full_data.html 3. Genenetweaver tool for insilico data generation: The genenetweaver tool can be of immense help for one interested in developing newer reverse engineering algorithms. This tool generates data based on known biological functions. This tool also provides a way for generating sub networks of very large networks. In specific this tool has the known yeast and e-coli regulatory networks. Using this tool one can generate data sets for desired network size and desired number of microarray experiments. This tool is available freely at: http://gnw.sourceforge.net/ 4. ARACNE and CLR: The popular ARACNE and CLR algorithms can be downloaded at the following link: http://gardnerlab.bu.edu/clr.html 5. rREVEAL implementation is available upon request. Please e-mail
[email protected] to obtain a MATLAB implementation of rREVEAL algorithm. REFERENCES [1] [2] [3] [4]
[5]
Hartemink A.J. “Reverse engineering gene regulatory networks” Nature Biotechnology 23, 554 - 555 (2005). Babu MM, Lang B, Aravind L. Methods to reconstruct and compare transcriptional regulatory networks. Methods Mol Biol. 2009; 541:163-80. Babu MM. Introduction to microarray data analysis. In Computational Genomics 2004: Theory and Application (Grant, R.P., ed.), pp. 225–249, UK, Horizon Press. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 2002, 298:799-804. Imoto S, Goto T, Miyano S. Estimation of genetic networks and functional structures between genes
[6] [7] [8] [9] [10] [11]
[12] [13] [14]
[15]
[16] [17] [18] [19] [20]
[21]
[22]
[23]
[24]
[25]
[26]
[27] [28]
by using bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing. 2002;7:175–186. Murphy K, Mian S: Modelling gene expression data using dynamic Bayesian networks. In Technical report 1999, Computer Science Division University of California, Berkeley, CA. Zou M, Conzen SD: A new dynamic bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005. 21(1):71-79. Schmulevich Ilya, Dougherty E R, Kim S, Zhang W: Probabilistic boolean networks: A rule-based uncertainty model for gene regulatory networks. BMC Bioinformatics 2002, 18(2):261-274. Shmulevich Ilya, Dougherty E R, Zhang W: From boolean to probabilistic boolean networks as models of genetic regulatory networks. Proceedings of the IEEE 2002, 90(11):1778-1792. Cover T M, Thomas J A: Elements of information theory. Wiley-Interscience, New York, 1991. Zhao W, Serpedin E, Dougherty E R: Inferring connectivity of genetic regulatory networks using information-theoretic criteria, IEEE, Transactions on Computational Biology and Bioinformatics, 2008, 5(2): 262 – 274. Butte AJ, Kohane IS. Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000: 418–429. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998. 95: 14863–14868. Margolin A A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A: ARACNE: An algorithm for reconstruction of genetic networks in a mammalian cellular context. BMC Bioinformatics, 2006, 7: p. S7. Faith J.J, Hayete B, Thaden J.T, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins J.J, and Gardner T.S. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology 2007, 5. Grünwald P D, Myung I J, Pitt M A: Advances in minimum description length (Theory and Applications). The MIT Press, 2005. Rissanen J: Modeling by shortest data description. Automatica, 1978, 18: 465-471. Rissanen J: Universal coding, information, prediction and estimation 1984, IEEE Transactions on Information Theory, 30(4):629-636. Zhao W, Serpedin E, Dougherty E R: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics, 2006, 22(17):2129-2135. Dougherty John, Tabus I, Astola J: Inference of gene regulatory networks based on a universal minimum description length. EURASIP Journal on Bioinformatics and Systems Biology, 2008, Article ID: 482090, 11 pages. Chaitankar V, Zhang C, Ghosh P, Perkins E, Gong P, Deng P: Gene regulatory network inference using predictive minimum description length principle and conditional mutual information. IJCBS 2009: 487-490. Chaitankar V, Zhang C, Ghosh P, Perkins E, Gong P, Deng Y, Predictive minimum description length principle approach to infer gene regulatory networks. Springer, Computational Biology and Bioinformatics, 2010. Chaitankar V, Ghosh P, Perkins EJ, Gong P, Deng Y, Zhang C. A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst Biol. 2010 May 28;4 Suppl 1:S7. Chaitankar V, Zhang C, Ghosh, P, Perkins E, Gong.P: Effects of cDNA microarray time-series data size on gene regulatory network inference accuracy. ACM International Conference on Bioinformatics and Computational Biology. 2010. Marbach D, Schaffter T, Mattiussi C, and Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods, Journal of Computational Biology, 2009, 16(2):229-239. Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, Altan-Bonnet G, and Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE, 2010, 5(2):e9202. Shoudan L: REVEAL: A general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing, 1998, 3:18-29. Rao A, Hero AO 3rd, States DJ, Engel JD: Inference of biologically relevant gene influence networks using the directed information criterion. ICASSP Proceedings 2006.Rissanen J: Modeling by shortest
[29] [30] [31] [32]
[33]
[34]
[35]
data description. Automatica, 1978, 18: 465-471. Chaitankar V, Ghosh P, Perkins E, Gong P, Zhang C: Time lagged information theoretic approaches to the reverse engineering of gene regulatory networks. BMC Bioinformatics. 2010. Chaitankar V, Ghosh V, Elasri M, Perkins E: Poster: Gene regulatory network inference using time lagged context likelihood of relatedness. ICCABS 2011: 236. Kummerfeld, S.K. & Teichmann, S.A. DBD: a transcription factor prediction database. Nucleic Acids Res., 2006, 34, D74-8. Chaitankar, V., Ghosh P., Elasri, M., and Perkins, E: sREVEAL: Scalable extensions of REVEAL towards regulatory network inference. Workshop on Computational Biology. 11th International Conference on Intelligent Systems Design and Applications, 2011. Yip K., Alexander R., Yan K. and Gerstein M. Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS ONE, vol. 5, issue 1, no. e8121, (2010). Gama-Castro et al. “RegulonDB (version 6.0): Gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation, Nucleic Acids Res, 2008, 36:D120-4. Chaitankar, V., Ghosh P., Elasri, M., and Perkins, E: A scalable gene regulatory network reconstruction algorithm combining gene knock-out data. BiCob 2011.
FIGURE CAPTIONS FIGURE – 1 Inference/Reverse Engineering outline. FIGURE – 2 Pseudo-code for PMDL Algorithm. FIGURE – 3 Time Lags Computation Approaches. FIGURE – 4 REVEAL and sREVEAL Approaches. FIGURE – 5 rREVEAL example. FIGURE – 6 Knock-out based network inference approach.
TABLE CAPTIONS TABLE – 1 Conditional Probability table
yz:x 00 01 10 11
0 0.6 0.3 0.5 0.8
1 0.4 0.7 0.5 0.2
A
1
0
1
1
1
B
0
0
0
1
1
C
0
1
1
0
0
D
1
0
1
0
1
E
1
1
0
1
0
Gene Expression Matrix
More Input Data (Ex. Knock-Out Data)
Optional
Optional
Prior Knowledge
REVERSE ENGINEERING ALGORITHM
A C
B D
E
Inferred Network
1. Input Time Series Data 2. Preprocess Data 3. Initialize M n × n , C n × n and P ( x j ) ⇐ φ 4. Calculate the cross - time mutual info between genes and fill M n × n 5. for i = 1 to n 6. for j = 1 to n δ ⇐ Mi × j 7. for k = 1 to n 8. for l = 1 to n 9. if M k × l ≥ δ then 10. C k × l = 1, P ( xl ) ⇐ P ( xl ) ∪ xk 11. end if 12. end for 13. end for 14. 15. Compute probabilities using Eqn (7) 16. Compute LD using Eqn (10) 17. end for 18. end for 19. Select the MI of model having least LD as the MI threshold δ . 20. for i = 1 to n 21. for j = 1 to n if Ci × j == 1 then 22. for k = 1 to n and k ≠ i, j 23. if CMI i, j , k < Th then 24. Ci × j ⇐ 0 25. break ; 26. end if 27. end for 28. end if 29. 30. end for 31.end for
time lag between B and A as proposed in [29] = 3 units
Zou’s time lag = 1 unit Ia
1.1
Ub
Ib
Ca
Expression Value
1 .8 .6
Ua
Cb
time lag between A and B as proposed in [29] = 1 unit
.4 .2
Gene A Gene B
1
2
3
4 Time
5
6
A
A
B
B
C
C
A
B
A
C
B
MI([C,A])/E(A) = 1 i.e. A is solved and Will not be considered In future computations
C
A
B
A B
A
B C
Only A and B are transcription Factors.
A
MI([A,C], B)/E(B) = 1 i.e. B is solved
B
C
Say MI(B,A) is max for A Say MI(A, B) is max for B Say MI([A, B], C) is max for C
A Inferred Network
B REVEAL
C
Inferred Network
B
C
sREVEAL-1
Regulator Combination r-score
A 0.8
B 1
C 1
D 0.8
r − score =
MI between regulators entropy of gene
Regulator Combination [A,B] [A,C] [A,D] [B,C] [B,D] [C,D] r-score 0.8 1 0 1 1 0.8 Regulator Combination [A,B,C] [A,B,D] [A,C,D] [B,C,D] r-score 0.8 0.7 0.5 1 Regulator Normalized Score(NR)
A 0
B .5
C .5
D 0
Bin 1 N
Regulator A B C D Normalized Score(NR) 0.167 0.333 0.333 0.167
Bin 2
Regulator Normalized Score(NR)
Bin 3
Regulator Final Score(NS)
A 0
B C D 0.333 0.333 0.333
A B C D 0.055 0.388 0.388 0.167
R
= probability of participating in the regulator combination that achieves the maximum r - score.
N = Average N R across all bins S
To get final regulator list for each gene, the r-score’s, the normalized score (NR), and the final scores (NS) are computed as shown in the example above. Then the regulators are sorted based on the final scores and the desired number of regulators are inferred for the specific gene under consideration. In this fashion desired number of regulators are inferred for every gene. In the example above we have four transcription factors as potential regulators, and the desired number of regulators for each gene is three, thus the final regulator list is [B,C,D].
Time Lag Value’s
Mutual Information Value’s
1
1
2
B
C
D
.7
.8
E
F
.8
A
A
Network 1: Mutual Information Network
Network 2: Knock-out Network
E
C
F
A Network 3: Final Network