Gene Regulatory Network Inference Using Predictive Minimum ...

1 downloads 0 Views 256KB Size Report
Gene Regulatory Network Inference Using Predictive Minimum Description Length. Principle and Conditional Mutual Information. Vijender Chaitankar,.
2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing

Gene Regulatory Network Inference Using Predictive Minimum Description Length Principle and Conditional Mutual Information Vijender Chaitankar, Chaoyang Zhang, Preetam Ghosh School of Computing The University of Southern Miss Hattiesburg, MS, USA-39401 [email protected]

Edward J. Perkins Environmental Laboratory U.S. Army E.R.D.C Vicksburg, MS, USA-39180

then there is no connectivity between X and Y. The ARACANE method loses validity in other cases like ZĺX, Y [1]. To deal with such cases, Wentao et.al [1] exploited the concept of CMI. Wentao et al. [1] were the first to implement both MI and CMI to infer gene regulatory networks from microarray data, however, selecting the MI and CMI thresholds are the major drawbacks in their approach. The MDL principle has been implemented in [3] and [2] to estimate the MI threshold. Various implementations of the MDL principle have been studied extensively in [7].The algorithm proposed in [3] often yields good results, but it does so with an ad hoc coding scheme that requires a userspecified tuning parameter. Dougherty et al. [2] implemented the normalized maximum likelihood model to overcome this issue. In the proposed algorithm, we implement the predictive minimum description length principle which is much suited to time series data and combine it with the CMI metric. In particular, our scheme requires only one threshold parameter as against two threshold values that need to be specified in the scheme proposed in [1], besides performing better in terms of accuracy. A number of information theoretic approaches to infer gene regulatory networks exist, that rely on threshold values and/or fine tuning parameters. One other approach [Reference] does not involve any fine tuning parameters or threshold values but it does not utilize conditional mutual information which is useful in reducing false edges in a network. Our goal is to develop an algorithm which eliminates the need of any fine tuning parameter and implements conditional mutual information to improve the accuracy of the resulting network. The paper is organized as follows. Section 2 covers the Network Model, the information theoretic quantities MI and CMI, the PMDL principle and the proposed algorithm. Section 3 demonstrates the performance of the proposed algorithm over synthetic networks and finally we conclude in Section 4.

Abstract— Inferring gene regulatory networks using information theory models have received much attention due to their simplicity and low computational costs. One of the major problems with information theory models is to determine the threshold which defines the regulatory relationships between genes. The minimum description length (MDL) has been used to overcome this problem. We propose an inference algorithm which incorporates mutual information (MI), conditional mutual information (CMI) and predictive minimum description length (PMDL) principles to infer gene regulatory networks from microarray data. The information theoretic quantities MI and CMI determine the regulatory relationships between genes and the PMDL principle determines the MI threshold. The performance of the proposed algorithm is demonstrated on random synthetic networks, and the results show that the PMDL principle is a good choice to determine the MI threshold.

I. INTRODUCTION Reverse engineering of gene regulatory networks remain a major issue and area of interest in the field of bioinformatics and systems biology. A survey paper [9] discusses a number of models related to this area viz. Bayesian Networks, Dynamic Bayesian Networks, Boolean Networks, Probabilistic Boolean Networks, Differential Equation Models and Information Theory Models. Simplicity and low computational costs are the major advantages of these information theory based models. Because of their low data requirements, they are suitable to infer even large-scale networks. Thus, they can be used to study global properties of large-scale regulatory systems [9]. Mutual information measures the amount of information that can be obtained about one random variable by observing another one. Compared with the correlation coefficient based metric, the mutual information is suitable for nonlinear relations and represents a good metric for evaluating the dependency between two random variables [5]. In one of the more recent works in this area [3], it was assumed that if the MI value is less, then genes are not connected in the network otherwise they are connected. Based on the study of chemical kinetics, it was found that that the second assumption was not true. If there were two genes which were being regulated by a third gene then the mutual information between the two genes could be high resulting in a false edge in the network. ARACNE [4] was the first inference algorithm to implement a method to identify such false edges. ARACNE stated that if the mutual information between two genes X and Y is less than or equal to the mutual information between genes X and Z and mutual information between Y and Z i.e. I(X, Y) ” [I(X, Z), I(Y, Z)], 978-0-7695-3739-9/09 $25.00 © 2009 IEEE DOI 10.1109/IJCBS.2009.133

Ping Gong, Youping Deng SpecPro Inc. Vicksburg, MS, USA-39180

II. SYSTEM AND METHODS A. Genetic Network The network formulation is similar to the one used in [3]. A graph G(V, E) represents a network where V denotes a set of genes and ‫ ܧ‬denotes a set of regulatory relationships between genes. If gene ‫ ݔ‬shares a regulatory relationship with gene ‫ݕ‬, then there exists an edge between ‫ ݔ‬and ‫ݕ‬ (‫ ݔ‬՜ ‫) ݕ‬. Genes can have more than one regulator. The notation Զ(‫ )ݔ‬is used to represent a set of genes that share 521 511 493 487

regulatory relationship with gene ‫ݔ‬. For example, if gene ‫ݔ‬ shares a regulatory relationship with ‫ ݕ‬and ‫ ݖ‬then Զ(‫= )ݔ‬ {‫ݕ‬, ‫}ݖ‬. Also every gene is associated with a function ݂‫ ݔ‬Զ(‫)ݔ‬ which denotes the expression value of gene ‫ ݔ‬determined by the values of genes in Զ(‫)ݔ‬. The gene expression is affected by many environmental factors. Since it is feasible to incorporate all factors all regulatory functions are assumed to be probabilistic. Also, the gene expression values are assumed discrete-valued and the probabilistic regulation functions are represented as lookup tables. If the expression levels are quantized to ‫ ݍ‬levels and a gene ‫ ݔ‬has ݊ predecessors then the look up table has ‫݊ ݍ‬ rows and ‫ ݍ‬columns and every entry in the table corresponds to a conditional probability.

1

‫݉ = )ݒ = ݔ(݌‬

݇=1

Where, 1{.} (. ) is the indicator function, defined as 1, ݂݅ ‫ܣ א ݏ‬ 1{‫ = )ݏ( }ܣ‬൜ 0, ݂݅ ‫ܣ ב ݏ‬ C. Predictive Minimum Description Length Principle The description length of the two-part MDL principle involves calculation of model length and the data length. As the length can vary for various models, the method is in danger of being biased towards the length of the model. Various implementations of the MDL principle exist, and have been studied extensively in [7]. The Normalized Maximum Likelihood Model has been implemented in [2]. We chose to implement the PMDL principle as it suits time series data. PMDL principle models the data points only, thus calculating code length involves only the data. We calculate code length as data length given in [3]. A gene can take any value when transformed from one time point to another due to the probabilistic nature of the network, thus the network is associated with Markov chain which is used to model state transitions. These states are represented as n-gene expression vectors ܺ‫ݔ( = ݐ‬1,‫ ݐ‬, … , ‫݊ݔ‬,‫ ܶ) ݐ‬and the transition probability ‫ݐܺ(݌‬+1 |ܺ‫) ݐ‬ can be derived as follows:

B. Information Theoretic Quantities 1) Mutual Information: Mutual information measures the amount of information that can be obtained about one random variable by observing another one. Mutual information by itself does not contain directional information and hence, by introducing time lag in one of the variables, a sense of direction is given in [3]. The gene system is assumed to be event driven, i.e. all the regulations are performed step by step and in each step all regulations happen only once. Therefore, the latency parameter is set by default to a unit step. ‫ ݐݔ(݌‬, ‫ݐݕ‬+1 ) ൨ (1) ‫ݐܻ ; ݐܺ(ܫ‬+1 ) = ෍ ൤‫ ݐݔ(݌‬, ‫ݐݕ‬+1 )݈‫݃݋‬ ‫) ݐݔ(݌‬. ‫ݐݕ(݌‬+1 )

‫ݐܺ(݌‬+1 |ܺ‫ = ) ݐ‬ς݊݅=1 ‫݌‬൫‫݅ݔ‬,‫ݐ‬+1 ห Զ‫)) ݅ݔ( ݐ‬ The probability ‫݅ݔ(݌‬,‫ݐ‬+1 |Զ‫ ) ݅ݔ( ݐ‬can be obtained from the look-up table associated with the vertex ‫ ݅ݔ‬and is assumed to be time invariant. It is estimated as follows:

‫ ݐ ݔ‬,‫ݐݕ‬

where ‫ ݐݔ(݌‬, ‫ݐݕ‬+1 ) is the cross-time joint probability and ‫) ݐݔ(݌‬, ‫ݐݕ(݌‬+1 ) are the marginal probabilities. Mutual information can also be defined in terms of entropies [9]: (2) ‫ݐܻ ; ݐܺ(ܫ‬+1 ) = ‫ ) ݐܺ(ܪ‬+ ‫ݐܻ(ܪ‬+1 ) െ ‫ ݐܺ(ܪ‬, ܻ‫ݐ‬+1 )

݉ െ1

‫݅ݔ(݌‬,‫ݐ‬+1 = ݆|Զ‫= ) ݅ݔ( ݐ‬

1 ෍ 1{݆ } (‫݅ݔ‬,‫ݐ‬+1 |Զ‫)) ݅ݔ( ݐ‬ ݉െ1 ‫=ݐ‬1

Each state transition brings new information which is measured by the conditional entropy: (‫ݐܺ(݌‬+1 |ܺ‫)) ݐ‬ ‫ݐܺ(ܪ‬+1 |ܺ‫ = ) ݐ‬െlog༌ The total entropy for given m time-series sample points, (ܺ1 , … , ܺ݉ ) is given by:

2) Conditional Mutual Information: High mutual information indicates that there may be a direct or indirect relationship between the genes. To overcome this issue, we implement the concept of conditional mutual information. Conditional mutual information is the reduction in the uncertainty of ܺ due to knowledge of ܻ when ܼ is given. Time lag between variables is considered to give a sense of direction. The conditional mutual information of random variables ܺ and ܻ given ܼ is defined by ‫ݐܻ ; ݐܺ(ܫ‬+1 |ܼ‫= ) ݐ‬ σ‫ݐ ݔ‬, ‫ݐݕ‬+1 ,‫݌ ݐݖ‬൫‫ݐݔ‬, ‫ݐݕ‬+1 , ‫ ݐݖ‬൯. ݈‫݃݋‬

݉

෍ 1{‫) ݇ܵ( }ݒ‬

݉ െ1

‫ܺ(ܪ = ܦܮ‬1 ) + ෍ ‫ܪ‬൫݆ܺ +1 |݆ܺ ൯ ݆ =1

As ‫ܺ(ܪ‬1 ) is same for all models it is removed thus the model length is: ݉ െ1

‫ ݐ ݔ(݌‬,‫ݐݕ‬+1 |‫) ݐݖ‬ ‫) ݐݖ| ݐ ݔ(݌‬.‫ݐݕ(݌‬+1 |‫) ݐݖ‬

‫ = ܦܮ‬෍ ‫ܪ‬൫݆ܺ +1 |݆ܺ ൯

(3)

݆ =1

D. Inference Algorithm

Conditional mutual information can also be expressed in terms of entropies as ‫ݐܻ ; ݐܺ(ܫ‬+1 |ܼ‫ ݐܺ(ܪ = ) ݐ‬, ܼ‫ ) ݐ‬+ ‫ݐܻ(ܪ‬+1 , ܼ‫ ) ݐ‬െ ‫) ݐܼ(ܪ‬ (4) െ ‫ ݐܺ(ܪ‬, ܻ‫ݐ‬+1 , ܼ‫) ݐ‬ 3) Entropy Calculations: The proposed algorithm deals with quantized data. In general, it is assumed that the q-level quantization admits the alphabet ‫{ = ݍܣ‬0,1, … , ‫ ݍ‬െ 1} then, the probability mass function from ݉ samples ‫ݏ‬1 , … , ‫ ݉ݏ‬is estimated as

488 494 512 522

(5)

1. Input Time Series Data 2. Preprocess Data 3. Initialize M nun , C nun and P ( x j )  I

conditional mutual information of the genes with every other gene is evaluated using (3) and if the value is below the user specified threshold (݄ܶ ) the connection is deleted.

4. Calculate the cross - time mutual info between genes and fill M nun

III. Results A. Simulation on Random Synthetic Networks The proposed algorithm is compared with the one proposed in [3] on synthetic random networks. Benchmark measures like recall R and precision P are used to evaluate the performance of inference algorithms. While different definitions for recall and precision exist [22], in this paper, R is defined as Ce/(Ce + Me) and P is defined as Ce/(Ce + Fe), where Ce denotes the edges that exist in the true network and in the inferred network, Me are the edges that exist in the true network but not in the inferred network and Fe are edges that do not exist in the true network but do exist in the inferred network. For a specific size of the network, both the algorithms are run for different threshold values 30 times each and the average of precision and recall are calculated. The algorithms are run for 20, 30, 40 and 50 numbers of genes. The precision vs. recall curves for each of these networks with different threshold values are given in Table 1. Wentao et al. reported 0.2 to 0.4 as suitable values for the tuning parameter, and hence we use the values 0.2, 0.3 and 0.4 to build the networks. For our proposed algorithm, based on simulations, we found that the threshold for conditional mutual information worked best for values in the range 0.1 to 0.2. Thus 0.1, 0.15 and 0.2 threshold values were used to build the networks. It is observed that the precision of the proposed algorithm is higher and recall is lower in most of the cases as compared to the algorithm proposed in [3] referred as MDL in Table 1. The number of false negatives is less in the proposed algorithm and as most biologists are interested in true positives, our results look promising.

5. for i 1 to n 6. for j 1 to n G  M iu j 7. 8. for k 1 to n for l 1 to n 9. if M k ul t G then 10. C k ul

11. 12. 13.

1

end if

15.

end for end for

16.

Calculate the model length L D using (5)

17. end for 18.end for 19.Select the MI of model having least L D as the MI threshold G . 20. for i 1 to n 21. for j 1 to n if C iu j 22. 1 then 23. 24. 25. 26. 27.

1 to n and k z i, j if CMI i, j ,k  Th then

for k

C iu j  0 break ; end if

TABLE I.

28. end for 29. end if 30. end for 31.end for

Network Size/Method

20

Given the time series data, the data is first preprocessed, which involves filling missing values and quantizing the data. Then the cross time mutual information matrix ‫ ݊×݊ܯ‬is evaluated. A connectivity matrix ‫ ݊×݊ܥ‬is maintained which has two entries: a 0 and a 1. An entry of 0 indicates that no regulatory relationship exists between genes, but an entry of 1 at ‫ ݆×݅ܯ‬indicates that gene ݅ regulates ݆. From lines 5 to 18 every value of the mutual information matrix is used as threshold and a model is obtained and the model lengths for each of these models are evaluated. The mutual information which was used to obtain the model with shortest model length is then used as the mutual information threshold į  WR REWDLQ the initial connectivity matrix. From lines 20 to 31, for every valid regulatory connection in the connectivity matrix, the

30

40

50

489 495 513 523

PRECISION AND RECALL VALUES. PMDL

Precision 0.7900 0.9464 0.9944 0.7224 0.9044 0.9808 0.6906 0.8746 0.9471 0.6335 0.8590 0.9696

MDL Recall 0.4756 0.3733 0.2556 0.4785 0.3489 0.2474 0.4806 0.3417 0.2528 0.4716 0.3467 0.2591

Precision 0.6939 0.8214 0.8834 0.6741 0.7897 0.8770 0.6608 0.7658 0.8432 0.6394 0.7401 0.8266

Recall 0.5133 0.4451 0.4 0.5015 0.4385 0.4163 0.4567 0.3978 0.3867 0.4764 0.4103 0.3920

determining the conditional mutual information threshold. We have tested the sensitivity of the threshold on the performance of our scheme and identified the value of 0.1 as optimal for most synthetic networks. The proposed algorithm produced less false edges as compared to [3], however, it resulted in a larger number of missing edges. We plan to improve this aspect of our algorithm in the future. Our major contribution has been the implementation of predictive MDL principle which eliminates the need of a fine tuning parameter. Also, our work combines the MDL principle with conditional mutual information for the first time to achieve better performance. Conditional mutual information was used in the past but our scheme adds direction using an ad hoc time delay. Also we reported the threshold sensitivity of gene regulatory network inference schemes for the first time as it gives the users an estimate on what range of thresholds should be used.

(a)

REFERENCES [1]

Wentao Zhao, Serpedin E and Dougherty E R, "Inferring connectivity of genetic regulatory networks using information-theoretic criteria", 2008, IEEE, VOL.5 No.2.

[2]

Dougherty John, Tabus I and Astola J, “Inference of gene regulatory networks based on a universal minimum description length”, 2008, EURASIP Journal on Bioinformatics and Systems Biology.

[3]

Wentao Zhao, Serpedin E, Dougherty E R, "Inferring gene regulatory networks from time series data using the minimum description length principle”, 2006, Bioinformatics, Vol. 22 No.17.

[4]

Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A, "ARACNE: An algorithm for reconstruction of genetic networks in a mammalian cellular context", 2006, BMC

(b)

Bioinformatics.

Figure 1. (a) Precision Sensitivity (b) Recall Sensitivity

B. Threshold Sensitivity Here we report the performance of our scheme based on different values of the user specified threshold. For threshold values of 0.15 and 0.2, high precision (over 90% in most cases) was observed but the recall for these thresholds were low (from 25% to 30%) as compared to a threshold value of 0.1 which had a fair recall (over 47%) and good precision (63% to 79%) performance. It is observed from the graphs in Figure 1 that as the threshold value is increased, precision increases and at the same time the recall decreases. The simulation experiments show that 0.1 is the optimal threshold value.

[5]

Cover T M, Thomas J A, “Elements of information theory”, 1991,

[6]

Rissanen J, “Modeling by shortest data description”, 1978, Automatica.

[7]

Grünwald P D, Myung I J, Pitt M A, “Advances in minimum

Wiley-Interscience, New York.

description length (Theory and Applications)”, 2005, The MIT Press. [8]

Hecker M, Lambeck S, Toepfer S, Eugene van Someren, Reinhard Guthke, "Gene regulatory network inference: Data integration in dynamic models-A review", 2008, Bio Systems.

[9]

Rissanen J, “An introduction to the MDL principle”, 2006, Helsinki Institute

for

Information

Technology,

Tampere

and

Helsinki

Universities of Technology, Finland, and University of London, England.

IV. Conclusion We have proposed a new gene regulatory inference algorithm which implements mutual information, conditional mutual information and the Predictive MDL principles. Our scheme requires only one user specified threshold parameter as against a similar technique that needs two such threshold estimates. The simulation results show that our predictive MDL principle is fair in determining the mutual information threshold. A problem with the proposed algorithm is in

(http://www.mdlresearch.org/jorma.rissanen/pub/Intro.pdf),

unpublished. [10] Zhang X, Baral C, Kim S, “An algorithm to learn causal relations between genes from steady state data: simulation and its application to melanoma dataset”, 2005,

Proceedings of 10th Conference on

Artificial Intelligence in Medicine (AIME 05), Aberdeen, Scotland, 524-534.

490 496 514 524