Inference of regulatory networks from microarray data with R and the Bioconductor package qpgraph Robert Castelo Research Program on Biomedical Informatics, Department of Experimental and Health Sciences, Universitat Pompeu Fabra, and Institut Municipal d'Investigació Mèdica, Barcelona, Spain. Email:
[email protected] (corresponding author) Alberto Roverato Department of Statistical Science. Università di Bologna, Bologna, Italy Email:
[email protected] (This is a preprint from a chapter of the following book currently in press: Next Generation Microarray Bioinformatics, J. Wang, A.C. Tan and T. Tian, Eds., Methods in Molecular Biology Series, Humana Press, USA, ISBN 978-1-61779-399-8.) Abstract Regulatory networks inferred from microarray data sets provide an estimated blueprint of the functional interactions taking place under the assayed experimental conditions. In each of these experiments, the gene expression pathway exerts a finely tuned control simultaneously over all genes relevant to the cellular state. This renders most pairs of those genes significantly correlated, and therefore, the challenge faced by every method that aims at inferring a molecular regulatory network from microarray data, lies in distinguishing direct from indirect interactions. A straightforward solution to this problem would be to move directly from bivariate to multivariate statistical approaches. However, the daunting dimension of typical microarray data sets, with a number of genes p several orders of magnitude larger than the number of samples n, precludes the application of standard multivariate techniques and confronts the biologist with sophisticated procedures that address this situation. We have introduced a new way to approach this problem in an intuitive manner, based on limited-order partial correlations, and in this chapter we illustrate this method through the R package qpgraph, which forms part of the Bioconductor project and is available at its website (1). Key words: molecular regulatory network, microarray data, reverse engineering, network inference, non-rejection rate, qpgraph
1
1. Introduction The genome-wide assay of gene expression by microarray instruments provides a high-throughput readout of the relative RNA concentration for a very large number of genes p across a typically much smaller number of experimental conditions n. This enables a fast systematic comparison of all expression profiles on a gene-by-gene basis by analysis techniques such as differential expression. However, the simultaneous assay of all genes embeds in the microarray data a pattern of correlations projected from the regulatory interactions forming part of the cellular state of the samples, and therefore, estimating this pattern from the data can aid in building a network model of the transcriptional regulatory interactions. Many published solutions to this problem rely on pairwise measures of association based on bivariate statistics, such as Pearson correlation or mutual information (2). However, marginal pairwise associations cannot distinguish direct from indirect (that is, spurious) relationships and specific enhancements to this pairwise approach have been made to address this problem (see, for instance, (3) and (4)). A sensible approach is to try to apply multivariate statistical methods such as undirected Gaussian graphical modeling (5) and compute partial correlations which are a measure of association between two variables while controlling for the remaining ones. However, these methods require inverting the sample covariance matrix of the gene expression profiles and this is only possible when n > p (6). This has led to the development of specific inferential procedures, which try to overcome the small n and large p problem by exploiting specific biological background knowledge on the structure of the network to be inferred. From this viewpoint, the most relevant feature of regulatory networks is that they are sparse, that is the direct regulatory interactions between genes represent a small proportion of the edges
2
present in a fully connected network (see, for instance, (7)). Statistical procedures for inference on sparse networks include, among others, a Bayesian approach with sparsity inducing prior (8), the lasso estimate of the inverse covariance matrix (see, among others, (9) and (10)), the shrinkage estimate of the covariance matrix (11) and procedures based on limited-order partial correlations (see, for instance, (12) and (13)). In (14) a procedure is proposed for the statistical learning of sparse networks based on a quantity called the non-rejection rate. The computation of the non-rejection rate requires carrying out a large number of hypothesis tests involving limited-order partial correlations, nonetheless that procedure is not affected by the multiple testing problem. Furthermore, in (15) it is shown that averaging non-rejection rates obtained through different orders of the partial correlations is an effective strategy to release the user from making an educated guess on the most suitable order. In the same article, a method based on the concept of functional coherence is introduced, for the comparison of the functional relevance of different inferred networks and their regulatory modules. In the rest of this chapter we show how to apply this entire methodology by using the statistical software R and the Bioconductor package qpgraph. 2. Materials 2.1. The non-rejection rate We represent the molecular regulatory network we want to infer by means of a mathematical object called a graph. A graph is a pair G=(V, E), where V={1,2, ... , p} is a finite set of vertices and E is a subset of pairs of vertices, called the edges of G. In this context, vertices are genes and edges are direct regulatory interactions (see Note 1). Nevertheless, the graphs we consider here have no multiple edges and no loops;
3
furthermore, they are undirected so that both (i,j) ∈ E and (j, i) ∈ E are an equivalent way to write that the vertices i and j are linked by an edge. A basic feature of graphs is that they are visual objects. In the graphical representation, vertices may be depicted with circles while undirected edges are lines joining pairs of vertices. For example, the graph G=(V, E) with V={1, 2, 3} and E={(1, 2), (2, 3)} can be represented as ──────. A path in G from i to j is a sequence of vertices such that i and j are the first and last vertex of the sequence, respectively, and every vertex in the sequence is linked to the next vertex by an edge. The subset Q ⊆ V is said to separate i from j if all paths from i to j have at least one vertex in Q. For instance, in the graph of the example above the sequence (1, 2, 3) is a path between 1 and 3 whereas the sequence (1, 3, 2) is not a path. Furthermore, the set Q={2} separates 1 from 3. The random vector of gene expression profiles is indexed by the set V and denoted by XV=(X1, X2, ... , Xp)T and, furthermore, we denote by ρij.V\{i,j} the full-order partial correlation between the genes i and j, that is the correlation coefficient between the two genes adjusted for all the remaining genes V\{i, j}. We assume that XV belongs to a Gaussian graphical model with graph G=(V, E) and refer to (5) for a full account on these models. Here, we recall that in a Gaussian graphical model XV is assumed to be multivariate normal and that the vertices i and j are not linked by an edge if and only if ρij.V\{i,j}=0. It follows that the sample version of full-order partial correlations plays a key role in statistical procedures for inferring the network structure from data. However, these quantities can be computed only if n is larger than p and this has precluded the application of standard techniques in the context of regulatory network inference from microarray data. On the other hand, if the edge between the genes i and j is missing from the graph then possibly a large number of limited-order partial correlations are equal to zero. More specifically, for a subset Q
⊂ V\{i,j} we denote by ρij.Q the limited-order partial correlation, that is the correlation
4
coefficient between i and j adjusted for the genes in Q. It can be shown that if Q separates i and j in G, then ρij.Q is equal to zero. This is a useful result because the sample version of ρij.Q can be computed whenever n > q+2 and, if the distribution of XV is faithful to G (see (14) and references therein), then ρij.Q=0 also implies that the vertices i and j are not linked by an edge in G. In sparse graphs one should expect a high degree of separation between vertices and therefore limited-order partial correlations are useful tools for inferring sparse molecular regulatory networks from data. There are, however, several difficulties related to the use of limited-order partial correlations because for every pair of genes i and j there are a huge number of potential subsets Q, and this leads to computational problems as well as to multiple testing problems. In (14) the authors propose to use a quantity based on partial correlations of order q that they call the non-rejection rate. The non-rejection rate for vertices i and j is denoted by NRR(i,j|q) and it is the probability of not rejecting, on the basis of a suitable statistical test, the hypothesis that ρij.Q=0 where Q is a subset of q genes randomly selected from V\{i,j}. Hence, the non-rejection rate is a probability associated to every pair of vertices, genes in the context of this chapter, and takes values between zero and one, with larger values providing stronger evidence that an edge is not present in G. The procedure introduced in (15) amounts to estimating the non-rejection rate for every pair of vertices, ranking all the possible edges of the graph according to these values and then removing those edges whose non-rejection rate values are above a given threshold. Different methods for the choice of the threshold are discussed in the forthcoming sections where the graph inferred with this method will be called the qp-graph; we refer to (14) and (15) for technical details. Here we recall that the computation of the non-rejection rate requires the specification of a value q corresponding to the dimension of the potential separator, with q ranging from the
5
value 1 to the value n-3. Obviously, a key question when using the non-rejection rate with microarray data is what value of q should be employed. We know that a larger value of q increases the probability that a randomly chosen subset Q separates i and j, but this could compromise the statistical power of the tests which depends on n-q. In (15) a simple and effective solution to this question was introduced and consists of averaging (taking the arithmetic mean), for each pair of genes, the estimates of the non-rejection rates for different values of q spanning its entire range from 1 to somewhere close to n-3. These authors also showed that the average non-rejection rate is more stable than the non-rejection rate, avoids having to specify a particular value of q and it behaves similarly to the non-rejection rate for connected pairs of vertices in the true underlying graph G (i.e., for directly interacting genes in the underlying molecular regulatory network). They also pointed out that the drawback of averaging is that a disconnected pair of vertices (i,j) in a graph G whose indirect relationship is mediated by a large number of other vertices, will be easier to identify with the non-rejection rate using a sufficiently large value of q than with the average non-rejection rate. However, in networks showing high degrees of modularity and sparseness the number of genes mediating indirect interactions should not be very large, and therefore, the average non-rejection rate should be working well, just as they observed in the empirical results reported in (15). 2.2. Functional coherence A critical question when estimating a molecular regulatory network from data is to know the extent to which the inferred regulatory relationships reflect the functional organization of the system under the experimental conditions employed to generate the microarray data. The authors in (15) addressed this question using the Gene Ontology (GO) database (16) which provides structured functional annotations on genes for a large number of organisms including Escherichia coli (E. coli). The
6
approach followed consists of assessing the functional coherence of every regulatory module within a given network. Assume a regulatory module is defined as a transcription factor and its set of regulated genes. The functional coherence of a regulatory module is estimated by relying on the observation that, for many transcription factor genes, their biological function, beyond regulating transcription, is related to the genes they regulate. Note that different regulatory modules can form part of a common pathway and thus share some more general functional annotations, which can lead to some degree of functional coherence between target genes and transcription factors of different modules. However, in (15) it is shown that for the case of E. coli data, the degree of functional coherence within a regulatory module is higher than between highly correlated but distinct modules. This observation allowed them to conclude that functional coherence constitutes an appealing measure for assessing the discriminative power between direct and indirect interactions and therefore can be employed as an independent measure of accuracy. The way in which the authors in (15) estimated functional coherence is as follows. Using GO annotations, concretely those that refer to the biological process (BP) ontology, two GO graphs are built such that vertices are GO terms and (directed) links are GO relationships. One GO graph is induced (i.e., grown toward vertices representing more generic GO terms) from GO terms annotated on the transcription factor gene discarding those terms related to transcriptional regulation. The other GO graph is induced from GO terms overrepresented among the regulated genes in the estimated regulatory module which, to try to avoid spuriously enriched GO terms, we take it only into consideration if it contains at least 5 genes. These overrepresented GO terms can be found, for instance, by using the conditional hypergeometric test implemented in the Bioconductor package GOstats (17) on the
7
E. coli GO annotations from the org.EcK12.eg.db Bioconductor package. Finally, the level of functional coherence of the regulatory module is estimated as the degree of similarity between the two GO graphs, which in this case amounts to a comparison of the two corresponding subsets of vertices. The level of functional coherence of the entire network is determined by the distribution of the functional coherence values of all the regulatory modules for which this measure was calculated (see Note 2). 2.3. Escherichia coli microarray data In this chapter we describe our procedure through the analysis of an E. coli microarray data set from (18) and deposited at the NCBI Gene Expression Omnibus (GEO) with accession GDS680. It contains 43 microarray hybridizations that monitor the response from E. coli during an oxygen shift targeting the a priori most relevant part of the network by using six strains with knockouts of key transcriptional regulators in the oxygen response (∆arcA, ∆appY, ∆fnr, ∆oxyR, ∆soxS and the double knockout ∆arcA∆fnr). We will infer a network starting from the full gene set of E. coli with p=4,205 genes (see the following subsection for details on filtering steps). 2.4. Escherichia coli functional and microarray data processing We downloaded the Release 6.1 from RegulonDB (19) formed by an initial set of 3,472 transcriptional regulatory relationships. We translated the Blattner IDs into Entrez IDs, discarded those interactions for which an Entrez ID was missing in any of the two genes and did the rest of the filtering using Entrez IDs. We filtered out those interactions corresponding to self-regulation and among those conforming to feedback-loop interactions we discarded arbitrarily one of the two interactions. Some interactions were duplicated due to a multiple mapping of some Blattner IDs to Entrez IDs, in that case we removed the duplicated interactions arbitrarily. We
8
finally discarded interactions that did not map to genes in the array and were left with 3,283 interactions involving a total of 1,428 genes. We have obtained RMA expression values for the data in (18) using the rma() function from the affy package in Bioconductor. We filtered out those genes, for which there was no Entrez ID and when two or more probesets were annotated under the same Entrez ID we kept the probeset with highest median expression level. These filtering steps left a total number of p=4,205 probesets mapped one-toone with E. coli Entrez genes. 3. Methods 3.1. Running the Bioconductor package qpgraph The methodology briefly described in this chapter is implemented in the software called qpgraph, which is an add-on package for the statistical software R (20). However, unlike most other available software packages for R, which are deposited at the Comprehensive R Archive Network -CRAN- (21), the package qpgraph forms part of the Bioconductor project (see (22) and (1)) and it is deposited in the Bioconductor website instead. The version of the software employed to illustrate this chapter runs over R 2.12 and thus forms part of Bioconductor package bundle version 2.7 (see Note 3). Among the packages that get installed by default with R and Bioconductor, qpgraph will automatically load some of them when calling certain functions but one of these, Biobase, should be explicitly loaded to manipulate microarray expression data through the ExpressionSet class of objects. Therefore, the initial sequence of commands to successfully start working with qpgraph through the example illustrated in this chapter is as folows: > library(Biobase) > library(qpgraph)
9
Additionally, we may consider the fact that most modern desktop computers come with four or more core processors and that it is relatively common to have access to a cluster facility with dozens, hundreds or perhaps thousands of processors scattered through an interconnected network of computer nodes. The qpgraph package can take advantage of such a multiprocessor hardware by performing some of the calculations in parallel. In order to enable this feature it is necessary to install the R packages snow and rlecuyer from the CRAN repository and load them prior to using the qpgraph package. The specific type of cluster configuration that will be employed will depend on whether additional packages providing such a specific support are installed. For example if the package Rmpi is installed, then the cluster configuration will be that of an MPI cluster (see (23) and Note 4 for details on this subject). Thus, if we want to take advantage of an available multiprocessor infrastructure we should additionally write the following commands: > library(snow) > library(rlecuyer)
Once these packages have been successfully loaded, to perform calculations in parallel it is necessary to provide an argument, called clusterSize, to the corresponding function indicating the number of processors that we wish to use. In this chapter we assume we can use 8 processors, which should allow the longest calculation illustrated in this chapter to finish in less than 15 minutes. During long calculations it is convenient to monitor their progress and this is possible in most of the functions from the qpgraph package if we set the argument verbose=TRUE, which by default is set to FALSE.
10
3.2. A quick tour through the qpgraph package In this section we illustrate the minimal function calls in the qpgraph package that allow one to infer a molecular regulatory network from microarray data. We need first to load the data described in the previous section and which is included as an example data set in the qpgraph package. > data(EcoliOxygen)
The previous command will load on our current R default environment two objects, one of them called gds680.eset, which is an object of the class ExpressionSet and contains the E. coli microarray data described in the previous section. We can see these objects in the workspace with the function ls() and figure out the dimension of this particular microarray data set with dim(), as follows: > ls() [1] "filtered.regulon6.1" "gds680.eset" > dim(gds680.eset) Features Samples 4205 43
When we have a microarray data set, either as an ExpressionSet object or simply as a matrix of numeric values, we can immediately proceed to estimate non-rejection rates with a q-order of, for instance, q=3 with the function qpNrr(): > nrr qpGraphDensity(nrr, title="", breaks=10) > g g A graphNEL graph with undirected edges Number of Nodes = 4205 Number of Edges = 644036
By default, the qpGraph() function returns an adjacency matrix but, by setting return.type="graphNEL" we obtain a graphNEL-class object as a result, which, as we shall see later, is amenable for processing with functions from the Bioconductor packages graph and Rgraphviz. We can conclude this quick tour through the main cycle of the task of inferring a network from microarray data by showing how we can extract a ranking of the strongest edges in the network with the function qpTopPairs(): > qpTopPairs(nrr) i j x 1 947758 947761 0 2 948517 948512 0 3 944834 944794 0 4 948517 944797 0 5 948512 944797 0 6 945112 945108 0
where the first two columns, called i and j, correspond to the identifiers of the pair of variables and the third column x corresponds, in this case, to non-rejection rate values. An immediate question is whether the value of q=3 was appropriate for this data set and while we may try to find an answer by exploring the estimated nonrejection rate values in a number of ways described in (14), an easy solution introduced in (15) consists of estimating the so-called average non-rejection rates whose corresponding function, qpAvgNrr(), is called in an analogous way to qpNrr() but without the need to specify a value for q.
12
In (15) a comparison of this procedure with other widely used techniques is carried out. Here, we restrict the comparison to a simple procedure based on sample Pearson correlation coefficients and, furthermore, to the worst performing strategy which consists of setting association values uniformly at random to every pair of genes (which we shall informally call the random association method) leading to a completely random ranking of the edges of the graph. All these quantities can be computed using two functions available also through the qpgraph package: > allGenes pcc rndcor regulonTFgenes set.seed(123) > avgnrr regulonGenes avgnrr.pr plot(100*avgnrr.pl[, 1:2], type="b", pch=19, xlim=c(0, 6), xlab="Recall (% RegulonDB interactions)", ylab="Precision (%)")
In Figure 1b this plot is shown jointly with the other calculated curves, where the comparison of the average non-rejection rate (labeled qp-graph) with the other methods yields up to 40% improvement in precision with respect to using absolute Pearson correlation coefficients and observe that for precision levels between 50% and 80% the qp-graph method doubles the recall. We shall see later that this has an important impact when targeting a network of a reasonable nominal precision in such a data set with p=4,205 and n=43. [FIGURE 1 NEAR HERE] 3.5. Inference of molecular regulatory networks of specific size Given a measure of association for every pair of genes of interest, the most straightforward way to infer a network is to select a number of top-scoring interactions that conform a resulting network of a specific size that we choose. We showed before such a strategy by looking at the graph density as a function of threshold, however, we can also extract a network of specific size by using the argument topPairs in the call to the qpGraph() and qpAnyGraph() functions where the call for the random association values would be analogous to the one of Pearson correlations. > qpg1000sze pcc1000sze avgnrr.50p.thr avgnrr.3r.thr qpg50pre pcc50pre qpg3rec pcc3rec library(GOstats) > library(org.EcK12.eg.db) > qpg50preFC boxplot(qpg50preFC$functionalCoherenceValues, col=grey(0.50), ylim=c(0,1), ylab="Functional coherence")
In Figure 2 we see the boxplots for the functional coherence values of all networks obtained from each method and selection strategy. Through the three different strategies, the networks obtained with the qp-graph method provide distributions of functional coherence with mean and median values larger than those obtained from networks built with Pearson correlation or simply at random. [FIGURE 2 NEAR HERE]
18
3.8. The 50%-precision qp-graph regulatory network We are going to examine in detail the 50%-precision qp-graph transcriptional regulatory network. A quick glance at the pairs with strongest average non-rejection rates including the functional coherence values of their regulatory modules within this 50%-precision network can be obtained with the function qpTopPairs() as follows: > qpTopPairs(measurementsMatrix=avgnrr, refGraph=qpg50pre, pairup.i=regulonTFgenes, pairup.j=allGenes, annotation="org.EcK12.eg.db", fcOutput=qpg50preFC, fcOutput.na.rm=TRUE) i j iSymbol jSymbol x funCoherence 1 948797 945585 appY appC 0.01 0.33 2 947466 945938 glcC mhpR 0.01 0.14 3 945938 946255 mhpR astC 0.01 0.71 4 947466 948336 glcC fadB 0.02 0.14 5 948797 947547 appY appB 0.02 0.33 6 948797 946206 appY appA 0.02 0.33
The previous function call admits also a file argument that would allow us to store these information as a tab-separated column text file, thus more amenable for automatic processing when combined with the argument n=Inf since by default this is set to a limited number (n=6) of pairs being reported. For many other types of analysis, it is useful to store the network as an object of the graphNEL class, which is defined in the graph package. This is obtained by calling the qpGraph() function setting properly the argument return.type as follows: > g g A graphNEL graph with undirected edges Number of Nodes = 147 Number of Edges = 120
As we see from the object's description, the 50%-precision qp-graph network consists of 120 transcriptional regulatory relationships involving 147 different genes. A GO enrichment analysis on this subset of genes can give us insights into the main
19
molecular processes related to the assayed conditions. Such an analysis can be performed by means of a conditional hypergeometric test using the Bioconductor package GOstats as follows: > goHypGparams goHypGcond table(sapply(connComp(g), length)) 2 24
3 6
4 4
6 2
8 1
9 17 19 1 1 1
and observe that two of them, formed by 17 and 19 genes, are distinctively larger than the rest, thus corresponding to the more complex part of the network. In order
20
to examine in more detail these two subnetworks we can plot them using the Bioconductor package Rgraphviz (see Note 5) and calling the function qpPlotNetwork() which will output Figure 3a: > library(Rgraphviz) > qpPlotNetwork(g, minimumSizeConnComp=17, pairup.i=regulonTFgenes, pairup.j=allGenes, annotation="org.EcK12.eg.db")
Often the visualization of many interacting genes is dificult to interpret as, for instance in this case, the module regulated by mhpR. We can also visualize the part of the network connected to mhpR by using the arguments vertexSubset and boundary as follows and obtain the result shown in Figure 3b: > g_mhpR qpTopPairs(measurementsMatrix=avgnrr[nodes(g_mhpR), nodes(g_mhpR)], refGraph=g_mhpR, pairup.i=regulonTFgenes, pairup.j=allGenes, annotation="org.EcK12.eg.db", fcOutput=qpg50preFC) i j iSymbol jSymbol x funCoherence 1 945938 947466 mhpR glcC 0.01 0.71 2 945938 946255 mhpR astC 0.01 0.71 3 947466 948336 glcC fadB 0.02 0.14 4 945938 944954 mhpR mhpC 0.02 0.71 5 947466 946255 glcC astC 0.02 0.14 6 945938 948823 mhpR fadI 0.02 0.71
This last step allows us to see that the two strongest associations occur within the mhpR regulatory module, which also has a very high value of functional coherence, thus constituting two promising candidates for a follow up study. [FIGURE 3 NEAR HERE]
21
Notes 1. The underlying method assumes that it is estimating an undirected Gaussian graphical model, which is a well-defined statistical model. However, our biological interpretation of this model as a transcriptional regulatory network will lead us to discard interactions between genes where none of them is a transcription factor, and to put directions in the resulting graph from transcription factor genes to their putative targets. This provides us with a network model of the underlying transcriptional regulation, which does not have a statistical interpretation anymore in terms of, for instance, conditional independence, but which allows one to formulate educated guesses on plausible biological hypotheses. 2. The limited availability of GO functional annotations for genes outside wellstudied model organisms can compromise a reliable estimation of functional coherence values. 3. Bioconductor release versions are synchronized with R software release versions and thus updated twice a year. It is always recommended to work with the latest versions. For a detailed explanation on how to install and update the R and Bioconductor software please visit the website (27). 4. The installation of the package Rmpi requires a prior installation and configuration of an MPI library. For further details on this issue please visit the website (28). 5. The installation of the package Rgraphviz requires a prior installation of the software graphviz available at the website (29).
22
Acknowledgments This work is supported by the Spanish Ministerio de Ciencia e Innovación (MICINN) [TIN2008-00556 / TIN] and the ISCIII COMBIOMED Network [RD07/0067/0001]. R.C. is a research fellow of the "Ramon y Cajal" program from the Spanish MICINN [RYC-2006-000932]. A.R. acknowledges support from the Ministero dell'Università e della Ricerca [PRIN-2007AYHZWC]. References 1.
http://www.bioconductor.org
2.
Butte, A. J., Tamayo, P., Slonim, D., et al. (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks., Proc Natl Acad Sci U S A 97, 12182-12186.
3.
Basso, K., Margolin, A. A., Stolovitzky, G., et al. (2005) Reverse engineering of regulatory networks in human B cells., Nat Genet 37, 382-390.
4.
Faith, J. J., Hayete, B., Thaden, J. T., et al. (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles., PLoS Biol 5, e8.
5.
Edwards, D. (2000) Introduction to graphical modelling, Springer New York.
6.
Dykstra, R. L. (1970) Establishing Positive Definiteness of Sample Covariance Matrix, Ann Math Statist 41, 2153-2154.
7.
Barabasi, A.-L., and Oltvai, Z. N. (2004) Network biology: understanding the cell's functional organization., Nat Rev Genet 5, 101-113.
8.
Dobra, A., Hans, C., Jones, B., et al. (2004) Sparse graphical models for exploring gene expression data, J. Multivariate. Anal. 90, 196-212.
9.
Friedman, J., Hastie, T., and Tibshirani, R. (2008) Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9, 432-441.
23
10.
Yuan, M., and Lin, Y. (2007) Model selection and estimation in the Gaussian graphical model, Biometrika 94, 19-35.
11.
Schäfer, J., and Strimmer, K. (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Stat. Appl. Genet. Mol. Biol. 4, 1-32.
12.
de la Fuente, A., Bing, N., Hoeschele, I., et al. (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients Bioinformatics 20, 3565-3574.
13.
Wille, A., and Bühlmann, P. (2006) Low-order conditional independence graphs for inferring genetic networks, Stat. Appl. Genet. Mol. Biol. 5, 1.
14.
Castelo, R., and Roverato, A. (2006) A robust procedure for Gaussian graphical model search from microarray data with p larger than n, J Mach Learn Res 7, 2621-2650.
15.
Castelo, R., and Roverato, A. (2009) Reverse engineering molecular regulatory networks from microarray data with qp-graphs, J Comput Biol 16, 213-227.
16.
http://www.geneontology.org
17.
Falcon, S., and Gentleman, R. (2007) Using GOstats to test gene lists for GO term association., Bioinformatics 23, 257-258.
18.
Covert, M. W., Knight, E. M., Reed, J. L., et al. (2004) Integrating highthroughput and computational data elucidates bacterial networks., Nature 429, 92-96.
19.
Gama-Castro, S., Jimenez-Jacinto, V., Peralta-Gil, M., et al. (2008) RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation., Nucleic Acids Res 36, D120-124.
20.
http://www.r-project.org
24
21.
http://cran.r-project.org
22.
Gentleman, R. C., Carey, V. J., Bates, D. M., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics., Genome Biol 5, R80.
23.
Schmidberger, M., Morgan, M., Eddelbuettel, D., et al. (2009) State-of-the-art in Parallel Computing with R, Journal of Statistical Software 31.
24.
Wasserman, W. W., and Sandelin, A. (2004) Applied bioinformatics for the identification of regulatory elements, Nat Rev Genet 5, 276-287.
25.
Fawcett, T. (2006) An introduction to ROC analysis, Pattern Recogn Lett 27, 861-874.
26.
Cho, B.-K., Knight, E. M., and Palsson, B. O. (2006) Transcriptional regulation of the fad regulon genes of Escherichia coli by arcA., Microbiology 152, 22072219.
27.
http://www.bioconductor.org/install
28.
http://www.stats.uwo.ca/faculty/yu/Rmpi
29.
http://www.graphviz.org
25
Tables Table 1. Gene Ontology biological process terms enriched (P-value ≤ 0.05) among the 147 genes forming the 50%-precision qp-graph network inferred from the oxygen deprivation data in (18). GO Term Identifier GO:0006350 GO:0009059
P-value
Exp. Count
Count
Size
< 0.0001 0.0004
Odds Ratio 4.76 2.14
13.79 27.81
39 43
292 589
GO:0019395 GO:0030258 GO:0044260
0.0022 0.0022 0.0035
5.34 5.34 1.84
1.42 1.42 38.15
6 6 51
30 30 808
GO:0044238 GO:0006542 GO:0006578 GO:0009268 GO:0006807
0.0073 0.0096 0.0124 0.0124 0.0398
2.08 8.92 20.62 20.62 1.50
66.10 0.47 0.19 0.19 43.44
76 3 2 2 52
1400 10 4 4 920
GO:0042594
0.0428
4.44
0.80
3
17
26
Term transcription macromolecule biosynthetic process fatty acid oxidation lipid modification cellular macromolecule metabolic process primary metabolic process glutamine biosynthetic process betaine biosynthetic process response to pH nitrogen compound metabolic process response to starvation
Figure legends Figure 1. Performance comparison on the oxygen deprivation E. coli data with respect to RegulonDB. (A) Graph density as function of the non-rejection rate estimated with q=3. (B) Precision-recall curves comparing a random ranking of the putative interactions, a ranking made by absolute Pearson correlation (Pairwise PCC) and a ranking derived from the average non-rejection rate (qp-graph). Figure 2. Functional coherence estimated from networks derived with different strategies and methods. (A) A nominal RegulonDB-precision of 50%, (B) a nominal RegulonDB-recall of 3%, and (C) using the top ranked 1 000 interactions. On the xaxis and between square brackets, under each method, are indicated, respectively, the total number of regulatory modules of the network, the number of them with at least 5 genes and the number of them with at least 5 genes with GO-BP annotations. Among this latter number of modules, the number of them where the transcription factor had GO annotations beyond transcription regulation is noted above between parentheses by n and corresponds to the number of modules on which functional coherence could be calculated. Figure 3. The 50%-precision qp-graph transcriptional network. (A) The two largest connected components. (B) The mhpR regulatory module in detail.
27
Figure 1 100%
16
33
49
66
82
98
115 131
148 164
18%
1%
0.0
60 30 20
28%
0
7%
13%
23%
10
20
41% 34%
50
60 40
51%
40
67%
graph density
0
Precision (%)
70
80
80
181 197
qp−graph Pairwise PCC Random
90
graph density
0
100
B
100
A
Recall (# RegulonDB interactions)
0.2
0.4
0.6
0.8
threshold
1.0
0
1
2
3
4
5
6
Recall (% RegulonDB interactions)
Figure 2 B
PwsePCC (n=1) [13,3,1]
Method
Random (n=79) [160,160,119]
1.0 0.8 0.7 0.6 0.5 0.4
Functional coherence
0.8 0.7 0.6 0.5 0.4 0.3
Functional coherence
0.2
0.2
0.1
0.1
0.0
0.0
qp−graph (n=3) [46,7,3]
Top 1000 interactions
0.9
0.9
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Functional coherence
C
3% recall 1.0
1.0
50% precision
0.3
A
qp−graph (n=13) [104,36,19]
PwsePCC (n=26) [97,56,41]
Method
Random (n=104) [160,160,160]
qp−graph (n=24) [145,69,38]
PwsePCC (n=15) [58,31,22]
Method
Random (n=19) [160,128,26]
Figure 3 A
hdeB
B
hdeA
hdeD
fadB
gadB
fadA
astC fadI
fadD
slp gadE gadA
ybaS gadW
gadC
aidB
fadA astC gadX
fadB
astB fadJ
ybaT acs
yjcH yejG
mhpR
arcA
dctR mdtE
mdtF
acs
glcC
astD yjcH
aldA mglB
yhiD
fadJ
fadI
glcC
fadD
mhpR
glcD
astB astD
mhpC
astA
astA
mhpC
yejG