[5] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review.
Learning probabilistic networks of condition-specific response: Digging deep in yeast stationary phase Sushmita Roy∗ , Terran Lane∗ , and Margaret Werner-Washburne+ ∗ Department of Computer Science, University of New Mexico + Department of Biology, University of New Mexico
Abstract Condition-specific networks are functional networks of genes describing molecular behavior under different conditions such as environmental stresses, cell types, or tissues. These networks frequently comprise parts that are unique to each condition, and parts that are shared among related conditions. Existing approaches for learning condition-specific networks typically identify either only differences or similarities across conditions. Most of these approaches first learn networks per condition independently, and then identify similarities and differences in a postlearning step. Such approaches have not exploited the shared information across conditions during network learning. We describe an approach for learning condition-specific networks that simultaneously identifies the shared and unique subgraphs during network learning, rather than as a post-processing step. Our approach learns networks across condition sets, shares data from conditions, and leads to high quality networks capturing biologically meaningful information. On simulated data from two conditions, our approach outperformed an existing approach of learning networks per condition independently, especially on small training datasets. We further applied our approach to microarray data from two yeast stationary-phase cell populations, quiescent and non-quiescent. Our approach identified several functional interactions that suggest respiration-related processes are shared across the two conditions. We also identified interactions specific to each population including regulation of epigenetic expression in the quiescent population, consistent with known characteristics of these cells. Finally, we found several high confidence cases of combinatorial interaction among single gene deletions that can be experimentally tested using double gene knock-outs, and contribute to our understanding of differentiated cell populations in yeast stationary phase.
1
1
Introduction
Although the DNA for an organism is relatively constant, every organism on earth has the potential to respond to different environmental stimuli or to differentiate into distinct cell-types or tissues. Different environmental conditions, cell-types or tissues can be considered as different instantiations of a global variable, the condition variable, which induces condition-specific responses. These condition-specific responses typically require global changes at the transcript, protein and metabolic levels and are of interest as they provide insight into how organisms function at a systems level. Condition-specific networks describe functional interactions among genes and other macromolecules under different conditions, providing a systemic view of condition-specific behavior in organisms. Analysis of condition-specific responses has been one of the principal goals of molecular biology, and several approaches have been developed to capture condition-specific responses at different levels of granularity. The most common approach is the identification of differentially expressed genes in a condition of interest using genome-wide measurements of gene, and often protein expression [20]. More recent approaches are based on bi-clustering, which cluster genes and conditions simultaneously [5,7,9,29], and identify sets of genes that are co-regulated in sets of conditions. However, these approaches do not provide fine-grained interaction structure that explains the conditionspecific response of genes. More advanced approaches additionally identify transcription modules (set of transcription factors regulating a set of target genes) that are co-expressed in a conditionspecific manner [11,13,26,31], but these too do not provide detailed interaction information among genes for each condition. In this paper, we describe a novel approach, Network Inference with Pooling Data (NIPD), for condition-specific response analysis that emphasizes the fine-grained interaction patterns among genes under different conditions. The main conceptual contribution of our approach is to learn networks for any subset of conditions. This subsumes existing approaches that find either only patterns that are specific to each condition, or only patterns that are shared across conditions. To make this clear, let us consider a simple example of two environmental starvation conditions: Carbon and Nitrogen starvation. Using our approach we can simultaneously find patterns that are 2
specific only to Carbon starvation, only to Nitrogen starvation, and those that are shared across these two conditions. From the methodological stand-point our work is similar to Bayesian multinets [10], which we extend by allowing data to be pooled across conditions and learning networks for any subset of conditions. NIPD is based on the framework of probabilistic graphical models (PGMs), where edges represent pairwise and higher-order statistical dependencies among genes. Similar to existing PGM learning algorithms, NIPD infers networks by iteratively scoring candidate networks and selecting the network with the highest score [12]. However, NIPD uses a novel score that evaluates candidate networks with respect to data from any subset of conditions, pooling data for subsets with more than one conditions. This subset score and search strategy of NIPD incorporates and exploits the shared information across the conditions during structure learning, rather than as a post-processing step. As a result, we are able to identify sub-networks not only specific to one condition, but to multiple conditions simultaneously, which allows us to build a more holistic picture of condition-specific response. The data pooling aspect of NIPD makes more data available for estimating parameters for higher-order interactions, i.e., interactions among more than two genes. This enables NIPD to robustly estimate higher-order interactions, which are more difficult to estimate due to the high number of parameters relative to pairwise dependencies. By formulating NIPD in the framework of PGMs we have additional benefits: (a) PGMs are generative models of the data, providing a system-wide description of the condition-specific behavior as a probabilistic network, (b) the probabilistic component naturally handles noise in the data, (c) the graph structure captures condition-specific behavior at the level of gene-gene interactions, rather than coarse clusters of genes, (d) the PGM framework can be easily extended to more complex situations where the condition variable itself may be a random variable that must be inferred during network learning. We implement NIPD with undirected, probabilistic graphical models [14]. However, the NIPD framework is applicable to directed graphs as well. We are not the first to propose networks for capturing condition-specific behavior [24, 34]. Several network-based approaches have been developed for capturing condition-specific behavior
3
such as disease-specific subgraphs in cancer [8], stress response networks in yeast [21], or networks across different species [4,28]. However, these approaches are not probabilistic in nature, often rely on the network being known, and are restricted to pairwise co-expression relationships rather than general statistical dependencies. Other approaches such as differential dependency networks [34], and mixture of subgraphs [24], construct probabilistic models, but focus on differences rather than both differences and similarities. The majority of these approaches infer a network for each condition separately, and then compare the networks from different conditions to identify the edges capturing condition-specific behavior. We compared NIPD against an existing approach for learning networks from the conditions independently. We refer to this approach as INDEP, which represents a general class of existing algorithms that learn networks per condition independently. On simulated data from networks with known ground truth, NIPD inferred networks with higher quality than did INDEP, especially on small training datasets. We also applied our approach to microarray data from two yeast (Saccharomyces cerevisiae) cell types, quiescent and non-quiescent, isolated from glucose-starved, stationary phase cultures [2]. Networks learned by NIPD were associated with many more Gene ontology biological processes [3], or were enriched in targets of known transcription factors (TFs) [17], than networks learned by INDEP. Many of the TFs were involved in stress response, which is consistent with the fact that the populations are under starvation stress. NIPD also identified many more shared edges, which represent biologically meaningful dependencies than the INDEP approach. This suggests that by pooling data from multiple conditions, we are able to not only capture shared structures better, but also to infer networks with higher overall quality.
2
Results
The goal of our experiments was three fold: (a) to examine the quality of condition-specific networks inferred by our approach that combines data from different conditions (NIPD) versus an independent learner (INDEP), (b) to evaluate the algorithmic performance (measured by network structure quality) as a function of training data size, (c) analyze how two different cell populations behave, at the network level, in response to the same starvation stress. We address (a) and (b) 4
on simulated data from networks with known topology, giving us ground truth to directly validate the inferred networks. We address (c) on microarray data from two yeast cell populations isolated from glucose-starved stationary phase cultures [2].
2.1
NIPD had superior performance on networks with known ground truth
We simulated data from two sets of networks, each set with two networks, one network per condition. In the first, HIGHSIM, the networks for the two conditions, shared a larger portion (60%) of the edges, and in the second, LOWSIM, the networks shared a smaller (20%) portion of the edges. We compared the networks inferred by NIPD to those inferred by INDEP by assessing the match between true and inferred node neighborhoods (See Supplementary Methods). Briefly, the data were split into q partitions, where q ∈ {2, 4, 6, 8, 10}, and networks learned for each partition. The size of the training data decreased with increasing q. We first evaluated overall network structure quality by obtaining the number of nodes on which one approach was significantly better (t-test p-value, < 0.05) in capturing its neighborhood as a function of q. On LOWSIM, NIPD was significantly better for smaller amounts of training data. On HIGHSIM, NIPD performed significantly better than INDEP for all training data sizes (Fig 1). Next, we evaluated how well the shared edges were captured as a function of decreasing amounts of training data (Supplementary Fig 1). NIPD captured shared edges better than INDEP on LOWSIM as the amounts of training data decreased. NIPD was better than INDEP on HIGHSIM regardless of the size of the training data. Our results show that when the underlying networks corresponding to the different conditions share a lot of structure, NIPD has a significantly greater advantage than INDEP, which does not do any pooling. Furthermore, as training data size decreases, NIPD is better than INDEP for learning both overall and shared structures, independent of the extent of sharing in the true networks.
2.2
Application to yeast quiescence
We applied NIPD to microarray data from two yeast cell populations, quiescent (QUIESCENT) and non-quiescent (NON-QUIESCENT), isolated from glucose starvation-induced stationary phase
5
cultures [2]. The two cell populations are in the same media but have differentiated physiologically and morphologically, suggesting that each population is responding differently. We learned networks using NIPD and INDEP treating each cell population as a condition. Because each array in the dataset was obtained from a single gene deletion mutant, the networks were constrained such that genes with deletion mutants connected to the remaining genes1 . The inferred networks from both methods were evaluated using information from Gene Ontology (GO) process, GO Slim [3] and transcriptional regulatory networks [17]. Gene Ontology is a hierarchically structured ontology of terms used to annotate genes. GO slim is a collapsed single level view of the complete GO terms, providing high level information of the processes, functions and cellular locations involving a set of genes. Finally, we analyzed combinations of genes with deletions that were in the neighborhood of other non deletion genes.
2.2.1
NIPD identified more biologically meaningful dependencies
To determine if one network was more biologically meaningful than the other, we examined the networks based on Gene Ontology (GO) slim category (process, function and location), transcription factor binding data and GO process, referred as GOSLIM, TFNET and GOPROC, respectively (Fig 2). Network quality was determined by the number of GOSLIM categories (or TFNET or GOPROC) with better coverage than random networks (See Methods). Both approaches were equivalent for GOSLIM, with INDEP outperforming NIPD in QUIESCENT and NIPD outperforming INDEP on NON-QUIESCENT. NIPD outperformed INDEP with a larger margin than was outperformed on TFNET categories from NON-QUIESCENT. NIPD was consistently better than INDEP on GOPROC categories. The networks learned by NIPD had many more edges than the networks learned by INDEP (Supplementary Table 1). To estimate the proportion of the edges capturing biologically meaningful relationships, we computed semantic similarity of genes connected by the edges [16]. Although both INDEP and NIPD had significantly better semantic similarity than random networks, INDEP degraded in p-value for QUIESCENT at the highest value of semantic similarity (Fig 3). NIPD1
This is not a bi-partite graph because the genes with deletion mutants are allowed to connect to each other.
6
inferred networks had many more edges with high semantic similarity than INDEP, while keeping the proportion of edges satisfying a particular semantic similarity threshold close to INDEP. This suggests that NIPD identifies more dependencies that are biologically relevant than INDEP without suffering in precision.
2.2.2
NIPD identified more shared edges representing common starvation response
We performed a more fine-grained analysis of the inferred networks by considering each gene and its immediate neighborhood and tested whether these gene neighborhoods were enriched in GO biological processes, or in the target set of transcription factors (TFs) (See Methods). Using a false discovery rate (FDR) cutoff of 0.05, we identified many more subgraphs in the networks inferred by NIPD than by INDEP to be enriched in a GO process or in targets of TFs (Figs 4, 5). NIPD identified more processes and larger subgraphs in both populations (oxidative phosphorylation, protein folding, fatty acid metabolism, ammonium transport) than did INDEP. NIPD identified subgraphs involved in aerobic respiration and oxidative phosphorylation were enriched in targets of HAP4, a global activator for respiration genes. The presence of HAP4 targets in both cell populations makes sense because both populations are experiencing glucose starvation and must switch to respiration for deriving energy. We also found the TFs, MSN2, MSN4, and HSF1, regulating subgraphs involved in protein folding. These TFs activate stress responses and are known to activate genes involved in heat, oxidative and starvation stress. We also found targets of SIP4 in both populations. SIP4 is a transcriptional activator of gluconeogenesis [32], expressed highly in glucose repressed cells [15], and therefore would be expected to be present in both quiescent and non-quiescent cells. In contrast, the only shared regulatory connection found by INDEP was HAP4. We conclude that the NIPD approach identified more networks that were biologically relevant and informative about glucose starvation response than did INDEP.
7
2.2.3
Wiring differences in NIPD-inferred networks exhibit population-specific starvation response
NIPD identified several processes associated exclusively with quiescent cells. This included regulatory processes (regulation of epigenetic gene expression, and regulation of nucleobase, nucleoside and nucleic acid metabolism) and metabolic processes (pentose phosphate shunt). These were novel predictions that highlight differences between these cells based on network wiring. INDEP identified only one population-specific GO process (response to reactive oxygen species in NONQUIESCENT). An INDEP identified subgraph specific to quiescent (protein de-ubiquitination), was actually a subset of the NIPD-identified subgraph involved in epigenetic gene expression regulation, indicating that NIPD subsumed most of the information captured by INDEP. NIPD QUIESCENT networks contained subgraphs enriched exclusively in targets of SKO1, and AZF1. Both of these are zinc finger TFs, with AZF1 protein expressed highly under non-fermentable carbon sources [27], and SKO1 which regulates low affinity glucose transporters [30], and are both consistent with the condition experienced by these cells. Unlike NIPD, which identified SIP4 to be associated with both populations, INDEP identified SIP4 only in QUIESCENT. However, as we describe in the previous section, it is more likely that SIP4 is involved in both QUIESCENT and NON-QUIESCENT populations. INDEP also found the TFs YAP7 and AFT2 exclusively in QUIESCENT and NON-QUIESCENT, respectively. YAP7 is involved in general stress response and would be expected to have targets in both QUIESCENT and NON-QUIESCENT. AFT2 is required under oxidative stress and is consistent with the over-abundance of reactive oxygen species in NON-QUIESCENT population [1]. NIPD also identified wiring differences in the subgraphs involved in shared processes. For example in addition to HAP4, NIPD identified HAP2 as an important TF in QUIESCENT. The presence of both HAP2 and HAP4 makes biological sense because they are both part of the HAP2/HAP3/HAP4/HAP5 complex required for activation of respiratory genes. The presence of both HAP2 and HAP4 in QUIESCENT, but not NON-QUIESCENT, suggests that the QUIESCENT population maybe better equipped for respiration and long term survival in stationary phase.
8
Overall, the NIPD inferred networks captured key differences and similarities in metabolic and regulatory processes, which are consistent with existing information about these cell populations [1,2], and also include novel findings that can provide new insight into starvation response in yeast.
2.2.4
NIPD identified several knock-out combinations
The microarrays used in this study measured expression profile of single gene deletions that were previously identified to be highly expressed at the mRNA level in stationary phase. We constrained the inferred networks to identify neighborhoods of genes comprising only the genes with deletion mutants, allowing us to identify combinations of such deletion mutants and their targets. Such combinations can be validated in the laboratory to verify cross-talk between pathways. We found that NIPD-inferred networks contained significantly more deletion combinations compared to random networks for both the quiescent and non-quiescent populations (p-value < 3E-10, Supplementary Tables 3, 4, 5), which was not the case for the INDEP-identified networks (Supplementary Tables 6, 7). A more stringent analysis of the knock-out combinations using GO process semantic similarity identified several double knock-out and target gene candidates (Supplementary Table 2). We also found more deletion combinations in NON-QUIESCENT compared to QUIESCENT. This is consistent with the identification of many more mutants affecting non-quiescent than quiescent cells [2]. In QUIESCENT, we found three genes that were all likely down-stream targets of a COX7-QCR8 double knock-outs, all involved in the cytochrome-c oxidase complex of the mitochondrial inner membrane. Other deletion mutant combinations were involved in mitochondrial ATP synthesis and ion transport. Many of these genes have been shown to be required for quiescent non-quiescent cell function, viability and survival [2, 18]. In NON-QUIESCENT, we found several knock-out combinations involved in oxidative phosphorylation, aerobic respiration etc, including a novel combination, YMR31 and QCR8, connected to TPS2. All three genes are found in the mitochondria, which play a critical and complex role in starved cells, but the exact mechanisms are not well-understood. Experimental analysis of this triplet can provide new insights into the role of mitochondria in glucose-starved cells. In summary, these results demonstrated another benefit
9
of data pooling in NIPD: learning more complex, combinatorial relationships among genes.
3
Discussion
Inference and analysis of cellular networks has been one of the cornerstones of systems biology. We have developed a network learning approach, Network Inference with Pooling Data (NIPD) to capture a systemic view of condition-specific response. NIPD is based on probabilistic graphical models and infers the functional wiring among genes involved in condition-specific response. The crux of our approach is to learn networks for any subset of conditions capturing fine-grained gene interaction patterns not only in individual conditions but in any combination of conditions. This allows NIPD to robustly identify both shared and unique components of condition-specific cellular networks. In comparison to an approach that learns networks independently (INDEP), NIPD (a) pools data across different conditions, enabling better exploitation of the shared information between conditions, (b) learns better overall network structures in the face of decreasing amounts of training data, and (c) learns structures with many more biologically meaningful dependencies. Small training data sets, which are especially common for biological data, present significant challenges for any network learning approach. In particular, approaches such as INDEP may learn drastically different networks due to small data perturbations leading to differences that are not biologically meaningful. NIPD is more resilient to small perturbations because by pooling data from different conditions during network learning, NIPD effectively has more data for estimating parameters for the shared parts of the network. Another challenge in the analysis of condition-specific networks is to extract patterns that are shared across conditions. Approaches such as INDEP that learn networks for each condition independently, and then compare the networks, are more likely to learn different networks making it difficult to identify the similarities across conditions. Application of both NIPD and INDEP approaches to microarray data from two yeast populations showed that many of subgraphs that would be considered specific to each population by INDEP, were actually shared biological processes that must be activated in both populations irrespective of their morphological and physiological differences. 10
One of the strengths of NIPD in comparison with INDEP was its ability to identify pairs of gene deletions and downstream targets using data from individual gene deletions. Amazingly, several of these gene deletions are already known to have a phenotypic effect on stationary phase cultures and often on quiescent or non-quiescent cells (Supplementary Table 2) [2,18]. These predictions are therefore good candidates for future experiments using double deletion mutants, and are a drastic reduction of the space of possible combinations of sixty-nine single gene deletions. Identification of population-specific malfunctions in signaling pathways via experimental analysis of these multiple deletions can provide new insight into aging and cancer studies using yeast stationary phase as a model system. The NIPD approach establishes ground-work for important future enhancements, including the ability to efficiently learn networks from many conditions. The probabilistic framework of NIPD can be easily extended to automatically infer the condition variable to make NIPD widely applicable to datasets with uncertainty about the conditions. The NIPD approach can also integrate novel types of high-throughput data including RNASeq [33] and ChipSeq [25]. These extensions will allow us to systematically identify the parts, and the wiring among them that determine stage-specific, tissue-specific and disease specific behavior in whole organisms.
4
Methods
4.1
Independent learning of condition-specific networks: INDEP
Existing approaches of learning condition-specific networks [4, 21, 28] can be considered as special cases of a general independent learning approach, INDEP, where networks for each condition are learned independently and then compared to identify network parts unique or shared across conditions. Let {D1 , · · · , Dk } denote k datasets from k conditions. In the INDEP approach, each network Gc , 1 ≤ c ≤ k, is learned independently using data from Dc only. Our implementation of the INDEP framework considered each Gc as an undirected probabilistic graphical model, or a Markov random field (MRF) [14], which like Bayesian networks, can capture higher-order dependencies,
11
but additionally captures cyclic dependencies. We use a pseudo-likelihood framework with an MDL penalty to learn the structure of the MRF [6]. The pseudo-likelihood score for a network P Gc describing data Dc is PLL(Gc ) = N i=1 PLLV(Xi , Mci , c) where X1 , · · · , XN are the random variables (one for each gene), encoding the expression value of a gene. PLLV is Xi ’s contribution to the overall pseudo-likelihood and is defined, including a minimum description length (MDL) penalty, P|D | c |) as PLLV(Xi , Mci , c) = d c logP (Xi = xdi |Mci = mcdi ) + |θci |log(|D . Here Mci is the Markov 2 blanket (MB) of Xi in condition c and xdi and mcdi are assignments to Xi and Mci , respectively from the dth data point. θci are the parameters of the conditional distribution P (Xi |Mci ). We assume the conditional distributions to be conditional Gaussians. The structure learning algorithm for each graph is described in [22].
4.2
Network Inference with Pooling Data: NIPD
The NIPD approach that we present extends the INDEP approach by incorporating shared information across conditions during structure learning. In this framework, we do not learn networks for each condition c separately. Instead, we devise a score for each edge addition that considers networks for any subset of the conditions. Let C denote the set of k conditions. For a non-singleton set, E ⊆ C, we pool the data from all conditions e ∈ E and evaluate the overall score improvement on adding an edge to networks for all e ∈ E. To learn {G1 , · · · , Gk } for the k conditions simultaneously, we maximize the following MDL-based score:
S(G1 , · · · , Gk ) = P (D1 , · · · , Dk |θ1 , · · · , θk )P (θ1 , · · · , θk |G1 , · · · , Gk ) + MDL Penalty
(1)
Here θ1 , · · · , θk are the maximum likelihood parameters for the k graphs. We assume P (Dc |θ1 , · · · , θk ) = P (Dc |θc ). That is, if we know the parameters θc , the likelihood of the data from condition, Dc , given Q θc can be estimated independently. Thus, P (D1 , · · · , Dk |θ1 , · · · , θk ) = kc=1 P (Dc |θc ). Because our networks are MRFs, we use pseudo-likelihood PLL(Dc ). We expand the complete condition-specific parameter set θc , to {θc1 , · · · , θcN }, which is the set of parameters of each variable Xi , 1 ≤ i ≤ N ,
12
in condition c. Using the parameter modularity assumption for each variable, we have:
P (θ1 , · · · , θk |G1 , · · · , Gk ) =
N Y
P (θ1i , · · · , θki |M1i , · · · , Mki )
(2)
i=1
Note the parameters of conditional probabilities of individual random variables are independent, but the parameters per variable are not independent across conditions. To enforce dependency among the θci , we make Mci depend on all the neighbors of Xi in condition c and all sets of conditions that include c. To convey the intuition behind this idea, let us consider the two condition case C = {A, B}. A variable Xj can be in Xi ’s MB in condition A, either if it is connected to Xi only in condition A, or if it is connected to Xi in both conditions A and B. Let M∗Ai be the set of variables that are connected to Xi only in condition A but not in both A and B. Similarly, let M∗{A,B}i denote the set of variables that are connected to Xi in both A and B conditions. Hence, S MAi = M∗Ai ∪ M∗{A,B}i . More generally, for any c ∈ C, Mci = E∈powerset(C) : c∈E M∗Ei , where M∗Ei denotes the neighbors of Xi only in condition set E. To incorporate this dependency in the structure score, we need to define P (Xi |Mci ) such that it takes into account all subsets E, c ∈ E. We assume that the MBs, M∗Ei , independently influence Xi . This allows us to write P (Xi |Mci ) as a product: Q P (Xi |Mci ) ∝ E∈powerset(C) : c∈E P (Xi |M∗Ei ). To learn the k graphs, we exhaustively enumerate over condition sets, E, and estimate parameters θEi by pooling the data for all non-singleton E. Our structure learning algorithm maintains a conditional distribution for every variable, Xi for every set E ∈ powerset(C). We consider the addition of an edge {Xi , Xk } in every set E. This addition will affect the conditionals of Xi and Xj in all conditions e ∈ E. Because the MB per condition set independently influence the conditional, the pseudo-likelihood PLLV(Xi , Mei , e) decomposes as P ∗ E s.t: e∈E PLLV(Xi , MEi , e) (Supplementary information). The net score improvement of adding an edge {Xi , Xj } to a condition set E is given by:
∆Score{Xi ,Xj },E =
|De | XX
PLLV(Xi , Mei ∪ {Xj }, e) − PLLV(Xi , Mei , e) +
e∈E d=1
PLLV(Xj , Mej ∪ {Xi }, e) − PLLV(Xj , Mej , e)
13
(3)
Because of the decomposability of PLLV(Xi |Mei ), all terms other than those involving the Markov blanket variables in condition set E remain unchanged producing the score improvement: ∆Score{Xi ,Xj },E = PLLV(Xi |M∗Ei ∪ Xj ) − PLLV(Xi |M∗Ei )
This score decomposability allows us to efficiently learn networks over condition sets. Our structure learning algorithm is described in more detail in Supplementary material.
4.3
Simulated data description and analysis
We generated simulated datasets using two sets of networks of known structure, HIGHSIM and LOWSIM. All networks had the same number of nodes n = 68 and were obtained from the E. coli regulatory network [23]. We used the INDEP model for generating the eight simulated datasets. The parameters of the INDEP model were initialized using random partitions of an initial dataset generated from a differential-equation based regulatory network simulator [19].
4.4
Microarray data description
Each microarray measures the expression of all yeast genes in response to genetic deletions from quiescent (85) and non-quiescent (93) populations [2], with 69 common to both populations. The arrays had biological replicates producing 170 and 186 measurements per gene in the quiescent and non-quiescent populations, respectively. We filtered the microarray data to exclude genes with > 80% missing values, resulting in 3,012 genes. We constrained the network structures such that a gene connected to only the 69 genes with deletion mutants and no gene had more than 8 neighbors.
4.5
Validation of network edges using coverage of annotation categories
The coverage of an annotation category A is defined as the harmonic mean of a precision and recall. Let L denote the complete list of genes used for network learning, LA ⊆ L denote the genes annotated with A. Let lA denote the number edges in our learned network among two genes gi and gj , such that gi ∈ LA and gj ∈ LA . Let tA be the total number of edges that are connected to genes in LA (note tA > lA ). Let sA denote the total number of edges that could exist among the 14
genes in LA , which is as pA = 2pA rA pA +rA .
lA tA
|LA | 2
if |LA | < 8 and |LA | ∗ 8 if |LA | > 8. Precision for category A is defined
and recall is defined as rA =
lA sA .
These are used to define the coverage of category A,
We compute this coverage score for all categories using each inferred network, and compare
the score against an expected coverage from random networks with the same degree distribution. To compare of NIPD against INDEP, assume we were comparing the inferred quiescent networks. Let AINDEP and ANIPD denote the categories better than random in the INDEP and NIPD quiescent networks, respectively. To determine how much better INDEP is than NIPD, we obtain the number of categories in AINDEP ∪ ANIPD on which INDEP has a better coverage than NIPD. We similarly assess how much better NIPD is than INDEP. We repeat this procedure for the non-quiescent networks. We also compared the semantic similarity of edges in inferred and random networks [16] (Supplementary material).
4.6
Evaluation of gene deletion combinations
We identified combinations of genes with deletion mutants from Markov blankets comprising > 1 of these deletion genes. We evaluated each algorithm’s ability to capture gene deletion combinations by comparing the number of such combinations in random networks with the same number of edges. This random network model provided a rough significance assessment on the number of inferred knock-out combinations (Supplementary Table 3). We then performed a more stringent analysis based on semantic similarity, using the sub-network spanning only the genes with deletion combinations. We generated random networks with the same degree distributions as this subnetwork and computed the semantic similarity of each gene with the set of deletion genes connected to it, in the inferred and random networks. We then selected genes with significantly higher semantic similarity than in random networks (ztest, p-value