GSGS: A Computational Approach to Reconstruct ... - Computer Science

2 downloads 0 Views 1MB Size Report
Oct 17, 2011 - A. Dobra, C. Hans, B. Jones, J.R. Nevins, and M. West, “Sparse. Graphical .... [43] J. Tegner, M.K.S. Yeung, J. Hasty, and J.J. Collins, “Reverse.
438

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. 2,

MARCH/APRIL 2012

GSGS: A Computational Approach to Reconstruct Signaling Pathway Structures from Gene Sets Lipi R. Acharya, Thair Judeh, Zhansheng Duan, Michael G. Rabbat, and Dongxiao Zhu Abstract—Reconstruction of signaling pathway structures is essential to decipher complex regulatory relationships in living cells. The existing computational approaches often rely on unrealistic biological assumptions and do not explicitly consider signal transduction mechanisms. Signal transduction events refer to linear cascades of reactions from the cell surface to the nucleus and characterize a signaling pathway. In this paper, we propose a novel approach, Gene Set Gibbs Sampling (GSGS), to reverse engineer signaling pathway structures from gene sets related to the pathways. We hypothesize that signaling pathways are structurally an ensemble of overlapping linear signal transduction events which we encode as Information Flows (IFs). We infer signaling pathway structures from gene sets, referred to as Information Flow Gene Sets (IFGSs), corresponding to these events. Thus, an IFGS only reflects which genes appear in the underlying IF but not their ordering. GSGS offers a Gibbs sampling like procedure to reconstruct the underlying signaling pathway structure by sequentially inferring IFs from the overlapping IFGSs related to the pathway. In the proof-of-concept studies, our approach is shown to outperform the existing state-of-the-art network inference approaches using both continuous and discrete data generated from benchmark networks in the DREAM initiative. We perform a comprehensive sensitivity analysis to assess the robustness of our approach. Finally, we implement GSGS to reconstruct signaling mechanisms in breast cancer cells. Index Terms—Gene sets, Gibbs sampling, signaling pathways, signal transduction.

Ç 1

INTRODUCTION

A

central goal of computational systems biology is to decipher signaling pathway structures in living cells. Characterization of complicated interaction patterns in signaling pathways can provide insights into biomolecular interaction and regulation mechanisms. Consequently, there have been a large body of computational efforts for reconstructing signaling pathway structures using Probabilistic Boolean Networks (PBNs) [38], [39], Bayesian Networks (BNs) [7], [36], Relevance Networks (RNs) [2], [3], Graphical Gaussian Models (GGMs) [5], [13], [34], and other approaches [1], [8], [22], [43], [46]. Although the existing approaches are useful, they often represent a phenomenological graph of the observed data. For example, a parent set of each gene in BNs indicates statistically causal relationships. In addition, the accuracy of a learned BN is determined by the choice of the number of parents for each node, a metric used to score a structure and . L.R. Acharya and Z. Duan are with the Department of Computer Science, University of New Orleans, 2000 Lakeshore Drive, New Orleans, LA 70148. E-mail: {lacharya, zduan}@uno.edu. . T. Judeh and D. Zhu are with the Department of Computer Science, Wayne State University, 5057 Woodward, Detroit, MI 48202. E-mail: {tjudeh, dzhu}@wayne.edu. . M.G. Rabbat is with the Department of Electrical and Computer Engineering, McGill University, McConnell Engineering, Room 639, 3480 University Street, Montre´al, Que´bec H3A 2A7, Canada. E-mail: [email protected]. Manuscript received 29 Mar. 2011; revised 13 Sept. 2011; accepted 11 Oct. 2011; published online 17 Oct. 2011. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2011-03-0079. Digital Object Identifier no. 10.1109/TCBB.2011.143. 1545-5963/12/$31.00 ß 2012 IEEE

other parameters set to alleviate the nontrivial computational burdens associated with BN inference. RNs, GGMs, and PBNs are computationally tractable even for large signaling pathways, however the coexpression criteria used in RNs and GGMs only models a possible functional relevancy, whereas the use of Boolean functions in PBNs may lead to an oversimplification of the underlying gene regulatory mechanisms. Moreover, the aforementioned approaches do not explicitly consider signal transduction events characterizing a signaling pathway. Signal transduction events refer to linear cascades of reactions from the cell surface to the nucleus and form the basic building blocks of a signaling pathway. It is necessary to design computational approaches for the structural inference of signaling pathways by incorporating signal transduction mechanisms. On the other hand, gene set-based analysis has received much attention in recent years [28], [29], [33], [41]. A gene set usually refers to a gene signature, which is a set of genes with combined pattern of expression downstream of transcription factors and is often linked to a given biological state of interest. In the present scenario, however, we define a gene set as a set of molecules (usually proteins) in a signaling pathway upstream of transcription factors which participate in a signal transduction event in the pathway. Since activation of signaling pathways affects gene expressions via transcription factors, it is necessary to understand signaling mechanisms upstream of transcription factors. It is also important to note that gene sets related to a signaling pathway indicate the existence of an underlying structure, whereas a gene signature may only correspond to a set of Published by the IEEE CS, CI, and EMB Societies & the ACM

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

functionally relevant genes without suggesting the presence of a structure. While the identification of signaling pathway components is a relatively well-addressed problem [45], challenges remain in inferring signal transduction mechanisms underlying these pathways. In the present work, specifically, we focus on the problem of inferring the structure underlying a given signaling pathway component. To achieve our goal, we utilize a compendium of gene sets related to the given pathway. A gene set can be interpreted as a discrete set of genes expressed in an experiment, whereas a gene set compendium comprises of many overlapping gene sets corresponding to different experiments. Overlapping, which arises from simultaneous participation of genes in many signal transduction events, reflects the interconnectedness among gene sets. We aim to exploit the overlapping among gene sets in order to uncover the underlying signaling mechanisms. Our motivation of considering a gene set-based approach for the structural inference of signaling pathways falls into many other categories. For instance, a gene set-based approach can more naturally incorporate higher order signaling mechanisms as opposed to pairwise interactions. In comparison to continuous molecular profiling data, gene sets are more robust to noise and facilitate data integration from multiple data acquisition platforms. The advantages of pathway-based approaches in bioinformatics analyses have been adequately demonstrated [42], [44]. Such approaches have also been used to dissect drug mechanism of action and to find transcriptional connections among genes, drugs, and diseases [11], [15]. However, the structural inference of signaling pathways by sufficiently exploiting gene sets, a promising area of bioinformatics research, remains underdeveloped. With few exceptions in the field of communication networks, the existing network inference approaches do not explicitly accommodate signal transduction events. The frequency method in [31] assumes a tree structure in the paths between pairs of nodes (genes). However, the method is subjected to fail in the presence of multiple paths between the same pair of nodes. The cGraph algorithm presented in [14] adds weighted edges between each pair of nodes that appear in some gene set and so the networks inferred by this approach might contain a large number of false positives. The EM approach [32], [47] treats permutations of genes in a gene set as missing data and infers a network by assuming a linear arrangement of genes along with a prior knowledge of two end nodes. It is also difficult to incorporate prior knowledge about regulator-target pairs in the approaches mentioned above. Therefore, it is necessary to develop a systems biology approach for a gene set-based inference of signaling pathway structures. A central aspect of developing such network reconstruction approaches is to understand the structure of signaling pathways. We hypothesize a signaling pathway structure as an ensemble of several overlapping signal transduction events with a linear arrangement of genes in each event. We denote these events as Information Flows (IFs). An Information Flow Gene Set (IFGS) contains the genes of the given IF. IFs form the building blocks of a signaling pathway and uniquely determine its structure. The true signaling pathway structure can be reconstructed by

439

inferring the order of genes in each IFGS and combining the inferred IFs into a single unit. We reemphasize that gene sets, such as gene signatures, considered in previous studies may not indicate the presence of an underlying structure. Each IFGS, on the other hand, comprises of genes participating in a directed chain of signal transduction. As there exist L! different permutations of gene ordering for an IFGS with L component genes, the number of signaling pathway structures consistent with a compendium of m IFGSs is of the order of L!m . On one side, not all network structures are equally likely and on the other, it might be computationally infeasible to find the most likely structure by exhaustive enumeration, even when the number and lengths of IFGSs are much controlled. In other words, if we treat the ordering of each IFGS as a random variable, which has a sampling space of size L!, it might not be practical to sample directly from the joint distribution of IFGSs with a sampling space of size L!m . As a result, our goal of signaling pathway structure inference can be translated into drawing samples of signaling pathway structures sequentially from the joint distribution of IFGSs and summarizing the most likely structure from the sampled structures. Our approach is similar to the Nested effects models (NEMs) [20] in that the two approaches utilize discrete measurements for inferring a directed structure by constructing it from smaller building blocks. However, a major difference between them lies in the fact that NEMs treat binary effect reporters as random variables, whereas the proposed approach considers the ordering of genes in IFGSs as random variables. In NEMs, submodels are built by independently scoring all pairs or triplets of genes. An edge in a submodel is defined in terms of a subset relation between phenotypic profiles of two genes. The proposed approach, on the other hand, infers gene ordering in a gene set by utilizing the overlapping among all of the remaining gene sets, which more naturally captures the higher order interaction mechanisms. Our approach further benefits from allowing a building block of larger size and explicitly accommodating linear signal transduction mechanisms in its settings, which characterize a signaling pathway structure. We develop a stochastic algorithm, Gene Set Gibbs Sampler (GSGS), to reconstruct signaling pathway structures from IFGSs. GSGS treats the ordering of genes in each IFGS as a random variable, and sequentially samples signaling pathway structures from the joint distribution of IFGSs (Fig. 1). The novelty of GSGS lies in hypothesizing IFGSs as the basic building blocks of signal pathways, the definition of gene orderings as a random variable to naturally accommodate higher order interactions and probabilistic network inference. We comprehensively examine the performance of GSGS by using both continuous and discrete data generated from gold standard networks in the Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative and compare it with the existing network inference approaches with a primary emphasis on Bayesian network approaches K2 [4], [25] and MCMC [24], [25]. We also perform sensitivity analysis to access the robustness of GSGS to the undersampling and oversampling of gene sets. Finally, we use GSGS to reconstruct a signaling pathway structure in breast cancer cells.

440

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

Fig. 1. Flowchart of GSGS for inferring signaling pathway structures from gene sets. (Left Panel) Sketch of the equivalence between inputs accommodated by existing network inference approaches and GSGS. Overlapping gene sets are exploited by GSGS to reconstruct the underlying directed network. Another way to obtain gene sets is to discretize molecular profiling data using binary labels (present/absent). Genes expressed or present in each experiment correspond to a gene set. Similarly, a gene set can be interpreted as a set of genes expressed in an experiment. Thus, a gene set compendium is essentially a matrix of binary discrete values used by Bayesian network and Mutual Information (MI) network inference methods. (Right Panel) Using an IFGS compendium with 125 IFGSs, GSGS successively draws 1,000 sample signaling pathway structures of the true signaling pathway structure from the joint distribution of IFGSs. At a given stage t, 125 IFs are combined to represent a sampled signaling pathway structure.

2

METHODS

2.1 Notation and Terminology An Information Flow is a directed linear path from one node (gene) to another node in a signaling pathway structure which does not allow self-transition or transition to a previously visited node. In other words, an IF represents a linear chain of reactions between two nodes via some intermediate nodes. An Information Flow Gene Set is the set of all genes present in an IF. Thus, an IFGS only reflects which genes participate in the underlying IF but not their ordering in the chain. The length of an IFGS is the number of genes present in it. Therefore, there are L! putative IFs that are compatible with an IFGS of length L. We assume throughout that L  3. An IF of length two simply represents an edge in a signaling pathway structure, which we use to serve as prior knowledge. Given a compendium of m overlapping IFGSs X1 ; X2 ; . . . ; Xm , we aim to reconstruct the underlying directed network topology. Our idea is to infer IFs corresponding to each Xi followed by combining the inferred IFs into a single unit. Assuming that the length of Xi is Li , we define a random variable i to represent the ordering of genes in Xi . Clearly, sampling space of i is the set of Li ! gene ordering permutations. We write ðXi ; i Þ to associate an ordering to the IFGS Xi . The notations X is used for a given IFGS compendium and we write all IFGSs and their associated orderings together as ðX; Þ, where X ¼ ðX1 ; . . . ; Xm Þ and  ¼ ð1 ; . . . ; m Þ. The notations are suffixed with i to consider all, but the ith component, e.g., X i , ðX; Þi etc., for i 2 f1; . . . ; mg. In the following sections, we will utilize

VOL. 9,

NO. 2,

MARCH/APRIL 2012

ðX; Þ to construct vectors of size n  1 and matrices of size n  n, where n is the number of distinct genes among m IFGSs. Suffixing such vectors or matrices with i means that they have been constructed without involving the ith IFGS. As the sampling space of i is of size Li !, it follows that theQ sampling space of the joint distribution P ðX; Þ isQthe set m of m i¼1 Li ! permutations. A sampling space of size i¼1 Li ! can be computationally intractable even for moderate values of Li and m. As a result, our goal of signaling pathway structure inference can be translated into drawing samples of signaling pathway structures sequentially from the joint distribution P ðX; Þ of IFGSs and summarizing the most likely structure from the sampled pathway structures. Indeed, we develop a Gibbs sampling like algorithm to sequentially sample ordering for each IFGS by conditioning on the remaining of the network structures, with a much reduced sampling space of size Li !. We refer to Table 1 for a comprehensive list of mathematical notations.

2.2

Joint Distribution and Conditional Distribution of Gene Sets Although the samplingQspace for the multivariate distribum tion P ðX; Þ of size i¼1 Li ! might be computationally intractable, it is possible to theoretically describe this distribution under certain assumptions. We consider IFGSs as random samples from a first order Markov chain model, where the state of a node is only dependent on the state of its previous node. From a given set of m IFs (ordered paths), the two model parameters, initial probability vector  and transition probability matrix , are estimated by treating each IF as a Markov chain. If there are n distinct genes across m IFs, we define c cn  1 ;...; ; ð1Þ ¼ m m where cl is the total number of times lth gene appears as the first node among m IFs, for each l ¼ 1; . . . ; n. If crs is the total number of times rth gene transits to sth gene (i.e., there is edge from r to s) among m IFs, then Pn

 ¼ ½prs nn ;

ð2Þ

where prs ¼ crs = s¼1 crs , r; s ¼ 1; . . . ; n. Thus,  captures the overlapping signaling mechanisms among IFs. The parameters  and  can be estimated individually for Q each of the m i¼1 Li ! collections of IFs. Each collection is an instantiation of all possible collections and represents a candidate signaling pathway structure. The parameters  and  estimated for a collection can be used to calculate its likelihood. The likelihood of a collection of IFs is the product of the likelihoods of m individual IFs in it. The likelihood of each IF can be computed by treating it as a first order Markov chain and using the parameters  and . For example, we compute the likelihood of the IF z ! y ! x as Pðz ! y ! xÞ ¼ P ðzÞ  P ðyjzÞ  P ðxjyÞ: ð3Þ Qm The likelihood values calculated for all i¼1 Li ! collections of IFs can be normalized to denote the joint distribution of Q IFGSs. However, exhaustive computation of m i¼1 Li ! likelihood values to choose the most likely structure might be computational infeasible, which serves as motivation for the proposed GSGS approach. The computational tract-

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

441

TABLE 1 List of Mathematical Notations (Left Column) and Their Descriptions (Right Column)

ability of GSGS lies in sequentially sampling an order for each IFGS Xi by conditioning on the orders of the remaining IFGSs, with a much reduced sample space of Q size Li ! as compared to m i¼1 Li !. In GSGS, we begin by assigning randomly selected orders to each IFGS. We update the orderings by sampling an order for each IFGS conditioned on the known orders of remaining m  1 IFGSs. To sample an order for Xi from the conditional distribution, we leave Xi out. From the remaining m  1 IFs, we then compute the initial probability vector i and transition probability matrix i by following the procedure described in (1) and (2). Next, we calculate the likelihoods of all possible orders ji ; j ¼ 1; . . . ; Li ! for Xi by conditioning on the orders of remaining m  1 IFGSs. The normalized conditional likelihood for the jth order for Xi is given by

Lji

¼

8 j < PPL i! :

i j¼1

1 Li ! ;

P ji

; if

PLi !

j¼1

P ji 6¼ 0;

ð4Þ

otherwise;

PððXi ; i ¼ z ! y ! xÞjðX; Þi Þ ¼ P ðzÞ  P ðyjzÞ  P ðxjyÞ;

ð6Þ

where each term on the right of (6) is conditioned on ðX; Þi and is available from i and i . The Lji values, for j ¼ 1; . . . ; Li !, can now be used to sample an order for Xi from the conditional distribution using inverse Cumulative Density Function (CDF) [9]. The CDF of the conditional distribution P ððXi ; i ÞjðX; Þi Þ is defined as F ððXi ; i ¼ ji ÞjðX; Þi ÞÞ ¼

j X

Lki ;

ð7Þ

k¼1

where P ji ¼ PððXi ; i ¼ ji ÞjðX; Þi Þ:

In the above equation, P ji represents the conditional likelihood of the jth order and is computed by decomposing it into the product of conditional probability terms. For example, we compute the conditional likelihood of z ! y ! x corresponding to IFGS Xi ¼ fx; y; zg as

ð5Þ

for each j ¼ 1; . . . ; Li !. By sampling a number u  Uð0; 1Þ and letting F 1 ðuÞ ¼ v, we get a randomly drawn order v for Xi from the conditional distribution (7).

442

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

2.3 Gene Set Gibbs Sampler In Algorithm 1, we present the Gene Set Gibbs Sampling (GSGS) approach, which leads to the reconstruction of signaling pathway structures from IFGSs. If prior knowledge of some edges (an IF of length 2) is available, we augment them with IFGSs as directed pairs and keep the direction of genes in each of them fixed during the execution of GSGS. Algorithm 1 outputs a list of most frequently occurred IFs among sampled IFs (Step 14 in Algorithm 1). To reconstruct a signaling pathway structure, we start with an empty network of distinct genes present in the input list and construct the most likely signaling pathway structure by joining IFs present in the output of Algorithm 1. Algorithm 1. Gene Set Gibbs Sampler 1: Input: X ¼ ðX1 ; . . . ; Xm Þ, where Xi ’s, i ¼ 1; . . . ; m, represent IFGSs, E ¼ ðE1 ; . . . ; Eu Þ, where Ek ’s, k ¼ 1; . . . ; u, represent prior known directed edges (optional), burn-in state B and number of samples N to be collected after burn-in state ^ i Þ, i ¼ 1; . . . ; m 2: Output: m information flows ðXi ;  ð0Þ 3: At t ¼ 0, randomly choose an order i for Xi from Li ! permutations, i ¼ 1; . . . ; m 4: for t ¼ 1; . . . ; B þ N do ðt1Þ  ¼ ð1 ; . . . ; ðt1Þ Þ 5: m 6: for i ¼ 1; . . . ; m do 7: Leave Xi out 8: Use the remaining IFs, including those present in E, to estimate the two Markov chain parameters. 9: Calculate the conditional likelihoods Lji ’s (4) of Li ! permutations by treating Xi as a first order Markov chain ðtÞ 10: Sample an order i for Xi from the inverse cumulative distribution F ððXi ; i ÞjE; ðX; Þi Þ (7) 11: Update the order information for Xi 12: end for 13: end for ^ i ¼ modeððBþ1Þ ; . . . ; ðBþNÞ Þ, i ¼ 1; . . . ; m. 14: Return  i i

2.4 Burn-In State A burn-in state in Algorithm 1 refers to a stage after which we start collecting samples of signaling pathway structures. Samples collected after burn-in state are assumed to be drawn from the joint distribution of IFGSs. To determine an appropriate burn-in state, we translated the approach presented in [9] and [10] in our framework to compute the ratio R¼

þ N1 Bv ; Wv

N1 N Wv

ð8Þ

for each of the three quantities of our interest Sensitivity, Specificity and PPV. Here, N is the total number of structures sampled after burn-in state, Wv is the averaged within-chain variance (within a single run of GSGS) and Bv is between-chain variance (between multiple runs of GSGS). Sensitivity ¼ TP=ðTP þ FNÞ, Specificity ¼ TN=ðTN þ FPÞ, and PPV ¼ TP=ðTP þ FPÞ, where TP ¼ number of true positives, TN ¼ number true negatives, FP ¼ number of

VOL. 9,

NO. 2,

MARCH/APRIL 2012

false positives, and FN ¼ number of false negatives. If we fix the burn-in state as B in a total of Jð2Þ independent runs of GSGS, then for a parameter of interest X, Wv ¼

J J 1X N X s2j ; and Bv ¼ ðxj  xÞ2 ; J j¼1 J  1 j¼1

ð9Þ

with x¼

J X ðtÞ X  ðtÞ 2 1X 1 BþN 1 BþN x  xj ; xj ; xj ¼ xj ; s2j ¼ J j¼1 N t¼Bþ1 N  1 t¼Bþ1 j

j ¼ 1; . . . ; J: ð10Þ If all chains are stationary then the numerator and denominator in (8) estimate the variance of X. Clearly, pffiffiffiffi R ! 1 as Np! ffiffiffiffi 1. In practice, the choice of B and N is acceptable if R < 1:2. Otherwise, either B or N or both should be increased (see [9], [10] for more details). We treated sensitivity, specificity, and PPV as three parameters to determine a burn-in state (see Section 3.1 and Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TCBB.2011.143).

2.5 Computational Complexity The worst case time complexity of GSGS is Nmðm þ n þ MLÞ, where N is the number of sampled structures, m is the number of IFGSs, n is the number of distinct genes, and L is the length of the longest gene set in the input and M ¼ L!. As longer gene sets ðL  10Þ are less likely to correspond to linear information flows, the complexity arising from ML could be managed by appropriately selecting the length of gene sets in each experiment. It is worth mentioning here that GSGS benefits from a much reduced computational load, both in terms of speed and memory requirements, in comparison to Bayesian network approaches, e.g., BN inference using samplingbased Metropolis-Hastings approach. Indeed, the complexity of GSGS is driven by the number of possible orderings for IFGSs, which is comparatively much smaller than the number of neighbors of a network generated at each stage of Metropolis-Hastings approach. Complexity of MetropolisHastings approach is often unmanageable, even for a network of small size, due to a large number of neighboring networks of a sampled network.

3

DATA ANALYSIS

3.1 Data We conducted three case studies to evaluate the performance of GSGS. For the first study, we obtained two gold standard directed networks, In silico network [21], [40] from DREAM2 and E. coli network [17], [18], [30] from DREAM3 network challenges in the DREAM initiative. Both E. coli and In silico networks comprised of 50 nodes with 62 and 37 true edges, respectively. From the E. coli and In silico networks, two collections of IFGSs were derived by a direct application of Algorithm 2. Indeed, Algorithm 2 finds IFGSs from a directed network by first finding all IFs (linear paths) in the network and then randomly permuting the order of

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

genes in each IF. There were a total of 125 and 57 IFGSs of length  3 for the E. coli and In silico networks, respectively, which served as input for GSGS. A given percentage of true edges were used to serve as prior knowledge. This corresponds to a proof-of-principle study necessary to validate our underlying assumption. Through this study, we evaluate the performance of GSGS and other existing approaches, when IFGSs are sampled from the true signaling pathway structure. Algorithm 2. Network2GeneSets 1: Input: A directed acyclic graph with n nodes 2: Output: All IFGSs 3: for i ¼ 1; . . . ; n do 4: if node i has no children then 5: continue 6: else 7: add to Queue Q and the Linked List L all the directed pairs consisting of i and a child of i 8: while Q is not empty do 9: Pop an information flow P from Q 10: if the last node in P , say k, has no children then 11: continue 12: end if 13: add to Q and L, all information flows obtained by appending each child of k to P 14: end while 15: end if 16: end for 17: Prune information flows in L of length 2 (prior knowledge) 18: Randomly permute orders of information flows in L and order of genes in each information flow 19: Return all IFGSs of length  3. Note that a gene set compendium can be written as a binary discrete dataset and vice versa (Fig. 1). A gene set represents a set of genes expressed in an experiment and so it naturally corresponds to a vector (sample) of binary values obtained by considering the presence (1) or absence (0) of genes in the set. Similarly, genes expressed in a sample of experimental measurements discretized into binary levels, correspond to a gene set. Thus, a gene set and a binary discrete sample represent the same underlying data in two different forms. Keeping this in mind, our approach can be compared with the existing network inference approaches accommodating discrete measurements, e.g., inference of directed topologies using Bayesian networks [4], [7], [24] and Mutual Information (MI)-based inference of undirected gene regulatory networks using ARACNE [19], CLR [6], MRNET [22], and RNs [2], [3]. In our second case study, we evaluated the performance of GSGS using four benchmark E. coli data sets available from DREAM3 network challenges in the DREAM initiative [17], [18], [30]. Each of these data sets contain the steady state levels for the wild type and the heterozygous knockdown strains for each gene. The first two data sets comprise of 50 genes and 51 experiments, whereas the remaining two data sets contain 100 genes and 101 experiments. The corresponding gold standard networks comprise of 62, 82,

443

125, and 119 edges, respectively. We derived four IFGS compendiums from E. coli data sets by declaring the top 10 percent of the measurements in each dataset as 1 and the remaining measurements as 0. This discretization resulted in a diverse range of IFGS lengths across different samples. In each compendium, we considered IFGSs with lengths in the range 3-9. The resulting compendiums comprised of 47, 45, 45, and 49 IFGSs, respectively. In the final case study, we analyzed the performance of GSGS by reconstructing a breast cancer signaling pathway from genes present in the ERBB signaling pathway in the KEGG database [12]. However, no prior knowledge about the structure of the ERBB signaling pathway available in KEGG was assumed. The ERBB signaling pathway is a directed network of 87 genes and plays an important role in breast cancer signaling [27]. For example, dysregulation/ mutation in EGFR and ERBB2 promotes angiobenesis and metastasis in breast cancer [16], [26]. We collected 299 samples of breast cancer patients from Affymetrix HG-U133 plus 2.0 platform. We mapped all 87 genes participating in the ERBB signaling pathway to the annotation table for Affymetrix HG-U133 plus 2.0 platform, and considered gene expression levels corresponding to exactly one probe set with the highest average measurement among 299 samples for each of them. This resulted in a dataset with 87 rows (genes) and 299 columns (samples). IFGSs were derived by discretizing the measurements into binary levels using the equalwidth method implemented in R package infotheo. In a majority of samples ð66%Þ, the number of expressed genes were found in the range 3-7 with this discretization. To compromise between time to reach an appropriate burn-in state and overlapping among gene sets, we considered such samples to form a compendium of 197 IFGSs. This study is particularly useful as signaling pathway structures in the databases may not represent a complete picture of the signaling mechanisms among genes present in the pathway. Also, the pathway structures in databases are often generic, whereas scientists are more interested in learning a context-specific network of genes. In this study, we inferred a breast cancer specific network of genes present in the ERBB signaling pathway.

3.2 Comparison with Other Approaches In our first study, we compared the performance of GSGS with a number of popular MI-based network inference approaches [2], [6], [19], [22] with a primary emphasis on two Bayesian network approaches, K2 [4] and MCMC (Metropolis-Hastings or MH) [24]. The main reasons are the following: 1) from methodology point of view our method infers the most probable linear structure(s) using likelihood scores calculated from the products of conditional probabilities. It is essentially in the same sprit as Bayesian network approaches, while fundamentally different from other approaches which are based on calculating pairwise similarity. 2) Both GSGS and Bayesian network approaches take discrete data and infer a directed network. The equivalence between gene sets and binary discrete data makes the comparison between GSGS and Bayesian network approaches very fair. 3) Most of the other network inference algorithms, e.g., ARACNE, CLR, MRNET, and RNs also discretize continuous data to estimate pairwise

444

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

similarities, however they are suitable for inferring undirected networks. Both K2 and MH have been implemented in the Bayes Net Tool Box (BNT) [25]. In principle, the K2 approach [4] begins by specifying an ordering of nodes involved in the underlying network. Thus, initially each node has no parent. The algorithm incrementally assigns a parent to a node whose addition increases the score of the resulting structure the most. For the ith node, parents are chosen from the set of nodes with index 1; . . . ; i  1. On the other hand, the MH algorithm [24] starts with an initial directed acyclic network G0 and selects a network G1 uniformly from the neighborhood of G0 . The neighborhood of a network G is the collection of all directed acyclic networks which differ from G by addition, deletion, or reversal of a single edge. The algorithm accepts or rejects the move from G0 to G1 by computing an acceptance ratio defined in terms of marginal likelihood ratio P ðDjG1 Þ=P ðDjG0 Þ, where D represents the given data. This procedure is iterated starting from the most recent network. A specified number of networks are collected after burn-in state. For scoring a structure, BNT provides Bayesian Information Criterion (BIC) [35] and Bayesian score function [4], where Bayesian score function is defined for discrete measurements. To compare the performance of GSGS with existing approaches in our first study, inputs were generated as follows: from the same underlying network, e.g., the E. coli network as the sole input (Fig. 1): 1) we generate IFGSs by a direct application of Algorithm 2. The IFGSs serve as input for GSGS, whereas the equivalent binary discrete data is used as input for K2, MH, and MI-based approaches. 2) As BN and MI-based approaches also accommodate continuous measurements, we generate continuous data inputs for these approaches using BNT. The performances were evaluated in terms of total number of predicted edges and F-score (F) defined as F ¼ 2pr=ðp þ rÞ. Here, r and p stand for Sensitivity and PPV, respectively. An increased number of predicted edges indicate the presence of many false positives, whereas a small number of predicted edges correspond to decreased sensitivity. Total number of predicted edges together with F-score, reveal an algorithm’s performance in predicting true and false positives in a detailed manner. In the second study, we used the IFGS compendiums obtained by discretizing E. coli data sets (see Section 3.1) as inputs for GSGS. We tested the performance of Bayesian network and MI-based methods using E. coli data in both, continuous and binary equivalent form.

3.3

Performance Evaluation Using E. coli and In silico Networks We now analyze the performance of GSGS using E. coli and In silico networks. Using GSGS we collected a total of 500 networks after burn-in state which we fixed at 500. All results were averaged from 100 independent runs of GSGS. pffiffiffiffi With the chosen set of parameters, R in (8) was found approximately equal to one, for each of the three quantities Sensitivity, Specificity and PPV. A detailed list of settings is presented in the Appendix, available in the online supplementary material. It is worth mentioning here that P ðXi ; i Þ may not always be unimodal. A reason which might lead to such

VOL. 9,

NO. 2,

MARCH/APRIL 2012

Fig. 2. Sensitivity analysis for the GSGS approach with increasing percentage of prior knowledge. Network: E. coli (Upper Panel) and In silico (Lower Panel). In blocks (a)-(f), x-axis represents the percentage of gene sets present in the input and y-axis plots the total number of edges predicted by GSGS (Solid Line). The dashed line plots correspond to the ground truth. We have considered only those genes which were present among IFGSs after pruning all gene pairs.

a situation is very poor overlapping between Xi and other IFGSs in the compendium. As the discovery of IFGSs depends on the quality of molecular profiling data, it is necessary to test the robustness of GSGS by accommodating real-world undersampling and oversampling scenarios. Therefore, we first performed a sensitivity analysis by varying the amount of overlapping among IFGSs. The multimodal problem is further addressed by incorporating an increasing percentage of prior knowledge and testing if the algorithm approaches toward the unique true structure. Fig. 2 demonstrates the effect of removing and adding IFGSs to the input of Algorithm 1. In Fig. 2, x-axis represents the percentage of gene sets present in the input, where 20 percent means that 80 percent of the gene sets were randomly removed from the original list of IFGSs, and 120 percent means that 20 percent of randomly sampled gene sets were added to the list. The figure presents the performance of our approach in terms of the total number of predicted edges. In blocks (a)-(f), the number of edges identified by GSGS (solid line) remains close to the ground truth (dashed line). We also observe the positive effect of incorporating prior knowledge. As the percentage of prior knowledge increases (block (a) to block (f)), difference between the ground truth and prediction decreases. In particular, our approach does not produce a large number of false positives in the presence of redundant gene sets. To further validate this statement, in Table 2, we present the

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

445

TABLE 2 F-Scores Calculated for the GSGS Approach with Increasing Percentage of Gene Sets in the Input (Row) and Prior Knowledge (Column)

Networks: E. coli (Left Panel) and In silico (Right Panel). We observe a clear increasing trend in the F-scores within each row, indicating the positive impact of incorporating prior knowledge, while a clear trend of similarity is observed within each column, indicating a marked robustness of the performance of GSGS to the oversampling and undersampling of gene sets.

F-scores for the GSGS approach with increasing percentage of gene sets (rows) and prior knowledge (columns). We observe that the F-scores increase with an increase in the percentage of prior knowledge (values in a row), and these scores remain close on removal or addition of gene sets (values in a column) demonstrating an impressive robustness to undersampling and oversampling. This observation strongly supports the applicability of GSGS in the realworld scenarios, where we often do not observe all gene sets or the observed gene sets are redundant. In Figs. 3 and 4, we plot the results from a comparative study in terms of total number of predicted edges using both discrete (left) and continuous (right) data. In the figures dashed line represents the ground truth. It is clear that the number of edges predicted by GSGS remains closer to the ground truth as compared to K2 and MH. In most of the cases, the number of edges predicted by K2 and MH are much higher than the ground truth, indicating an increased number of false positives in the inferred networks.

Figs. 5 and 6 plot the F-scores from different approaches with increasing percentage of prior knowledge. In both the figures, x- and y-axis represent the percentage of prior knowledge and F-scores, respectively. We observe that F-scores for GSGS is significantly higher than K2 and MH using both discrete (left) and continuous (right) data. Further, the impact of incorporating prior knowledge on F-score is more prominent in case of GSGS than K2 and MH, specially on using continuous data where F-scores for K2 and MH remain much lower than GSGS even in the presence of a large amount of prior knowledge. We also compared GSGS with four other MI-based approaches, ARACNE, CLR, MRNET, and RN, without using prior knowledge. The four approaches have been implemented in the R package MINET [23]. As MI networks are undirected, we treated the true underlying networks as well as the networks inferred by GSGS as undirected in the comparison. The F-scores calculated using both discrete and continuous data are presented in Table 3. We observed a

Fig. 3. Network: E. coli. Comparison of the GSGS approach with K2 and MH in terms of total number of predicted edges with increasing percentage of prior knowledge. Left panel corresponds to using discrete measurements, where both Bayesian and BIC score function were used. On the right panel “Method-N” stands for a Bayesian network method applied to continuous data of sample size N using Bayesian Information Criterion. The dashed line represents the ground truth.

446

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. 2,

MARCH/APRIL 2012

Fig. 4. Network: In silico. Comparison of the GSGS approach with K2 and MH in terms of total number of predicted edges with increasing percentage of prior knowledge. Left panel corresponds to using discrete measurements, where both Bayesian and BIC score function were used. On the right panel “Method-N” stands for a Bayesian network method applied to continuous data of sample size N using Bayesian Information Criterion. The dashed line represents the ground truth.

Fig. 5. Network: E. coli. Comparison of the GSGS approach with K2 and MH in terms of F-scores. Here, x-axis represents the percentage of prior knowledge and y-axis plots F-scores from three approaches. Left panel corresponds to using discrete measurements, where both Bayesian and BIC score function were applied. On the right panel “Method-N” stands for a Bayesian network method applied to continuous data of sample size N using Bayesian Information Criterion.

Fig. 6. Network: In silico. Comparison of the GSGS approach with K2 and MH in terms of F-scores. Here, x-axis represents the percentage of prior knowledge and y-axis plots F-scores from three approaches. Left panel corresponds to using discrete measurements, where both Bayesian and BIC score function were applied. On the right panel “Method-N” stands for a Bayesian network method applied to continuous data of sample size N using Bayesian Information Criterion.

significantly better performance of GSGS in comparison to MI network inference methods. In Fig. 7, we provide more detailed evidences of the superior performance of GSGS using both In silico and E coli

networks. In Fig. 7, two left panels represent the true topologies of the two networks, and two right panels represent the topologies reconstructed using GSGS. In each reconstructed network, blue edges represent true positives

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

447

TABLE 3 Performance Comparison of GSGS with Four Other Pair-Wise Similarity-Based Network Reconstruction Approaches in Terms of F-Scores

Left and right panels correspond to using discrete and continuous data, respectively. For continuous data sample size is 50.

Fig. 7. A proof of principle study. Left panels show two gold standard networks, E. coli (Upper) and In silico (Lower). Right panels show the corresponding predicted networks by GSGS, E. coli (Upper) and In silico (Lower). For a fair comparison, all stand-alone linear paths of length 2 are removed from both networks. On the right panels, the blue edges correspond to true positives and gray edges represent false positives. Figures were generated using Cytoscape [37].

and gray edges represent false positives. A high level of accuracy is observed in both the reconstructed networks.

3.4 Performance Evaluation Using E. coli Datasets We applied GSGS to infer signaling mechanisms using the IFGS compendiums derived from E. coli data sets. Our parameter setting was the same as used in the first study. We collected 500 samples after a burn-in state fixed at 500. We tested the performance of Bayesian network and MIbased methods using the given continuous data sets and binary equivalent data. In each case, we observed a very low sensitivity value by using Bayesian network methods. In addition, we could not discover any structure in several cases. Therefore, we compared the performance of GSGS with MI-based approaches. We inferred MI networks using continuous data as we could not discover a structure in some cases by using discrete data. In Fig. 8, we plot the performance of GSGS and MI-based network inference methods in terms of the F-score ratio, which is the ratio of the F-score from GSGS and the one from MI-based methods. A ratio more than 1 indicates a better performance by GSGS. As shown in Fig. 8, we observed a higher F-score using GSGS, compared with MIbased network inference methods.

3.5 Pathway Reconstruction in Breast Cancer Cells In our final case study, we used GSGS to infer a signaling pathway structure from the IFGS compendium derived using breast cancer molecular profiling data. The IFGS compendium comprised of genes participating in the ERBB signaling pathway in KEGG [12]. No prior knowledge about the structure of ERBB signaling pathway available from KEGG

Fig. 8. Comparison of GSGS with the contemporary MI-based network inference methods using four benchmark E. coli data sets available from the DREAM initiative.

448

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 9,

NO. 2,

MARCH/APRIL 2012

TABLE 4 Gene Arranged in Different Layers in the Hierarchial Representation of the Erbb Signaling Pathway Available from the KEGG Database

was assumed. Using GSGS, we sampled 4,500 networks after a burn-in state fixed at 500. The computational complexity of GSGS was easily manageable for the derived IFGS compendium. To validate the performance of GSGS, we utilized the structure of ERBB signaling pathway available from KEGG. As the direction of an information flow is from an upper layer (lower index layer) to a lower one (higher index layer) in the hierarchial representation of a pathway, we collected genes lying in different layers of the ERBB signaling pathway in KEGG, which have been presented in Table 4. Considering the noise, i.e., which genes were recognized in each IFGS by data discretization and undersampling among IFGSs, at the very minimum we expect larger number of edges from a gene in an upper layer to a gene in the lower layer. Indeed, it was found that 60% of the inferred edges follow this hierarchy, i.e., no parent came from a lower layer. In 20% of the edges a parent and child node came from the same layer. It is likely that genes lying in the same layer are expressed together in many IFGSs, as they often share a common regulator. Overall, the performance of GSGS depends on purity of input data, like any other inference method. In the upper panel of Fig. 9, we present a few reconstructed signaling events. It can be easily verified that each IF in the figure follows the hierarchy presented in Table 4. For example, corresponding to the IFGS {ARAF, ELK1, KRAS}, GSGS predicted an IF KRAS ! ARAF ! ELK1, where KRAS came from Layer 6, ARAF from Layer 7, and ELK1 from Layer 10. We further analyzed the inferred structure to identify linear signaling events reported in KEGG. In the lower panel of Fig. 9, we present a partial view of the reconstructed structure formed in the neighborhood of genes ERBB2 and ERBB3. Each edge in the figure follows the hierarchy presented in Table 4. Additionally, a red edge means that a linear signaling event between parent and child node has already been recognized in the ERBB signaling pathway structure in KEGG. For example, there exists a linear signal transduction from KRAS to ELK1 via ARAF, and from ERBB3 to ELK1 via KRAS and BRAF in the structure available from KEGG. Green edges correspond to a pair of nodes coming from the same layer in Table 4. Black

Fig. 9. Upper Panel: example of information flows inferred by GSGS. Genes in each information flow follow the hierarchy presented in Table 4. Lower Panel: a partial view of the network formed by genes in the neighborhood of ERBB2 and ERBB3. Each information flow follows the hierarchy presented in Table 4. In particular, a red edge means that a linear signaling event between parent and child node has been recognized in the ERBB signaling pathway structure in the KEGG database. Green edges correspond to a pair of nodes coming from the same layer in Table 4. Black edges represent a pair of nodes, where parent and child nodes come from an upper and a lower layer in Table 4, respectively. These edges can be viewed as predictions.

edges represent a pair of nodes, where parent and child node come from an upper and lower layer, respectively; however, a linear signaling between them has not been reported in the pathway structure available from KEGG. Such edges can be viewed as predictions. Overall, the interaction mechanisms presented in Fig. 9 support the use of GSGS for inferring signaling pathway structures.

4

CONCLUSION

In this paper, we proposed a novel computational approach, GSGS, to infer the most likely signaling pathway structure from a probability distribution of sampled signaling pathway structures using overlapping gene sets related to a given pathway. We first assessed the performance of GSGS by deriving gene sets from two gold standard networks: E. coli and In silico available from the DREAM initiative. Our approach was shown to have significantly better performance in terms of both F-score and total number of predicted edges than the Bayesian network approaches K2 and MCMC, and mutual information approaches ARACNE, RN, CLR, and MRNET. Robustness of GSGS against undersampling or oversampling of gene sets was proved by performing sensitivity analysis. Our conclusions were further validated by testing

ACHARYA ET AL.: GSGS: A COMPUTATIONAL APPROACH TO RECONSTRUCT SIGNALING PATHWAY STRUCTURES FROM GENE SETS

the performance of the aforementioned approaches on four E. coli data sets available from DREAM. Finally, we applied GSGS to reconstruct a network in breast cancer cells, and verified it using database knowledge available from KEGG. Overall, our analyses favor the use of GSGS approach in the inference of complicated signaling pathway structures. As far as we know, GSGS is original in the following aspects:

[7]

[8]

[9] [10] [11]

It offers a unique gene set-based approach for the reconstruction of directed signaling pathway structures. 2. The ordering of genes in each gene set is treated as a random variable to capture the higher order interactions among genes participating in signal transduction events. In most of the existing approaches, individual genes are treated as variables. 3. The problem of signaling pathway structure inference is cast into the framework of parameter estimation for a multivariate distribution. 4. The true signaling pathway structures are modeled as a probability distribution of sample signaling pathway structures. GSGS will substantially benefit by extending it to also incorporate the identification of novel pathway components from large-scale molecular profiling data. Since a signaling pathway structure represents a subnetwork of PPI network, another related extension could be the identification of PPI subnetworks relevant to signaling pathways. In this case, our study can be useful for both 1) identifying directed signaling mechanisms, as PPI subnetworks represent undirected networks and 2) identifying novel signaling mechanisms among proteins in the subnetwork. We expect our gene set-based GSGS approach to open a new avenue in methodology research of signal transduction. 1.

[12] [13]

[14]

[15]

[16] [17]

[18]

[19]

ACKNOWLEDGMENTS

[20]

This work was supported by NIH grant R21LM010137 to D. Zhu. All correspondence should be addressed to D. Zhu.

[21]

REFERENCES [1] [2]

[3]

[4] [5] [6]

G. Altay and F. Emmert-Streib, “Differences in Gene Network Inference Algorithms on the Network-Level by Ensemble Methods,” Bioinformatics, vol. 26, no. 14, pp. 1738-1744, 2010. A.J. Butte, P. Tamayo, D. Slonim, T. Golub, and I.S. Kohane, “Discovering Functional Relationships between RNA Expression and Chemotherapeutic Susceptibility Using Relevance Networks,” Proc. Nat’l Academy of Sciences USA, vol. 97, no. 22, pp. 12182-12186, 2000. A.J. Butte and I.S. Kohane, “Relevance Networks: A First Step toward Finding Genetic Regulatory Networks within Microarray Data,” Analysis of Gene Expression Data, G. Parmigiani, E.S. Garett, R.A. Irizarry, and S.L. Zeger, eds, pp. 428-446, Springer, 2003. G.F. Cooper and E. Herskovits, “A Bayesian Method for the Induction of Probabilistic Networks from Data,” Machine Learning, vol. 9, no. 4, pp. 309-347, 1992. A. Dobra, C. Hans, B. Jones, J.R. Nevins, and M. West, “Sparse Graphical Models for Exploring Gene Expression Data,” J. Multivariate Analysis, vol. 90, pp. 196-212, 2004. J.J. Faith, B. Hayete, J.T. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J.J. Collins, and T.S. Gardner, “Large-Scale Mapping and Validation of Escherichia Coli Transcriptional Regulation from a Compendium Of Expression Profiles,” PLoS Biology, vol. 5, no. 1, p. e8, 2007.

[22]

[23]

[24] [25] [26]

[27]

[28]

449

N. Friedman, M. Linial, I. Nachman, and D. Peer, “Using Bayesian Networks to Analyze Expression Data,” J. Computational Biology, vol. 7, pp. 601-620, 2000. T.S. Gardner, D. di Bernardo, D. Lorenz, and J.J. Collins, “Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling,” Science, vol. 301, no. 5629, pp. 102-105, 2003. A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin, Bayesian Data Analysis, second ed. Chapman and Hall, 2003. G.H. Givens and J.A. Hoeting, Computational Statistics, John Wiley and Sons, 2005. F. Iorio, R. Bosotti, E. Scacheri, V. Belcastro, P. Mithbaokar, R. Ferriero, L. Murino, R. Tagliaferri, N. Brunetti-Pierri, A. Isacchi, and D. di Bernardo, “Discovery of Drug Mode of Action and Drug Repositioning from Transcriptional Responses,” Proc. Nat’l Academy of Sciences USA, vol. 107, no. 33, pp. 14621-14626, 2010. M. Kanehisa and S. Goto, “Kegg: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, vol. 28, pp. 27-30, 2000. H. Kishino and P.J. Waddell, “Correspondence Analysis of Genes and Tissue Types and Finding Genetic Links from Microarray Data,” Genome Informatics, vol. 11, pp. 83-95, 2000. J. Kubica, A. Moore, D. Cohn, and J. Schneider, “cGraph: A Fast Graphbased Method for Link Analysis and Queries,” Proc. IJCAI Text-Mining and Link-Analysis Workshop, 2003. J. Lamb, E.D. Crawford, D. Peck, J.W. Modell, I.C. Blat, M.J. Wrobel, J. Lerner, J.P. Brunet, A. Subramanian, K.N. Ross, M. Reich, H. Hieronymus, G. Wei, S.A. Armstrong, S.J. Haggarty, P.A. Clemons, R. Wei, S.A. Carr, E.S. Lander, and T.R. Golub, “The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease,” Science, vol. 313, no. 5795, pp. 1929-1935, 2006. G. Lurje and H.J. Lenz, “EGFR Signaling and Drug Discovery,” Oncology, vol. 77, no. 6, pp. 400-410, 2009. D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano, “Generating Realistic in Silico Gene Networks for Performance Assessment of Reverse Engineering Methods,” J. Computational Biology, vol. 16, no. 2, pp. 229-239, 2009. D. Marbach, R.J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky, “Revealing Strengths and Weaknesses of Methods for Gene Network Inference,” Proc. Nat’l Academy of Sciences USA, vol. 107, no. 14, pp. 6286-6291, 2010. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. Favera, and A. Califano, “ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context,” BMC Bioinformatics, vol. 7, article S7, 2006. F. Markowetz, D. Kostka, O.G. Troyanskaya, and R. Spang, “Nested Effects Models for High-Dimensional Phenotyping Screens,” Bioinformatics, vol. 23, no. 13, pp. i305-i312, 2007. P. Mendes, “Framework for Comparative Assessment of Parameter Estimation and Inference Methods in Systems Biology,” Learning and Inference in Computational Systems Biology, N.D. Lawrence, M. Girolami, M. Rattray, G. Sanguinetti, eds., pp. 3358, MIT Press, 2009. P.E. Meyer, K. Kontos, and G. Bontempi, “Information-Theoretic Inference of Large Transcriptional Regulatory Networks,” EUROSIP J. Bioinformatics and Systems Biology, vol. 2007, p. 79879, 2007. P.E. Meyer, F. Lafitte, and Bontempi, “MINET: An Open Source R/Bioconductor Package for Mutual Information Based Network Inference,” BMC Bioinformatics, vol. 9, article 461, 2008. K. Murphy, “Active Learning of Causal Bayes Net Structure,” technical report, UC Berkeley, 2001. K. Murphy, “The Bayes Net Toolbox for Matlab,” Computing Science and Statistics, vol. 33, p. 331-350, 2001. P.M. Navolanic, L.S. Steelman, and J.A. McCubrey, “EGFR Family Signaling and Its Association with Breast Cancer Development and Resistance to Chemotherapy (Review),” Int’l J. Oncology, vol. 22, no. 2, pp. 237-252, 2003. M.A. Olayioye, “Update on HER-2 as a Target for Cancer Therapy: Intracellular Signaling Pathways of ErbB2/HER-2 and Family Members,” Breast Cancer Research, vol. 3, no. 6, pp. 385-389, 2001. H. Pang, A. Lin, M. Holford, B.E. Enerson, B. Lu, M.P. Lawton, E. Floyd, and H. Zhao, “Pathway Analysis Using Random Forests Classification and Regression,” Bioinformatics, vol. 22, pp. 20282036, 2006.

450

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

[29] H. Pang and H. Zhao, “Building Pathway Clusters from Random Forests Classification Using Class Votes,” BMC Bioinformatics, vol. 9, no. 1, article 87, 2008. [30] R.J. Prill, D. Marbach, J. Saez-Rodriguez, P.K. Sorger, L.G. Alexopoulos, X. Xue, N.D. Clarke, G. Altan-Bonnet, and G. Stolovitzky, “Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges,” PLoS ONE, vol. 5, no. 2, p. e9202, 2010. [31] M.G. Rabbat, J.R. Treichler, S.L. Wood, and M.G. Larimore, “Understanding the Topology of a Telephone Network via Internally Sensed Network Tomography,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, vol. 3, pp. 977-980, 2005. [32] M.G. Rabbat, M.A.T. Figueiredo, and R.D. Nowak, “Network Inference from Co-Occurrences,” IEEE Trans. Information Theory, vol. 54, no. 9, pp. 4053-4068, Sept. 2008. [33] A.J. Richards, B. Muller, M. Shotwell, L.A. Cowart, R. Baerbel, and X. Lu, “Assessing the Functional Coherence of Gene Sets with Metrics Based on the Gene Ontology Graph,” Bioinformatics, vol. 26, no. 12, pp. i79-i87, 2010. [34] J. Scha¨fer and K. Strimmer, “An Empirical Bayes Approach to Inferring Large-Scale Gene Association Networks,” Bioinformatics, vol. 21, pp. 754-764, 2005. [35] G. Schwartz, “Estimating the Dimension of a Model,” The Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978. [36] E. Segal, M. Shapira, A. Regev, D. Peer, D. Botstein, D. Koller, and N. Friedman, “Module Networks: Identifying Regulatory Modules and Their Condition-Specific Regulators from Gene Expression Data,” Nature Genetics, vol. 34, pp. 166-176, 2003. [37] P. Shannon, A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker, “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks,” Genome Research, vol. 13, no. 11, pp. 24982504, 2003. [38] I. Shmulevich, E.R. Dougherty, S. Kim, and W. Zhang, “Probabilistic Boolean Networks: A Rule-Based Uncertainty Model for Gene Regulatory Networks,” Bioinformatics, vol. 18, no. 2, pp. 261274, 2002. [39] I. Shmulevich, I. Gluhovsky, R. Hashimoto, E.R. Dougherty, and W. Zhang, “Probabilistic Boolean Networks: A Rule-Based Uncertainty Model for Gene Regulatory Networks,” Comparative and Functional Genomics, vol. 4, no. 6, pp. 601-608, 2003. [40] G. Stolovitzky, R.J. Prill, and A. Califano, “Lessons from the DREAM2 Challenges,” Annals of the New York Academy of Sciences, vol. 1158, pp. 159-195, 2009. [41] A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, and J.P. Mesirov, “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles,” Proc. Nat’l Academy of Sciences USA, vol. 102, pp. 15545-15550, 2005. [42] A.L. Tarca, S. Draghici, P. Khatri, S.S. Hassan, P. Mittal, J.S. Kim, C.J. Kim, J.P. Kusanovic, and R. Romero, “A Novel Signaling Pathway Impact Analysis,” Bioinformatics, vol. 25, no. 1, pp. 75-82, 2009. [43] J. Tegner, M.K.S. Yeung, J. Hasty, and J.J. Collins, “Reverse Engineering Gene Networks: Integrating Genetic Perturbations with Dynamical Modeling,” Proc. Nat’l Academy of Sciences USA, vol. 100, no. 10, pp. 5944-5949, 2003. [44] C.J. Vaske, S.C. Benz, J.Z. Sanborn, D. Earl, C. Szeto, J. Zhu, D. Haussler, and J.M. Stuart, “Inference of Patient-Specific Pathway Activities from Multi-Dimensional Cancer Genomics Data Using Paradigm,” Bioinformatics, vol. 26, no. 12, pp. i237-i245, 2010. [45] T.R. Xu, V. Vyshemirsky, A. Gormand, A. von Kriegsheim, M. Girolami, G.S. Baillie, D. Ketley, A.J. Dunlop, G. Milligan, M.D. Houslay, and W. Kolch, “Inferring Signaling Pathway Topologies from Multiple Perturbation Measurements of Specific Biochemical Species,” Science Signaling, vol. 3, no. 134 p. ra20, 2010. [46] D. Zhu, A.O. Hero, Z.S. Qin, and A. Swaroop, “High Throughput Screening of Co-Expressed Gene Pairs with Controlled False Discovery Rate (FDR) and Minimum Acceptable Strength (MAS),” J. Computational Biology, vol. 12, no. 7, pp. 1029-1045, 2005. [47] D. Zhu, M.G. Rabbat, A.O. Hero, R. Nowak, and M.A.G. Figueirado, “De Novo Reconstructing Signaling Pathways from Multiple Data Sources,” New Research in Signaling Transduction, Nova Publisher, 2006.

VOL. 9,

NO. 2,

MARCH/APRIL 2012

Lipi Rani Acharya received the MSc and PhD degrees in mathematics from Indian Institute of Technology, Madras, and Indian Institute of Technology, Kanpur, respectively. Since 2008, she has been working toward the doctoral degree in the Department of Computer Science, University of New Orleans. Her current research interests include computational and statistical methods for analyzing molecular profiling data and reverse engineering of gene regulatory networks. Thair Judeh received the bachelor’s degree in mathematics and computer science from Loyola University in New Orleans, Louisiana, and the master’s degree in computer science from the University of New Orleans. Currently, he is working toward the PhD degree in computer science at Wayne State University. His current interests include the reverse engineering and decomposition of gene regulatory networks.

Zhansheng Duan (S’07-M’10) received the BS and PhD degrees both in electrical engineering from Xi’an Jiaotong University, China, in 1999 and 2005, respectively, and the PhD degree in electrical engineering from the University of New Orleans, Louisiana, in 2010. From January 2010 to April 2010, he was a research assistant professor with Dr. Dongxiao Zhu in the Department of Computer Science, University of New Orleans. In July 2010, he joined the Center for Information Engineering Science Research, Xi’an Jiaotong University, where he is currently working as an associate professor. His research interests include gene regulatory network inference, estimation and detection theory, target tracking, information fusion, nonlinear filtering, and performance evaluation. He has authored or coauthored one book Multisource Information Fusion (Tsinghua University Publishing House, 2006) and 46 journal and conference proceedings papers. He is also a member of the International Society of Information Fusion (ISIF), the Honor Society of Eta Kappa Nu, and the IEEE. Michael G. Rabbat (S’02-M’07) received the BSc degree from the University of Illinois, Urbana-Champaign, the MSc degree from Rice University, Houston, and the PhD degree from the University of Wisconsin, Madison, all in electrical engineering, in 2001, 2003, and 2006, respectively. He was a visiting researcher at Applied Signal Technology, Inc., during the summer of 2003. Currently, he is working as an assistant professor at McGill University, Montreal, Quebec, Canada, and an associate editor for the ACM Transactions on Sensor Networks. His research interests include distributed information processing, network monitoring, and network inference. Dongxiao Zhu received the PhD degree in bioinformatics in 2006. He is currently an assistant professor in the Department of Computer Science, Wayne State University. He was with the University of New Orleans as an assistant professor and Stowers Institute for Medical Research as a biostatistician. His current research interests include primarily in developing pattern recognition and machine learning models, algorithms and tools to advance computational systems biology research.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents