184
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
Mining, Modeling, and Evaluation of Subnetworks From Large Biomolecular Networks and Its Comparison Study Xiaohua Hu, Member, IEEE, Michael Ng, Fang-Xiang Wu, Member, IEEE, and Bahrad A. Sokhansanj, Member, IEEE
Abstract—In this paper, we present a novel method to mine, model, and evaluate a regulatory system executing cellular functions that can be represented as a biomolecular network. Our method consists of two steps. First, a novel scale-free network clustering approach is applied to such a biomolecular network to obtain various subnetworks. Second, computational models are generated for the subnetworks and simulated to predict their behavior in the cellular context. We discuss and evaluate some of the advanced computational modeling approaches, in particular, state-space modeling, probabilistic Boolean network modeling, and fuzzy logic modeling. The modeling and simulation results represent hypotheses that are tested against high-throughput biological datasets (microarrays and/or genetic screens) under normal and perturbation conditions. Experimental results on time-series gene expression data for the human cell cycle indicate that our approach is promising for subnetwork mining and simulation from large biomolecular networks. Index Terms—Biomolecular network analysis, fuzzy modeling, probabilistic Boolean network (PBN) model, state-space model subnetwork mining.
I. INTRODUCTION S BIOMOLECULAR networks grow in size and complexity, the model of a biomolecular network must become more rigorous to keep track of all the components and their interactions. In general, this presents the need for computer simulation to manipulate and understand the biomolecular network model. However, a major challenge of modeling the dynamics of
A
Manuscript received March 12, 2008; revised June 12, 2008 and September 8, 2008. Current version published March 3, 2009. The work of X. Hu was supported in part by the National Science Foundation (NSF) under Career Grant NSF IIS 0448023 and Grant NSF CCF 0514679 and by the Research Grant from Pennsylvania (PA) Deptartment of Health. The work of M. Ng was supported in part by the Hong Kong Research Grants Council (RGC) under Grant 201508 and by the Hong Kong Baptist University Faculty Research Grant (HKBU FRGs). The work of F.-X. Wu was supported by the Natural Science and Engineering Research Council of Canada (NSERC). X. Hu is with the College of Information Science and Technology, Drexel University, Philadelphia, PA 19104 USA. He is also a Yellow-River Scholar of Henan University, Henan, China (e-mail:
[email protected]). M. Ng is with the Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong (e-mail:
[email protected]). F.-X. Wu is with the Department of Mechanical Engineering, Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada (e-mail:
[email protected]). B. A. Sokhansanj is with the School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA 19104 USA (e-mail:
[email protected]). This paper has supplementary downloadable multimedia material available at http://ieeexplore.ieee.org. provided by the author. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2008.2007649
a biomolecular network is that conventional methods based on physical and chemical principles (such as systems of differential equations) require data that are difficult to accurately and consistently measure using either conventional or high-throughput technologies, which characteristically yield noisy, semiquantitative, and often relative data. For example, microarray gene expression ratios are ultimately obtained from pixel counts of relatively messy images [1]. Boolean networks (e.g., [2]) are computationally simple, and thus, they are potentially suitable for handling both the complexity of biological networks and qualitative-text-based data. However, Boolean models have been proven to lack the resolution needed to accurately model biomolecular interactions [3]. In contrast, various differentialequation-based models (e.g., [4]) are computationally expensive and sensitive to imprecisely measured parameters (and virtually useless given purely qualitative data, i.e., from text mining). Therefore, there is a need to develop a computational model to incorporate the uncertainty of experimental results into the development of alternative frameworks, such as a probability Boolean network (PBN) model, fuzzy logic model, and/or statespace model. In this paper, we present a hybrid approach that combines data mining and advanced computational modeling to build and analyze the large biomolecular network of a cell process. Our method consists of two steps. First, a novel scale-free network clustering approach is applied to the biomolecular network to obtain various subnetworks. The clustering algorithm considers the characteristics of the scale-free network graphs and is based on the local density of the vertex and its neighborhood functions that can be used to find more meaningful clusters with different density level. Second, a computational model is generated for the subnetworks and simulated to predict their behavior in the cellular context. Our method integrates the process of obtaining network structure directly with simulation of the advanced computation models that are robust to qualitative (molecular biology) and noisy quantitative (biochemical) data. Thus, the method provides a way to iteratively test and refine hypothetical biomolecular networks. The rest of the paper is organized as follows. In Section II, we review some of the related work in community/subnetwork identification and computational modeling for biomolecular networks. We present the data flow of our methods in Section III. We describe a novel algorithm SNBuilder in Section IV for community structure analysis. The various computational modeling approaches such as state-space model, probabilistic Boolean network (PBN) model, and fuzzy logic
1089-7771/$25.00 © 2009 IEEE Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
HU et al.: MINING, MODELING, AND EVALUATION OF SUBNETWORKS
185
model are discussed in Section V along with their experimental results. Section VI concludes with our main finding and some future research directions. II. RELATED WORKS A. Community Structure Analysis Studying the community structure of biological networks is of particular interest due to high data volume and the complexity of interactions. In the context of biological networks, communities might represent structural or functional groupings. They can be synonymous with molecular modules, biochemical pathways, gene clusters, or protein complexes. Hashimoto et al. [5] have developed an approach to growing genetic regulatory networks from seed genes. Their work is based on PBNs, and subnetworks are constructed in the context of a directed graph using both the coefficient of determination (COD) and the Boolean function influence among genes. Related works also include those that predict interacting protein complexes. Jansen et al. [6] used a procedure integrating different data sources to predict the membership of protein complexes for individual genes based on two assumptions: first, the function of any protein complex depends on the functions of its subunits, and second, all subunits of a protein complex share certain common properties. Bader and Hogue [7] report a molecular complex detection (MCODE) clustering algorithm to identify molecular complexes in a large protein interaction network. MCODE is based on local network density—a modified measure of the clustering coefficient. Bu et al. [8] used a spectral analysis method to identify the topological structures such as quasi-cliques and quasi-bipartite in a protein–protein interaction network. These topological structures are found to be biologically relevant functional groups. In our previous work, we developed a spectral-based clustering method using local density and vertex neighborhood to analyze the chromatin network [9]. Two recent works along this line of research are based on the concept of network modularity introduced by Hartwell et al. [10]. B. Biomolecular Networking Modeling A variety of approaches have been implemented for modeling gene and protein networks, including hidden Markov models (e.g., [11] and [12], Bayesian networks [13], linear neural networks [14], finite-state models [15], and other mathematical models [16]). These methods are based on either treating biological variables at the crudest resolution (ON or OFF in Boolean networks, a few more levels possible for finite-state models but with rapidly growing complexity) or as absolute physical quantities. Somogyi and Sniegoski [17] have shown that Boolean networks have features similar to those in biological systems, such as global complex behavior, self-organization, stability, redundancy, and periodicity. Recently, Akutsu et al. [2] have devised a much simpler algorithm for the same problem, and proved that if the in-degree of each node (i.e., the number of input nodes to each node) is bounded by a constant, only O(log n) state transition pairs (from possible 2n pairs) are necessary and sufficient to identify the original Boolean network of n nodes (genes) correctly with high probability. However, the Boolean network
Fig. 1.
Outline of mining, modeling, and evaluation of biomolecular network.
models depend on simplified assumptions about biological systems. Boolean models have been proven to lack the resolution needed to accurately model biomolecular interactions [3]. In addition to Boolean network models, differential/ difference equation models have also been applied to inferring gene expression. Chen et al. [18] have proposed a differential equation model of gene expression. Due to the lack of gene expression data, these models will usually be underdetermined. Under the additional condition that the gene regulatory network should be sparse, they have showed that the model can be constructed in O(nh+1 ) time, where n is the number of genes and/or proteins in the model and h is the number of maximum nonzero coefficients (connectivity degree of genes in a regulatory network) allowed for each differential equation in the model. In order that the parameters of the models be identifiable, both Chen [19] and Akutsu et al. [2] assume that all genes have a fixed maximum connectivity degree h (often small). These assumptions are debatable. In biological reality, some genes are known to have many regulatory inputs, while others are not known to have more than a few. Moreover, differential-equationbased models [4] are computationally expensive and sensitive to imprecisely measured parameters, as well as unable to handle purely qualitative data, i.e., from text mining. III. DATA FLOW OF OUR APPROACH The data flow of our method is illustrated in Fig. 1. A novel scale-free network clustering approach is applied to the biomolecular network to obtain various subnetworks. Then a hypothetical model is computationally generated for the subnetwork and simulated to predict its dynamic biological behavior within relevant experimental contexts. Thus, modeling results can be verified against high-throughput data (microarrays and/or genetic screens) for both normal and perturbation conditions. If computational results do not match experimental or previously published results, then a new hypothesis is generated and fed back to the data mining and analysis step to refine the biomolecular network for the next iteration. As the procedure continues, better convergence between modeling and experiments evolves.
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
186
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
Notably, the dynamic modeling component of this method depends on the automated network structure generation of the first component and the subnetwork clustering, which are both essential to make the solution tractable. The details of steps are described in the subsequent sections. IV. MINING THE BIOMOLECULAR NETWORK TO IDENTIFY SUBNETWORKS Here, the goal of our study is to address the key question, what is the particular community within the whole biomolecular network to which a given set of proteins belong? We are motivated by two main factors. First, due to the complexity and modularity of biological networks, it is computationally more feasible to study a community containing a small number of proteins of interest. Second, sometimes the whole community structure of the network may not be our primary concern. Rather, we may be more interested in finding the community that contains a protein (or proteins) of interest. Our aim is to discover relatively small subnetworks such that proteins inside the subnetwork interact significantly and, meanwhile, are not strongly influenced by proteins outside the subnetwork. Subnetworks are constructed starting with a seed consisting of one or more proteins believed to be participated in a viable subnetwork. Given this seed, we iteratively adjoin new proteins following an adapted definition of a community in a network. In this section, we describe our procedure and its results for a particular biological test case.
is a clique); or 2) all vertices except the seed have the same in-community degree (a star-like structure). The algorithm performs a breadth-first expansion in the core expanding step. It first builds a candidate set containing the core and all vertices adjacent to each vertex in the core (line 14). A candidate vertex will then be added to the core if it meets one of the following conditions (line 19): 1) its in-community degree is greater than its out-community degree, i.e., the quantitative definition of community in a strong sense ((ktin > ktout ); or 2) its affinity coefficient is greater than or equal to the affinity threshold f . We define the affinity coefficient of a vertex to a network as the fraction of its in-community degree over the size of the network, excluding the vertex itself (kiin (D)/(|D| − 1)). We introduce the affinity coefficient and the affinity threshold f to provide a degree of relaxation when expanding the core, because it is too strict, requiring every expanding vertex to be a strongsense community member. Even though a candidate vertex may not have an in-community degree larger than out-community degree, it may connect to all (or even most of) other members of the network, indicating a strong tie between the candidate vertex and the network. We use an affinity threshold f of 1 in our implementation, meaning that in order to be eligible to add to the core set, the candidate vertex has to connect to all other vertices in the core set. However, f may be relaxed to be less than 1, if necessary or so desired.
A. Algorithm SNBuilder We model the protein–protein interaction network as an undirected graph, where vertices represent proteins and edges represent interactions between pairs of proteins. An undirected graph G = (V , E) is composed of two sets, vertices V and edges E. An edge E is defined as a pair of vertices (u, v) denoting the direct connection between vertices u and v. The graphs we use in this paper are undirected, unweighted, and simple—meaning no parallel edges. For a subgraph G ⊂ G and a vertex i belonging to G , we define the in-community degree for vertex i, kiin (G ), to be the number of edges connecting vertex i to other vertices belonging to G and the out-community degree kiout (G ) to be the number of edges connecting vertex i to other vertices that are in G but do not belong to G . In our algorithm, we adopt the quantitative definitions of community defined by Radicchi et al. [20], i.e., the subgraph G is a community in a strong sense if kiin (G ) > kiout (G ) for each vertex i in G , and in a weak sense if the sum of all degrees within G is greater than the sum of all degrees from G to the rest of the graph. The algorithm, called SNBuilder, accepts the seed protein s, gets the neighbors of s, finds the core of the community to build, and expands the core to find the eventual community. The two major components of SNBuilder are FindCore and ExpandCore. In fact, FindCore (line 5 to line 8) performs a na¨ıve search for maximum clique from the neighborhood of the seed protein by recursively removing vertices with the lowest in-community degree until either: 1) all vertices in the core set have the same incommunity degree (Km in = Km ax , i.e., the resulting subgraph
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
HU et al.: MINING, MODELING, AND EVALUATION OF SUBNETWORKS
187
TABLE I FOUR SEED PROTEINS AND THEIR COMMUNITIES
B. Evaluation of SNBuilder To test our algorithm, we downloaded a dataset of interactions for Saccharomyces cerevisae from the general repository for interaction datasets (BioGRID) (http://www.thebiogrid.org/). The BioGRID database contains all published large-scale interaction datasets as well as available curated interactions such as those deposited in Biomolecular Interaction Network Database (BIND) [7] and Munich Information Center for Protein Sequences (MIPS) (http://mips.gsf.de/genre/proj/mpact). The yeast dataset that we downloaded has 4907 proteins and 17 598 interactions. We applied our algorithm against the network built from the downloaded dataset. The average running time for finding a community of 50 members is about 20 ms. Because there is no previously published alternative approach to our method for our specific application, we elected to compare the performance of our algorithm to the work on predicting protein complex membership by Asthana et al. [21]. Asthana and colleagues reported the results of queries with four complexes using probabilistic network reliability (we refer to their work as the PNR method in the following discussion). Four communities are identified by SNBuilder using one protein as seed from each of the query complexes used by the PNR method. The seed protein is selected randomly from the “core” protein set. As a comparison, we use Complexpander, an implementation of the PNR method [21], available at http://llama.med.harvard.edu/Software.html, to predict co-complex using the core protein set that contains the same seed protein used by SNBuilder. The first community is identified using TAF6 as seed. TAF6 is a component of the SAGA complex that is listed in MIPS complex catalogue as a known cellular complex consisting of 16 proteins. The community identified by our algorithm contains 39 members, including 14 of the 16 SAGA complex proteins listed in MIPS (indicated by an asterisk in the Alias column in Table I). The community also contains 14 of 21 proteins listed in MIPS as Kornberg’s mediator (SRB) complex. The second community is discovered using NOT3 as seed. NOT3 is a known component protein of the CCR4–NOT complex that is a global regulator of gene expression and involved in functions such as transcription regulation and DNA damage responses. MIPS complex catalogue lists five proteins for NOT complex and 13 proteins (including the five NOT complex proteins) for CCR4 complex. The NOT community identified is composed of 40 members. All the five NOT complex proteins listed in MIPS and 11 of the 13 CCR4 complex proteins are members of the community. POL1, POL2, PRI1, and PRI2 are members of the DNA polymerase alpha (I)–primase complex, as listed in MIPS. RVB1, PIL1, UBR1, and STI1 have been grouped together with CCR4, CDC39, CDC36, and POP2 by systematic analysis [22]. The community also contains 20 out of 26 proteins of a complex that probably are involved in transcription and DNA/chromatin structure maintenance [23]. The third community is identified by using replication factor C2 (RFC2) as the seed. RFC2 is a component of the RFC complex. The community identified by our algorithm has 17 members. All five proteins of RFC complex listed in MIPS com-
Fig. 2.
Sample of subnetwork.
plex catalogue database are members of this community. We use ARP3 as seed to identify the last community ARP2/ARP3 complex. The identified community contains all seven proteins of the ARP2/ARP3 complex listed in MIPS, and there are 14 members belonging to the same functional category of budding, cell polarity, and filament formation according to MIPS. V. COMPUTATIONAL MODELS FOR SUBNETWORKS AND THEIR EXPERIMENTAL RESULTS A. Test Dataset To evaluate the accuracy and feasibility of computational approaches for biomolecular network modeling, we considered a gene network corresponding to a subnetwork found using SNBuilder proposed in Section IV. The subnetwork as shown in Fig. 2 involves human genes related to p53, apoptosis, DNA damage response, and cell cycle. Edges in Fig. 2 are taken to represent potential connections between genes, defining the
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
188
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
TABLE II BEST FITTING RULES FOR FIVE INPUT GENES
structure of the gene network. Table II shows the genes that encode the proteins in Fig. 2. This results in some differences in terminology, for example, EP300 encodes the protein p300. Also, where aliases for gene names exist, the more common usage is given in Table II. In this paper, we employ the human cell cycle gene expression data [24] to construct the computational model of the subnetwork shown in Fig. 2. There are independent data in [24] for five methods of cell cycle synchronization, two of which are complete for the genes in the subnetwork we studied. One dataset (“Thy-Thy 3”) is used as the “training set” to identify the parameters in the following three models. A different dataset (“Thy-Noc”) is used as the “testing set.” B. State-Space Model In this paper, the following state-space model is proposed to describe a gene regulatory network z(t + 1) = A · z(t) + Bu(t) + n1 (t) (1) x(t) = C · z(t) + n2 (t). The meaning of the variables is as follows: in terms of linear system theory [19], the equations (1) are called the state-space model of a dynamic system. The vector x(t) = [ x1 (t) · · · xn (t) ]T consists of the observation variables of the system and xi (t) (i = 1, . . . , n) represents the expression level of gene i at time point t, where n is the number of genes in the network. The vector z(t) = [ z1 (t) · · · zp (t) ]T consists of the internal state variables of the system and zi (t) (i = 1, . . . , p) represents the expression value of internal ele-
ment (variable) i at time point t, which directly regulates gene expression, where p is the number of the internal state variables. The vector u(t) = [ u1 (t) · · · ur (t) ]T represents the external input (control variable) of the internal state governing equation. The matrix A = [aij ]p×p is the time translation matrix of the internal state variables or the state transition matrix. It provides key information on the influences of the internal variables on each other. The matrix B = [bik ]p×r is the control matrix. The entries of the matrix reflect the strength of a control variable to an internal variable. The matrix C = [cik ]n ×p is the observation matrix, which transfers the information from the internal state variables to the observation variables. The entries of the matrix encode information on the influences of the internal regulatory elements on the genes. Finally, the vectors n1 (t) and n2 (t) stand for system noise and observation noise. In model (1), the upper equation is called the internal state governing equation while the lower one is called the observation equation. 1) Parameter Estimation: Let X be the gene expression data matrix with n rows and m columns, where n and m are the numbers of the genes and the measuring time points, respectively. The construction of model (1) using microarray gene expression data X may be divided into three phases. Phase 1 identifies the internal state variables and their expression matrix, and estimates the elements of observation matrix C. Phase 2 determines the control matrix B based on the observation matrix C and the structure of the network. Phase 3 estimates the elements of matrices A. a) Internal variables and estimation of observation matrix: The internal states are latent variables in gene regulatory networks. They can be any unobserved molecules in a cell that participate in the process of gene regulation. In this study, the maximum-likelihood algorithm for probabilistic principal component analysis (PPCA) [25] is employed to extract the internal variables from the observation data (time-course gene expression data). Using the PPCA model, it follows that X=C·Z+N
(2)
where X is the n × m observation data matrix, each column of which is viewed as an observation sample, C is the n × p transformation matrix, Z represents the expression profile of an internal state, and N is the n × m noise matrix consisting of m n-dimensional observation noise vectors. We assume that the sample mean is shifted to zero. The log-likelihood of PPCA model is expressed by m (3) L = − {n (ln 2π) + log |D| + tr(D−1 S) 2 where D = CC T + σ 2 I, σ 2 is the variance of the observation noise, and S = X∗ X /m. For the given number of internal variables p, the global maximum log-likelihood of the PPCA model is calculated by p n λj m Lp = − log(λj ) + (n − p)∗ log 2 j =1 n −p j =p+1 + n(log(2π) + 1)
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
(4)
HU et al.: MINING, MODELING, AND EVALUATION OF SUBNETWORKS
189
when C = Up
(5)
where λj (j = 1, . . . , p) are the first p largest eigenvalues of the sample variance matrix S and Up is an n × p matrix, each column of which is a corresponding eigenvector of S. From (4), the values of the maximum log-likelihood for the PPCA model increase with the increased numbers of internal state variables p. The redundant internal state variables may result in a complicated model. In this paper, the Akaike information criterion (AIC) is adopted to avoid the redundant variables. For each model, the AIC can be calculated as p n λ j log(λj )+(n−p)∗ log AIC(p) = −m (n − p) j =1
j =k +1
− 2(np + 1).
(6)
By the definition [26], the model with the largest AIC is chosen. After the transformation matrix C is determined, the expression profiles of internal variables accumulated in matrix Z can be calculated by formula Z = C+ X. b) Control variables, network structure, and control matrix: In state-space model (1), the control variables together with current internal states determine the next internal states. From the viewpoint of biology, the overall expression level of all genes in the network affects the internal (hidden) variables [27]. In this study, we take u(t) = x(t) as the input of the internal state equation. Therefore, from the model (1), it follows that x(t + 1) = CAz(t) + CBx(t).
(7)
This equation quantitatively describes the regulatory relationships among genes through the matrix CB. On the other hand, using the algorithm SNBuilder, the subnetwork can be presented by a graph as shown in Fig. 2. In this paper, the adjacent matrix of such a graph is called the structure matrix of the network as it qualitatively describes the regulatory relationships among genes, denoted by R. Therefore, the structure of matrix CB should be the same as that of matrix R, i.e., the (i, j)th element of CB is nonzero (or zero) if the (i, j)th element of S is nonzero (or zero). It is nontrivial to find a control matrix B such that the structure of matrix CB is the same as that of matrix R. In reality, the weak connections among genes may be ignored in the structure of the network, and we reformulate the problem as follows: find a matrix B such that the squared sum over the elements of CB corresponding to nonzeros in S is much larger than that over other elements. The problem can be solved by optimizing a Rayleigh quotient. c) Estimates of state transition matrix: With the calculated control matrix B and the profiles of internal variables Z, one can estimate the parameters of the state transition matrix in the internal state governing equation z(t + 1) = A · z(t) + Bu(t) + n1 (t)
(8)
by minimizing the system noise n1 (t). This is equivalent to minimizing the cost function m CF = z(tj ) − v(tj )2 (9)
where the time-variant vector v(t) has the same dimensions as the internal state vector z(t + 1) and is calculated by the following difference equation: v(t + 1) = A · v(t) + Bu(t)
(10)
with the initial state value v(0) = z(t0 ), and control values u(0), . . . , u(t). 2) Computational Results and Validation: In this study, the inferred gene regulatory networks will be evaluated in the following aspects: prediction power, stability, robustness, and periodicity [28]. As the human cell cycle gene expression data are very noisy, some data preprocessing techniques are applied to the log ratio gene expression data. First, a filter is applied to each gene expression profile one at a time. At a given time point, the new expression value is the average of three raw values at the previous, current, and behind points. As the mean values and magnitudes for genes and microarrays mainly reflect the experimental procedure [29], the expression profile of each gene is normalized to have the mean of zero and the standard deviation of one, and then for the expression values on each microarray so as to have the median of zero and the deviation of one. Such normalizations also make the PPCA simple [25]. For each model with a various number of internal variables, the AIC is calculated by formula (6), and the number of internal variables is determined to be 9. The transformation matrix C is calculated by (5), and further the expression profiles of the internal state variables in matrix Z and control matrix B are determined by the methods mentioned in the previous paragraphs. As the human cell cycle gene expression data are collected at the equally spaced time points, the least square method for the linear regression problem is applied to determine the elements of matrix A in model (1). To investigate stability, robustness, and periodicity of the inferred gene networks, the eigenvalues of the state transition matrices A are calculated. The eigenvalues of matrix follow as: −0.0715, 0.2479, 0.9018, 0.6749 ± 0.3959i, 0.8125 ± 0.2924i, 1.0396 ± 0.1536i. All eigenvalues except for the last pair of matrix A lie inside the unit circle in the complex plane, and the last pair is very close to the boundary of the unit circle. This means that the inferred network is almost stable and robust. Furthermore, the dominant eigenvalues of the inferred network are pairs of conjugate complex number: 1.0396 ± 0.1536i. Accordingly, this implies that the network behaves periodically [30]. This stems from the fact that the networks are inferred from cell-cycle-regulated gene expression data. Fig. 3 shows comparison of four experimental gene expression profiles and the predicted profiles from the constructed model on the training dataset “Thy-Thy3.” The predicted error is calculated by [28], eq. (2.11)] as 0.2525, which is consistent with the qualitatively close fit of predicted curves to the experimental time series shown in Fig. 3. C. Fuzzy Logic Model Fuzzy rule sets are generated for genes in the subnetwork in Fig. 2. We generate all possible rule combinations for the inputs
j =1
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
190
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
Fig. 4. Best fit rule on training set “Thy-Thy 3” (as shown in Table II) predicting gene expression on the test dataset (solid line) compared to actual data from the test set “Thy-Noc” (dashed line). (a) CDK2. (b) BRCA1. (c) EP300. (d) CDK4. Fig. 3. Comparison of experimental (solid lines) and predicted (dotted lines) gene expression profiles. (a) DMFT. (b) F2. (c) RRM2. (d) TYR.
on each gene (including the “null” rule, equivalent to excluding a potential rule), assuming that if an expression level of an input gene is MEDIUM, it must also result in an output expression level of MEDIUM. Three fuzzy sets are used to retain tractability of the rule search method, which examines all potential hypotheses consistent with the data. However, this still represents a significant advance in resolution over Boolean logic, because of the nonbinary membership in LOW and HIGH fuzzy sets. We apply the resulting fuzzy rules to microarray gene expression time series data for the human cell cycle in [24]. Such cell cycle microarray data have been criticized for being very noisy, and previous methods for identifying cell-cycle-regulated genes may be artifacts. However, these problems are endemic to genomic and proteomic measurement methods. Thus, the human cell cycle dataset is a reasonable test for the practical application of fuzzy logic modeling on data for which conventional methods can fail. We evaluated each fuzzy rule set from the exhaustive search at each time point in the dataset and compared the prediction and experimental data for each gene using an error metric (E) in (11), which emphasizes the correlation in qualitative expression changes between predicted and experimental data E=
M (xj − x ˜ j )2 (xj − x ¯)2 j =1
(11)
where the M experimental data {xj } (with mean x ¯) and defuzzified predicted numerical values {˜ xj } for the output gene are shown. Table II shows the best fitting rules for five genes. Each entry in the table is the rule for the input gene (rows) acting on the output gene (columns). We adopt the name convention in which “H” means high, “M” for middle, and “L” for low. For example, “HML” denotes the rule: “If input is LOW, then output is HIGH; if input is MED, then output is MED; if input is HIGH, then output is LOW.” “MMH” denotes the rule: “If input is LOW, then output is MED; if input is MED, then output is MED; if input is HIGH, then output is HIGH.” A dashed line in Table II means that the gene is not an input in the biomolecular network of Fig. 2 and a “0” means that the best fit rule excludes one of the potential inputs in the network.
Fig. 5. Error on test data against error on training data for each rule combination exhaustively generated for CDK4 (top) and BRCA1 (bottom).
The dataset (“Thy-Thy 3”) was also used as the “training set,” and for each gene, exhaustively generated rule sets were ranked based on the error (E) of that rule set on the data (the error of the best fit rules is given in the row of Table II labeled “Train E”). The rules were then simulated at each time point in the “test set” (“Thy-Noc”); these are the errors in the last row of Table II. As shown in Fig. 4, agreement between the predictions of the best fit rule on the training set and the data in the test set is excellent in some cases, in particular CDK2 and CDK4, which are known to be cell-cycle-regulated genes (and thus are expected to have a regular pattern of behavior in this dataset). In others (e.g., BRCA1), there is little agreement. These patterns are reflected when looking at overall results for the exhaustive rule search. Fig. 5 plots the error on the test set against the error on the training set for every possible rule for two genes CDK4 and BRCA1. A linear trend indicates that rules that have a low-fit error on the training set will also tend to have a relatively low-fit error on the test set for CDK4, while no apparent line trend indicates relatively noisy results for BRCA1.
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
HU et al.: MINING, MODELING, AND EVALUATION OF SUBNETWORKS
191
D. Probabilistic Boolean Networks
(i)
PBNs have been recently developed and studied in the literature. A PBN is a generalization of a BN. A BN consists of a set of nodes V and Boolean functions F , denoted by G(V , F ) where V = {x1 , x2 , . . . , xn }
and
F = {f1 , f2 , . . . , fn }.
Let xi (t) represent the state of xi at time t, where xi = 0 represents that the gene is underexpressed and xi = 1 means it is overexpressed. The overall expression levels of all the genes in the network at time step t is given by the following column vector:
(i)
For all positive θj , we can obtain cj by (i) θj (i) , if θj ≥ 0 l(i) (i) (i) (i) {θ : θ > 0} cj = κ κ κ=1 0, otherwise. (i)
We note that cj satisfies l(i)
(i)
cj = 1,
for i = 1, . . . ,n.
j =1
The level of influence from gene i1 to gene i2 is given by
l(i 1 )
LFi2 =
T
x(t) = [x1 (t), x2 (t), . . . , xn (t)] .
(i 1 )
lf i2 (fj
(i 1 )
)cj
j =1
This vector is referred to the gene activity profile (GAP) of the network at time t. For x(t) ranging from [0, 0, . . . , 0]T (all entries are 0) to [1, 1, . . . , 1]T (all entries are 1), it takes on all the 2n possible states of the n genes. The list of Boolean functions represents the rules of the regulatory interactions among the nodes (genes) xi (t + 1) = f (i) (x(t)),
Here, each gene will update its state according to the states of other genes in the previous step and its corresponding Boolean functions. Thus, a BN is a deterministic dynamic system. In a PBN for each target gene, instead of only one single Boolean function, it has a number of Boolean functions having equivalent prediction abilities. All these Boolean functions can be selected randomly with some probabilities. We assume that for the ith gene, there are l(i) possible Boolean functions (i) F i = fj : for j = 1, . . . , l(i) (i)
(i)
and the probability of choosing function fj is cj , where fj is a function with respect to the activity levels of n genes. A PBN is said to be independent if the elements from different F i are independent [18]. If the joint probability distribution of F 1 , F 2 , . . . , F n cannot be factorized as the product of F i , then it is a dependent PBN. For an independent PBN of n genes, there are at most N = ni=1 l(i) different possible BNs. This means that there are totally N possible realizations of the genetic network. The probability Pk of choosing the kth BN is given by Pk =
n
(i)
i=1
Ck i ,
k = 1, 2, . . . , N .
(i)
(i)
The probability cj of choosing the jth predictor fj can be (i) estimated by COD [31]. Let εj be the optimal error achieved (i) (i) by fj and ε be the error of the best estimate of the ith gene in the absence of any conditional variable, then we have (i)
(i) θj
εi − ε j = . εi
(i1)
fj
([x1 (t), . . . , xi2−1 (t), 0, xi2+1 (t), . . . xn (t)]T )
is not equal to (i1)
fj
([x1 (t), . . . , xi2−1 (t), 1, xi2+1 (t), . . . xn (t)]T ).
Here, we are interested in the influence of gene i2 in the predictor (i1) fj .
i = 1, 2, . . . , n.
(i)
where lf i2 is the probability of the event that
(i)
We remark that if the number of input genes in fj is n, then n the number of possible functions is equal to 22 . To reduce the complexity of a PBN, the number of input genes is set to a very small number (two or three) in practice; see, for instance, [32]. A PBN is a Markov chain capturing transition probabilities among different genes expression states. In order to reduce the complexity of a PBN, Ching et al. [33] formulated a multivariate Markov model that can capture both the intra- and intertransition probabilities among genes expression states. In the multivariate Markov model, the ith gene expression state probability vector xi (t) depends on the weighted average of other genes expression state probability vectors. In matrix form, we write X1 (t + 1) X (t + 1) 2 X(t + 1) ≡ . . .
Xn (t + 1)
λ11 P (11) λ12 P (12) · · · λ1n P 1n λ P (21) λ P (22) · · · λ P 2n 21 22 2n = .. .. .. .. . . . .
X1 (t) X (t) 2 . ≡ QX(t) . .
λn 1 P (n 1) λn 2 P (n 2) · · · λn n P n n Xn (t) n where λj κ ≥ 0 and κ=1 λjκ = 1 ∀1 ≤ j, κ ≤ n, and P (j k ) is a transition probability matrix from gene expression state k to gene expression state j. Here, λj k can be considered as how the expression of gene j is affected by gene k. When λj k is close to 0, gene k does not give any influence on gene j. We remark that all the transition probability matrices P (j k ) can be estimated from gene expression data. The parameters λj k can be estimated efficiently by solving linear programming problems based on the
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
192
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
TABLE III PREDICTION ACCURACY OF EACH GIVEN GENE BASED ON THE GIVEN GENETIC NETWORK USING TWO STATES MICROARRAY DATA
TABLE IV PREDICTION ACCURACY OF EACH GIVEN GENE BASED ON THE GIVEN GENETIC NETWORK USING THREE STATES MICROARRAY DATA
TABLE V PREDICTION ACCURACY BASED ON THE INPUT GENES ESTIMATED FROM THE MULTIVARIATE MARKOV CHAIN MODEL
first-order moment matching. Using the multivariate Markov chain model, the number of genes in a network to be tested can be large compared with PBNs; for details, see [33]. We construct PBNs for the given microarray dataset “ThyThy 3” and use the dataset “Thy-Noc” to test the constructed PBNs. Since the gene expression dataset is quantitative, we first convert continuous values to binary states (1: ON and 0: OFF). We indicate 1 (or 0) if the expression value is above (or below) the mean computed over the sample time point. Besides binary data, we can consider three possible states (1: ON; 0: OFF; and ∗ : undetermined) of gene expressions. The undetermined state appears when the gene expression value is close to the mean, and it is difficult to define whether the corresponding gene is ON or OFF. The prediction results are given in Tables III and IV. In Tables III and IV, we observe that the prediction accuracy results for the gene “HE” is very low. In addition, there are some other genes with poor prediction accuracies. Therefore, we use the developed multivariate Markov chain to model the microarray dataset. The parameters λj k in the resulting Markov chain model provide gene–gene interaction information. We observe that in the output of the multivariate Markov chain model, most λj are equal to zero. It implies that the number of input genes for a target gene is small, and such input genes should be closely related to target gene. In particular, we modify the input genes in some target genes of PBNs. Table V lists the changes of input genes. By using the new input genes to construct the PBNs of the genes HE, myc, and cdk2, we have the following results. We see that the new sets of input genes can improve the prediction accuracy.
VI. CONCLUSION AND DISCUSSION In this paper, we have presented an efficient approach to growing a community from a given seed protein. It uses topological property of community structure of a network, and takes advantage of local optimization in searching for the community comprising the seed protein. Due to the complexity and modularity of biological networks, it is more desirable and computationally feasible to model and simulate a network of smaller size. Our approach builds a community of manageable size and scales well to large networks. Its usefulness is demonstrated by the experimental results that all the four communities identified reveal strong structural and functional relationships among member proteins. It provides a fast and accurate way to find a community comprising a protein or proteins with known functions or of interest. For those community members that are not known to be part of a protein complex or a functional category, their relationship to other community members may deserve further investigation, which, in turn, may provide new insights. In general, the state-space method, PBNs, and fuzzy logic modeling frameworks allow for inconsistencies and potentially noisy data to be identified and used to generate alternative computational hypotheses for biomolecular networks. The modeling methods are tractable and scalable because novel clustering methods are applied to adaptively extract biologically significant subnetworks for simulation and hypothesis testing. Thus, simulation of hypothetical biomolecular network models based on state-space method, PBNs, and fuzzy logic can be compared with experimental data to select and refine plausible hypotheses. We combine the simulation result with the computationally derived meta-model to identify key genes whose perturbation would generate the dataset that could most optimally differentiate between the alternative biomolecular network hypotheses. For example, in the state-space model, we identify some regulatory factors. In fuzzy logic model and PBNs, we identify putative connections between genes based on fuzzy rules and Boolean functions, respectively. Consequently, by uniting the system identification and simulation components of the modeling procedure into an integrated method, we can develop a cyclical flow from modeling through experiments through updates to the global biological knowledge base. Such a flow is designed specifically to respond to the challenges of designing and interpreting high-throughput experiments, which can, in the future, evolve in concert with modeling and information management. For instance, in Section V, we modify the gene network in the construction of PBNs. The prediction accuracy of gene–gene interactions is improved. In summary, we have developed a new method for adaptive modeling of biomolecular networks. The method iteratively mines and organizes quantitative and qualitative data to generate scalable hypothetical biomolecular network structures. The dynamics of these computational hypotheses are tested and refined through cycles of model-based simulation and laboratory experiments. While in the example here, only microarray data are presented, the modeling framework of representing biomolecular expression states can simply be extended to protein and metabolite levels. This is a key point because gene networks are
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
HU et al.: MINING, MODELING, AND EVALUATION OF SUBNETWORKS
an abstraction representing only one aspect of biomolecular networks. They must be integrated with protein–protein interaction networks and metabolite profiling to develop a comprehensive portrait of cellular function. We believe that our method provides a timely tool for this purpose. REFERENCES [1] J. P. Fitch and B. Sokhansanj, “Genomic engineering: Moving beyond DNA sequence to function,” Proc. IEEE, vol. 88, no. 12, pp. 1949–1971, Dec. 2000. [2] T. Akutsu, S. Miyano, and S. Kuhara, “Identification of gene networks from a small number of gene expression patterns under the Boolean network model,” Pac. Symp. Biocomput., vol. 4, pp. 17–28, 1999. [3] L. Glass and S. A. Kauffman, “The logical analysis of continuous nonlinear biochemical control networks,” J. Theor. Biol., vol. 39, no. 1, pp. 103–129, 1973. [4] J. Tegn´er, M. K. S. Yeung, J. Hasty, and J. J. Collins, “Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling,” Proc. Natl. Acad. Sci. USA, vol. 100, no. 10, pp. 5944–5949, 2003. [5] R. F. Hashimoto, S. Kim, I. Shmulevich, W. Zhang, M. L. Bittner, and E. R. Dougherty, “Growing genetic regulatory networks from seed genes,” Bioinformatics, vol. 20, pp. 1241–1247, 2004. [6] R. Jansen, N. Lan, J. Qian, and M. Gerstein, “Integration of genomic datasets to predict protein complexes in yeast,” J. Struct. Funct. Genomics, vol. 2, pp. 71–81, 2002. [7] G. D. Bader and C. W. Hogue, “An automated method for finding molecular complexes in large protein interaction networks,” BMC Bioinf., vol. 4, no. 2, 2003, pp. 20–29. [8] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li, and R. Chen, “Topological structure analysis of the protein–protein interaction network in budding yeast,” Nucl. Acids Res., vol. 31, pp. 2443–2450, 2003. [9] X. Hu, “Mining and analyzing scale-free protein–protein interaction network,” Int. J. Bioinf. Res. Appl., vol. 1, pp. 81–101, 2005. [10] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray, “From molecular to modular cell biology,” Nature, vol. 402, pp. C47–C52, 1999. [11] R. A. Rosales, M. Fill, and A. L. Escobar, “Calcium regulation of single ryanodine receptor channel gating analyzed using HMM/MCMC statistical methods,” J. Gen. Physiol., vol. 121, pp. 533–553, 2004. [12] A. Schliep, A. Schonhuth, and C. Steinhoff, “Using hidden Markov models to analyze gene expression time course data,” Bioinformatics, vol. 19, pp. i255–i263, 2003. [13] C. Rangel, J. Angus, Z. Ghahramani, M. Lioumi, E. Sotheran, A. Gaiba, D. L. Wild, and F. Falciani, “Modeling T-cell activation using gene expression profiling and state-space models,” Bioinformatics, vol. 20, pp. 1361– 1372, 2004. [14] P. D’Haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, “Linear modeling of mRNA expression levels during CNS development and injury,” in Proc. Pac. Symp. Biocomput., 1999, vol. 4, pp. 41–52. [15] R. Laubenbacher and B. Stigler, “A computational algebra approach to the reverse engineering of gene regulatory networks,” J. Theor. Biol., vol. 229, pp. 523–537, 2004. [16] H. D. Jong, “Modeling and simulation of genetic regulatory systems: A literature review,” J. Comput. Biol., vol. 9, pp. 67–103, 2002. [17] R. Somogyi and C. A. Sniegoski, “Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation,” Complexity, vol. 1, pp. 45–63, 1996. [18] T. Chen, H. L. He, and G. M. Church, “Modeling gene expression with differential equations,” in Proc. Pac. Symp. Biocomput., 1999, vol. 4, pp. 29–40. [19] C. T. Chen, Linear System Theory and Design, 3rd ed.. New York: Oxford Univ. Press, 1999. [20] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, “Defining and identifying communities in networks,” Proc. Natl. Acad. Sci. USA, vol. 101, pp. 2658–2663, 2004. [21] S. Asthana, O. D. King, F. D. Gibbons, and F. P. Roth, “Predicting protein complex membership using probabilistic network reliability,” Genome Res., vol. 14, pp. 1170–1175, 2004. [22] L. M. Machesky and K. L. Gould, “The Arp2/3 complex: A multifunctional actin organizer,” Curr. Opin. Cell Biol., vol. 11, pp. 117–121, 1999.
193
[23] M. E. J. Newman, “The structure and function of complex networks,” SIAM Rev., vol. 45, pp. 167–256, 2003. [24] M. L. Whitfield, G. Sherlock, A. J. Saldanha, J. I. Murray, C. A. Ball, K. E. Alexander, J. C. Matese, C. M. Perou, M. M. Hurt, P. O. Brown, and D. Botstein, “Identification of genes periodically expressed in the human cell cycle and their expression in tumors,” Mol. Biol. Cell., vol. 13, pp. 1977–2000, 2002. [25] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. R. Stat. Soc. B, vol. 61, pp. 611–622, 1999. [26] K. P. Burnham and D. R. Anderson, Model Selection and Inference: A Practical Information-Theoretic Approach. New York: Springer-Verlag, 1998. [27] B. Alberts, A. Johnson, J. Lewis, M. Raff, D. Bray, K. Hopkin, K. Roberts, and P. Walter, Essential Cell Biology. New York: Garland, 1998. [28] F. X. Wu, W. J. Zhang, and A. J. Kusalik, “State-space model with time delays for gene regulatory networks,” J. Biol. Syst., vol. 12, pp. 483–499, 2004. [29] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci. USA, vol. 95, pp. 14863–14868, 1998. [30] J. Durbin and S. J. Koopman, Time-Series Analysis by State Space Model. New York: Oxford Univ. Press, 2001. [31] E. R. Dougherty, S. Kim, and Y. Chen, “Coefficient of determination in nonlinear signal processing,” Signal Process., vol. 80, pp. 2219–2235, 2000. [32] S. Kim, E. R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J. M. Trent, and M. Bittner, “Multivariate measurement of gene expression relationships,” Genomics, vol. 67, pp. 201–209, 2000. [33] W. Ching, E. Fung, M. Ng, and T. Akutsu, “On construction of stochastic genetic networks based on gene expression sequences,” Int. J. Neural Syst., vol. 15, pp. 297–310, 2005.
Xiaohua Hu (M’00) received the B.Sc. (Software) degree from Wuhan University, Wuhan, China, in 1985, the M.Eng. degree in computer engineering from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China, in 1988, the M.Sc. degree in computer science from Simon Fraser University, Burnaby, BC, Canada, in 1992, and the Ph.D. degree in computer science from the University of Regina, Regina, SK, Canada, in 1995. He is a scientist, a teacher, and an entrepreneur. In 2002, he joined Drexel University, Philadelphia, PA, where he is currently an Associate Professor and the founding Director of the Data Mining and Bioinformatics Laboratory, College of Information Science and Technology. He was a Research Scientist in the world-leading R&D centers such as Nortel Research Center, GTE Laboratories, and HP Laboratories. In 2001, he founded the DMW Software, Silicon Valley, CA. He also founded the International Journal of Data Mining and Bioinformatics in 2006 and the International Journal of Granular Computing, Rough Sets and Intelligent Systems in 2008. His research ideas have been integrated into many commercial products and applications. His current research interests include biomedical literature data mining, bioinformatics, text mining, semantic web mining and reasoning, rough set theory and application, information extraction, and information retrieval. He has authored or coauthored more than 160 peer-reviewed research papers published in various journals, conferences, and books, including various IEEE/Association for Computing Machinery (ACM) Transactions, and has coedited nine books/proceedings. His research projects are funded by the National Science Foundation (NSF), US Department of Education, and the Pennsylvania Department of Health. Dr. Hu is currently the IEEE Computer Society Bioinformatics and Biomedicine Steering Committee Chair and the IEEE Computational Intelligence Society Granular Computing Technical Committee Chair (2007–2009). He has received a few prestigious awards including the 2005 National Science Foundation (NSF) Career Award (the most prestigious award from the NSF to young faculty in USA), the Best Paper Award at the 2007 International Conference on Artificial Intelligence, the Best Paper Award at the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, the 2007 IEEE Bioinformatics and Bioengineering Outstanding Contribution Award, the 2006 IEEE Granular Computing Outstanding Service Award, and the 2001 IEEE Data Mining Outstanding Service Award.
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.
194
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 13, NO. 2, MARCH 2009
Michael Ng received the B.Sc. and M.Phil. degrees from the University of Hong Kong, Pokfulam, Hong Kong, in 1990 and 1992, respectively, and the Ph.D. degree from the Chinese University of Hong Kong, Shatin, NT, Hong Kong, in 1995. He is a Professor in the Department of Mathematics, Hong Kong Baptist University, Kowloon, Hong Kong. His current research interests include bioinformatics, data mining, operations research, and scientific computing. He has authored or coauthored more than 160 journal papers and has edited five books. He has reviewed papers for more than 40 international journals. He currently serves on the editorial boards of several international journals.
Bahrad A. Sokhansanj (S’00–M’03) received the B.S. degree in engineering physics from the University of Saskatchewan, Saskatoon, SK, Canada, and the M.S. and Ph.D. degrees in applied science from the University of California-Davis, Livermore. He was a Postdoctoral Fellow at Lawrence Livermore National Laboratory. He is currently an Assistant Professor in the School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, where he leads the Molecular Health Engineering Laboratory, which develops quantitative experimental biology methods in conjuction with novel analysis and modeling methods.
Fang-Xiang Wu (M’06) received the B.Sc. and M.Sc. degrees in applied mathematics from Dalian University of Technology, Dalian, China, in 1990 and 1993, respectively, the first Ph.D. degree in control theory and its applications from Northwestern Polytechnical University, Xi’an, China, in 1998, and the second Ph.D. degree in biomedical engineering from the University of Saskatchewan, Saskatoon, SK, Canada, in 2004. He was a Postdoctoral Fellow at Laval University Medical Research Center (CHUL), Quebec City, QC, Canada. Since 2005, he has been an Assistant Professor of bioengineering in the Department of Mechanical Engineering, University of Saskatchewan. His current research interests include systems biology, genomic and proteomic data analysis, biological system identification and parameter estimation, and applications of control theory to biological systems.
Authorized licensed use limited to: Drexel University. Downloaded on February 25,2010 at 13:22:47 EST from IEEE Xplore. Restrictions apply.