Practical Guidelines for Incorporating Knowledge-Based ... - IEEE Xplore

2 downloads 156 Views 1MB Size Report
years, few of the studies can satisfy both of the requirements simultaneously. ... I. INTRODUCTION. N the modeling of gene regulatory networks (GRNs), the.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

1

Practical Guidelines for Incorporating Knowledge-Based and Data-Driven Strategies into the Inference of Gene Regulatory Networks Yu-Ting Hsiao1,2, Wei-Po Lee1*, Wei Yang3, Stefan Müller4, Christoph Flamm5, Ivo Hofacker5,7, Philipp Kügler4,6  Abstract—Modeling gene regulatory networks (GRNs) is essential for conceptualizing how genes are expressed and how they influence each other. Typically, a reverse engineering approach is employed; this strategy is effective in reproducing possible fitting models of GRNs. To use this strategy, however, two daunting tasks must be undertaken: one task is to optimize the accuracy of inferred network behaviors; and the other task is to designate valid biological topologies for target networks. Although existing studies have addressed these two tasks for years, few of the studies can satisfy both of the requirements simultaneously. To address these difficulties, we propose an integrative modeling framework that combines knowledge-based and data-driven input sources to construct biological topologies with their corresponding network behaviors. To validate the proposed approach, a real dataset collected from the cell cycle of the yeast S. cerevisiae is used. The results show that the proposed framework can successfully infer solutions that meet the requirements of both the network behaviors and biological structures. Therefore, the outcomes are exploitable for future in vivo experimental design. Index Terms—Prior knowledge-mapping, S-system, gene regulatory network modeling, evolutionary algorithms

I. INTRODUCTION

I

the modeling of gene regulatory networks (GRNs), the purpose of developing a computational framework is to elucidate cellular processes [1, 2], including (1) understanding the causality of gene expression, (2) observing interactions among components, and (3) generating new possible pathways N

1 Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan. 2 Genomics Research Center, Academia Sinica, Taipei, Taiwan. E-mail: [email protected] 3 Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China. E-mail: [email protected] 4 Johann Radon Institute for Computational and Applied Mathematics, Austrian Academy of Sciences, Linz, Austria. 5 Institute for Theoretical Chemistry, University of Vienna, Wien, Austria. E-mail: [email protected]; [email protected] 6 Institute for Applied Mathematics and Statistics, University of Hohenheim, Stuttgart, Germany. E-mail: [email protected] 7. Research group BCB, Faculty of Computer Science, University of Vienna. * Correspondence: [email protected]

(hypotheses) to assist in the design of in vivo experiment. To achieve these objectives, biologists employ modern high-throughput experimentation as data input for discovering a variety of cellular processes, and record expression levels of thousands of genes to capture gene regulatory interactions in a time-series format. These experimental data can primarily be categorized into two types of applications [3]: (1) observing genes in a single environment and evaluating the distinctions between two types of tissues, usually for normal versus cancerous tissues (e.g., [4-6]), and (2) monitoring genes multiple times with preferred conditions (e.g., [7-8]). The former applications are proven successful in distinguishing tissue types (i.e., recognizing tumor tissues [9-11]). On the other hand, applications for the latter are helpful for identifying functionally related genes among expression patterns (e.g., [7-8, 12-13]). The obtained patterns can assist biologists in modeling the gene regulatory relationships and constructing GRNs models [13]. Both types of applications mentioned above that use temporal measurements as the input in a mathematical function, belong to the data-driven modeling method [14]. Drawing on the central dogma of molecular biology, gene expression can be regarded as consisting of modules that describe how genetic processes operate from DNA segments through mRNA to protein products in a cell. Hence, any effective data-driven inference should be able to apply the gene expression data to model GRNs and ideally characterize both the magnitude of the genetic interactions and the scaffolding of the network [15]. To infer networks from expression data, two crucial steps are to select a network model and then to find the most suitable structural parameters for the network. Many models have been proposed to address different levels of biological details, for example, Boolean networks, Bayesian networks, and the ordinary differential equations (ODEs). Once the network models are determined, different computational approaches, such as the information-theoretic approach, the Bayesian analysis, and the evolutionary algorithm, can be applied to the models to derive the relevant network parameters. Additional details on computational models and methods often used for gene network inference can be found in some review articles

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

2 [15-17]. In practice, however, the parameter values of a computational model cannot provide detailed guidance regarding a biological system, because the information on genetic processes that is recorded by the time series is mainly implicit [18]. Therefore, the identification of the regulatory logic or the underlying mechanism of a system by using the data measurements alone could cause the parameter values to be inapplicable [19]. Because the information on the inferred parameter values is insufficient to comprehend the complexity of the GRNs, scientists thus advocate inferring the computational models by integrating knowledge-based data (i.e., prior knowledge datasets, or PKDs) into the original modeling methods [1, 20-21]. To construct the correct pathways, over the past decade, biological scientists have collected prior knowledge datasets as precisely as possible, regarding the gene functions, the causal links, and the partial topology (due to limited experimental coverage) of the biological systems. Examples of such datasets include KEGG [22], YeastNet [23], WikiPathways [24], and CellNetOptimizer [25]. Although PKDs still contain different degrees of data inconsistency on account of the various experimental settings and purposes, these databases serve as useful resources for providing the structural relationships as a practical guideline for inferring genetic networks, e.g., [26-27]. Therefore, applying qualitative behavior that is obtained from PKDs to network modeling has been considered to be a complementary strategy to construct the genetic dynamics in a way that has biological meaning [28-29]. In general, the modeling process integrated with prior knowledge includes five major steps [14, 27]: (1) the source of PKD selection and construction, (2) experimental data and network topology pre-processing and integration, (3) computational method selection and design, (4) parameter estimation and refinement, and (5) model analysis, validation, and suggestion. Figure 1 illustrates this process.

Fig. 1. The conceptual process of GRN modeling with prior knowledge. Of note, PKD refers to some prior knowledge datasets, e.g., YeastNet, WikiPathways for gene-gene interaction information, S. cerevisiae Genome Database for acquiring gene functions, etc.

The first step is to find suitable and unambiguous PKDs as a reference. This step is crucial for knowledge-based modeling because if the information taken from PKDs contains uncertainties, then the pre-defined network topology and linkages among the genes provide non-informative content for

the whole process. Moreover, the content will lead to inapplicable inferred results. Therefore, finding the most relevant validated information that corresponds to a target biological system is the cornerstone of knowledge-based model construction [30]. Once the exploitable structure and the regulation information of the system have been built up, the second step is dedicated to integrating the attributes of PKDs (from step 1) with the gene expression data and to presenting them in a structured form. Here, some assumptions or simplifications about the biological system should be made and encoded. This procedure depends on the goal that we would like to achieve through model construction (i.e., a coarse-grained or fine-grained oriented inference [31]). For example, we must determine how to present the intensity of a gene regulation (by a Boolean or real-number representation) and how these intensities map onto the expression profiles (such as encoding the expression data into a Boolean network inference [32]). In the third step, a suitable computational model is determined to depict the relationships between the symbolic equations and the genetic regulations. Among various model choices, many studies of biological systems have applied canonical models (e.g., ODE models) that are based on biochemical systems theory (BST) [33], which is the power-law representation of a genetic process, to characterize the dynamics of a regulatory network, for example, [34-38]. In canonical models, each genetic regulation is assigned to a unique parameter in the symbolic equations. Drawing upon the integrated information that was mentioned previously (step 2), all the parameters in the equations are now constrained by means of encoded prior knowledge. The constrained model thus can be used to infer appropriate (i.e. with biological implications) parameter values in the subsequent steps. Thus, the task of the fourth step is to derive the network regulatory relationships. After the model is trained and evaluated in a cyclic process by running the optimization algorithm with an objective function or estimating the model evaluation formula ([27]), several sets of candidate solutions are acquired. With the results, the final step is attempting to determine plausible parameter sets; these sets should be consistent with suggestions in the literature as well as with the experimental data. This step can be achieved by giving certain criteria to validate the inferred results. Upon completing the parameter validations, feedback to a biological system design can be made. The suggestions of the gene interactions in a GRN computational model can provide biologists with some promising insights into a new experimental design [30]. Drawing on the modeling process described above, in this paper we propose an implementable framework that combines the merits of both knowledge-based and data-driven inference strategies to elucidate a biologically plausible computational model of gene regulatory networks. The goal of this procedure is not only to resolve the deficiencies of prior research on knowledge-based GRN modeling but also to have a better understanding of how to exploit the existing literature and bio-software, to help us to build a comprehensive framework. Though there exist works that used knowledge and data to

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

3 construct networks (e.g., [14, 39-40]), our work differs from theirs in the network model and the corresponding inference method, and the focus is on the ODE-based model and on setting parameters as structural priorities on the ODEs. To validate the proposed approach, we applied a real dataset that was taken from the cell cycle of yeast S. cerevisiae (reported in [7]), to provide a walkthrough example and to demonstrate how our approach operates in practice.

II. MATERIALS AND METHODS Referring to the main steps of the modeling process with prior knowledge described in the above section, we developed an integrated pipeline, as follows: (1) selection and construction of the PKD source, (2) data pre-processing and integration, (3) computational model selection (S-system), (4) mapping prior knowledge onto the S-system, and (5) generation of a PKD-based modeling algorithm using a reverse engineering approach with a novel evaluation function. Finally, in terms of the pipeline, the main steps of our approach are given below. A. Selection and construction of the PKD source In recent years, many studies have used yeast S. cerevisiae as an experimental organism to investigate the mechanisms of DNA damage and repair in eukaryotic organisms [41]. Understanding the complete process of how DNA repairs itself could yield significant applications for human disease and aging [42]. Nevertheless, few studies have attempted to construct the network regulatory relationships of DNA repair genes as a whole in systems biology [43]. Therefore, in this work, we used a cDNA microarray dataset that captured the mRNA levels during the whole cell cycle of the budding yeast S. cerevisiae to infer the regulatory network among DNA repair genes. This dataset was first reported in [7] and can be downloaded from http://arep.med.harvard.edu/ExpressDB. This dataset has been duly updated and well-studied, and the domain knowledge resources have been established for verifying new methods. Over the years, continuing development has been made so that it is still often used in the research of network reconstruction [13, 44-45]. We thus chose to use this dataset here. In the article by Cho et al., the authors started by obtaining a synchronous yeast culture that was at the late G1 checkpoint [7]. The cells were observed for nearly two full cell cycles and were collected at 17 time points (10 minutes per interval). As mentioned in their report, they reset the cell cycle at 110 minutes (i.e., the twelfth time point); hence, we adopted the first complete cell cycle as the time-series dataset. B. Data pre-processing and integration After the target regulatory network and the dataset were determined, the source of the PKDs was then selected from the existing literature. We exploited the software YeastNet version 2.0 [46] as our main source for the genetic structures of yeast DNA repair genes. YeastNet is a powerful theoretical

framework that integrates several heterogeneous databases including protein-encoding open reading frames (ORFs) of the yeast genome from the S. cerevisiae Genome Database (SGD) [47]. This software calculated confidences in pair-wise genetic interactions from the collected databases, and accordingly, it suggested 102,803 linkages among 5,483 yeast genes. Several studies have used YeastNet as a data source for reference and then developed various methods for genetic network reconstruction, evaluation, or prediction based on the output of YeastNet, for example [48-49]. To form the regulatory network of DNA repair genes and to construct useable prior knowledge, we referenced two benchmark papers, [7] and [12]. We incorporated the genes labeled ‘repair’ into a network according to the biological function listed in these articles. Furthermore, based on this network, we used YeastNet to build the connection relationships (i.e., the adjacency matrix) among the chosen genes (Table 1). Originally, there were 23 genes classified as a group that has repair characteristics; however, three isolated genes were discarded, including ALK1 (YGL021W), KIM4 (YHR038W), and IXR1 (YKL032C). Once the adjacency matrix for the genes was built, the procedure of constructing the desired prior knowledge, which included 71 paired connections in the network, was then completed. TABLE I THE ADJACENCY MATRIX OF THE DNA REPAIR REGULATORY NETWORK*.

*Label ‘1’ means that there is a connection between two genes; there are 71 pairwise connections (non-directional) in this table.

C. Computational model In the GRN modeling, equations are often used to represent the network structure and to describe, to some extent, a relatively simple approach to the chemical dynamics. These equations mainly approximate the manifold ways in which the component reactions affect some other components to which they connect. Among various GRN models, ODEs have been widely applied to capture the regulatory interactions in a genetic network. As mentioned above, ODEs are regarded as canonical models [50] in which the pre-defined homogeneous structure of the equations makes the network model versatile and scalable.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

4 A general biochemical ODEs model is responsible for formulating the regulatory influence (i.e. activation, inhibition, or no effect) among the genes (mainly represented by the corresponding mRNAs and proteins). This model can be described as dx  f (v( x), p, u ) dt

(1)

The function f can be either linear or non-linear, depending on the level of complexity of the system dynamics that we would like to derive. The vector v(x) is the set of all inferences of genes on each other; p is the parameter set of the system; and u is an external perturbation to the system. To date, one of the most prominent and widely studied ODE models in context of GRN, the S-system, has been considered to be suitable to characterize gene regulation and system dynamics [36, 38]. This system consists of a set of tightly coupled ODEs, in which the synthesis and degradation processes are approximated by power law functions. The corresponding S-system model can be described as follows: N N dX i g h =  i  X j i , j - i  X j i , j ,  i dt j 1 j 1

synthesis

(2)

degradation

In the above equation, Xi is the expression level of gene i, and N is the total number of genes in the genetic network. The parameters αi and βi ∈ [0, 10] are rate constants (e.g., some constant input can be represented herein); gi,j and hi,j ∈ [-3, 3] are kinetic orders that reflect the interactions from gene j to gene i in the synthesis and degradation processes, respectively. Inferring a tightly coupled S-system, however, is a large-scale parameter optimization problem that is very time-consuming. By examining the structural characteristics of the gene networks, Maki et al. [51] proposed an efficient strategy to decouple this inference problem with N separated sub-problems, each of which refers to one gene. In other words, in a decoupled S-system, a tightly coupled system of non-linear differential equations is decomposed into N differential equations [38, 52]. In addition to the computational efficiency, the main benefit of adopting this strategy is that it allows us to assign prior knowledge to corresponding genes and observe genetic interactions toward the target gene independently. By describing the genetic relationships in the form of numerical constraints, one can quantitatively encode prior knowledge of the regulatory relationships among genes. In a S-system model, the parameters gi,j and hi,j represent the regulatory relationships for a gene network. The most advantageous feature of the S-system for parameter estimations is that the range of kinetic orders can be set either by a default or with an indicative search range that represents the intensity as well as the regulatory relationship. In the first case, if the structural information remains unknown, then we can set gi,j and hi,j within the default range: usually [-3, +3] (e.g., [36, 38, 52]). In contrast, if the knowledge indicates some structural interactions, then the kinetics can be identified to be within a specific range or at an exact value. For example, if the regulation from gene j to gene i has been known as a positive (or negative) relationship, then the

range of its kinetics is set as being in (0, +3] (for a negative situation: [-3, 0)); otherwise, if gene j to gene i has no connection, then its kinetics value is zero. Based on this merit of the structural settings of the S-system, follow-up work was to integrate the connection relationships with the computational model, in other words, to map the connections and the constraints onto the kinetics of the S-system. Note that although the rate constants (i.e., αi and βi) can be controlled by the numerical constraints as well, this study remain the default search ranges for αi and βi. The main reason is that the prior knowledge we collected did not include the information of the constant input, so αi and βi did not join the mapping process in the following procedure. D. Mapping prior knowledge onto the S-system Before going through the process of knowledge mapping, we first demonstrate how the components of the S-system can be used to represent a network topology. Figure 2a shows the visualized topology transformed from the adjacency matrix (i.e., Table 1). In the graph, we take gene UNG1, which has a link toward OGG1, as an example to illustrate the network topology. The corresponding pathway diagram is expressed in Figure 2b. As shown, the input magnitude (flux-in) of UNG1 can be affected by OGG1 (i.e. gung1, ogg1), and this relationship is called the synthesis process of UNG1 (equation 3). At the same time, the output magnitude (flux-out) of UNG1 depends on the concentration level of UNG1 (i.e., hung1, ung1) and can be affected by OGG1 (i.e., hung1, ogg1) as well. These relationships are depicted by the degradation process in equation 4. The concentration of UNG1 at the next time step is determined by a calculation of the magnitude of synthesis minus that of degradation (equation 5). Similarly, a larger example from the perspective of OGG1 in Figure 2c with the same computational process can be completed through equations 6 to 8.

(a) hogg1,ung1 gogg1,msh2 gogg1,ung1

hung1,ogg1 gung1,ogg1

gogg1,msh6

(b)

hogg1,msh2

hogg1,msh6

(c)

Fig. 2. Network topology for the S-system representation. (a) The topology depicted from Table 1. The small and large dash circlea are the structural information for (b), i.e., UNG1, OGG1, and (c), i.e., UNG1, MSH2, MSH6,

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

5 OGG1, respectively. (b) and (c) are examples to illustrate how the network topology can be applied to a S-system model.

S   ung1 OGG1

gung1, ogg1 hung1, ung1

D  ung1 UNG1

(3) hung1, ogg1

(4)

OGG1



(5)

UNG1  S - D S   ogg1 UNG1

gogg1, ung1

MSH 2

hogg1, ogg1

D   ogg1 OGG1

 MSH 2

hogg1, msh2

gogg1, msh2 h

UNG1ogg1, ung1

MSH 6

MSH 6

gogg1, msh 6

(6) (7)

hogg1, msh 6



OGG1  S - D

fobj(i) = MSE(i) + StrPri(i), for i = 1, 2, 3, …, N

(9)

This evaluation function was defined by two terms: (1) the curve-fitting term: MSE(i), to minimize the accumulated discrepancy between the gene expression data (actual values) and the simulations; (2) the structural priority term: StrPri(i), to minimize the structure inconsistency between the structures suggested by prior knowledge and that of the inferred model. In addition, N is the number of genes in the network. As mentioned in the previous section, we adopted the decoupled strategy in which each sub-problem corresponds to the ith MSE and the ith structural priority function. In the above equation, the first part of the function (the curve-fitting term) is the mean squared error (MSE) of gene i over the time period t. It can be described as

(8)

 X s (t ) - X io (t )  MSE (i)    i  X io (t ) t 1   T

E. PKD modeling algorithm using the reverse-engineering approach To infer numerical values for the kinetic orders in the S-system (i.e., αi, βi, gi,j, and hi,j in equation (2)), an efficient reverse-engineering approach was adopted. This approach is called GA-PSO and is described in our previous work [53]. It is an improved algorithm that combines two population-based optimization procedures, genetic algorithm (GA) and particle swarm optimization (PSO), to exploit their respective advantages. In brief, the proposed GA-PSO procedure initially generates a random population (containing n individuals) and evaluates the individuals; then the n individuals are ranked by the fitness values and separated into two parts: p% and (1-p)%, where p% is the best part of individuals. The GA and PSO process are performed as follows. First, the p% individuals are preserved and enhanced by the PSO procedure and are also used to generate a candidate list for replacing the worst part (1-p)% individuals. Second, in the meantime, the parent pool receives some fresh individuals (i.e., r%) generated by a new random population. Third, this pool is used to create new individuals through the GA procedure, and the newly created individuals fill in for the discarded part of population. Finally, when the whole population is formed, the individuals are again evaluated and ranked by their fitness values for the next iteration. This procedure performs iteratively until meeting the termination criterion. In the experimental settings, p had a fixed value 0.1 (estimated from a preliminary test), while r was a variable whose value linearly changed from 0.1 to 0.5 during the run to control the population diversity. Additional details are referred to [53]. Our approach in modeling GRNs has been shown to outperform other algorithms on a series of benchmarks and on the empirical datasets [53]. Therefore, in this study, we adopted this method as the main scaffold for developing the PKD modeling approach to train the decoupled S-system (i.e., the model to be built) by a proposed objective function, as follows:

2

(10)

where XiO is the actual expression level of gene i at time t, XiS is a simulated value obtained from the inferred model, and T is the number of data points measured for a gene. The second part of the objective function (the structural priority term) is to prioritize a preferred network topology for an inferred GRN model in which the estimated skeletal structure can dovetail with the structure given by prior knowledge. As indicated in previous studies (for example, [37, 43, 52]), one major difficulty in the S-system modeling process lies in how to select the skeletal structure that can represent the observed topology of a target genetic network and reconstruct the corresponding network behaviors. With the structural priority defined in the proposed objective function as the guidance for the skeletal structure, the computational algorithm can find a preferred network structure that is in accordance with the suggested structure (prior knowledge). In this way, a target network structure can be derived during the process of evaluating candidate structures. According to the notion of prioritizing the network topology, there are two sub-terms in the structural priority below:

StrPri1 (i)  w1 

nzpi zPKi

(11)

StrPri2 (i )  w2 

zpi nzPKi

(12)

where structure priority 1 (StrPri1 ∈ [0, 1]) is the ratio of parameters that violate the suggestions from PKDs, i.e., there is no plausible connection between gene j and i (refer to equation 2). In detail, the numerator nzpi ∈ N0 (i.e., a non-negative integer) records how many inferred kinetic orders of gene i are non-zero (i.e., meaning the connections) but should be zero according to the PKDs. The denominator zPKi represents the total number of kinetic orders of gene i that are zero (i.e., there is no link to gene i that is given by the PKDs). In contrast,

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

6 structure priority 2 (StrPri2 ∈ [0, 1])) is also a ratio of parameters, but it indicates that the kinetic orders of gene i fail to follow the suggestion that a plausible connection should exist between gene j and i. Hence, the numerator zpi  N0 counts the number of zero values from the inferred kinetic orders, and the denominator nzPKi represents the total number of kinetic orders of gene i that are non-zero. The parameters w1 and w2 are structure weighting factors for balancing the magnitude between the sub-terms in equation 9 (w1 and w2 were both 0.01 in this study). In fact, StrPri1 is the value of the true negative rate (also called specificity), which means the number of kinetic orders of gene i that are negative connections. On the other hand, StrPri2 is the true positive rate (also called sensitivity) to indicate the number of kinetic orders of gene i that follows the suggestion that a plausible connection (a true positive connection) should exist between gene i and other genes j. After introducing the reverse-engineering approach with the corresponding objective function, we then summarize the main steps of the PKDs modeling algorithm, as described below (see Figure 3): Step (A): Data. Prior knowledge is used to construct the skeletal structure of the target gene network, and the algorithm applies the information on the topology constraints for the optimization process. Step (B): Optimization. This step adopts the GA-PSO approach to train the kinetic orders (i.e. inferring the GRN models). The procedure includes two phases, as follows. First, it initializes the GA-PSO population with topology constraints and starts the evolutionary process with the proposed objective function. It is notable that after the first generation, the values of the kinetic orders are not limited by the constraints anymore, but they are affected by the structural priority term of the objective function. The structural priority guides only the values of the kinetic orders toward the ideal topology. Second, when the evolutionary process continues for a predefined number of generations (i.e., 300 here), the procedure will examine the globally best solution set and fix the boundaries of a parameter (i.e., the kinetic order) to [0, 0] if a parameter still has the value of zero. Thus, in the following iterations, a parameter is bounded to zero until the termination of the GA-PSO approach. It is also notable that if the absolute value of a parameter is less than a threshold δ1 (i.e., 0.05 in this study), then the parameter is regarded as ‘no connection’ and thus is set to zero during the evolution. Step (C): Inferred model analysis. This step estimates the violation numbers that are calculated by the sum of the numerators in structure priorities 1 and 2 (i.e., nzpi and zpi). The algorithm ranks the number of structural violations of gene i among thirty runs and then acquires the lowest violation number with an acceptable MSE value for each gene to provide a reference for step (D). The reason why there are some violations that exist in the inferred models is that although the proposed objective function attempted to describe both the skeletal structure and the corresponding network behavior, it was sometimes inevitable that the preferred topology could not be derived properly. The unfitted results thus caused violations. Two reasons explain the unfitted structures in an inferred model:

first, it makes it easier for the inferred models to produce the network dynamics; and second, it implies that some possible links, rather than the suggested structures given by the PKDs, might exist. The details and the proposed solutions for these two aspects are given in the experiment section. Step (D): Model confirmation. A GRN is finally formed at this step. The criteria of constructing a plausible GRN are, first, to choose the smallest structural violation number of gene i and, then, to estimate its MSE value. If the MSE value is unacceptable (i.e., MSE is larger than a threshold δ2, which is 0.15 determined experimentally), then the next lowest structural violation value will be chosen, and the MSE value will be evaluated again. Following this selection process with the criteria mentioned above, the algorithm chooses models iteratively, in which the ideal structural links of the GRN without losing the system behavior (i.e., to fit the expression data) can be selected. In the end, our approach confirms a plausible regulatory network and suggests possible new connections for the target network.

Fig. 3. The proposed PKD modeling framework: it embeds prior knowledge (step A), runs the optimization process (step B), estimates plausible models (step C), and confirms a plausible GRN and suggests some possible new connections (step D).

III. EXPERIMENTAL RESULTS To validate the proposed approach, we performed two sets of experiments and presented a walkthrough example to explain how to exploit the PKD modeling algorithm to investigate the genetic relationships in a regulatory network. The first set of experiments evaluated the performance of the proposed framework with a popular benchmark model of five-gene artificial network [54]. For the second set of experiments, a real dataset was used; it was collected from research on gene

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

7 expression in the cell cycle of the yeast S. cerevisiae [7] and was used to investigate how the proposed approach can be applied to the study of real gene networks. The prior knowledge, the network models, and the plausible connections were derived accordingly. A. The performance of the proposed framework on an artificial dataset In the first set of experiments, our method was employed to model a five-node artificial network [68] from a time series data profile of thirty simulation time steps. The genes in the network had the following relationships: 

X1 =15.0X3 X5-0.1 -10.0X12.0 

X 2 =10.0X12.0 -10.0X 2.0 2 

-0.1 2.0 X3 =10.0X -0.1 2 -10.0X 2 X 3

(10)



X 4 =8.0X12.0 X5-0.1 -10.0X 2.0 4 

2.0 X5 =10.0X 2.0 4 -10.0X 5

Following the procedure of the PKDs modeling algorithm, without loss of generality, we assumed that the genetic relationships mentioned above can be prior knowledge and can serve as source information. This knowledge was used to build the adjacency matrix (i.e., the connection relationships) of the network (see Table 2). Then, the algorithm mapped the connection relationships and constraints onto the kinetic orders of the S-system in Table 3. This procedure assigned structure priorities 1 and 2 from gene 1 to gene 5 in the order (7, 3), (8, 2), (7, 3), (7, 3), and (8, 2). Once the mapping was completed, the reverse-engineering approach (i.e., GA-PSO) was activated to infer the gene expression profiles and the numerical values of the kinetic orders according to the proposed objective function (i.e., equation 9). Thirty independent runs were conducted and each run continued for 1,500 iterations with a population size of 800. TABLE 2 THE ADJACENCY MATRIX OF THE FIVE-GENE ARTIFICIAL NETWORK. gene i gene 1 gene 2 gene 3 gene 4 gene 5 1 1 1 2 1 3 1 4 1 1 5 1 TABLE 3 THE RESULTS OF MAPPING PRIOR KNOWLEDGE ONTO THE S-SYSTEM OF THE FIVE-GENE ARTIFICIAL NETWORK. gene i 1 2 3 4 5

gi,j ∈ [-3,3] (otherwise, 0) g1,3, g1,5 g2,1 g2,3 g4,1, g4,5 g5,4

hi,j ∈ [-3,3] (otherwise, 0) h1,1 h2,2 h3,2, h3,3 h4,4 h5,5

StrPri1

StrPri2

7 8 7 7 8

3 2 3 3 2

To determine a final inferred model with the desired kinetic orders, there are two model selection strategies that are based on the summary shown in Table A1. First, if there were some runs in which all parameters fitted the suggestion about the connection relationships (i.e. without structure 1 and 2 violations), then the model that had the lowest MSE value was chosen. Second, if there was no single run that could produce a model that fit the structural suggestion for all of the genes (often occurring in a real dataset), then drawing on the nature of a decoupled S-system, the algorithm chose to take the best structural fitting model with a desired MSE (i.e., smaller than the threshold δ2) of genes in different runs. For example, gene 1 was chosen from run j, while gene 2 could be selected from the other run. In this set of experiments, the first strategy was adopted, whereas the second strategy was used in a real dataset experiment, i.e., using yeast S. cerevisiae data. With the selection strategy described above, the algorithm chose the model inferred from run 27 (see Table A1) as the final result, because it met the selection criteria: there were no violations and it had the lowest MSE value (i.e., 1.65E-05). Based on this result, Figure 4 compares the inferred and target gene expression profiles (i.e., the network behaviors), in which the behavior of the inferred model (left) is almost identical to that of the target network (right). In addition, a comparison between the values of the chosen and expected kinetic orders was made and the results are shown in Table 4. As seen, very similar values can be obtained by the proposed approach.

Fig. 4. Overview of the network behaviors of the five-gene artificial network. The left part is the inferred behavior, and the right part, the target behavior.

Fig. 5. Overview of the pathway diagram of the five-gene artificial dataset.

The above results show that the most important contribution of our framework is to discover the plausible pathway diagram(s) (see Figure 5), which can be drawn through the inferred kinetic orders (Table 4). Taking this five-gene network as an example, we can find that although the final inferred GRN model coordinated with the expected structures perfectly, and therefore the plausible pathways of the model were consistent with those of the expected model, there were still some gene regulations (i.e., gi,j or hi,j) that were not among the preferred

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

8 connection relationships. In detail, the parameters g3,1 and g3,4 were against the suggestion in 2 and 9 runs out of 30 runs, TABLE 4 A COMPARISON OF KINETIC ORDERS BETWEEN THE INFERRED AND EXPECTED NUMERICAL VALUES. gene i 1 2 3 4 5

(expected) (inferred) (expected) (inferred) (expected) (inferred) (expected) (inferred) (expected) (inferred)

gi,1

gi,2

gi,3

gi,4

gi,5

hi,1

hi,2

hi,3

hi,4

hi,5

αi

βi

0 0 2 2.0000 0 0 2 1.9963 0 0

0 0 0 0 -0.1 -0.0934 0 0 0 0

1 0.9926 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 2 2.0000

-0.1 -0.1221 0 0 0 0 -1 -0.9903 0 0

2 2.0013 0 0 0 0 0 0 0 0

0 0 2 2.0000 -0.1 -0.0951 0 0 0 0

0 0 0 0 2 2.0064 0 0 0 0

0 0 0 0 0 0 2 1.9632 0 0

0 0 0 0 0 0 0 0 2 2.0000

15 14.0093 10 10.0000 10 10.1508 8 8.1886 10 10.0000

10 9.2967 10 10.0001 10 10.1553 10 10.2462 10 10.0000

respectively. Under such a situation, when considering the plausible pathways of a network, we must keep in mind that there are two possible scenarios behind the diagram(s), as described below. First, in spite of fitting all of the genetic relationships of the inferred model, some kinetic orders still deviated from the preferred topology that was given by the prior knowledge. The reason is that it is easier for the inferred models to meet the targeted network dynamics by behaving in this way. In this case, it is possible to examine the unfitted kinetic orders by looking into the biological meanings of the corresponding genes. These unfitted topologies are perhaps essential for further in vivo experiments. Second, if the final inferred model did not match all of the genetic relationships (meaning that some of the genes did not fit the suggested topology properly), the existence of some links can, thus, not be certain. These links were in contravention of the expected structures. One possible solution to such a situation is to provide a threshold (e.g., the top 3 unfitted genes) for examining the most unfitted genes and for considering whether to design new experiments in terms of the biological context. Obviously, the first set of experiments lies in the first scenario (and the second set in the second scenario). Because the artificial dataset was used, we considered the suggested topology to be equal to the ‘true’ topology. Hence, g3,1 and g3,4 were regarded as the parameters that attempted to capture the target network behaviors. We thus discarded additional analyses of these two genes. Based on the experimental results, our frameworks demonstrated that if we derive connection relationships from the existing biological knowledge and directly encode the information in the form of topology constraints to prioritize the search solutions, then meaningful solutions that correspond to similar network behaviors can be obtained. In this way, the inferred kinetic orders suggest that we should make some attempts to produce a new experimental design.

B. Evaluation of the PKD modeling algorithm on yeast S. cerevisiae In the second set of experiments, our approach was employed to model a subset of the yeast S. cerevisiae regulatory network. As described in Table 1, an adjacency matrix of this real dataset was built up by referring to the results obtained from the PKD software, [-3, 3]. Then, the DNA repair regulatory network defined by the 20 genes and 71 non-directional paired links was mapped onto the kinetic orders of the S-system for the topology constraint settings. The predefined topology was used as guidance to distribute the 800 parameters of the S-system in structure priorities 1 and 2 (i.e., Table 5). The GA-PSO approach was thus performed to infer both the network behaviors and structures by the proposed evaluation function. In the experiments, thirty runs were conducted, in which the iteration number and population size of each run were 6,000 and 2,000, respectively. TABLE 5 THE RESULTS OF 800 PARAMETERS (KINETIC ORDERS) DISTRIBUTED IN STRUCTURE PRIORITIES 1 AND 2. gene StrPri1 StrPri2 gene StrPri1 StrPri2 RDH54 23 17 RHC18 35 5 DUN1 23 17 UNG1 37 3 MSH6 21 19 OGG1 33 7 RAD51 23 17 PIF1 31 9 RAD54 15 25 SGS1 13 27 HPR5 23 17 PMS1 25 15 RAD27 13 27 MSH2 19 21 RAD5 31 9 DHS1 27 13 TOP3 25 15 RAD53 27 13

After the optimization process was performed, the algorithm recorded the number of structural violations of the inferred kinetic orders, as shown in Table A2. As can be observed, there was no single run that can have 20 genes to fit all of the structural suggestions simultaneously. Therefore, the proposed approach took the models that had the best structural fitting from different runs for each gene. This strategy searches models that have the smallest number of structural violations for each gene i and evaluates its MSE. As mentioned previously, if the MSE value is lower than the threshold (i.e.,

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

9 δ2), then the corresponding kinetic orders are taken. For example, because the RDH54 gene model had the lowest number of violations in run 5 (i.e., no violation) and its MSE

G1

S1

G2

M

was smaller than δ2 (a value of 0.15 was used), the kinetic orders in run 5 was thus taken (see Table A2).

M/G1

G1

S1

G2

M

M/G1

Fig. 6. An overview of the network behaviors of the yeast S. cerevisiae DNA repair genes. The inferred (up) and target (down) network behaviors of the repair system. The x-axis represents the time step corresponding cell cycle phases, and the y-axis shows the concentration of gene components.

In contrast, if the MSE of the model with the best structural violation number is higher than δ2, then the model next to the lowest structural violations is chosen, and its MSE is evaluated. The procedure continues until a model is found to meet the criteria. For example, in the experiment, the RAD5 gene had the lowest structural violation (i.e., 2 violations) in run 16, but its MSE failed to satisfy the δ2 requirement; thus, the model that had the second lowest structural violation (i.e., 3 violations) was considered (see Table A2). In this case, there were three runs (including runs 1, 7, and 22) that met this criterion, in which the model inferred from run 1 had the lowest MSE. Therefore, the algorithm chose the corresponding kinetic orders from this run. With the above selection strategy, the algorithm can determine the final solution, as depicted in Table A3. The network behaviors of the inferred yeast S. cerevisiae DNA repair genes are expressed in Figure 6. This figure compares the inferred and target gene expression profiles. As shown, all of the profiles from the inferred model (up) have behaviors very similar to those of the target network (down), except for the RDH54 gene. After further examination, we found that this finding is mainly due to the measurement noise contained in the real dataset. According to the experimental discussion in [7] and [12], the RDH54 gene peaked only once in the G1 phase (but not in the S or in the G2 phases). In other words, the inferred expression profile of RDH54 has in fact fitted all of the structural suggestions from the PKDs and captured the actual network behavior as well. In addition to the network behaviors, we also calculated the structural accuracy (by the structure priority equations) averaged from thirty runs for this network: the sensitivity (the true positive rate) was 88% (with a standard deviation 0.01) and the specificity was 81% (with a standard deviation 0.02). The results show that the strength of our approach lies in the fact that it can derive the network behaviors with the meaningful network structures concurrently.

IV. DISCUSSIONS To investigate the unfitted structure in the inferred yeast S. cerevisiae DNA repair network, further examinations and analyses have been made as follows. As indicated in the first set of experiments, if some inferred parameters did not fit in with all of the genetic relationships given by the PKDs, the existence of certain links cannot be certain. The second set of experiments is applicable to this case. The results for each gene in the inferred model were thus be distilled into the following four scenarios: (a) satisfying all of the connections that were given by the PKDs, (b) against structure priority 1, (c) against structure priority 2, and (d) against both priorities. Table 6 lists all of the inferred genes that were classified by the scenarios. It is notable that we did not group genes together for each scenario. For example, in scenarios (a), RDH54, MEC3, PIF1, and RAD54 are independent. It means a gene (which could be one of the four genes in this case) with its related forty connections all satisfied the connections by the constraint of structure priorities. With these results, the proposed approach has shown its strength that took advantages of prior knowledge to derive the parameter set with biological meaning and to reconstruct the gene expression profiles. TABLE 5 GENES OF THE YEAST S. CEREVISIAE DNA REPAIR SYSTEM GROUPED BY FOUR MODEL SUGGESTION SCENARIOS

scenario (a) (b) (c) (d)

Gene RDH54, MEC3, PIF1, and RAD54 DUN1, DHS1, HPR5, OGG1, RAD51, RAD5, RHC18, UNG1 MSH2, RAD27, and TOP3 MSH6, PMS1, RAD53, SGS1, and SPK1

Four walking through examples of gene examination process from scenarios (a) to (d) are given below to explain how to interpret the derived results, especially for the connections which did not match the PKDs suggestions.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

10 Figure 7 illustrates how the pathway diagrams represent in the scenarios. First, there are four genes belonging to scenario (a). Taking RDH54 as an example (Figure 7a), we can see that all parameters related to this gene followed the connection suggestions. Therefore, we can conclude that the suggested links from the PKDs helped the inference approach to derive a parameter set that had biological meaning and to reconstruct the gene expression profiles successfully.

Fig. 7. The inferred pathway diagrams. (a) all kinetic orders fit prior knowledge; (b) the kinetic orders against structure priority 1; (c) the kinetic orders against structure priority 2; (d) the kinetic orders against both structure priorities 1 and 2.

In the second scenario, some connections appeared that were thus against the suggestions from the PKDs. For example, in the case of DUN1 (Figure 7b), there is one gene connection that appears in the synthesis process of DUN1 in 29 out of 30 runs (i.e., the SGS1 gene). This result gave us a strong clue to assume that this gene regulation exists. Because the inferred model was determined computationally, we cannot ensure that the connection indeed exists and a careful literature review is thus needed. Following up on this assumption, we searched the literature for information on the relationship between SGS1 and DUN1. Pan et al. reported causal relationships between DNA replication genes (e.g., SGS1) and the regulation of DNA replication checkpoint genes (e.g., DUN1) through stalled replication forks [55]. This result demonstrated the interplay between these two genes. Based on the findings reported in the literature, our approach leads to discoveries of the missing links and of the modeling results for the inferred parameters (i.e., to imply possible connections). In contrast to the above scenario, the third scenario indicates the cases in which several connections were discarded by the inferred GRN model. For example, with regard to MSH2 (see Figure 7c), DHS1 and HPR5 were disconnected to MSH2 in the synthesis and degradation processes respectively. In fact, in the gene function among these genes, we found that although PKDs considered them to belong to the same group in the yeast S. cerevisiae DNA repair system, these genes belong to different repair pathways (sub-groups) in the repair system [55-56]. Hence, the suggested links were removed by the inferred model after the training process. Notably the

remaining connections (DHS1  MSH2 and HPR5  MSH2) are considered to be weak connections among these sub-groups of the repair system. These links might provide useful information for future laboratory experiments. The last scenario is a combination of scenarios (b) and (c) that contains some newly added and removed connections simultaneously. In Figure 7d, as suggested by the inferred model, the DNA damage and meiotic checkpoint gene, MEC3, has a connection in the synthesis process of the MSH6 gene. Both MEC3 and MSH6 are classified as the “cellular response to DNA damage stimulus” term in gene ontology (GO), according to the saccharomyces genome database (SGD) [57]. Although, to our knowledge, there is no explicit relationship between them in literature, this link (i.e., MEC3  MSH6) could be considered a weak connection as they correspond to the same GO term. In addition, two partial linkage, SGS1  MSH6 and PMS1  MSH6, in the synthesis and degradation processes, respectively, were suggested by the final determined GRN model. The first gene, SGS1, is a DNA helicase protein that plays a key role for DNA repair – to repair double-strand breaks during homologous recombination [58]. The second gene, PMS1, involves in DNA mismatch repair, whereas its role in the mismatch repair function has not been proven [57]. Therefore, further laboratory experiments for the uncertain regulatory relationships among SGS1, MSH6, and PMS1 in the cellular response to DNA damage stimulus are suggested. Following the genes examination process in these four scenarios, it is possible to look into the connections involving a specific gene in the synthesis and degradation processes based on the magnitude of its kinetic orders. In other words, we can examine some connections appeared in the synthesis or degradation processes (i.e. scenario 2) on gene DHS1, HPR5, OGG1, RAD51, RAD5, RHC18, and UNG1, and scrutinize the connections removed in the regulatory processes (i.e. scenario 3) on gene RAD27 and TOP3, respectively. Furthermore, we can reason, if possible, the interaction relationships with added or abundant connections on gene PMS1, RAD53, SGS1, and SPK1, respectively (i.e. scenario 4). These examinations help us realize more about the curve-fitting between the estimated and the target expression level and that of structural priorities. The four scenarios demonstrate the advantages of our proposed approach, which has successfully integrated PKDs into our network modeling algorithm with a novel objective function that weights both the curve-fitting and structural priorities. This approach have proven its strength to bring scientists with new gene regulation suggestions and design in in vivo experiments. V. CONCLUSIONS The need for integrating knowledge-based and data-driven inference strategies has been emphasized and addressed to construct network models that exhibit correct network structures and behaviors. This study proposes a systematic framework that has a traceable pipeline that combines the strength of both the knowledge and data to construct biologically plausible connections based on the ODE

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

11 computational model. This integrated framework allows us to investigate the inferred models which can suggest plausible genetic regulatory relationships testable in ‘in vivo’ research. A series of experiments have been conducted to validate the proposed approach. In the analyses of the yeast S. cerevisiae DNA repair network, the results show that the knowledge-driven modeling approach can derive the parameter set implied by the available biological data and can reconstruct the corresponding gene expression profiles. This study identifies four scenarios that are suggested by the final inferred GRN model and gives for each scenario an example of how to address newly added or removed genetic regulatory interactions toward the synthesis or degradation processes for the target genes. These results confirm that the proposed approach can successfully utilize PKDs to indicate the network regulatory relationships involved in real biological systems. ACKNOWLEDGEMENTS This work was supported in part by National Science Council of Taiwan, under contract NSC-100-2221-E-110-086. REFERENCES [1] E. Davidson, and M. Levin, “Gene regulatory networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 14, pp. 4935-4935, 2005. [2] M. Bansal, V. Belcastro, A. Ambesi-Impiombato, and D. Bernardo, “How to infer gene networks from expression profiles,” Molecular Systems Biology, vol. 3, no. 1, pp. 1-10, 2007. [3] C.-H. Zheng, D.-S. Huang, and L. Shang, “Feature selection in independent component subspace for microarray data classification,” Neurocomputing, vol. 69, no. 16, pp. 2407-2410, 2006. [4] J. Kononen, L. Bubendorf, A. Kallionimeni, M. Bärlund, P. Schraml et al., “Tissue microarrays for high-throughput molecular profiling of tumor specimens,” Nature Medicine, vol. 4, no. 7, pp. 844-847, 1998. [5] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745-6750, 1999. [6] D. B. Ulanet, D. L. Ludwig, C. R. Kahn, and D. Hanahan, “Insulin receptor functionally enhances multistage tumor progression and conveys intrinsic resistance to IGF-1R targeted therapy,” Proceedings of the National Academy of Sciences of the United States of America, vol. 107, no. 24, pp. 10791-10798, 2010. [7] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway et al., “A genome-wide transcriptional analysis of the mitotic cell cycle,” Molecular Cell, vol. 2, no. 1, pp. 65-73, 1998. [8] C. J. Roberts, B. Nelson, M. J. Marton, R. Stoughton et al., “Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles,” Science, vol. 287, no. 5454, pp. 873-880, 2000. [9] C.-H. Zheng, L. Zhang, T.-Y. Ng, C. K. Shiu, and D.-S. Huang, “Metasample-based sparse representation for tumor classification,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 5, pp. 1273-1282, 2011. [10] S.-L. Wang, Y.-H. Zhu, W. Jia, and D.-S. Huang, “Robust classification method of tumor subtype by using correlation filters,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 580-591, 2012. [11] D.-S. Huang, and C.-H. Zheng, “Independent component analysis-based penalized discriminant method for tumor classification using gene expression data,” Bioinformatics, vol. 22, no. 15, pp. 1855-1862, 2006. [12] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273-3297, 1998.

[13] R. Chavez-Alvarez, A. Chavoya, and A. Mendez-Vazquez, “Discovery of possible gene relationships through the application of self-organizing maps to DNA microarray databases,” PLoS One, vol. 9, no. 4, e93233, 2014. [14] F. Eduati, J. De Las Rivas, B. Di Camillo, G. Toffolo, and J. Saez-Rodriguez, “Integrating literature-constrained and data-driven inference of signalling networks,” Bioinformatics, vol. 28, no. 18, pp. 2311-2317, 2012. [15] E. O. Voit, “Biochemical systems theory: a review,” ISRN Biomathematics, vol. 2013, no. 897658, pp. 1-53, 2013. [16] L. E. Chai, S. K. Loh, S. T. Low, M. S. Mohamad, S. Deris et al., “A review on the computational approaches for gene regulatory network construction,” Computers in Biology and Medicine, vol. 48, pp. 55-65, 2014. [17] W.-P. Lee, and W.-S. Tzou, “Computational methods for discovering gene networks from expression data,” Briefings in Bioinformatics, vol. 10, pp. 408-423, 2009. [18] H. W. Engl, C. Flamm, P. Kügler, J. Lu, S. Müller et al., “Inverse problems in systems biology,” Inverse Problems, vol. 25, no. 12, e123014, 2009. [19] Y. Fomekong-Nanfack, M. Postma, and J. Kaandorp, “Inferring Drosophila gap gene regulatory network: a parameter sensitivity and perturbation analysis,” BMC Systems Biology, vol. 3, no. 1, pp. 94, 2009. [20] G. Alterovitz, and M. F. Ramoni, Knowledge-Based Bioinformatics: from Analysis to Interpretation, Chichester, UK: Wiley, 2010. [21] G. Karlebach, and R. Shamir, “Modelling and analysis of gene regulatory networks,” Nature Reviews Molecular Cell Biology, vol. 9, no. 10, pp. 770-780, 2008. [22] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono et al., “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Research, vol. 27, no. 1, pp. 29-34, 1999. [23] A. H. Y. Tong, G. Lesage, G. D. Bader, H. Ding, H. Xu et al., “Global mapping of the yeast genetic interaction network,” Science Signaling, vol. 303, no. 5659, pp. 808, 2004. [24] A. R. Pico, T. Kelder, M. P. van Iersel, K. Hanspers, B. R. Conklin et al., “WikiPathways: pathway editing for the people,” PLoS Biology, vol. 6, no. 7, pp. e184, 2008. [25] C. Terfve, T. Cokelaer, D. Henriques, A. MacNamara, E. Goncalves et al., “CellNOptR: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms,” BMC Systems Biology, vol. 6, pp. 133, 2012. [26] F. Emmert-Streib, G. V. Glazko, G. Altay, and R. de Matos Simoes, “Statistical inference and reverse engineering of gene regulatory networks from observational expression data,” Frontiers in Genetics, vol. 3, no. 8, 2012. [27] T. Saithong, S. Bumee, C. Liamwirat, and A. Meechai, “Analysis and practical guideline of constraint-based Boolean method in genetic network inference,” PLoS One, vol. 7, no. 1, pp. e30232, 2012. [28] K. Jaqaman, and G. Danuser, “Linking data to models: data regression,” Nature Reviews Molecular Cell Biology, vol. 7, no. 11, pp. 813-819, 2006. [29] T. M. Przytycka, and Y.-A. Kim, “Network integration meets network dynamics,” BMC Biology, vol. 8, no. 1, pp. 48, 2010. [30] I. C. Chou, and E. O. Voit, “Recent developments in parameter estimation and structure identification of biochemical and genomic systems,” Mathematical Biosciences, vol. 219, no. 2, pp. 57-83, 2009. [31] S. Bornholdt, “Systems biology: less is more in modeling large genetic networks,” Science Signaling, vol. 310, no. 5747, pp. 449, 2005. [32] N. Xuan, M. Chetty, R. Coppel, and P. P. Wangikar, “Gene regulatory network modeling via global optimization of high-order dynamic Bayesian network,” BMC Bioinformatics, vol. 13, no. 1, pp. 131, 2012. [33] M. A. Savageau, and R. Rosen, Biochemical Systems Analysis: a Study of Function and Design in Molecular Biology, Mass, USA: Addison-Wesley, Reading, 1976. [34] M. Vilela, I. C. Chou, S. Vinga, A. Vasconcelos, E. O. Voit et al., “Parameter optimization in S-system models,” BMC Systems Biology, vol. 2, no. 1, pp. 35, 2008. [35] Y. Lee, P.-W. Chen, and E. O. Voit, “Analysis of operating principles with S-system models,” Mathematical Biosciences, vol. 231, no. 1, pp. 49-60, 2011. [36] S. Kimura, K. Ide, A. Kashihara, M. Kano, M. Hatakeyama et al., “Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm,” Bioinformatics, vol. 21, no. 7, pp. 1154-1163, 2005. [37] Y.-T. Hsiao, and W.-P. Lee, “A sensitivity-based incremental evolution approach for the inference of gene networks,” BMC Bioinformatics, vol. 13, no. S2, S8, 2012. [38] S. Kikuchi, D. Tominaga, M. Arita, K. Takahashi, and M. Tomita, “Dynamic modeling of genetic networks using genetic algorithm and S-system,” Bioinformatics, vol. 19, no. 5, pp. 643-650, 2003.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

12 [39] S. Q. Wang, and H. X. Li, “Quantitative modeling of transcriptional regulatory networks by integrating multiple source of knowledge,” Bioprocess and Biosystems Engineering, pp. 1-11, 2012. [40] A. Greenfield, C. Hafemeister, and R. Bonneau, “Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks,” Bioinformatics, vol. 29, no. 8, pp. 1060-1067, 2013. [41] W. H. Mager, and J. Winderickx, “Yeast as a model for medical and medicinal research,” Trends in Pharmacological Sciences, vol. 26, no. 5, pp. 265-273, 2005. [42] D. Branzei, and M. Foiani, “Regulation of DNA repair throughout the cell cycle,” Nature Reviews Molecular Cell Biology, vol. 9, no. 4, pp. 297-308, 2008. [43] Y. Fu, L. Pastushok, and W. Xiao, “DNA damage-induced gene expression in Saccharomyces cerevisiae,” FEMS Microbiology Reviews, vol. 32, no. 6, pp. 908-926, 2008. [44] C. Bertoli, J. M. Skotheim, and R. A. de Bruin, “Control of cell cycle transcription during G1 and S phases,” Nature Reviews Molecular Cell Biology, vol. 14, no. 8, pp. 518-528, 2013. [45] A. Z. Ostrow, T. Nellimoottil, S. R. Knott, C. A. Fox, S. Tavaré et al., “Fkh1 and Fkh2 bind multiple chromosomal elements in the S. cerevisiae Genome with distinct specificities and cell cycle dynamics,” PLoS One, vol. 9, no. 2, e87647, 2014. [46] I. Lee, Z. Li, and E. M. Marcotte, “An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae,” PLoS One, vol. 2, no. 10, e988, 2007. [47] J. M. Cherry, C. Adler, C. Ball, S. A. Chervitz et al., “SGD: Saccharomyces genome database,” Nucleic Acids Research, vol. 26, no. 1, pp. 73-79, 1998. [48] N. Huang, I. Lee, E. M. Marcotte, and M. E. Hurles, “Characterising and predicting haploinsufficiency in the human genome,” PLoS Genetics, vol. 6, no. 10, e1001154, 2010. [49] D. G. MacArthur, S. Balasubramanian, A. Frankish, N. Huang, J. Morris et al., “A systematic survey of loss-of-function variants in human protein-coding genes,” Science, vol. 335, no. 6070, pp. 823-828, 2012. [50] E. O. Voit, “Canonical modeling: review of concepts with emphasis on environmental health,” Environmental Health Perspectives, vol. 108, no. 5, pp. 895-909, 2000. [51] Y. Maki, T. Ueda, M. Okamoto, N. Uematsu, K. Inamura et al., “Inference of genetic network using the expression profile time course data of mouse P19 cells,” Genome Informatics, pp. 382-383, 2002. [52] N. Noman, and H. Iba, “Inferring gene regulatory networks using differential evolution with local search heuristics,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 634-647, 2007. [53] W.-P. Lee, and Y.-T. Hsiao, “An adaptive GA-PSO approach with gene clustering to infer S-system models of gene regulatory network,” The Computer Journal, vol. 54, no. 9, pp. 1449-1464, 2011. [54] H. Cao, L. Kang, Y. Chen, and J. Yu, “Evolutionary modeling of systems of ordinary differential equations with genetic programming,” Genetic Programming and Evolvable Machines, vol. 1, no. 4, pp. 309-337, 2000. [55] X. Pan, P. Ye, D. S. Yuan, X. Wang, J. S. Bader et al., “A DNA integrity network in the yeast Saccharomyces cerevisiae,” Cell, vol. 124, no. 5, pp. 1069-1081, 2006. [56] N. Sugawara, T. Goldfarb, B. Studamire, E. Alani, and J. E. Haber, “Heteroduplex rejection during single-strand annealing requires Sgs1 helicase and mismatch repair proteins Msh2 and Msh6 but not Pms1,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 25, pp. 9315-9320, 2004. [57] M. C. Costanzo, S. R. Engel, E. D. Wong, P. Lloyd, K. Karra et al., “Saccharomyces genome database provides new regulation data,” Nucleic Acids Research, vol. 42, no. D1, pp. D717-D725, 2014. [58] E. P. Mimitou, and L. S. Symington, “Sae2, Exo1 and Sgs1 collaborate in DNA double-strand break processing,” Nature, vol. 455, no. 7214, pp. 770-774, 2008.

Dr. Stefan Müller is a research scientist at the Radon Institute for Computational and Applied Mathematics (RICAM) of the Austrian Academy of Sciences. His areas of interest include ODE models of biological and biotechnological systems, in particular, chemical reaction networks, metabolic networks, gene regulatory networks, and bioreactors.

Dr. Yu-Ting Hsiao is a postdoc fellow in Genomics Research Center, Academia Sinica, Taiwan. His research focuses on nonlinear dynamical systems and evolutionary algorithms in computational biology.

Dr. Philipp Kügler holds the professorship for modelling of complex biological systems at the University of Hohenheim and is a co-leader of the research group on mathematical methods for molecular and systems biology at the Radon institute of the Austrian Academy of Sciences. His current

Dr. Wei-Po Lee is a professor at the Department of Information Management, National Sun Yat-sen University, Taiwan. He received his PhD in artificial intelligence from University of Edinburgh, United Kingdom. His research interests include systems biology, network biology, autonomous systems and artificial life. Dr. Wei Yang is an associate research fellow in Shanghai Center for Mathematical Sciences, Fudan University, specializing in nonlinear dynamical systems and inverse problems in systems biology.

Dr. Christoph Flamm, originally trained as Organic Chemist, is since 2006 Associate Professor at the Institute for Theoretical Chemistry, University of Vienna. His research focuses on the mathematical modeling of the structure and dynamics of (bio)chemical reaction networks and RNA computational biology. Over many years he has gained, as PI in several research projects, rich experience in the challenges of interdisciplinary research. Dr. Ivo Hofacker is professor in the faculties of chemistry and of computer science at the University of Vienna, where he heads the Institute for Theoretical Chemistry and the Research Group Bioinformatics and Computational Biology, respectively. His research interests focus on the computational biology of RNA and structural bioinformatics.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCBB.2015.2465954, IEEE/ACM Transactions on Computational Biology and Bioinformatics

13 research focus is on the link between pharmacodynamics and inverse bifurcation analysis.

1545-5963 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Suggest Documents