BIOINFORMATIC Investigating the dynamic behavior of biochemical

0 downloads 0 Views 299KB Size Report
Dec 16, 2004 - biochemical networks using model families. Marc Daniel Haunschilda, Bernd ..... The following list briefly outlines the special features of. MMT2:.
Bioinformatics Advance Access published December 16, 2004

BIOINFORMATIC Investigating the dynamic behavior of biochemical networks using model families Marc Daniel Haunschild a , Bernd Freisleben b , Ralf Takors c , Wolfgang Wiechert,a∗ a

Department of Simulation, University of Siegen, Paul-Bonatz-Str. 9-11, D-57068 Siegen, Germany, b Dept. of Math. and Comp. Science, University of Marburg, Hans-Meerwein-Str., D-35032 Marburg, Germany, c Institute of Biotechnology, Research Center Juelich, D-52425 Juelich, Germany

ABSTRACT Motivation: Supporting the evolutionary modeling process of dynamic biochemical networks based on sampled in vivo data requires more than just simulation. In the course of the modeling process, the modeler is typically not only concerned with a single model but with sequences, alternatives and structural variants of models. Powerful automatic methods are then required to assist the modeler in the organization and evaluation of alternative models. Moreover, the structure and peculiarities of the data require dedicated tool support. Summary: To support all stages of an evolutionary modeling process, a new general formalism for the combinatorial specification of large model families is introduced. It allows to automatically navigate in the space of models and to exclude biologically meaningless models on the basis of elementary flux mode analysis. An incremental usage of the measured data is supported by using splined data instead of state variables. With MMT2, a versatile tool has been developed as a computational engine, intended to be built into a tool chain. By the use of automatic code generation, automatic differentiation for sensitivity analysis, and grid computing technology, a high performance computing environment is achieved. MMT2 supplies XML model specification and several software interfaces. The performance of MMT2 is illustrated by several examples from ongoing research projects. Availability: http://www.simtec.mb.uni-siegen.de/ Contact: [email protected]

1

INTRODUCTION

Modeling and simulation of biochemical networks is an important issue in computational biology. With the advent of the new disciplines systems biology (Kitano, 2000) and metabolic engineering (Stephanopoulos et al., 1998), which are aiming at a holistic understanding of cellular systems from a theoretical or applied viewpoint, the quantitative ∗ to

whom correspondence should be addressed

Bioinformatics © Oxford University Press 2004; all rights reserved.

understanding of regulatory networks has become a main focus of interest. The investigation of the dynamic behavior of biochemical networks gives complementary information to the structural knowledge about cellular systems obtained from genome, transcriptome and proteome data. In particular, modeling and simulation help to unravel the hierarchically organized cellular regulation networks (Lengeler, 2000) and the complex interactions between cellular components. However, the design of simulation tools strongly depends on the research aims and on the available data. Consequently, a still growing number of different modeling and simulation tools for biochemical networks are currently under development. Many existing tools (Wiechert, 2002) are starting at a stage where the regulatory structure, reaction kinetic terms and kinetic parameters are already given. One group of these tools aims at a holistic understanding of cellular regulation and thus puts emphasis on conceptional modeling and hierarchical decomposition of networks (Takahashi et al., 2003; Tr¨ankle et al., 2000). Another group is more concerned with the analysis of biochemical networks, e.g. by applying the concepts developed in metabolic control theory (Mendes, 1997; Schilling et al., 1999; Sauro, 1993). Some of these tools are not only capable of computing and analyzing the stationary state of a system but also of simulating the dynamical behavior. However, in practice the regulatory structure of biochemical networks and in particular the reaction kinetic details are still only partially known. In fact, even after several decades of enzyme investigation, the regulatory properties of the major biocatalysts are still incomplete (Wiechert, 2002). The main reason is that all these investigations took place under in vitro conditions and may not be representative for the in vivo situation. As an example, the regulatory importance of the phosphotransferase glucose uptake system in E. coli, which cannot investigated in vitro, has been recognized only recently.

Haunschild et al

Consequently, in vivo experiments with high throughput measurement techniques are required to obtain more information about cellular regulation in vivo. Surprisingly, only a few tools are available to support a bottom up model development process based on experimental in vivo data (Rizzi et al., 1997; Westerhoff, 2001). Such data driven tools are not only important for the evaluation of rapid sampling experiments in metabolic engineering, but also for transient observations in gene regulatory processes. In this paper, first it will be discussed how data driven modeling is typically carried out (sections 2-4). Without a supporting software, this is a tedious procedure of setting up models, trying out alternative modeling approaches and evaluating parameter fitted models. To support the modeler in this work, the formalism of model families will be presented (sections 4-9). Two other important methods for data driven modeling are to decrementally use splines instead of state variables and to perform sensitivity analysis for statistical purposes (section 10). In sections 11-12 it will be explained how these methods are implemented in the proposed MMT2 tool. In sections 13 and 14, benchmarking results for sensitivity computations and model variants will be presented. Section 15 concludes the paper, and section 16 gives an outlook on model selection.

2

DYNAMIC SAMPLING DATA

Dynamic sampling is the most important method to obtain information about intracellular biochemical networks. As an example from metabolic engineering, rapid sampling devices for pulse experiments have been developed and applied to strains of S. cerevisae, E. coli and C. glutamicum cells (Koning and Dam, 1992; Rizzi et al., 1997; Schaefer et al., 1999). They serve to measure the dynamic behavior of the cell’s metabolism at a very high sampling rate. To this end, cells are grown in a bioreactor and stimulated by a substrate pulse (Figure 1). A sampling device takes out cells at a frequency of 5/second which are immediately cooled down with −50 o C methanol containing quenching fluid that stops the metabolic activity. Each of these samples stores a snapshot of the cell’s metabolism activity during the experiment. At the moment, over 30 chemical substances in the cell’s main metabolism can be quantified, leading to over 5000 data points for a single experiment. Only recently such a large amount of highly informative in vivo data on biochemical systems has become available. By using rapid sampling devices, new challenging biological questions can be tackled in the field of metabolome research. The resulting data can be very useful when trying to unravel the structure of the metabolic regulation network in vivo.

3

DATA DRIVEN MODELING

An important aspect of modeling - particularly of biological systems - is that model development is an iterative process

2

Fig. 1. By using a rapid sampling device, a very high sampling rate can be achieved when measuring the metabolite pool concentrations in a pulse experiment (Schaefer et al., 1999).

that takes place in a sequence of experiments, simulation runs, model based data evaluation and experimental design (Wiechert and Takors, 2004). In this experimental cycle, an initial guess of a system model is successively improved until finally a satisfactory reproduction of the measured data is achieved. Moreover, even if the data can be reproduced well by a model at the end of this process, this still does not mean that the model has some predictive power. For this purpose, a simulator must supply an accompanying set of tools for model analysis and statistical methods to test the validity of the model. It is a common experience that the final outcome of this process is not the condensated knowledge gathered in the experimental cycle. On the contrary, most insights into the biological system have been obtained from the decisions made in the modeling process. For example, the decision to leave out a certain effect from the model because it has only little influence on the system behavior is very important but nevertheless it is not documented in the final model structure. On the other hand, for those effects that are recognized in the final model it does not become clear why they have been incorporated. To summarize, there is much more knowledge gathered about the biological system in the modeling process than is actually documented in its final result. In particular, an adequate software tool must address the following requirements of an evolutionary modeling process: 1. In the course of the modeling process, the modeler is typically not only concerned with one single model but with sequences, alternatives and structural variants of models. In many situations, several model alternatives have to be tried out simultaneously which in practice can become a tedious procedure. Consequently, model variants should be explicitly supported and managed by the modeling tool. However, this is typically not a linear

Investigating the dynamic behavior of biochemical networks using model families

A v1 v2

B

Michaelis Menten + 1 Inhibitor

-

-

-

-

v3

C

D

v4

v5

(a) Network No. 1, initial guess

(b) Network No. 2, inhibitor effect assumed from pool D

(c) Network No. 3

(d) Network No. 4

(e) Network No. 5

Fig. 2. Sequence of models during a modeling process. Networks 1-4 have the same topology, only the kinetic formulas vary. Network 5 shows a change in the topology introducing a new reaction step. The way from Networks 1-4 to Network 5 is not an elementary step, as shown in Figure 4.

evolution but rather a process with many dead ends, new model setups, trials of model variants and model fusions. 2. The experimental context given by the available concentration data of several system pools must be taken into account. In particular, it is absolutely necessary for any statistical evaluation to incorporate the knowledge about measurement uncertainties and special signal features like steps in the feed concentration. 3. Not only the simulation of a given model, but also tools for parameter fitting, extensive statistical evaluation, experimental design and model discrimination must be offered. These tools are absolutely necessary to support the modeling decisions made in the process. 4. An evolutionary modeling process typically starts with a small model. In this situation, it makes sense to model only a few regulatory loops while other influences are not mechanistically described by the model. If known effectors from unmodeled parts of the system are measured, they can be incorporated into a simulation as an external data source. To this end, it is necessary to replace noisy measurement data by smoothing functions. In the case of pulse experiments, discontinuous signals must also be allowed.

¿From these requirements it becomes clear that simulation is only a very basic task in the modeling process while other algorithms are much more important and time consuming. Since the computational efforts dramatically grow with the size of the models, the usage of high performance computing is mandatory (see section 11).

4

THE MODELING PROCESS

Biochemical networks are usually represented by bipartite graphs with reaction and pool nodes (Wiechert, 2002). The graph can be augmented by additional arcs for effector relations (activation, inhibition). From this general network structure, a commonly used mathematical model structure in eq. 1 is immediately derived. ~x˙ = N · ~v(~ α, ~x),

~x(0) = ~x0

(1)

Here, the vector ~x contains the metabolites, N is the stoichiometric matrix describing the interconnecting fluxes between the metabolites. The fluxes’ quantitative description is enclosed within the vector of functions ~v depending on a vector of parameters α ~ and, of course, the current metabolites concentrations ~x. To support the evolutionary process of model building for biochemical networks, a formalization of the relations between different models is necessary. From an abstract viewpoint, the set of all models defined in a modeling process is given by a family of biochemical network models which are given, in general, as ~x˙ i = Ni · ~vi (~ αi , S, ~xi ),

i∈I

(2)

Here, I is an index set which enumerates all models in the sequential modeling process. However, a linear sequence of models does not reflect the true interdependencies between various models, because there might be branches, variants, dead ends or model fusions. Thus, the index set I must be extended by an additional relation ≺ that indicates the interdependencies between the models. Basically, the relation

3

Haunschild et al

i ≺ j is used if model j is obtained by an elementary network modification from model i. Here, ”elementary” modification means a single change of some reaction kinetics or the network topology, as defined more precisely below. A directed graph is induced by the elementary relation ≺ representing the relations between all models (i.e. a ”network of networks” as shown in Figure 4).

5

KINETIC VARIANTS

The simple example shown in Figure 2 is used to illustrate these concepts in more detail. Assume that the first model (Figure 2a) in the modeling process is a simple branched network with three reaction steps each with a Michaelis Menten type kinetics: v1MM = v1,max

A · A + k1,S

(3)

A comparison of the model with measured data might now indicate that an inhibitory effect of metabolite D on the first reaction step v1 is present. To include this effect in the model, the reaction kinetic term of the first reaction step has to be changed, thus producing the second model (Figure 2b) in the sequence. The second model is called kinetic variant of the first model because some kinetics is changed whereas the network topology is unaltered. For simplicity, the inhibition is described by a multiplicative term, although arbitrary complex kinetics are possible in MMT2. v1Inh = v1,max ·

A 1 · A + k1,S 1 + k1,I

(4)

Obviously, there is just a single parameter added to the system (an inhibition constant kI ), and the Michaelis Menten parameters v1,max , k1,m basically have the same biological meaning in the second model as in the first model. This also becomes clear by the fact that if the limit lim is taken, the kI →0

Michaelis Menten model turns out as a special case of model 1. This elementary extension relationship between the first two models will be formalized by the relation 1 ≺ 2. After trying out the inhibition model, the result might still not be satisfying and the question arises whether the remaining metabolite C might be the inhibitor. But there is also

0 Null Kinetics

1 Michaelis Menten

3

MM + Substrate C Inhibition

4

MM + Substrate C + D Inhibition

Fig. 3. Relationship of kinetic variants represented by a graph. Here, variant 1 is a special case of variants 2 and 3 having the inhibition parameter kI set to zero.

4

6

NETWORK VARIANTS

Up to now, the running example only illustrates kinetic variants of a network which do not change the network topology. Network variants are the second type of modifications in a modeling process. In an elementary network modification, a new reaction step can be added to the network or an existing step can be removed. Assume that additional measurement data becomes available so that the lumped pool A can be subdivided into two parts A and E. Consequently, reaction step v1 (Figure 2d) is subdivided into two steps v6 and v7 (Figure 2e). However, the change from network 4 to network 5 is not an elementary extension or specialization step because three reactions (v1 , v6 and v7 ) are involved. Therefore, auxiliary models have to be introduced to explain the relation between models 4 and 5 by a sequence of elementary modifications (see Figure 4). It will be shown below that steps I-IV do not have to be explicitly formulated by the modeler. The road from 4-5 leads through a sequence of models 4-I-II-5 or alternatively 4-II-IV-5. It turns out that models I, III and IV are meaningless in practice because they have isolated or dead end nodes. Thus, no biologically meaningful stationary state except for zero fluxes can occur which does not make sense for a metabolic network. Model II in turn represents the maximum network which generalizes both 4 and 5. In general, any metabolic network should satisfy the condition that a stationary solution different from zero is possible (see section 8).

7

2 MM + Substrate D Inhibition

another alternative that both C and D have an inhibiting effect on the first reaction step. Consequently, two more reaction kinetic terms have to be tried out giving rise to models 3 and 4 (Figure 2c and Figure 2d) in the sequence. Looking at the interdependencies of the four kinetic variants (Figure 3) of reaction step v1 , it becomes clear that the numbering of the models does not say anything about their logical interdependency. More precisely, model 3 is an elementary generalization of model 1 in the same way as model 2 extends model 1, whereas model 4 is an elementary generalization of both models 2 and 3. Model 4 generalizes model 1 only indirectly via model 2 or 3.

COMBINATORIAL MODEL GENERATION

Up to this point, the modeling process has been described from a retrospective viewpoint in order to show the general structure of the modeling process. When small biochemical networks with less than five reaction steps are considered, it is usually sufficient to carry out the modeling process step by step with elementary alteration operations. The observed discrepancy between model prediction and the measured data in this situation usually gives the right hints to make a certain change in the network structure or the reaction kinetics.

Investigating the dynamic behavior of biochemical networks using model families

a)

III

b)

I

4

V

Network 4 (fig 3d)

IV

II

5

A v1 v2

VI

v6 v7

E

B v3

C

D

v4

v5

Maximal Network

Network 5 (fig 3e)

Fig. 4. a) Some variants of the example network and their neighborhood relationships. In Figure 2, the step from Network 4 to Network 5 was not elementary. As shown here, there are intermediate networks. b) Two other network variants

However, when large biochemical systems are modeled in an evolutionary process, it becomes more and more difficult to get the right ideas of how to change a given network. Usually, there will be several ideas of how to improve the current model and also a combination of these ideas might make sense. In the case of the running example, assume that the current stage of the modeling process is represented by network 5 in Figure 2. All the single steps that led from network 1 to network 5 are now combined to produce a model search space in the following combinatorial way:

1. All reaction steps considered in the different models are united to the maximum network II shown in Figure 4. In this network, the reaction steps v1 , v6 and v7 can be switched on and off. If a metabolite pool becomes isolated in this way, it is automatically removed from the network. For example, when the reaction steps v6 and v7 are both switched off, pool E is removed. The resulting network is network 4. 2. For some reaction steps, kinetic alternatives can be specified by the modeler. The alternatives must be explicitly given by kinetic formulas and also the elementary generalization relations (see figure 4) must be specified. Parameters with the same biological meaning in all kinetic variants must be given the same name in all models. In the example, there is only one reaction step v1 + v6 , for which different reaction alternatives are considered.

The four possible variants for reaction step v1 with their relations are shown in Figure 3. The total number of models which can be obtained in this way is 5 · 5 · 2 = 50. In this situation, it becomes a more and more tedious procedure to fit one model variant after the other to a given data set. Even a small number of possible network operations together with all their combinations can quickly produce a combinatorial explosion of model variants until the number of possible models becomes prohibitive for a manual modeling process. Having the perspective of grid computing (Foster and Kesselman, 2003) in mind, this is a typical task that can be automated to run in parallel on a computer network. Using grid technology, it is possible to explore even a search space of several hundred metabolic models and to fit them to a given data set. (see section 9) Nevertheless, the number of tested models must be reduced. It should be clear that a large number of combinations is not really wanted or biologically meaningful. Through elementary flux modes and user defined constraints, the space of meaningless models is minimized, as described in section 8.

8

EXCLUDING MEANINGLESS MODELS

To increase the efficiency of combinatorial model testing, the number of biologically meaningless models is minimized. It should be clear that fitting the parameters of such a model is useless since the modeler will reject it anyways. In

5

Haunschild et al

this section, the two presented exclusion mechanisms enable the modeler to define a priori what models to be rejected. However, an automatic navigation algorithm might still use the meaningless models as auxiliary intermediate steps to produce a model variant from a given model (see Figure 4). 1. Each model in the search space must be biologically meaningful. Especially those models having inoperable parts are to be rejected. This means that a stationary solution apart from zero fluxes must be possible. Whether such a condition is possible can be automatically determined by computing the elementary flux modes, as described below. Using this mechanism, networks I, III, IV are excluded. 2. Further constraints might be imposed by the user by specifying Boolean expressions that describe which combinations should be excluded. For example, the following constraint enforces that exactly one of the reaction steps v1 and v6 must be the null kinetics. isnull(v1 )XOR isnull(v6 )

If both 1 and 2 are considered, the exclusion mechanisms reduce the model space down to 8 models. Taking into account the combinatorial explosion and the proposed reduction, it should become clear now that the exploration of rather complex model search spaces is well within the range of modern grid computing facilities. Exclusion mechanism 1, namely the method of elementary flux modes (Schuster et al., 2000) breaks down the complete network to elementary sets of fluxes. These so called elementary flux modes can be interpreted as subnetworks that can autonomously operate at steady state. Table 1 and figure 5a show the four modes of the maximum example network II. An easy way to find the ”failure modes” is described in (Klamt and Gilles, 2004). By computing the elementary flux modes of the maximum network, it is possible to forecast what reactions will be inoperable when switching off a particular one. For example, if reaction #6 is switched off, all elementary modes that contain this reaction (Modes 3 and 4) are not operable anymore. In the remaining modes, reaction #7 is always switched off. This indicates that this reaction is cut off.

AUTOMATIC NAVIGATION IN MODEL SPACE

In the previous sections, we have described how model families are built up and how auxiliary models are used to fill the gaps between similar models in the model space to achieve elementary changes between the models. As a consequence,

6

6

1 7

2

3

4

7

2

5

(a) Mode no. 1

3

4

(b) Mode no. 2

5

(c) Mode no. 3

(d) Mode no. 4

Fig. 5. The elementary flux modes of the running example. Figure 4, Network II shows the maximum network, (a)-(d) the four elementary sets of reactions that operate on their own. Note that metabolite A is external in this example.

(5)

It excludes all the introduced intermediate networks I, II, III, V and VI, which reduces the search space to 16 models.

9

6 1

Reaction No.

#1

#2

#3

#4

#5

#6

#7

Mode no. 1 Mode no. 2 Mode no. 3 Mode no. 4

1 1 0 0

1 0 1 0

0 1 0 1

1 0 1 0

0 1 0 1

0 0 1 1

0 0 1 1

Table 1: Tabular notation of the elementary flux modes.

a neighborhood relationship that allows a navigation in the model space comes into existence. The knowledge of the elementary modification relations is very important when the model space is automatically explored by a systems analysis, parameter fitting or optimization algorithm. The relation ≺ enables intelligent evolutionary search algorithms to be implemented that can navigate in the corresponding graph. When a promising candidate for explaining the experimental data has been found, it makes sense to automatically test its neighbors, too. Moreover, the initial parameters taken for the neighbors of a certain model can be partly taken from the parameter fits already obtained for the present model. This example also shows that an algorithm for generating network variants by applying elementary alteration operations to a given network might have to generate biologically meaningless networks on the way. These networks are automatically excluded from further considerations. The user might specify additional constraints to reduce the complexity of the search space.

10

SENSITIVITY ANALYSIS

During the process of modeling, several methods assisting the analyses of the model are needed. Such methods are e.g. parameter fitting, statistical analysis, system analysis and experimental design. A basic operation for all these analysis

Investigating the dynamic behavior of biochemical networks using model families

Fig. 6. One cycle in the modeling process beginning with an initial model that has to be formulated in XML for further processing (step 1).

methods is sensitivity analysis. The typically used forms of sensitivities are: ∂~x ∂ ~x˙ ∂~x (t), (t) (6) (t), ∂~ α ∂ x~0 ∂~x Using the approach of sensitivity equations, one has to differentiate the implicit model equations by the rules of calculus. More precisely, by exploiting the underlying model structure, the sensitivity ODEs in eq. 7 are derived in (Hurlebaus et al., 2002). •

(Dα~ ~x) = N · •

(Dx~0 ~x) = N·

! ∂~v ∂~v + · (Dα~ ~x) ∂~ α ∂~x ∂~v · (Dx~0 ~x) ∂~x

(7)

where D is the total differential.

These equations show that the differentiation only needs to be carried out for the formulas ~v(~ α, ~x), whereas all other calculations can be done by matrix computations. It will be shown later that the required computational effort is in the same order as that for finite differences, which are known to have several numerical deficits. Note that the left hand side is now matrix valued and for every model parameter there are dim(~v) ODEs describing how it effects the respective metabolic pool. This means that dim(~ α) × dim(~v) ODEs have to be generated and to be solved, which can be a very large number in practice.

11

the information gathered on the way and to validate successful solutions. MMT2 is based on the experiences made with its predecessor MMT1 (Hurlebaus et al., 2002). MMT2 was originally developed for the evaluation of rapid sampling experiments in metabolic engineering but can also be applied to any biochemical network described by chemical reactions and reaction kinetics. Its architecture is a complete redesign based on state of the art simulation technology and extending MMT1’s functionality. MMT2 has been designed as a computational engine that supports all computationally expensive steps described above. Thus, a major focus in the development of MMT2 is high performance computing. On the other hand, MMT2 is intended to be used in a tool chain for biochemical network modeling that might also include preprocessing tools like database access and graphical network representations, powerful optimization frameworks or post processing tools like data visualization (Qeli et al., 2003) or statistical (Qeli et al., 2004a,b) analysis. Although some of these tools have already been implemented for MMT2, they are not the subject of this contribution. Likewise, methods for model discrimination and experimental design, that are already supported by MMT2 are described in (Wiechert, 2002). As already outlined, MMT2 is mainly targeted at model based evaluation of experimental data. It is designed for large models with more than 10 ODEs and 100 parameters and large experimental data sets (having several thousands of measured values). In this situation, high performance computing is absolutely crucial. MMT2 and its predecessor MMT1 have been extensively tested with data from rapid sampling experiments.

MMT2: A COMPUTATIONAL ENGINE

The metabolic modeling tool MMT2 has been specifically designed to support the process of modeling, to conserve

12

FEATURES OF MMT2

The following list briefly outlines the special features of MMT2: XML Support. To store metabolic models, MMT2 uses its own XML-format M3L (Metabolic Model Markup Language). It is a dialect of the widely used SBML-format (Hucka and et. al, 2003) and differs only in those parts where extended features of MMT2 (e.g. splines and model variants) come into action. Splines. As stated in section 3, in the beginning of a modeling process, the modeler is not certain at all how the model has to be setup. On the other hand, the modeler already has a large amount of experimental data. Instead of using this experimental entirely for checking and validating the modeling approach it can be used to reduce the model complexity. For instance, complex mechanisms such as the ATP-ADP flow can be substituted by data and the modeler can first focus on simpler mechanisms before tackling parts of the surrounding network. The measured metabolite time course is represented

7

Haunschild et al

dim(~ α) #1 #2 #3 #4 #5

3 22 19 92 122

dim(~x) 4 8 5 9 10

dim(~v) 3 10 13 16 22

r.s. 0.015s 0.016s 0.025s 0.13s 0.32s

f.d. 0.04s 0.24s 0.25s 9.5s 14.0s

10−2 0.057s 0.42s 0.28s 6.07s 4.00s

s.e. 10−4 10−6 0.076s 0.112s 0.75s 0.34s 0.34s 0.39s 0.55s 1.01s 0.38s 1.48s

10−8 0.192s 0.67s 0.63s 2.04s 3.60s

Table 2: r.s. = reference simulation, f.d. = method of finite differences, s.e. = method of sensitivity equations (using several levels of error tolerances). See text for a description of the columns. by a smoothing spline, which are computed by an independent tool and stored as a piecewise polynomial in the M3L format. Code generation. All the tools mentioned in the introduction use an interpreter to perform the simulation, having the advantage of easy usability and simple program installation. MMT2 takes the different way of automatic program generation, having the advantage of very fast execution of a single simulation. Automatic Differentiation. The method of sensitivity equations described in section 10 has been implemented in MMT2. Using ADOL-C (Griewank et al., 1996), an algorithm was written that implements eq. 7. ADOL-C takes care v v and ∂~ in eq 7. The of computing the derivatives such as ∂∂~α ~ ˜ ∂~ x ˜ big advantage is that the undifferentiated code does not have to be touched and the unhandy sensitivity equations do not explicitly occur in the generated code. Variant specification. MMT2 fully supports the formalism of model families presented in the previous sections. MMT2 does not distinguish kinetic variants from network variants. Instead, they are brought together into one description by introducing a null kinetics. In the given example, there is a combinatorially generated family of models. Figure 4 displays the structure of this model variant space. Here, the models related by one elementary generalization or specialization step are linked by an arrow. In the rows of Table 3 these model variants are listed. Fortunately, the model space of all combinatorially possible models contains only few models being biologically meaningful. By using two exclusion mechanisms, the number of possible models is reduced significantly. Grid computing. The MMT2 grid service was realized using the Java based Globus Toolkit 3.0 (GT3) (Foster and Kesselman, 2003) as a service-oriented grid middleware. The grid specific parts of the service are implemented as a Java wrapper that uses the computational methods of the MMT2 simulator, accessed through the JNI since they are implemented in C. All descriptions and communication artifacts that are needed to access the simulator as a grid service are generated from the Java wrapper using a number of tools included in the GT3 distribution. The service is currently

8

used by a basic scheduler that offers the distributed execution of the MMT2 simulator service through a web portal that is accessible using a standard web browser.

13

BENCHMARK RESULTS FOR SENSITIVITY COMPUTATION

Several benchmark examples were used to investigate the performance of MMT2. The models #1, #2 and #3 in table 2 are published as SBML files on the homepage of www.sbml.org and were converted to M3L. Model #1 implements the famous Lorenz equations as a set of reactions. Model #2 is concerned with the mitogen-activated protein kinase pathways. Model #3 is a modified Goldbeter minimal mitotic oscillator. Models #4 and #5 were built in a running research project investigating rapid sampling experiments (Degenring and Takors, 2004) with the aid of MMT2. All of the benchmark examples presented in table 2 were computed on an Athlon 1900+ processor with 0.5 GByte main memory, running Linux, Kernel 2.6. In table 2, the first three columns show the model’s size. Column dim(~ α) gives the number of parameters, dim(~x) gives the number of ODEs and dim(~v) gives the number of fluxes. Column r.s. displays the time of a single simulation run of this model which serves as an indicator of the model’s complexity and stiffness.

4 I II 5 IV III V VI

v1 4 4 4 0 0 0 4 0

v2 1 1 1 1 1 1 1 1

v3 1 1 1 1 1 1 1 1

v4 1 1 1 1 1 1 1 1

v5 1 1 1 1 1 1 1 1

v6 0 4 4 4 4 0 0 0

v7 0 0 1 1 0 0 1 1

# of variants 4 16 16 4 4 1 4 P1 50

Table 3: Tabular representation of the model space by explicitly listing each network variant. The digit 0 denotes usage of the null kinetics, otherwise it denotes the number of possible kinetics.

Investigating the dynamic behavior of biochemical networks using model families

The remaining columns give the results of the benchmark runs. Column f.d. gives the time the algorithm of finite differences needed to calculate the parameter’s sensitivities. In theory, this value should be (dim(~ α) + 1)·r.s., but the reference simulation includes the initialization which is only done once when computing the finite differences. Column s.e. is composed of 4 sub columns each showing the needed time with a specific error tolerance. In general, a smaller tolerance causes a higher precision of the results, but also more simulation steps. The interesting effect with models #4 and #5 is worth mentioning. Here, the computation with an error tolerance of 10−2 takes longer than with 10−4 . This is primarily caused by the behavior of the integrator’s step size control which is irritated because of the low accuracy. This is a common experience with numerical ODE solvers. However, above a certain threshold a higher accuracy always means a longer computation time.

14

BENCHMARK RESULTS FOR MODEL VARIANTS

To test the model selection process and to demonstrate the computing performance, a simple exhaustive search strategy was implemented which will be substituted by a more intelligent search algorithm in the future project work. The strategy simply visits any feasible variant of the model and does a repeated parameter fit with multiple starting points by using the subplex optimization algorithm. The biological model originates from another research project which is concerned with the model based evaluation of a rapid sampling experiment investigating the metabolism of E. coli. In the current stage it is focused on the shikimic acid pathway. The model consists of 8 metabolites and another 7 taken from data as smoothed splines. Its 10 reactions sum up 58 parameters which are fitted to 499 measured values from 6 metabolites. For a model discrimination study, the approach of model families was used. The model family had 196 variants from which 158 were excluded. For each of the remaining 38 models, 15 jobs were created, and every job was run for a maximum of L=8 hours. In total, 570 jobs were created which were distributed on a cluster of 100 Opteron 2 GHz processors having 2 GB main memory. In total, the single simulation run was executed over 5.8 · 108 times and was finished after 51 hours and 41 minutes which means 3117 simulation runs per second on the average. After all jobs were finished, the results are ready to be examinatied by the modeler. Each of the variants was fitted 15 times which usually results in 15 different local minima. These minima give a hint whether or not a variant can be fitted to the measurement data. If the sum of squares is below a certain threshold, the modeler has to evaluate whether the time course fits according to his or her intentions.

15

CONCLUSIONS

A major part of biochemical network modeling consists of modifying a model. To disburden the modeler, the new concept of model families was proposed in this paper. With this method, a multitude of similar models can be formulated in a single description by using network and kinetic variants to specify sets of similar networks. Since such a description leads to a combinatorial explosion which can include a vast number of biologically meaningless models, the two exclusion mechanisms of constraints and elementary flux modes were provided. Based on this description, the software tool MMT2 presented in this paper can automatically test all given models and thus relieve the modeler from this tedious work. At the end of this automatic procedure, a large set of data has been produced which then must be analyzed by the modeler. Future work will include developing adequate data analysis tools for preprocessing (drawing the metabolic network and exporting it to M3L), post-processing and visualization of experimental data and the produced simulation results. ¿From the computational viewpoint, MMT2 will be enhanced by alternative optimization algorithms for parameter fitting.

16

OUTLOOK ON MODEL SELECTION

For good reasons, concrete algorithms for model search have not been discussed in detail in this contribution. Various concepts for model selection can be found in the literature (Linhart and Zucchini, 1986; Burnham and Anderson, 1998; Wiechert and Takors, 2004). However, it should be noticed that classical model selection procedures were developed for comparably small model families, such as polynomials of different degree. When they are applied to large families with many parameters, they generally do not produce clear results. As an example, take the classical Akaike model discrimination criterion (Wiechert and Takors, 2004; Burnham and Anderson, 1998): AIC = SSQ + 2mn/(m − n − 1)

(8)

with SSQ the (weighted) sum of squares, n the number of parameters and m the number of data. To have some realistic numbers, assume that n = 100 and m = 1000 for a biological network with rapid sampling data. Then, for a very good fit the SSQ will be in the order of the degrees of freedom m − n, say SSQ = 1000. On the other hand, the quotient on the right hand side is 2mn/(m − n − 1) ≈ 222. In other words: The number of parameters of the models does (almost) not play any role when an improvement of the SSQ can be achieved by a more complex model. This observation is not significantly different for other model selection criteria. This demonstrates that strategies for the selection of large models should not solely rely on abstract selection criteria because well fitting models of similar complexity will be only judged by their SSQ. Instead, the modeler should guide the

9

Haunschild et al

model selection process by making sensible decisions. Generally, the search must be aimed at simple models that can describe the data sufficiently well. Two different strategies are possible: 1. A global search algorithm is implemented that has a good coverage of the model space. Once a model is found that describes the data sufficiently well, the navigation paths can be used to find out if there are simpler models in the neighborhood which perform equally well. 2. The search algorithm is only used to produce a catalogue of well fitting models with an arbitrary number of parameters. Then, other methods based on sensitivity analysis (Degenring and Takors, 2004) are applied to reduce the model complexity. Whatever strategies will be developed in the future, the MMT2 framework supplies the necessary computational primitives.

ACKNOWLEDGEMENT This project is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) within the research program SPP 1063.

REFERENCES Burnham, K. P. and Anderson, D. R. (1998). Model Selection and Interference. Springer. ISBN: 0–387–98504–2. Degenring, D. and Takors, R. (2004). Sensitivity analysis for the reduction of complex metabolism models. J. Process Contr., 14:729–745. Foster, I. and Kesselman, C. (2003). The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, Inc. Griewank, A., Juedes, D., Mitev, H., Utke, J., Vogel, O., and Walther, A. (1996). Automatic differentiation by overloading in C++. ACM TOMS, 22(2):131–167. Algor. 755. Hucka, M. and et. al (2003). The Systems Biology Markup Language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531. Hurlebaus, J., Buchholz, A., Takors, R., Alt, W., and Wiechert, W. (2002). MMT - A pathway modeling tool applied to data from rapid sampling experiments. In Silico Biology, 2:1–18. Kitano, H. (2000). Perspectives on systems biology. New Generation Computing, pages 199–216. Klamt, S. and Gilles, E. (2004). Minimal cut sets in biochemical reaction networks. Bioinformatics, 20:226–234. Koning, W. d. and Dam, K. v. (1992). A method for the determination of changes of glycolytic metabolites in yeast on a subsecond time scale using extraction at neutral pH. Anal. Biochem., 204:118–123.

10

Lengeler, J. W. (2000). Metabolic networks: a signal-oriented approach to cellular models. Biol. Chem., 381(9-10):911–20. Linhart, H. and Zucchini, W. (1986). Model Selection. Wiley. ISBN: 0–471–83722–9. Mendes, P. (1997). Biochemistry by numbers: simulation of biochemical pathways with Gepasi 3. Trends Biochem. Sci., 22:361–363. Qeli, E., Wiechert, W., and Freisleben, B. (2003). Metvis: A tool for designing and animating metabolic networks. In The 2003 European Simulation And Modelling Conference (ESMc’2003), pages 333–338. EUROSIS. Qeli, E., Wiechert, W., and Freisleben, B. (2004a). Visualization of sensitivity matrices generated during simulations of metabolic network models. In Proceedings of the 2004 IASTED International Conference on Applied Simulation and Modelling (ASM 2004), pages 583–589. ACTA Press. Qeli, E., Wiechert, W., and Freisleben, B. (2004b). Visualizing time-varying matrices using multidimensional scaling and reorderable matrices. In Proceedings of the 2004 International Conference on Information Visualization (IV2004), pages 561–567. IEEE Press. Rizzi, M., Baltes, M., Theobald, U., and Reuß, M. (1997). In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: II. Mathematical model. Biotechnol. Bioeng., 55(4):592–608. Sauro, H. (1993). SCAMP: a general-purpose simulator and metabolic control analysis program. Comput. Applic. Biosci., 9. Schaefer, U., Boos, W., Takors, R., and Weuster-Botz, D. (1999). Automated sampling device for monitoring intracellular metabolite dynamics. Anal. Biochem., 270:88–96. Schilling, C. H., Schuster, S., Palsson, B., and Heinrich, R. (1999). Metabolic pathway analysis: basic concepts and scientific applications in the post-genomic era. Biotechnol. Prog., 15:296–303. Schuster, S., Dendekar, T., and Fell, D. (2000). Description of the algorithm for computing elementary flux modes. http://bmsmudshark.brookes.ac.uk/algorithm.pdf. Stephanopoulos, G., Aristidou, A. A., and Nielsen, J. (1998). Metabolic engineering-principles and methodologies. Academic Press, San Diego, CA. Takahashi, K., Ishikawa, N., Sadamoto, Y., Sasamoto, H., Ohta, S., Shiozawa, A., Miyoshi, F., Naito, Y., Nakayama, Y., and Tomita, M. (2003). E-Cell 2: Multi-platform E-Cell simulation system. Bioinformatics, 19:1727–1729. Tr¨ankle, F., Zeitz, M., Ginkel, M., and Gilles, E. (2000). ProMoT: a modeling tool for chemical processes. Math. Comp. Model. Dyn., 6:283–307. Westerhoff, H. (2001). The silicon cell, not dead but live! Metab Eng., 3:207–210. Wiechert, W. (2002). Modelling and simulation: Tools for metabolic engineering. J. Biotechnol., 94(1):37–63. Wiechert, W. and Takors, R. (2004). Validation of metabolic models: concepts, tools, and problems. In Kholodenko, B. N. and Westerhoff, H. V., editors, Metabolic Engineering in the Post Genomic Era. Horizon Bioscience, Wymondham, England.

Suggest Documents