BioBayes: A Software Package for Bayesian Inference ...

3 downloads 243 Views 285KB Size Report
Jul 16, 2008 - Results: Described herein is a software package, BioBayes, which ... tial model ranking over models of biochemical systems defined using.
Bioinformatics Advance Access published July 16, 2008

BioBayes: A Software Package for Bayesian Inference in Systems Biology Vladislav Vyshemirsky 1,∗ and Mark Girolami 1 1

Department of Computing Science, University of Glasgow, G12 8QQ, UK

Associate Editor: Dr. Olga Troyanskaya

1 INTRODUCTION Inferring the structure of biochemical systems from experimental observations is one of the important challenges in Systems Biology (Burbeck and Jordan, 2006). Such structures are usually defined with mathematical models. One of the advantages of using formal mathematical models is the possibility to make predictions of system behaviour alongside explaining the observed processes. Systems of ordinary differential equations (ODE) are a widely used formalism for modelling biochemical systems (see, for example, de Jong, 2003; Voit, 2000). Inferring parameters of ODE models of biochemical systems can be achieved using methods of Bayesian inference (Golightly and Wilkinson, 2005; Rogers et al., 2006), and evidence-based ranking of alternative models is possible using Bayes factors (Vyshemirsky and Girolami, 2008). The main benefit of adopting the Bayesian approach to model inference is the consistent propagation of uncertainty through all the stages of analysis and the formal way in which prior knowledge can be included in the modelling process. This approach allows one to consider noisy observations as a source of data for learning full distributions of beliefs rather than restricting oneself to the most plausible explanation of some phenomenon. So, instead of making ∗ to

whom correspondence should be addressed

future predictions based on one’s best guess, the Bayesian approach considers all probable outcomes. Implementing the methods of Bayesian inference for probabilistic analysis of biochemical models, however, requires addressing many technical problems such as solving initial value problems for stiff systems of differential equations (Press et al., 2002), or estimating effective proposal distributions for satisfactory convergence of Markov Chain Monte Carlo (MCMC) algorithms (Gelman et al., 1995). It is also important to mention, that in recent years the scientific community has formulated a number of standards for a unified description and exchange of data and models, for example, the SBML standard (M. Hucka et al., 2003) for models of biochemical systems. At the same time working with ad hoc implementations of inference algorithms usually requires some fine tuning to each particular problem. We herein present an extensible software package, BioBayes, which supports standard definitions of mathematical models, and provides a framework for applying methods of Bayesian inference to ODE models of biochemical systems. In addition to implementations of general inference and model comparison methods, BioBayes provides an infrastructure for plugging-in user specific methods using standard interfaces, thus enabling fine tailoring of the tool to user’s specific requirements if needed. It is important to mention that Bayesian inference over ODE models can be performed using some other tools, for example WinBUGS (Lunn et al., 2004) with the WBDiff extension, however, BioBayes supports importing biological models in the SBML standard. The support for model exchange standards saves significant modelling efforts when using BioBayes.

2 METHODS BioBayes is built using the Java virtual machine for its user interface, while using platform-specific libraries for effective computations. Its modular architecture is based on the Eclipse Rich Client Platform that allows straightforward integration with many existing extensions (e.g. version control of documents). This modular structure also allows users to build their own extensions to BioBayes ranging from additional algorithms for more effective inference in specific classes of problems to new types of models and editors. The software package allows users to organise their files in projects, import SBML models, create datasets in such projects, and define tasks for parameter inference or model comparison. The release version of BioBayes published at the official web site http://www.dcs.gla.ac.uk/BioBayes/ includes two MCMC methods for

© The Author (2008). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

1

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 2, 2013

ABSTRACT Motivation: There are several levels of uncertainty involved in the mathematical modelling of biochemical systems. There often may be a degree of uncertainty about the values of kinetic parameters, about the general structure of the model, and about the behaviour of biochemical species which cannot be observed directly. The methods of Bayesian inference provide a consistent framework for modelling and predicting in these uncertain conditions. We present a software package for applying the Bayesian inferential methodology to problems in Systems Biology. Results: Described herein is a software package, BioBayes, which provides a framework for Bayesian parameter estimation and evidential model ranking over models of biochemical systems defined using ordinary differential equations. The package is extensible allowing additional modules to be included by developers. There are no other such packages available which provide this functionality. Availability: http://www.dcs.gla.ac.uk/BioBayes/ Contact: [email protected]

Vyshemirsky et al

model parameter inference and one method for model comparison. We estimated the computational speed of these algorithms to be approximately the same as the performance of ad hoc Matlab samplers.

Metropolis-Hastings Sampler The implementation of the Metropolis-

Population-based MCMC There is also a population-based MCMC sampler (Jasra et al., 2007) available that can be applied to more complex problems when straightforward Metropolis-Hastings fails to converge, e.g. when using nonlinear oscillator models (see tutorial package and examples on the official web site). This sampler runs several Markov chains in parallel using a tempered sequence of distributions as their targets. Moves between different chains in such a sequence of distributions help the sampler to overcome energy barriers and therefore sample more efficiently from multi-modal posterior distributions. The number of steps in such a sequence can be adjusted by the user. The convergence of this sampler to the true posterior distribution is ˆ statistic over several population-based MCMC again judged by using the R samplers run simultaneously. Annealing-Melting Integration Annealing-melting integration can be used to compute marginal likelihoods, the quantity used for evidence-based ranking of alternative models (see Vyshemirsky and Girolami, 2008). This algorithm is based on the population-based MCMC sampler described above. The samples from the tempered sequence of target distributions are used to estimate the marginal likelihoods with thermodynamic integrals. Several population-based MCMC samplers are run simultaneously to evaluate their convergence to the true posterior distribution, and at the same time the standard deviation of the final estimate is computed using this set of simultaneous samplers.

3 APPLICATIONS Consider an example of performing Bayesian inference over the parameters of a model of exponential protein decay. The concentration of protein S undergoing the decay process may be defined = −k1 · S, where using the following differential equation: dS dt

2

4 SUMMARY BioBayes can import SBML descriptions of biochemical models together with experimental data to perform consistent Bayesian learning of model parameter values. It can also estimate marginal likelihoods of alternative models for evidence based model comparison and ranking. The software is built using a modular architecture which enables straightforward extension by users.

ACKNOWLEDGEMENTS This research is funded by Microsoft Research within the “Modelling and Predicting in Biology and Earth Sciences 2006” programme. Mark Girolami is an EPSRC Advanced Research Fellow, EP/E052029/1.

REFERENCES Burbeck, S. and Jordan, K. E. (2006). An assessment of the role of computing in systems biology. IBM J. RES & DEV., 90(6), 529–543. de Jong, H. (2003). Modeling and simulation of genetic regulatory systems: a literature review. Lecture Notes in Computing Science, 2602, 149–162. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995). Bayesian Data Analysis. Chapman & Hall. Golightly, A. and Wilkinson, D. J. (2005). Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics, 61(3), 781–788. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and thier applications. Biometrika, 57, 97–109. Jasra, A., Stephens, D. A., and Holmes, C. C. (2007). On population-based simulation for static inference. Stat Comput, 17, 263–279. Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2004). WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4), 325–337. M. Hucka et al. (2003). The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics, 19(4), 524–531. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (2002). Numerical Recipes in C++: The Art of Scientific Computing. Cambridge University Press. Rogers, S., Khanin, R., and Girolami, M. (2006). Bayesian model-based inference of transcription factor activity. BMC Bioinformatics, 8(2). Voit, E. O. (2000). Computational Analysis of Biochemical Systems. Cambridge University Press. Vyshemirsky, V. and Girolami, M. A. (2008). Bayesian ranking of biochemical system models. Bioinformatics, doi:10.1093/bioinformatics/btm607.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 2, 2013

Hastings sampler (Hastings, 1970) utilises Markov Chain Monte Carlo methods and enables model parameter inference from experimental data for simpler models of biochemical systems. Users can define the desired prior distributions for model parameters, run this sampler to infer parameter posteriors using one or more experimental datasets, and progress can be monitored via the results pane. We optimise the proposal distribution of the Metropolis-Hastings sampler for more effective convergence of Markov Chains by scaling the proposal variance proportionally to the local acceptance ratio and also by adjusting the proposal covariance matrix to a local approximation of the posterior distribution as described by Gelman et al. (1995). This implementation of the Metropolis-Hastings sampler allows users to run several chains at the same time to monitor the convergence of the sampler to the true posterior distribution by comparing within-chain variance of the sample to between-chain variance as proposed by Gelman et al. (1995). ˆ statistic is computed for that purpose for each of the model parameThe R ˆ are displayed in the results pane of the programme. ters. Current values of R ˆ values approach one as the chains mix. The software allows users to The R ˆ values, and the programme assesses define an acceptable threshold for the R that the chains have converged to the true posterior after all the values fall below that threshold. Users, of course, can override that convergence criterion and use, for example, a simple limit on the number of steps during the initialisation of the chains. ˆ statistic, that the chains have converged to When it is judged, using the R the posterior distribution, the programme produces the final posterior sample performing sample thinning if required by the user. The marginalised projections of this sample are then displayed, and the sample itself can be exported for further analysis using external tools.

k1 is the decay rate parameter. The overall statistical model has two more parameters: the initial concentration of the decaying protein S|t=0 , and observation noise variance σ 2 . Arbitrarily selecting values for these parameters: S|t=0 = 1, k1 = 0.1 and σ 2 = 0.1, we generated an example dataset D. Priors have then been assigned to model parameters. A singular prior S|t=0 = 1 was used for parameter S|t=0 , while parameters k1 and σ 2 were assigned a Gamma prior Γ(1, 2). The Metropolis-Hastings sampler from the BioBayes package is then applied to infer the parameter posterior. The inferred posterior sample has mean of k¯1 = 0.132, σ¯2 = 0.126 and standard deviation of s2k1 = 0.02, s2σ = 0.04, which matches very well the original values used for generating dataset D. A detailed tutorial which includes this example and examples of parameter inference and model ranking for nonlinear circadian controllers is available from the official web site.