others in the field, and recently a number of new infrastructures have ... by the myGrid project group, such as the Computer-Aided Software Engineering. (CASE) ...
OpenKnowledge FP6-027253
Summative Report on Bioinformatics Case Studies Adrian Perreau de Pinninck1, Carles Sierra1, Chris Walton2, David de la Cruz1, David Robertson2, Dietlind Gerloff3, Enric Jaen1, Qian Li4, Joanna Sharman5, Joaquin Abian6, Marco Schorlemmer1, Paolo Besana2, Siu-wai Leung2, and Xueping Quan2 1. Artificial Intelligence Research Institute, IIA-CSIC, Spain 2. School of Informatics, University of Edinburgh, UK 3. Dept of Biomolecular Engineering, University of California Santa Cruz, USA 4. Gulfstream-software Ltd 5. Institute of Structural and Molecular Biology, University of Edinburgh, UK 6. Institute for Biomedical Research of Barcelona, IIBB-CSIC, Spain.
Report Version: first Report Preparation Date: 20.11.2008 Classification: deliverable D6.4 Contract Start Date: 1.1.2006
Duration: 36 months
Project Co-ordinator: University of Edinburgh (David Robertson) Partners: IIIA(CSIC) Barcelona Vrije University Amsterdam University of Edinburgh KMI, Open University University of Southhampton University of Trento
1. Introduction Modern biological experimentation requires computational techniques of different kinds to enable large-scale and high-throughput studies. For example, structural genomics efforts aim to understand the function of proteins from their 3-D structures, which are either determined by experimental methods (e.g, X-ray crystallography and NMR spectroscopy) or predicted by computational methods (e.g., comparative modelling, fold recognition, and ab initio prediction). Proteomics efforts (in the most inclusive sense of the term) as another example, aim to understand the functional consequences of the collection of proteins that is present in a cell, or tissue, at a given time - particularly where differences are observed between healthy and disease states. In both examples the data, and the analytical methodology applied to them, are obviously central to accomplishing the aims of these scientific domains. In addition, however, a framework is required that allows researchers to access the data, interpret the data, and exchange knowledge with one another. Most of the infrastructures currently in use enable straightforward access to centrally stored experimental data via large databases, and to tools and services either via web servers or repositories. Their existence and availability has undoubtedly been seminal for establishing the important position held by applied bioinformatics research in many biological experimental laboratories today; and will still play a role in the future. However, their versatility (e.g., with respect to heterogenic data types) and expandability (e.g., with respect to the ease by which data is shared amongst different research groups) are limited. In our view there is still ample room for improvement, even re-invention, of the framework underpinning biological and bioinformatics research interactions. The value of integrating resources and researchers more effectively has been recognised by many others in the field, and recently a number of new infrastructures have emerged that address some of the shortcomings and foster interactions. Many of them are focused exclusively on facilitating bioinformatics research within specific domains, and range from work-bench style environments (e.g. Schlueter 2006) to peer to peer based systems for small specialised group (e.g. Overbeek 2005). In our own research we were interested in realising a more general framework. In this context, any experimental protocol that is followed when one, or several, researchers are undertaking a bioinformatics experiment can be viewed as a series of interactions between the researcher(s), the databases from which the data are obtained, and the tools that are applied to derive secondary information from this data. Many bioinformatics protocols can be represented as consecutive interactions, or steps in a workflow. Moreover, an improved/novel framework should build on existing network connections (such as the internet, or the Grid). Accordingly the developments by the myGrid project group, such as the Computer-Aided Software Engineering (CASE) tool Taverna (Oinn 2004), currently play the most prominent role in the sector of automated experimentation enactment in bioinformatics. However, one of the weaknesses of this design is that, while it facilitates reproducible research, it cannot be extended to facilitate as effective sharing of resources (tools, data, knowledge, etc.) as this is conceivable across a peer to peer network. Below we describe how OpenKnowledge P2P infrastructure was used to enact bioinformatics analyses, involving consistency checking amongst comparable data from different different bioinformatics programs (ranked lists of short amino acid sequences that could have yielded a given tandem mass spectrum, section 4.1) and different databases (section 5; atomic coordinates of modelled 3-D structures of yeast proteins), peer ranking for measurement of relative popularity of protein identifications by tandem mass spectrometry (MS/MS) (section 4.2), and data sharing for protein identification in proteomics (OK-omics scenario, section 4.3), respectively.
2. Bioinformatics Problems 2.1. Protein Identification by MS/MS in Proteomics Tandem mass spectrometry (MS/MS) involves multiple steps of mass selection or analysis and has been widely used to identify peptides and analyze complex mixtures of proteins. In the last decades, many specific techniques for identifying protein sequences from MS/MS spectra were implemented and made available either as downloadable local applications, or in form of web-enabled data processing services. The two most frequently used computational approaches to recognizing sequences from mass spectra are: (a) peptide fragment fingerprinting approach, in which spectrum analysis is performed specifically for candidate proteins extracted from a database by building theoretical model spectra (from theoretical proteins) and comparing the experimental spectra with the theoretical model spectra. This takes advantage of public genomic-translated databases (GTDB) that can be accessed through data-mining software (search engines) which directly relates mass spectra with database sequences. Most of the search engines (MASCOT (Perkins et al, 1999), OMSSA (Geer LY et al, 2004), SEQUEST (Eng et al., 1994)) are available both as standalone programs enquiring local copy of a GTDB or as web services connected to online GTDBs. The main drawback of this approach is that it can only be used in situations where the genome has been sequenced and all predicted proteins for the genome are known. This approach is not suitable for the proteins with missing post-translational modifications (PTMs) and from unsequenced genomes. (b) de novo sequencing approach, in which the inferences of the peptide sequences or partial sequences is independent of the information extracted from pre-exsiting protein or DNA databases. Sequence similarity search algorithms are specially developed to compare the inferred complete or partial sequences with theoretical sequences. Once a protein has been sequenced by de novo methods, one can look for related proteins in a GTDB using a matching algorithm such as MS-Blast (Shevchenko et al 2001). Factors complicate the MS/MS spectrum interpretation task usually include: •
the number of admitted PTMs can multiply the volume of results to be analysed;
•
bad quality and noise in mass spectra increase uncertainty of interpretation; and
•
database errors in sequence annotations can lead to mis-interpretation.
These obstacles indirectly produce a huge amount of un-interpreted data (for instance, non-matching mass spectra or low-scoring de novo interpreted sequences) which are likely to be trashed. The unmatched data could be due to peptides derived from novel proteins, allelic or species-derived variants of known proteins, or PTMs. Nowadays these un-interpreted data are seldom accessible to other groups involved in the identification of the same or homologous proteins. 2.2. Protein Structure Prediction in Structural Genomics Protein structure prediction is one of the best-known goals pursued by bioinformaticians. A protein’s three-dimensional (3-D) structure contributes crucially to understanding its function, to targeting it in drug discovery and enzyme design, etc. However, there is a continually widening gap between the number of protein amino acid sequences that are deduced rapidly through the ongoing genomics efforts, and the number of proteins for which atomic coordinates of their 3-D structures are deposited in the Protein Data Bank (PDB) (Berman, et al., 2000), i.e. those that are determined by structural biological techniques. To bridge this gap many computational biology research groups have been focussing on developing ever improving methodology to produce structural models for proteins
based on their amino acid sequences. Still the resulting methods are far from perfect, and there is no one method that is always producing an accurate model. However, particularly in comparative modelling cases (where a protein with known structure can be used as a template for a protein of interest, based on similarity between their sequences), high-quality modelled structures can be useful resources for biological research (see Krieger 2003 for a review). Consistency checking and consensus building are commonly used strategies in the field to select high quality models from the pool of available models produced by different methods.
3. OpenKnowledge for Bioinformatics Experiments 3.1. OpenKnowledge Infrastructure The OpenKnowledge system (Siebes et al. 2007) is a fully distributed system that uses peer to peer technologies in order to share interaction models of protocols and service components (so-called OpenKnowledge Components, or OKCs), across a network (normally the Internet). The OpenKnowledge kernel (downloadable from http://www.openk.org) is a small Java program that can easily be downloaded and run (in a manner similar to downloading and running systems like Skype). Its job is to allow the computer on which it has been installed to discover interaction models available in the peer community; put together a confederation of peers to take part in any chosen interaction; and run that interaction to completion if possible. Primarily this involves one peer acting as a coordinator of the interaction, and fellow peers managing OKCs to play changeable roles as are specified in the interaction model (Figure 1). The OpenKnowledge system consists of three main services which can be executed by any computer once the kernel is installed: • the discovery service is a distributed hash table (DHT) in which the shared interaction models and OKCs are stored so that the users can search for them and download them; • the coordination service manages the interactions between OKCs; and • the execution service consists of the kernel executing the service on the local machine. The workflow for implementing a new application is as follows. First the interaction model (or to a biologist or bioinformatician, an experimental protocol), must be expressed in the specification language LCC (Lightweight Coordination Calculus) (Robertson 2004). LCC is described briefly below, in Section 3.2. This interaction model is published to the discovery service so that other users can find it, and subscribe to play a role in it. A developer, not necessarily the one who implemented the interaction model, will develop OKCs that play the roles defined in the interaction model. Some of these OKCs may be shared across the network by publishing them to the discovery service, i.e. other users may download them onto their local computers. At this point the distributed application is implemented and can be executed using the OpenKnowledge system. Peers can proactively search interaction models, or be advertised about newly published interaction models they are interested in. In both cases, peers receive a list of interaction models and will subscribe to perform one role (or more) in those that best match their capabilities and interests. Any peer in the network can opt to assume the coordinator role and manage the interaction. The coordinating peer, randomly chosen by the discovery service among those who have opted in, is handed the interaction model together with the list of matching peers for the roles. Once a confederation of mutually compatible peers has been assembled, they are asked by the coordinator to commit to the services they have subscribed to provide. If the peers commit, the interaction model is executed.
Figure 1: Inside the OpenKnowledge kernel. A peer can store and search interaction models and OKCs in the discovery service. Interaction models specified in LCC define the roles to be played by OKCs. A peer can act as a regular peer managing one, or several, OKCs or it can act as a coordinator for the interactions. An interpreter interprets an interaction model as it is being executed.
3.2. LCC as Experiment Description Language An interaction model is specified in LCC as a set of clauses, each of which defines how one role in the interaction must be performed by a peer assuming this role. A role is described by the type of the role, and an identifier for the peer taking it on. Formally we describe this as a(Role,Identifier), e.g. a(experimenter,E). The definition of a role is constructed using combinations of the sequence operator (‘then’) or choice operator (‘or’) to connect messages and changes of role. Messages are either outgoing to another peer in a given role (‘=>’) or incoming from another peer in a given role (‘