A Comparison of Computational Methods for the Maximum Contact

0 downloads 0 Views 872KB Size Report
The maximum contact map overlap (MAX-CMO) between a pair of protein structures can be used as a measure of protein similarity. It is a purely topological ...
A Comparison of Computational Methods for the Maximum Contact Map Overlap of Protein Pairs N.Krasnogor1 • G.Lancia2 • A.Zemla3 • W.E.Hart4 • R.D.Carr4 • J.D.Hirst5 • E.K. Burke1 1

Automatic Scheduling, Optimisation and Planning Research Group School of Computer Science and IT University of Nottingham NG8 1BB, Nottingham, United Kingdom 2

3

Dipartimento di Matematica e Informatica University of Udine 33100, Udine, Italy

Biology and Biotechnology Research Program Lawrence Livermore National Laboratory Livermore, CA 94550, USA 4

Sandia National Laboratories PO Box 5800,MS 1110 Albuquerque, NM 87185, USA

5

Computational Biophysics and Chemistry Group School of Chemistry, University of Nottingham NG7 2RD, Nottingham, UK

[email protected][email protected][email protected][email protected][email protected][email protected][email protected]

The maximum contact map overlap (MAX-CMO) between a pair of protein structures can be used as a measure of protein similarity. It is a purely topological measure and does not depend on the sequence of the pairs involved in the comparison. More importantly, the MAX-CMO present a very favorable mathematical structure which allows the formulation of integer, linear and Lagrangian models that can be used to obtain guarantees of optimality. It is not the intention of this paper to discuss the mathematical properties of MAX-CMO in detail as this has been dealt elsewhere [13],[23], [1]. In this paper we compare three algorithms that can be used to obtain maximum contact map overlaps between protein structures. We will point to the weaknesses and strengths of each one. It is our hope that this paper will 1

encourage researchers to develop new and improve methods for protein comparison based on MAX-CMO. (Protein Structure Comparison; Local-Global Alignment; Genetic Algorithms; Memetic Algorithms; Lagrangian Relaxation)

1

Introduction

The comparison of proteins’ structures is at the intersection point of various computational genomics problems. Nowadays more than 30 genomes are fully sequenced and available through the web1 with a total of 21838 structures stored in the PDB[4] at the time of writing this paper. Structural genomic endeavors are concerned with comparing and evaluating sequences, structures and, perhaps more importantly, biological functionalities among proteins. As experimental methods are limited at the time of inferring the likely function of a gene or set of genes and because to a large extent the biological function of a protein is given by its three dimensional structure, biologists usually resort to the comparison of a target protein structure with the structures of those proteins for which their function has already been determined. That is, identifying structural similarities is an essential first step in the correct assessment of the relation between structure and function in proteins. Moreover, without the ability to perform reliable and efficient structural matchings it will not be possible to carry out rational drug design. Additionally, we have noted[38]:

With the significant expansion of activity in the structure prediction field, processing and subsequent analysis of predictions has become increasingly complex.... One of the lessons learned from CASP is that analyzing the effectiveness of prediction methods is not a trivial matter... To evaluate predictions, first we need an analytical approach to identify what in a prediction worked and what failed. Second we need a comparative approach, using both general and specialized techniques, to identify which methods work best, and which address a specific aspect of prediction most successfully.” What this means is that there is yet another important role for structural matching algorithms, which is the evaluation of ab-initio, threading or homology modeling structure 1

Visit the NCBI site ftp://ncbi.nlm.nih.gov/genbank/genomes

2

predictions. The quality of a prediction is assessed by comparing a target structure with the predicted structure (the model) in order to determine which regions of the latter closely resemble the former. Hence, any advancement in structural matching will also impact on structure prediction and its evaluation methodologies, an example of which is the CAFASP initiative[10]. Clustering of proteins in families (e.g. see [15],[29]) is also an important aid in structural genomics in particular and biomedical sciences in general. The clustering problem can be roughly decomposed into two sub-problems. On the one hand, a proper similarity measure needs to be found and, on the other, a good clustering technique must be employed. A good algorithmic solution to the MAX-CMO problem constitutes a solution to the first of these sub-problems. A variety of structure comparison methods have been proposed, and used in classification servers as SCOP[28], DALI[14] , LGA[36],[38], etc., but the problem of comparing protein structures is considered to be still open. Methods for this problem range from dynamic programming [33], comparisons of distance matrices [14], maximal common sub-graph detection [2] to geometrical matching [34] to name but a few. There is no consensus among scientists on which of these methods is the best one as there exist various difficulties associated with them. Some difficulties are often related to the fact that these methods implicitly accept that a suitable scoring function can be defined for which optimum values correspond to the best possible structural match between two structures. Methods that are based on rootmean-square-distances, e.g., [25],[9], and differences of distance matrices, e.g., [15], present numerical instabilities problems; other algorithms cannot produce a proper ranking due to an ambiguous definition of the similarity measure or to the fact that they neglect alternative (different) solutions that present equivalent values of similarity. More recent approaches that have attempt to address these problems can be found in [22], [38] and [16]. An excellent survey of various (37 in total) similarity measures can be found in [26]. There is yet another, often over-looked, problem associated with some of the most established comparison methods: similarity can at least (but not only) be measured by both the minimum root mean square deviation (RMSD) attainable between two structures and also by the number of equivalent residues (for a suitable definition of equivalence) between the two structures. However, these two measures are not independent one of each other and the optimisation of one does not follow from the optimisation of the other. When the structural comparison problem should be treated as a truly multi-objective optimisation one, comparison servers, 3

like e.g. ProSup [35], optimise the number of equivalent residues while using as an additional constraint (but not as an another search dimension) the RMSD. On the other hand DALI [14] combines various derived measures into one value, effectively transforming a multi-objective problem into a (weighted) single objective one. All the previously mentioned methods can be roughly divided in three main approaches for structural comparison. The first one is represented by those approaches that consider one of the protein structures as fixed and that rotates the second structure as a rigid body trying to minimize the root mean square deviation between the two rigid bodies [17]. The second method, while related to the one we investigate here but not entirely identical, relies on a similarity measure based on distance matrices. This is best exemplified by [14]. Finally, the contact map overlap based similarity originally introduced in [11] is the only of the three approaches that does not requires a pre-calculated set of residues equivalences as it is precisely that equivalence that the overlap is defined by. Goldman et.al. [13] suggest that, among the various desirable features, a similarity measure should not penalize too heavily insertions and deletions and it should be reasonably robust, in that small perturbations of the definition should not make too much difference in the measure itself. Also, a good similarity measure should be easy to compute (or approximated), and capture local as well as global alignments. Moreover they argue that it is important that the measure is accepted in the field by protein scientists. The aim of this paper is to try to show that the maximum overlap of contact maps can be considered as a useful measure of protein similarity. We present the comparison of the results from the LGA server [36],[3] used to obtain contact map overlaps to the results from other two methodologies, namely, a memetic evolutionary algorithm and a lagrangian relaxation based algorithm.

2

The Maximum Contact Map Overlap Problem

In its simplest form, a contact map[24][7],[27] is a matrix of all pairwise distances within a protein. It is a minimalist representation of a protein native three dimensional structure and it takes the following form: (

Si,j =

1 0

if residue i and j are in contact otherwise 4

(1)

Residues i and j are said to be in contact if they are closer than R Angstroms away from each other. Oftentimes R is called the “threshold” of the contact map. In this version of the contact map, distances are not explicitly represented (usual values for R are between 2 and 9 Angstroms),rather a Boolean value is assigned to a matrix cell specifying whether two residues are considered neighbors or not2 . Contact maps might be calculated by taking into account the distance of the Cα atoms of the residues under consideration, or the minimum distance between any two atoms belonging to those residues. In some cases contact maps are also computed based on the distances between the residues’ side chains center of mass. As mentioned before, the contact map captures the three dimensional structure of proteins; for example, if matrix Si,j is graphically represented as a white-black dot matrix, α−helices will appear as wide bands on its main diagonal, while β − sheets will manifest themselves as bands parallel or perpendicular to the diagonal. Several software packages, some of them in the public domain, permit the user to compute and display contact maps3 (see [32],[31] and references therein). Given contact maps for a pair of proteins, protein similarity can be computed by aligning the two contact maps. An alignment of two proteins is a pairing of amino acids between them. For example, in figure 3 we show a candidate alignment (equivalent residues are identified by blue alignments) between contact maps for dissimilar proteins 1aa9 (fig. 1) and 1hnf (fig. 2) (shown also in table 7). The value of an alignment is determined by considering the size of the common subgraph identified by the alignment. That is, the edges that are the same in each graph, given the alignment of amino acids. As another example consider the contact map overlap (figure 6) for more structurally related proteins 1ash (fig. 4) and 1hlm (fig. 5) (shown also in table 1). It can be seen from the comparison of figures 3 and 6 that in the case of two similar proteins the detected alignment is strong, while for different proteins the alignment is rather random. The first rigorous approach to contact maps overlap was introduced by Lancia et. al. in [23] and later improved (by different means) in [6] and [1]. This approach is based on an effective integer programming (IP) formulation of protein structures contact map overlaps 2

In more specific formulations, the entries in Si,j will be positive real numbers specifying the inter-residue distances. 3 A java based program for contact maps can be downloaded from www.cs.nott.ac.uk/˜nxk/USM/protocol.html

5

(a)

(b)

Figure 1: Three dimensional native structure (a) for proteins 1aa9 as taken from the PDB[4] and its contact map (b).

(a)

(b)

Figure 2: Three dimensional native structure(a) for proteins 1hnf as taken from the PDB[4] and its contact map (b).

and the development of a branch and cut strategy that uses lower bounding heuristics at the branch nodes. Also, a linear programming (LP) formulation exists that provides upper bounds on the value of the optimal alignments. These upper bounds allows us to compare the results produced by various algorithms (lower bounds) with those of the LP formulation. Having the upper and the lower bounds for the value of the resulting structural overlap of two proteins is a strong guarantee of quality for the alignment and an indication of how similar the two protein structures really are. Although the problem of aligning contact maps might seem to be somehow easier than aligning matrices with real values, the problem remains NP-complete[12],[13],[18].

6

Figure 3: A candidate alignment of value 42 for the contact maps of proteins 1a99 and 1hnf. This alignment was generated with the Memetic evolutionary algorithm. The overlap value obtained is superior to the one found by LGA (20) but still inferior to the final value reported by the MA (66) or the Lagrangian relaxation method (110). See table 7 for other alignments involving these proteins.

2.1

A Branch and Cut Approach for the Maximum Contact Map Overlap Problem

In this section we briefly describe the first successful and rigorous approach to contact map overlaps as originally introduced in [23]. The IP formulation introduced here was later improved in [1] and is at the core of the Lagrangian method we will use in this paper. Formally, a contact map is represented as an undirected graph that gives a concise representation of a protein’s 3D fold. In this graph, each residue is a node and there exists an edge between two nodes if they are in contact as described before. An alignment between two contact maps is an assignment of residues in the first contact map to residues on the second contact map. Residues that are thus aligned are considered equivalents. The value of an alignment between two contact maps is the number of contacts in the first map whose end-points are aligned with residues in the second map that, in turn, are in contact (i.e. the number of size 4 undirected cycles that are made between the two contact maps and the alignment edges). This number is called the overlap of the contact maps and the goal is to maximize this value. The Max CMO problem was introduced in [11] and proved 7

(a)

(b)

Figure 4: Three dimensional native structure (a) for proteins 1ash as taken from the PDB[4] and its contact map (b).

(a)

(b)

Figure 5: Three dimensional native structure (a) for proteins 1hlm as taken from the PDB[4] and its contact map (b).

NP-complete in [13] and later in [18]. Figures 1(a), 2(a), 4(a) and 5(a) present the native structures for proteins while Fig.1(b), 2(b) and 4(b), 5(b), show their contact maps. Please note the long range interactions of residues that are far away in the sequence but close in the three dimensional structure adopted by the native state. Short arcs connecting nearby residues are graphical representations of α-helices. The proteins in 4(a), 5(a), and 1(a) have helical content while the one in 2(a) has sheets. Candidate alignments for the two pairs of protein contact maps are shown in figures 6 and 3(equivalent residues are identified by blue alignments). The IP approach introduced in [23] builds upon a polynomial reduction from Max CMO to Maximum Independent Set (MIS). The size of the converted instances is the product of the number of contacts of the two maps (around 10000 nodes for the instances studied here). 8

Figure 6: A candidate alignment of value 162 for the contact maps of proteins 1ash and 1hlm. This alignment was generated with the Memetic evolutionary algorithm. The overlap value obtained is inferior to the one found by LGA (247) and the Lagrangian relaxation method (279). See table 1 for other alignments involving these proteins.

To solve this size MIS instances, the authors use some characteristics of the instances thus generated. Let G1 = (E1 , V1 ) and G2 (E2 , V2 ) be the two graphs that correspond to two 0-1 contact maps, where Ei are the edges (i.e. residues’ contacts) in these graphs and Vi the vertices (proteins’ residues). The IP formulation proposed by Lancia et al. is: max

X

ye,f

(2)

e∈E1 ,f ∈E2

subject to the constraints x

+x

≤1

j,v Pi,u y ≤ xmax(i,j),u Pi

Suggest Documents