Published online 27 November 2006
Nucleic Acids Research, 2006, Vol. 34, No. 22 e152 doi:10.1093/nar/gkl788
Rapid detection of similarity in protein structure and function through contact metric distances Andreas Martin Lisewski and Olivier Lichtarge* Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA Received September 4, 2006; Revised September 28, 2006; Accepted September 29, 2006
ABSTRACT The characterization of biological function among newly determined protein structures is a central challenge in structural genomics. One class of computational solutions to this problem is based on the similarity of protein structure. Here, we implement a simple yet efficient measure of protein structure similarity, the contact metric. Even though its computation avoids structural alignments and is therefore nearly instantaneous, we find that small values correlate with geometrical root mean square deviations obtained from structural alignments. To test whether the contact metric detects functional similarity, as defined by Gene Ontology (GO) terms, it was compared in large-scale computational experiments to four other measures of structural similarity, including alignment algorithms as well as alignment independent approaches. The contact metric was the fastest method and its sensitivity, at any given specificity level, was a close second only to Fast Alignment and Search Tool—a structural alignment method that is slower by three orders of magnitude. Critically, nearly 40% of correct functional inferences by the contact metric were not identified by any other approach, which shows that the contact metric is complementary and computationally efficient in detecting functional relationships between proteins. A public ‘Contact Metric Internet Server’ is provided.
INTRODUCTION There are now over 39 000 entries in the Protein Data Bank (PDB) (1), with about 100 more added weekly. A growing fraction are from the protein structure initiative (2), 30–50% of which are listed without a known function (3). Thus the functional annotation gap in the structural proteome may eventually come to mirror that encountered in genomics where, for example, up to 40% of the genes sequenced at NCBI’s RefSeq databank lack annotation of biological
function (4). In this light, there is an important need for novel methods of protein function prediction that exploit the available structural knowledge in order to go beyond the limitations of sequence analysis (5). Methods to infer functional similarity from structure currently fall in two broad classes. Those that rely on a global similarity, scoring whole structures likeness, and others that estimate local similarity among structural ‘motifs’ that embody key functional properties. Eventually, one may expect that both approaches will be complementary (6–8) since similar folds may mediate different biological functions (5,9,10), while conversely different folds may support identical functions (11) based on common local structural motifs (12). This study, however, focuses on the first class motivated by examples where structural similarity points to functional similarity, long after any common evolutionary origin is rubbed out from sequence comparison (13–15). Many such methods compare whole protein structures, or domains, and annotate function (16–33). Typically, protein similarity is obtained by computing structural alignments (34). These alignments are computationally hard (35,36), however, and thus often require heuristics and approximations. For example, the DALI algorithm (37) looks for common local patterns in residue–residue distance matrices, and then maximizes their size by combining smaller patterns into larger ones using a Monte Carlo method; the CE algorithm (21) searches for the maximum alignment by a combinatorial extension of a path of aligned fragment pairs that satisfy certain similarity criteria; VAST (29,30) uses a graph–theoretical approach to align secondary structure elements based on their type, relative orientation and connectivity. Alternatively, other algorithms compute similarity much faster by avoiding structural alignment: the method of Gauss integrals applied to the topological curve in space defined by the polypeptide chain’s Ca backbone resulting in the so-called Scaled Gauss Metric, SGM (38); PRIDE and PRIDE2 (25,39) describe proteins as distributions of backbone carbon Ca(i)–Ca(i+n) geometrical distances, where i is the residue number in the protein chain and n is an integer in the range [3–30], so that structural similarity is evaluated on 28 distance distributions and expressed into a single similarity score.
*To whom correspondence should be addressed. Tel: +1 713 798 5646; Fax: +1 713 798 7773; Email:
[email protected] 2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
e152
Nucleic Acids Research, 2006, Vol. 34, No. 22
Recently, some of these methods have also been evaluated against structural similarity benchmarks (17,40), although their performance for function prediction has not been tested comparatively. These studies focused on the ability to recognize CATH (41) fold classifications of several distantly related proteins, and to detect ‘difficult cases’ of structural similarity proposed previously (42). However, in contrast to structural similarities based on manually maintained or automatically generated fold classifications, predicted functional relationships can be ultimately tested in experiments, and hence it is desirable to benchmark protein structure similarity measures against existing functional annotations. Here, we introduce a new vector representation of polypeptide structure that is remarkably fast, quantitative, and, as we show, lends itself to global structure comparison and function prediction. The components of the vector are the frequencies of Ca backbone contacts at a given sequence separation, i.e. at a given contact length. This so-called ‘contact vector’ embodies the histogram of a structure’s contact lengths and thus quantifies the topology of the protein fold by taking into account residue–residue contacts from local secondary as well as from non-local tertiary structure. To compare structures, we use a distance metric between vectors, the ‘contact metric’. We aim to confirm the following hypotheses: (i) that this measure is computationally very efficient since, as a consequence of the contact vector representation, it does not require a structural alignment; (ii) that it carries enough information that the similarities it detects match, or are correlated, with the more familiar root mean square deviation (RMSD) between structural alignments—despite the simplified representation through contact vectors; (iii) that it is useful for functional prediction, namely that it detects functional similarity among remote homologs either more accurately than, or in a way that is complementary to, comparable methods. As a result, the contact metric would then efficiently complement current structural similarity tools used for automated functional annotation of proteins. The following results confirm these hypotheses. First, a similarity search against 34 000 protein chains in the current PDB runs in a few seconds of single CPU time. Thus across large molecular databases, the contact metric computes similarity nearly instantaneously. Second, for small contact metric distances, randomly chosen pairs of protein structures positively correlate (between 0.48 and 0.64) with maximum alignment RMSD. Finally, receiver operating characteristics (ROC) over 1.38 million pairs of remotely homolog PDB chains distributed among the Gene Ontology (GO) classes ‘molecular function’, ‘biological process’ and ‘cellular component’ (43) show that the contact metric sensitivity is higher, across all specificity levels tested, than those obtained with Basic Local Alignment and Search Tool (BLAST) (44), SGM, PRIDE2, and nearly as high as Fast Alignment and Search Tool (FAST) (33), a detailed 3D alignment method shown to be computationally more efficient and accurate for structural recognition than standard alignment algorithms, such as DALI and CE. Critically, up to 44% of functional relationships in GO detected by the contact metric could not be found by any of the other methods tested, including FAST, which itself missed nearly 60% of the contact metric hits. The contact metric therefore
PAGE 2
OF
10
provides complementary information to more established approaches in structure-based functional annotation. A public ‘Contact metric internet server’ for similarity searches against the PDB is maintained and available at the internet URL http://mammoth.bcm.tmc.edu/cm/. RESULTS The contact metric as a similarity measure for protein folds The contact metric uses a histogram representation of protein structures. These histograms record the number of residue– residue contacts in a structure, as a function of their separation along the sequence. These values are then ordered with a so-called ‘contact vector’, which can then be compared directly between proteins by the contact metric. Figure 1A illustrates this representation in human ubiquitin (PDB 1ubq). In this 76 residue-long protein, every pair of residues with Ca backbone atoms closer than 9 s is recorded in a contact matrix (45). Summing up, diagonally in the contact matrix, all the contacts among residues i and j, such that i j ¼ k 1 with k > 3, leads to a histogram that enumerates all structural contacts among residues that are k1 positions apart in the sequence. Typically many contact lengths k are short (k ¼ 3, 4 and 5), consistent with the local secondary structure constraints of a-helices and turns. But other contact lengths are longer, with even some comparable to the length of the chain, and these carry information about the entire fold. Thus the contact vector representation of a tertiary structure is (q3, . . . , qk, . . . , q400), where qk is the absolute number of contact lengths 3 < k < 400. The cut-off at k ¼ 400 reflects the few contribution above this characteristic limit (Supplementary Figure S1). The contact metric between any two protein chains (X, Y) is the absolute distance between two contact vectors, i.e., dðX‚ YÞ ¼
400 X
j qk ðXÞ qk ðYÞ j :
k ¼3
Even though short contact lengths, say with k < 10, dominate in contact vectors, summation over long-range contacts contributes almost equally, and the contact metric reflects both secondary and tertiary structure. For example, in PDB 1 ubq the sum of contacts with k < 10 is 176 while all remaining contact give 140, a comparable number (Figure 1A). This balance between short and long range contributions also holds on a large scale. The Pearson correlation between the contact metric d and d10, i.e. the long-range contact metric accounting contacts larger than k ¼ 10, is 0.79 (Supplementary Figure S2B). This shows that contact metric values take into account local structure to some extent, but long range contacts contribute most. Hence, it cannot be attributed to local features alone, such as to secondary structures. Differences in protein chains length bias (raise) the contact metric distances of longer chains. This follows because in native structures chain lengths are proportional to the number
PAGE 3
OF
10
Nucleic Acids Research, 2006, Vol. 34, No. 22
e152
Figure 1. (A) Contact vector representation of human ubiquitin (PDB 1ubq). Contact lengths are given as integer values k, and their frequencies are counted by qk. Red lines indicate contacts between Ca-atoms (spheres) with sequence separation k 1 ¼ 3. A contact matrix C is derived from the spatial coordinates of Ca-atoms documented in the PDB. For a single-chain protein of length L, one defines a contact by using a Euclidean distance threshold of 9 s between Ca-atoms. In a protein sequence consecutive residues are in contact, thus C(i, j) ¼ 1 for all i, j in {1,. . ., L} with |i j| ¼ 1. Then the frequency qk is the number of contact pairs (i, j) for any given sequence separation k 1 ¼ j i with 2 < k 1 < L 1 and with j > i. (B) Scatter plot of length corrected contact metric (LCM) and the maximum geometrical alignment RMSD calculated with FAST for 32 525 pairs of PDB structures. (C) Distribution of alignments fraction favg for the same data as in (B); pairs of structures which can be aligned close to perfect (favg > 0.8) cluster at small contact metric and RMSD values. (D) LCM-RMSD scatter plot for 3074 pairs of PDB chains, with 1838 pairs having sequence identity below 25%. The average alignment fraction for all pairs is 0.82 (SD 0.19). Solid lines give the average RMSD by steps of 0.01 in dl. The Pearson correlation coefficient between RMSD and dl is 0.48, between RMSD and sequence identity it is 0.55, and between LCM and sequence identity it is 0.67.
e152
Nucleic Acids Research, 2006, Vol. 34, No. 22
of contacts with k > 3 (Supplementary Figure S3), thus we have dl(X,Y) < c (LX + LY), with the proportionality constant c 5.8 and protein chain lengths LX, LY. This can be corrected by normalizing the contact metric with the factor 1/[c(LX + LY)] to yield the length corrected contact metric (LCM), used henceforth. We note that although the contact metric is a true metric mathematically, LCM is not: it is still positive, non-degenerate and symmetric, but does not necessarily satisfy the triangle inequality. The length corrected contact metric can then be written by the simple formula P400 dlðX‚YÞ ¼
k¼3 P 400
j qk ðXÞ qk ðYÞ j
k ¼ 3 qk ðXÞ
þ qk ðYÞ
With this expression all dl-values are limited between 0 and 1, where 0 signalizes maximum similarity, and 1 the minimum. LCM distances are always well defined for any two polypeptide structures, regardless of their geometrical similarity. Because they can be computed rapidly, it was possible to estimate the statistical significance of any number dl(X, Y), by assigning a P-value from the total distribution of contact metric distances. This distribution was randomly sampled choosing Ns ¼ 2.5 · 106 single-chain protein pairs (X, Y) taken from the PDB (Supplementary Figure S4), and it suggested a level of statistical significance at values below dlc ¼ 0.15, because for larger distances their distribution is characterized by a rapid ‘blow-up’ in relative frequency indicating the onset of a random regime. Hence we considered structures to be significantly similar if their LCM distance was below dlc ¼ 0.15, which corresponded to 99.7% significance level (P ¼ 0.003). To demonstrate that LCM is not wholly distinct from the intuitive, standard geometric similarity measure between molecules, it was compared to the RMSD of a maximum geometrical alignment of two chains. We used the same set of 32 525 PDB pairs, and for each pair (X, Y) calculated a maximum alignment RMSD with FAST, as shown in the scatter plot of Figure 1B. Over the entire LCM domain, there is no correlation between the length corrected contact metric and the alignment RMSD (the Pearson correlation coefficient is 0.11). But, for statistically significant values of LCM (dl < 0.15), the RMSD between aligned structures falls toward 1 s, as shown in Figure 1D, and these values are positively correlated with maximum alignment RMSD: for the 79 pairs in Figure 1B the correlation coefficient is 0.64, and for the randomly chosen 3074 PDB pairs in Figure 1D we measured a correlation of 0.48, again in the range 0 < dl < 0.15. Higher LCM values do not maintain this correlation because alignment RMSD can be restricted to very different alignment fractions (or, alignment coverage). In contrast, the contact metric always relates entire protein chains and a loss of correlation is expected as structures become more dissimilar, which was confirmed by the following analysis. Given any two protein chains X and Y with lengths LX and LY, their maximum number of geometrically aligned residue pairs is Lmax ¼ (LX + LY)/2, which is reached if both chains have the same number of residues and if geometrical
PAGE 4
OF
10
alignment is perfect (i.e. every residue has exactly one aligned partner residue). We then defined the alignment fraction f as the ratio of actually aligned residue pairs L < Lmax over the maximum number Lmax. Figure 1C gives a 100 · 100 array of alignment fractions favg derived from the data in Figure 1B (Each pixel in Figure 1C covers an area of (0.1 s · 0.01) in RMSD/contact metric space and coloring represents the average value favg of all points within a single pixel.) High alignment values of favg (favg > 0.8, red pixels) cluster at small contact metric distances (dl < dlc) and RMSD values (RMSD < 8 s), while for dl > dlc well aligned pairs with favg > 0.6 were rarely observed, and the correlation coefficient was 0.10. Only statistically significant LCM values signalized both small RMSD and high alignment coverage. This convergence between the contact metric and the geometrical alignment RMSD suggests that significant contact metric distances predict that two polypeptide structures can be aligned with a small geometrical error—thus they correspond to our intuitive definition of geometrical similarity. Large-scale annotation with Gene Ontology terms To demonstrate that LCM distances carry functional information, we performed large-scale functional recognition experiments and compared the results to several other methods for structural comparison. We mimicked realistic conditions and selected a test set of 1662 non-redundant protein structures with