Towards a fully automated protein structure classification: How to get

0 downloads 0 Views 381KB Size Report
For any protein to be classi ed we can estimate the expected success rate of ... Thornton and coworkers use a combination of automatic and manual ... similarity are put in the same fourth level family, called homologous superfamily. ..... the clustering algorithm that was used to create it (such as KMNV, SPC and AVL { see. 11 ...
Towards a fully automated protein structure classi cation: How to get CATH classi cation from FSSP Z-scores Gad Getz1, Michele Vendruscolo2 and Eytan Domany1 Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel 2 Oxford Centre for Molecular Sciences, New Chemistry Laboratory, Oxford OX1 3QT, UK 1

Corresponding author:

Dr. Michele Vendruscolo Oxford Centre for Molecular Sciences Phone: +44-1865-275919 Fax: +44-1865-275921 Email: [email protected] Mail: Oxford Centre for Molecular Sciences New Chemistry Laboratory Oxford OX1 3QT, UK Running title: Towards automatic structure classi cation For Journal of Molecular Biology, article

1

Abstract Currently, each week about 50 new protein structures are made available in public databases. The attention is focused on developing automatic methods of classi cation. The work of organization is being done by several groups, to a large extent independently. To our knowledge, the consistency of di erent classi cations has never been examined on a protein by protein basis. Moreover, the potential advantages of a simultaneous consideration of di erent classi cation schemes has not been much scrutinized. In this work we compare the classi cations given by two widely used databases, FSSP and CATH. The pairwise structural similarities provided by the FSSP are found to be largely consistent with the CATH classi cation and we discuss ve proteins for which the two databases are not consistent. We present a fully automated scheme to predict the CATH architecture of proteins on the basis of their FSSP scores. Using this scheme, we predict the architecture classi cations of 165 single{domain proteins that were not yet processed by CATH when this work was done. For any protein to be classi ed we can estimate the expected success rate of our architecture predictions, which reaches 94% for 50% of the proteins and 73% for an additional 35%. We found that correlating the information in the di erent protein structure databases can help automating classi cations that are performed manually. We showed that automatic algorithms can be used to enlarge the CATH database using information available in the FSSP database. The identi cation of inconsistencies between databases can uncover possible errors of classi cation and may reveal shortcomings of the present methods. Protein Structure, Protein Databases, CATH, FSSP, Classi cation, Clustering, Structure prediction

Keywords:

2

INTRODUCTION Proteins can be classi ed according to similarities in their sequences, in their structures and in their functions (Chothia, 1992; Orengo et al., 1994; Brenner et al., 1997). The relationship between such levels of description makes them complementary. A reliable classi cation scheme is useful for di erent reasons; It allows for the identi cation of distant evolutionary relationships (Gerstein and Levitt, 1997) and provides a library of representative structures to perform prediction of protein structure (Bowie et al., 1991; Jones et al., 1992; Fisher et al., 1996). Given a particular protein, it provides a tool to identify other proteins of similar structure and function (Murzin, 1996). At a more abstract level, the physical principles dictating structural stability of proteins are re ected in their folded state. Therefore most of the recently proposed methods to derive energy functions to perform protein fold predictions rely in di erent ways on structural data (Jernigan and Bahar, 1996; Finkelstein, 1997). In the long term, the most exciting perspective is the possibility to routinely assign gene function from genome databases (Sali, 1998). The most comprehensive repository of three dimensional structures of proteins is the Brookhaven Protein Data Bank (PDB) (Bernstein et al., 1977; Abola et al., 1987)( the bank is available on internet at the web site http://www.pdb.bnl.gov/). The number of released structures is rapidly increasing and more than 9000 complete sets of coordinates were available at the end of 1998. Many research groups maintain web-accessible databases of protein folds, which divide the conformational space into structural hierarchies (nested groups) e.g. FSSP (Holm and Sander, 1994), CATH (Orengo et al., 1997), SCOP (Murzin et al., 1995), MMDB (Gibrat et al., 1996) and 3Dee (Siddiqui and Barton, 1995). Each group has its own way to compare and classify proteins. Here we consider two such databases, the FSSP database created by Holm and Sander (1996a; 1996b) and the CATH database created by Orengo et al. (1997).

3

The FSSP database The FSSP (Fold classi cation based on Structure-Structure alignment of Proteins) uses a fully automated structure comparison algorithm, DALI (Distances ALIgnment algorithm) (Holm and Sander, 1994), to calculate a pairwise structural similarity measure (the S-score) between protein chains. The algorithm searches for the amino acid alignment between the two protein chains which yields the most similar pair of C distance maps. In general, the more geometrically similar are two chain structures, the higher is their S-score. The mean and standard deviations of the S-scores obtained for all the pairs of proteins are evaluated. Shifting the S-scores by their mean and rescaling by the standard deviation yield the statistically meaningful Zscores. For classi cation of structures the FSSP uses the Z-scores for all pairs in a representative sub-set of the PDB. A fold tree is generated by applying an average-linkage hierarchical clustering algorithm (Jain and Dubes, 1988) to the all-against-all Z-score matrix.

CATH database Thornton and coworkers use a combination of automatic and manual procedures to create a hierarchical classi cation of domains (CATH) (Orengo et al., 1997). They arrange domains in a four level hierarchy of families according to the protein class (C ), architecture (A ), topology (T ) and homologous superfamily (H ). The class level describes the secondary structures found in the domain and is created automatically. There are four class types: mainly- , mainly- , - and proteins with few secondary structures (FSS). The architecture level, on the other hand, is assigned manually (using human judgement) and describes the shape created by the relative orientation of the secondary structure units. The shape families are chosen according to a commonly used structure classi cation, like barrel, sandwich, roll, etc. The topology level groups together all structures with similar sequential connectivity 4

between their secondary structure elements. Structures with high structural and functional similarity are put in the same fourth level family, called homologous superfamily. Both the topology and homologous superfamily levels are assigned by thresholding a calculated structural similarity measure (SSAP) at two di erent levels, respectively (Taylor and Orengo, 1989; Orengo et al., 1992).

CATH classi cation from FSSP data In the present work we present a method to obtain the CATH classi cation from the FSSP scores. The rst step is to test whether the two databases are consistent. Having established their consistency, the second step is to develop suitable classi cation algorithms to the FSSP Z-scores in order to predict without human intervention the architecture classi cation of CATH (which presently is performed manually). Such a procedure can save human e ort and help enlarging the CATH database. Moreover, our analysis identi es inconsistencies between the CATH classi cation and FSSP scores and can thus focus attention on possible misclassi cations in the CATH database or to problems with the Z-score. In order to develop and test the classi cation scheme, we retrieved the CATH classi cations and the FSSP Z (and S) scores for a set of 479 proteins, denoted by PR479, which appear in both databases. Since the CATH database handles domains, whereas FSSP deals with chains, we restricted ourselves only to chains which form a single domain and, therefore, these proteins appear as a single entry in both databases. An additional set of 165 single-domain proteins (PR165) that appear in FSSP but were not yet processed by CATH was identi ed. The architectures of these 165 proteins are predicted using the Z-scores between all proteins and the known classi cations of the set PR479. We illustrate the problem in Figure 1 which shows the Z-score matrix between all proteins  In order to identify the single-domain chains, the proteins were checked against the 3Dee database

(available at the internet address http://circinus.ebi.ac.uk:8080/3Dee/)

5

in PR644 (the combined set PR479+PR165). A pair of similar proteins, with a Z-score larger than 2.0, is represented by a black dot. The rst 479 proteins constitute PR479, for which the CATH classi cation is known. Proteins numbered from 480 to 644 constitute the set PR165, for which the CATH architecture is yet unknown. The order of the proteins within PR479 and PR165 is arbitrary. Our goal is to predict the architectures of PR165 proteins, given the architectures of PR479 and the similarity score matrix for the complete PR644 set. Predicting the architectures is actually a classi cation problem which is treated with pattern recognition tools. We tested several architecture prediction algorithms using crossvalidation to estimate (Stone, 1974) their performance. Every one of the algorithms that were tested can be viewed as a two-stage process. In the rst stage a new similarity measure is produced from the original Z-scores. This is done either by a direct rescaling of the original Z-scores, or by using the results of various hierarchical clustering methods to produce new similarity measures. The second stage consists of using these similarities as the input to some classi cation method, yielding predictions for the classes and architectures. Our nal predictions for the set PR165 are listed in table V. These predictions are not the outcome of a single preferred method. Rather, they are based on a majority vote between several architecture prediction algorithms that performed best. The success rate of the majority vote method, as was tested on PR479, reaches 94% on 50% of the proteins and 73% for an additional 35%.

RESULTS AND DISCUSSION The main results of this work are : (i) We established that the FSSP and CATH are to a large extent consistent; we analyzed and interpreted these correlations. (ii) We identi ed several proteins whose CATH classi cation is inconsistent with the FSSP Z-scores; we investigated the origin of these discrepancies. 6

(iii) Using the FSSP Z-scores we predicted the CATH architecture for 165 proteins that were not yet processed by CATH. Further results are more technical and concern the suggested classi cation schemes, including the methods and similarity measures which gave the best performance in predicting architectures.

FSSP and CATH correlation A simple and visually appealing way to study the correlation between the FSSP similarity scores and the CATH classi cations is represented in Fig. 2. This gure is produced by reordering the rows and columns of the Z-score matrix within the set PR479 (the upper left square in Figure 1). The reorganization is performed according to the CATH classi cation. For each of the proteins in this set we have the CATH classi cations at all levels. First, we order the proteins by their class; within the class, by the architecture; within it by the topology and so on. This reordering generates a permutation of the columns and rows of the Z matrix. The full{line grid in Figure 2 separates the proteins according to their CATH class and a dotted{line grid is placed at the boundaries between architectures. Comparison with Fig. 1 reveals the extent to which the FSSP Z-scores re ect the CATH classi cation. Figure 2 shows the underlying order behind the apparent randomness of Figure 1. Several interesting observations can be made. First, consider the Class level of CATH. As can be seen in Figure 2, there are no matrix elements with Z > 2:0 in region A, that connects proteins of the mainly- class to the mainly- class. At variance with this, some proteins from both of these classes have large Z scores with proteins from the - class (region B). This is reasonable, because of the way similarity is de ned by FSSP, a mainly- protein can have a high Z-score with an - protein, due to high similarity with the part. Second, turning to the Architecture level, we observe that there are architecture families 7

which are highly connected within themselves, e.g. - barrels (283-308: region C), whereas for others the intra-family connections are very sparse. The similarities within the mainly- sandwich family (186-238: region D) have two relatively distinct subgroups which suggest an inner structure corresponding to the lower levels in the CATH hierarchy. Checking the topology level (the third CATH level) for this architecture, one indeed nds two large topology sub-families, the immunoglobulin-like proteins (189-209: upper left part of region D) and the Jelly-Rolls (214-235: lower right part of region D), which correspond precisely to the two strongly connected sub-groups that appear in Figure 2. We found that the CATH classi cation at the level of topology is re ected in the Zmatrix. This is to be expected since the Z-score measures the structural similarity of two aligned proteins, while preserving their connectivity. Overall, this analysis shows that the Z-matrix is correlated with the CATH architecture and, therefore, can help predicting yet unclassi ed proteins.

Bounds on the prediction success rate Armed with the result that the CATH classi cation is strongly correlated with the FSSP Z-scores, we developed various methods to use the latter to predict the CATH class and architecture of yet unclassi ed proteins. In this section we establish an upper bound for the prediction success rate relevant to a family of prediction algorithms. The Z-matrix can be reinterpreted as a weighted graph; each vertex in the graph represents a protein and the weights on the edges connecting two vertices are the corresponding Zscores. Edges with Z < 2:0 are absent from the graph. Following this representation, we de ne two proteins as neighbors if they are connected by an edge. We can analyze the connectivity properties of set PR479 and make inferences about our predictive power. One can characterize the FSSP-based neighborhood of a protein according to the CATH classi cation of itself and its neighbors. Every protein must belong to one of four categories: (A) \Island" - The protein has no neighbors; 8

(B) \Colony" - It has no neighbors of its own (CATH) kind; (C) \Border" - It has neighbors of its own kind as well as of other kinds; (D) \Interior" - The protein has only neighbors of its own kind. Using these de nitions we can arrange the proteins of PR479 in groups according to their neighborhood category at the class and architecture levels. With a bit of thinking it is possible to get convinced that there are 7 possible groups: 1(Aa), 2(Bb), 3(Cb), 4(Cc), 5(Db), 6(Dc), 7(Dd). The rst (upper-case) letter refers to the neighborhood category at the class level and the second (lower-case letter) speci es the architecture surrounding. Group 5, for example, includes those proteins which \agree" with all their neighbors regarding the CATH class assignment (category D) but disagree with all of them with respect to the architecture (category B). The distribution of the proteins among these groups can be used to calculate an upper bound for a leave{one{out success rate estimate of a nearest{neighbor (1NN ) algorithm. When calculating a leave{one{out estimate, one tries to predict the class of every protein by using the classes of all the others. In a 1NN algorithm one assigns the protein that is being classi ed to the class of its nearest neighbor (of known class). Therefore, proteins which do not have neighbors of their own type can never be classi ed correctly by a 1NN algorithm. This is the case for proteins with neighborhoods from types A and B. Thus, from the list given above only those proteins that belong to groups 4, 6 and 7 can potentially attain their correct class and architecture assignments. The upper bound holds for any classi cation algorithm that yields, as its prediction, the architecture of any one of the neighbors as de ned by the Z score. Using the the distribution of PR479 proteins among the 7 groups listed above, one can calculate bounds for predicting the class and the architecture. Table I lists the number of proteins in each group and the number of correct predictions made by an 1NN algorithm that uses the Z-score as similarity measure. The class prediction success rate is bound by 91.4% (438=479) which is the percentage of proteins with at least one neighbor from their own class (groups 3,4,5,6 and 7 together). The architecture prediction rate is bound by 82.9% (397=479) which is the 9

percentage of proteins in groups 4,6, and 7. The leave{one{out success rate estimate of 1NN using Z-score as a similarity measure, can be read o table I. For class prediction it is 86.7% (415=479) and for architecture prediction it is 74.3% (356=479). One can see that the margin between the estimates for the 1NN architecture prediction rate and the upper bound is only 8.6%.

Inconsistencies between CATH and FSSP Using the neighborhood characterization of the previous section, one can look for proteins whose CATH classi cation is inconsistent with that of all their neighbors. For example, proteins that belong to group 2, do have neighbors but none of these are of their own kind, not even at the class level. This means that the FSSP scores imply that these proteins are similar only to proteins of di erent classes and architectures. Identifying these proteins can focus the attention to possible misclassi cation in CATH or drawbacks of the Z-score. For example, one of the 17 proteins in group 2 is the PDB entry 1rboC, which is classi ed as a - two-layer sandwich. It has 7 neighbors (in P479), all of which are classi ed as mainly- sandwiches. In Table II we list a subset of the proteins from groups 2 and 3. A full list of the inconsistent proteins is available on the web (at http://www.weizmann.ac.il/gaddyg/mscwork.html). In Table II are listed only those proteins that have more than 5 neighbors, out of which at least 70% have a common architecture (which is, however, di erent from that of the proteins listed in Table II). We believe that these proteins have either been misclassi ed by CATH, or there is a problem concerning their FSSP similarities with their neighbors (or both). The analysis of these 5 proteins reveals several problems that can cause these inconsistencies. Such problems are originated both by the CATH de nition of architectures and by the Z-score values. One problem is that there are some large Z-scores between proteins of di erent architectures. Such large Z-scores arise when a protein of one particular architecture has a similar structure to a part of a protein of a di erent architecture. Swindells et al. 10

call the phenomenon of structures within structures, the \Russian doll" e ect (Swindell et al., 1998). Such cases are common between architectures of long proteins that contain substructures corresponding to architectures of shorter proteins, e.g. there are many two-layer sandwich proteins that resemble a part of three-layer sandwich proteins. Such relationships can occur at the class level, e.g. - proteins that contain mainly- or mainly- proteins (1rboC, 1hgeA). They can also occur at the architecture level within the same class, e.g. - complex architecture contains - Two-layer sandwich (1regX). Other inconsistencies occur when proteins t two architecture de nitions. An example of such a case is described below.

Architecture prediction methods Our aim is to predict the class and the architecture of proteins whose CATH classi cation is not yet available (speci cally, the proteins in the set PR165). In order to perform such a task we are given with the classes of the proteins of PR479 and the FSSP Z-scores for both sets. At this point we need to take two main decisions: to choose (1) a similarity measure and (2) a classi cation method which uses the similarity measure to predict the architecture. These two choices specify every architecture prediction method. For example, the simplest combination is to use the original Z-scores as our similarity measure, to which we apply the 1NN classi cation algorithm de ned above. We tested several similarity measures which can be divided into two broad classes, direct and indirect measures. Direct measures include the original S and Z scores from the FSSP and their normalized counterparts denoted by CS and CZ. The normalization corrects for the variations in the \self similarities" of the FSSP scores { see Eq 5 or page 20. Indirect measures are created by using di erent clustering algorithms. These measures are nonlocal, e.g. they infer similarities between proteins according to whether they belong to the same cluster. Each of these similarity measures is denoted by an acronym that identi es the clustering algorithm that was used to create it (such as KMNV, SPC and AVL { see 11

Materials and Methods section). We also considered various classi cation methods, (see materials and methods), such as 1NN , max eld (MF), a new heuristic (HR) and a direct (D) classi cation method. We tried many pairwise combinations of similarity measures and classi cation methods. Each combination is denoted by the acronym that identi es the classi cation method followed by that of the similarity measure in parentheses. For example, 1NN (SPC-CZ) represents a 1NN classi cation method which is applied to a similarity measure that was obtained by applying the SPC clustering algorithm on the CZ direct measure. All these combinations were evaluated by working with the set PR479. First the set was randomly \diluted"; that is, we randomly chose a certain fraction of these proteins and placed them in a test set, pretending that we do not know their classi cation. The FSSP scores of the entire set were then used to classify the test set. For each protein from the test set we either return a predicted classi cation or reject the protein (i.e. we declare that we are unable to classify it - see below). The quality of any classi cation algorithm is measured by its success rate (fraction of correctly classi ed proteins, out of the test set) and by the purity (success rate out of the non-rejected proteins). Both quality measures were evaluated at di erent dilutions. Detailed descriptions of the classi cation methods, similarity measures and our testing procedure are given in the Materials and Methods section. Since the dataset PR165 that we wish to classify constitutes 25 % of the total set PR644 we will prefer methods that perform well at this range of dilutions. On the basis of the performance of our prediction algorithms we decided to base our nal prediction on a majority vote of the following 5 algorithms; 1NN (CZ), HR(K20-CZ), 1NN (AVL-CZ), 1NN (SPCK10-CZ) and D(SPC-K10-CZ). The performance of each of these algorithms is summarized in table III. The highest purity, 90:4  0:5%, is obtained by the D(SPC-K10-CZ) algorithm, which is based on the SPC clustering algorithm. The highest success rate is obtained by HR(K20-CZ) which performs similarly to 1NN (CZ) at low dilutions but outperforms it at higher dilutions. As a nal estimate of the reliability and quality of our predictions, we tested the majority 12

vote of the ve selected algorithms at several dilutions. We divided the proteins of the test sets into groups M5, M4, etc, according to the number of methods (out of the ve used) that made up the majority on which the classi cation was based (all 5 in agreement in M5, etc). The success rates of all the groups are summarized in table IV. The reported rates are the average over two dilutions; of 20% and 50%. Using this table allows us to place next to each prediction a con dence level; if for a given protein 4 methods agree, we trust that the prediction is correct with probability 80%.

Classi cation Prediction Table V lists classes and architectures and the proteins that were assigned to them. The proteins in each architecture category are divided into groups according to the number of methods that agreed. Group M5 includes proteins for which all methods predicted the same class. Group M4 those for which 4 methods agreed and so on down to group M0 which is formed by the proteins rejected by all the methods. Table V does not include proteins that belong to groups M0 and M1 since we do not trust their classi cation. Therefore the total number of proteins that are listed is 147 (out of 165). Each protein is speci ed by its PDB entry and chain identi er. The full list of proteins with the prediction of each of the 5 methods is available on the web (http://www.weizmann.ac.il/gaddyg/mscwork.html). The di erent typeface forms in the list are explained in the next subsection.

Recent CATH postings Since this work has been completed (Getz, 1998) the CATH database was enlarged and, by December 98, 94 out of the 165 proteins in PR165 had been classi ed. Twelve out of these 94 proteins have more than one domain according to CATH and therefore appear as many entries (we made predictions only for single domain chains which were identi ed using the 3Dee database (http://circinus.ebi.ac.uk:8080/3Dee/). 3 out of the remaining 82 proteins are classi ed to architectures that do not appear in the list of 29 to which the proteins from 13

PR479 belong. Therefore, the architectures of 79 newly classi ed proteins can be compared to their predictions. The comparison can be read from table V where each one of the 147 protein entries (PDB code and chain identi er) appears in one of 5 typeface forms: (a) Normal - the protein was still not processed by CATH (until December 98). (b) Bold - the predicted classi cation is the same as the CATH classi cation (c) Underline - wrong prediction. The CATH classi cation appears in small parenthesis. (d) Italic - the protein has more than one domain and can not be compared. (e) Italic and underline - the protein's architecture is not one of the 29 that appear in PR479. We analyze the misclassi cations found in groups M4 and M5, for which the classi cations are the most reliable. There are 15 misclassi cations in groups M4 and M5 (out of 53 recently classi ed proteins in these groups). One can identify 4 types of misclassi cations: (1) The rst type of misclassi cation, is caused by an ambiguous de nition of the architecture. Our algorithm classi ed 7 proteins (1occI, 1occJ, 1occK, 1occL, 1occM, 1tiiC and 1mof) as mainly- , few secondary structure architecture while in CATH they were classi ed to the fourth class - few secondary structures. Manual inspection of these proteins revealed that they have one or two -helices; thus they can t both classes and architectures. Figure 3 shows the FSSP alignment of 1tiiC with its most similar protein in PR479, 1junA, for which the Z-score is 4.1. Although the proteins overlap very well, the CATH classi cation of 1tiiC is FSS whereas 1junA is classi ed as mainly- few secondary structures. (2) The second type of misclassi cation is due to high Z-scores between two proteins when the architecture of one coincides with that of a sub-part of a di erent architecture the Russian doll e ect mentioned above. We found 4 proteins of this type (2fua, 1gdoA, 1occE and 1hqi). For example, 2fua, which is an - three-layer sandwich, has the highest Z-score in PR479 (3.1) with 1dcpA, which is a - two-layer sandwich. The FSSP alignment nds that two out of the three layers have similar structure. Other cases of architectures that \include" smaller ones, occur between the complex architecture families, in both mainly- and - classes, and other, shorter architectures of their own class, such as roll, sandwich, 14

etc.. (3) The third type of misclassi cation is caused when a protein has few neighbors, with very low similarities, and each of them belongs to a di erent architecture. This type includes 1iba and 1whi. (4) The last type includes probable misclassi cation in CATH. The protein 1mil, whose cartoon plot is shown in Figure 4, is predicted to be - three-layer sandwich but is classi ed in CATH as - two-layer sandwich. The SCOP database (Murzin et al., 1995) also classi es the protein as + with three layers - - . The misclassi cation of 1far is due to an old misclassi cation in CATH (that was used for our prediction), of its most similar neighbor 1ptq (Z score of 5.3). In the December 98 version of CATH the classi cations of 1ptq was xed and using it would have produced the correct classi cation for 1far. Overall, after discarding misclassi cations of the rst and fourth types as trivial, we estimate the success rate for proteins in group M5 as 93% (25/27) and for proteins in group M4 as 78% (21/26). These gures are in agreement with the expected success rate estimated from PR479 (see Table IV).

Biological relevance The rapidly increasing number of experimentally-derived protein structures forces a continuous updating of the existing structure classi cation databases (CATH, FSSP etc.). Each group adopts di erent classi cation criteria at the level of sequence, of structure and of function similarities. A comparison between di erent classi cation schemes can help understanding the optimal interplay between di erent levels, it can reveal possible misclassi cation, and it can ultimately o er a fully automated updating procedure. Manual steps can be automated in a ever-increasing way by using the tools made available by other databases. We showed that it is possible to automatically predict the CATH architecture from the FSSP Z-scores, by using several pattern recognition tools. We also proposed improved similarity measures, based on the original scores. 15

The advent of genome projects is boosting the e orts in the eld of protein classi cation. In the past the aim was to help nding the structure of the particular protein which was interesting at a given time. Now the hope is to nd a large representative set of structures which can encompass most of the possible folds, if not all of them. In such a large scale project human intervention, which is precious in setting the principles of classi cation, should be gradually replaced by automated procedures.

MATERIALS AND METHODS The CATH architectures are predicted by widely used and studied statistical pattern recognition tools (Fukunaga, 1990); Duda and Hart, 1973). We implemented the following general scheme: (i) We considered two sets of proteins. The rst set included proteins for which the CATH classi cation was known (PR479) and the second set included proteins that were not yet classi ed by CATH (PR165). Both sets appeared in FSSP and the corresponding similarity score matrices were retrieved. (ii) We de ned a methodology to evaluate architecture prediction algorithms. Each prediction algorithm was tested blindfoldedly on a subset of the proteins with known classi cation. Success and rejection rates were estimated by cross-validation. (iii) Various combinations of similarity measures and classi cation methods were tested. The similarity measures include the original S and Z scores, a renormalized version of these, as well as cophenetic similarity measures (see below) based on the outcome of two clustering algorithms. The classi cation methods that were tested included nearest{neighbors, a heuristic algorithm and a direct utilization of the clustering results. (iv) The ve best combinations of classi cation methods and similarity measures were selected; a majority vote among these was used to predict the architecture of the yet unclassi ed proteins. A detailed description of each of these steps is presented elsewhere (Getz, 1998).

16

Protein sets The S and Z score matrices of the FSSP version from December 25, 1997 were obtained from Liisa Holm (one of the authors of FSSP). A Z-score matrix containing only Z values with Z > 2:0, can be retrieved directly from the FSSP web site (http://www2.ebi.ac.uk/dali/fssp/). This version of the FSSP database included 1188 protein chains which represent 9153 PDB structures. The list of 1188 FSSP chain names was checked against the CATH database (Jan 5, 1998 version) (http://www.biochem.ucl.ac.uk/bsm/cath/). For simplicity, out of the 686 proteins that appear in CATH, we used only the 479 single domain ones and had a single classi cation (referred to as PR479). We note that not all the CATH families had representatives in PR479. Some small families at the topology and homologous superfamily levels were missing. In the rst (class) level, there were representatives from all four class types (numbered 1 to 4). In the second, architecture level, the 479 chains covered 29 di erent architecture types. Of the 1188 FSSP entries, 503 chains were not processed by CATH; these were checked against the 3Dee database (http://circinus.ebi.ac.uk:8080/3Dee/) which includes an indication whether a chain is single or multi-domained. In this way 165 single-domain chains not yet processed by CATH were identi ed. These chains are candidates as new CATH entries and we predicted their class and architecture levels. The dataset containing these yet unclassi ed proteins is called PR165. The dataset of the combined list of protein entries is referred to as PR644. A list of the selected proteins and architecture types can be found on the web (at http://www.weizmann.ac.il/gaddyg/mscwork.html).

Evaluating an architecture prediction algorithm In order to choose the best architecture prediction algorithm, we need to evaluate an algorithm's quality. Since an algorithm can output either a predicted architecture or a 17

\rejection", if it does not have any prediction, one has to estimate two probabilities: Psuccess and Preject. Robust estimation of these parameters is produced by cross-validation (Stone, 1974), a procedure which consists in averaging over many (T ) randomly sampled test trials. In each trial, the set PR479 is divided into two subsets; one is used for training the algorithm and the other set, of Ntest proteins, is used to test the algorithm by comparing its prediction to the true classi cation. The probability estimates are given by P^success = 1=T

XN T

t=1

success

Ntest

P^non?reject= 1 ? P^reject = 1=T

XN T

t=1

test

(1)

? Nreject

Ntest

(2) (3)

Another gure of merit, the purity Ppure, is the probability of correctly classifying nonrejected proteins. It is estimated by P^ : (4) P^pure = success 1 ? P^reject The success and rejection rates were evaluated 11 di erent dilutions: we randomly selected test sets of sizes Ntest =479 = 1; 10; 20; 30; 40; 50; 60; 70; 80; 90; 95 percent. Overall 50 tests were performed for each dilution and every architecture prediction algorithm. The success and rejection rates, plotted as a function of the dilution, are presented in Figure 5 for predictions made by a nearest-neighbor classi cation method using several similarity measures. The success rate and the purity were used to select the algorithms that were used for prediction (see below).

Architecture prediction algorithms As described earlier, the prediction is performed by applying a classi cation method to a similarity matrix between the proteins. We denote each classi cation method by an acronym: 1NN for rst nearest{neighbor, MF for max eld, which is a weighted nearest{ neighbors method, D for a direct classi cation method which uses the hierarchies produced 18

by clustering and HR for a new, heuristic classi cation method. All these methods are described in detail elsewhere (Getz, 1998). The similarity measures are denoted by a list of acronyms according to the methods used to create them. Direct similarities include S, Z which represent the original FSSP scores and CS, CZ that stand for normalized versions, the cosine of S and of Z. Other preprocessing techniques apply various clustering methods to construct indirect similarities which further improve the prediction. One type of preprocessing is performed using the KMNV (K-mutual neighborhood value) method (Blatt et al., 1996a). Another type produces cophenetic similarities (Jain and Dubes, 1988) from dendrograms created by hierarchical clustering techniques. We tested two such techniques; SPC (super paramagnetic clustering (Blatt et al., 1996a; 1996b) and AVL (average linkage (Jain and Dubes, 1988)). Using this notation, the entire architecture prediction procedure is named by the combination of the classi cation method and the similarity measure, e.g. 1NN (SPC-K10-CZ) refers to the 1NN algorithm applied to the combined similarities SPCK10-CZ. These are produced by rst applying the KMNV method (with K = 10) to the CZ direct similarities and then using the SPC algorithm to generate a dendrogram from which new similarities are produced (that are then used as inputs of the 1NN algorithm).

Similarity measures S and Z: the original FSSP scores

S and Z are the original scores calculated by Holm and Sander in order to produce the FSSP database. The S score is a pairwise structural similarity measure between protein chains obtained by heuristic optimization algorithm, DALI (Holm and Sander, 1993). Dali nds the alignment that gives rise to most similar C distance maps of the aligned residues. The S score between two proteins depends on the length of the aligned parts, which depend in turn on the compared chains. Consequently, the S scores can not be used directly to compare the pairwise similarity of chains A and B to the pairwise similarity of chains C 19

and D. In order to put all scores on a common scale and to assign them a statistical meaning, Holm and Sander calibrated them against pairwise all-on-all comparisons in the database, as a function of chain size. The calibrated score is expressed in terms of normalized Z scores, that is, the number of standard deviations above the mean. The Z score can attain positive and negative values, but in the FSSP database only values greater than 2.0 are listed. CS, CZ: normalized cosine similarities

We are led to the introduction of CS and CZ by our interpretation of the S and Z scores as a dot product between two non normalized vectors. Therefore the self-similarity of the proteins is the squared norm of the vector. Following this picture, we de ne as new similarity measures the cosine of the angle between vectors CSij =

qSS S ij

ii jj

;

CZij =

qZZ Z ij

ii jj

:

(5)

Remakably, the S and Z score values in PR644 are consistent with this interpretation: all the fCZi6=j g turned out to be less than 1, as expected from a cosine measure. Note, that they are all positive since the S and the available Z scores are positive. The advantage of using cosine scores is that it takes into account the self-similarity of the proteins. We tested the 1NN classi cation method with CS and CZ and compared performance to the original S and Z scores. Figure 5 shows the success and non-rejection rates of 1NN classi cation using the four similarity measures as a function of the dilution percentage. The cosine measures improve the classi cation signi cantly (when comparing to the upper bound). Additionally, the cosine transformation improves the S score to approximately the same level as it improves the Z score. Note that the non-rejection rates of all methods are identical.

20

Indirect similarity measures

Non-local properties of the original direct similarities can be used to calculate new similarities which therefore are quali ed as indirect. We proposed and evaluated several indirect similarity measures, all based on pre-clustering the data. Clustering techniques (Jain and Dubes (1988); Duda, 1973; Blatt, 1997) use only measurements regarding the objects to be clustered and do not use any external category labels. This is the distinction between clustering and classi cation techniques. In other words, clustering algorithms make use of the entire similarity matrix (such as depicted in Figure 1), between all the proteins, classi ed and unclassi ed, but they do not use the known labels of the classi ed proteins. Similarity measures derived from clustering techniques are expected to improve classi cation performance since they reduce the in uence of isolated classi ed points that \stray" into regions that belong to a di erent class. This improvement will be more pronounced as the data get more diluted (Getz et al., 1998). KMNV: pruning connections by K-mutual-neighborhood-value

The task of the K-mutual-neighbors-value method (KMNV) is to remove (prune) \inconsistent" edges from a similarity graph or, in other words, to set some similarity values to zero. The algorithm keeps a similarity S~ij only if both proteins i and j are at most the K-th neighbors of each other, where each protein orders its neighbors in decreasing order of similarity. Otherwise we set S~ij = 0. The performance of the KMNV method is controlled by the integer parameter K. For large K values, (K > n number of points), no connections are removed. Small K values leave only few connections. We expect that there will be an optimal K value for each classi cation method. According to the average number of neighbors in PR479, we chose to test K values from 5 to 30. The similarity measures produced by KMNV, applied to Z and CZ, were tested using 21

the 1NN classi cation method. This pruning improved the performance of the Z scores but not the CZ scores. Indirect similarities from hierarchical clustering methods

Here we describe how to use hierarchical clustering methods to produce cophenetic similarity measures. Hierarchical clustering methods produce a sequence of nestedy partitions to describe the data (Jain and Dudes, 1988). Such a sequence can be described by a tree which is called a dendrogram (see gure 6). The dendrogram provides a picture of the clustering that can be easily interpreted. Cutting the dendrogram at any level de nes a particular clustering, or partition, in the sequence. The level itself, which is the index in the sequence of partitions has, in general, no direct correspondence to the original similarities that are used to create the clustering. The leaves of the dendrogram are the individual data points; at the lowest level each point constitutes its own cluster. Using the dendrogram one can de ne a new similarity measure between the data points, called the cophenetic similarity (Jain and Dudes, 1988). The cophenetic similarity, SijC , between two given points is de ned as the lowest level in the dendrogram at which the two points are still in one cluster. For example, according to the dendrogram in gure 6 the cophenetic similarity between point C and D is 1 since they split o to di erent branches at level 1. The levels are numbered in increasing order from zero at the root towards the leaves. The cophenetic proximity is a similarity measure since two points which stay together at the same branch in the dendrogram, i.e. remain members of the same cluster up to a higher level, get a higher cophenetic proximity than points that split to di erent clusters close to the root. yPartition X is nested in partition Y, if each set in X is a proper subset of a set in Y.

22

We suggest to use the cophenetic similarity to perform the classi cation. The dendrogram created, and therefore the induced cophenetic similarities, depend on the clustering technique used. Clustering algorithms

The rst type of hierarchical clustering method we used is the Super Paramagnetic Clustering algorithm (SPC) which was introduced by Blatt et al. (1996a; 1996b; 1997). We slightly modi ed the algorithm in order to use it as a hierarchical clustering method (Getz, 1998). We tested the 1NN classi cation method with many cophenetic similarities induced from SPC dendrograms. The dendrograms were created by applying SPC to the Z and CZ direct similarities and similarities created from them by the KMNV algorithm with 5  K  30. The highest success rate was obtained by applying SPC to the combination of KMNV with K = 10 and the CZ direct similarity measure, i.e. 1NN (SPC-K10-CZ). We compared the 1NN (SPC-K10-CZ) results to a classi cation method that uses the dendrogram directly, D(SPC-K10-CZ). the 1NN method (see Table III). The second type of hierarchical clustering algorithm we used is the average{linkage method (AVL) (Jain and Dudes, 1988) (also called UPGMA). This method is an agglomerative clustering algorithm. It starts from n distinct clusters, one for each point, and forms the hierarchy by successively merging two clusters at a time. The AVL algorithm merges the most similar pair of clusters, where the similarity between two clusters is de ned as the average of the similarity between all pairs of points that belong to them. The AVL algorithm applied to the Z score was used by Holm and Sander to create the dendrogram reported in the FSSP database (Holm and Sander, 1994). We tested the 1NN and the D classi cation methods with the dendrograms created by the AVL algorithm. The dendrograms were created using the direct similarities Z and CZ and the outcome of the KMNV algorithm with K values from 5 to 30. The best classi cation results 23

are obtained using the CZ direct similarity measure, 1NN (AVL-CZ). In terms of success rate the AVL method performs better than the SPC algorithm. The purity, however, is higher for the SPC algorithm. One conclusion from the average{linkage analysis is that the CZ similarity measure produces a better dendrogram (for architecture classi cation) than the Z similarity measure. We therefore suggest that using CZ instead of Z can improve the dendrogram reported in the FSSP database.

Classi cation methods Nearest{neighbors classi cation methods

Nearest{neighbors (NN ) algorithms are widely known and studied (Duda, 1973; Fukunaga, 1990). A NN algorithm classi es a protein according to the classes of its nearest neighbors, where the distance is de ned using some similarity measure. The simplest NN algorithm is the rst nearest{neighbor, 1NN , in which the predicted class of a data point is the class of its closest or most similar classi ed neighbor, i.e. the one with highest similarity measure value with it. Other more complicated NN methods use k nearest neighbors in order to predict the class (Fukunaga, 1990). We tested a method referred to as max eld (MF), which classi es a point i by taking a weighted vote among its k nearest classi ed neighbors. The weight of each neighbor is the similarity with the point i. Hence the rst nearest neighbor (the one with the highest similarity) has the largest contribution to the class decision, but in cases where the nearest neighbor's class does not agree with several next nearest neighbors, the class prediction may di er from that of the rst nearest neighbor.

24

Classi cation methods used with dendrograms

The classi cation using a dendrogram was performed by two methods; one by applying the 1NN method to the cophenetic similarities and the other by using the dendrogram in a more direct top-down approach (denoted by D). In order to classify a yet unclassi ed point x by the 1NN method, one starts going up towards the root from the leaf in the dendrogram that corresponds to x. At each step, which corresponds to a lower cophenetic similarity, more points join the cluster to which the unclassi ed point belongs. The rst time one (or more) classi ed points join the cluster, we perform a majority vote between their classes to obtain the predicted class of the unclassi ed point, demanding that at least 70% of the votes are for a certain class, otherwise a rejection is returned. The direct classi cation method (D) works as follows: rst, we identify large and pure clusters by going down the dendrogram and stopping whenever a cluster is larger than 5 proteins and its purity exceeds 80%, meaning that more than 80% of its classi ed members are of the same architecture. Then, all unclassi ed members of these clusters are assigned to the class of the majority. The remaining unclassi ed proteins that belong to small or non-pure clusters are rejected. HR: heuristic approach to classi cation

We introduced a new classi cation scheme that is based on the cost function de ned by the SPC algorithm. Each protein is assigned an integer, ci, describing its architecture (1 to 29, in our case). The proteins with known classi cation have a xed value according to their classi cation, whereas the integers assigned to the yet unclassi ed proteins are random variables that can attain any value from 1 to 29. For each architecture con guration C = fci g a cost is assigned that penalizes assigning di erent architectures to any pair of proteins. The value of this penalty is chosen to be the similarity measure between proteins i and j , S~ij , 25

which can be any of the similarity measures de ned previously. The cost function is de ned as the sum of penalties for all protein pairs hi; j i, E (C ) =

X S~ [1 ? (c ; c )] :

hi;j i

ij

i

j

(6)

The classi cation problem is stated as nding the minimal cost con guration of the unclassi ed proteins, while keeping the architectures (i.e. ci values) of the classi ed proteins xed. This problem corresponds to nding the ground state of a random eld Potts ferromagnet. We propose a heuristic approach, described in detail elsewhere (Getz, 1998) to nd a low energy con guration. The heuristic is an iterative greedy algorithm. The algorithm can identify at which iteration, if any, it performed a heuristic decision. At low dilutions it is fairly common to nd that the algorithm reaches the global minimum of the cost function. We tested the HR algorithm using the direct similarities Z and CZ and also after elimination of some edges by means of the KMNV algorithm. The best performance was obtained for K = 20; at high dilutions the classi cation using HR was better than 1NN (CZ) and even exceeded the upper bound for direct similarities. This is possible since the heuristic algorithm uses indirect considerations in the classi cation. Another important property of the HR algorithm is that it yields a rejection only for proteins that have no neighbors (classi ed or not) regardless of the dilution. This causes the purity of the algorithm to decrease as the dilution rate is increased.

Selection of algorithms for prediction As described above, each combination of classi cation method and similarity measure had its success rate and purity estimated (on PR479) as a function of dilution. We selected 5 algorithms to perform the architecture prediction, and took a majority vote among them to reach the nal prediction. The guidelines for selecting the algorithms were the following: (i) Since we want to predict the PR165 proteins out of the PR644 proteins, we are generally interested in methods that perform well at dilution of around 25% (165=644). The dilution, 26

however, is a global parameter; that is, the average of local dilutions surrounding the yet unclassi ed proteins; therefore, we seek algorithms that perform well in the range of 20% to 50% dilution. (ii) Since we use a majority vote, the best overall performance is reached if one takes uncorrelated classi cation methods (Fukunaga, 1990). The purity and success rates of the di erent algorithms showed considerable variation. We selected to use for the prediction the following 5 algorithms; 1NN (CZ), HR(K20-CZ), 1NN (AVL-CZ), 1NN (SPC-K10-CZ) and D(SPC-K10-CZ). The selected K-values of these yielded the optimal success rate - purity combinations that were listed in Table III.

ACKNOWLEDGEMENTS We thank Liisa Holm for making the raw FSSP data available and for useful discussions. This work is based on an thesis for the M.Sc. degree submitted by G.G to Tel-Aviv University. We also thank Noam Shental for discussions. This research was supported by grants from the Minerva Foundation, the Germany-Israel Science Foundation (GIF) and the US Israel Science Foundation (BSF).

27

REFERENCES [1] Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F. & Weng, J. (1987). Protein Data Bank. In Crystallographic Databases - Information Content, Software Systems, Scienti c Applications (Allen, F.H., Bergerho , G. & Sievers, R., eds), 107{132. [2] Bernstein, F.C., et. al. & Tasumi, M. (1977). The protein data bank: a computer based archival le for macromolecular structures. J. Mol. Biol. 112, 535{542. [3] Blatt, M., Wiseman, S. & Domany, E. (1996). Super{paramagnetic clustering of data. Phys. Rev. Lett. 76, 3251{3255. [4] Blatt, M., Wiseman, S. & Domany, E. (1996). Clustering data through an analogy to the Potts model. Advances in Neural Information Processing Systems 8, 416{422, Touretzky, Mozer, Hasselmo, eds., MIT Press. [5] Blatt, M. (1997). Clustering of Data, Computational Capabilities and Applications of Neural Networks, Ph.D. Thesis. [6] Bowie, J. U., Luthy R. & Eisenberg, D. (1991) A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164-170. [7] Chothia, C. (1992). One thousand families for the molecular biologist. Nature 357, 543-544. [8] Duda, R.O. & Hart, P.E. (1973). Pattern Classi cation and Scene Analysis. Wiley{ Interscience, New York. [9] Finkelstein, A. (1997). Protein structure: what is it possible to predict now? Curr. Op. Struct. Biol. 7, 60-71. [10] Fisher, D., Rice, D., Bowie, J. U. & Eisenberg, D. (1996). Assigning amino acid sequences to 3-dimensional protein folds. FASEB J. 10, 126-136. [11] Fukunaga, K. (1990). Introduction to statistical Pattern Recognition. Academic Press, 28

San Diego. [12] Getz, G. (1998). Classifying and Clustering protein structures, M.Sc. Thesis. [13] Gerstein, M. & Levitt, M. (1997). A structural census of the current population of protein sequences. Proc. Natl. Acad. Sci. USA 94, 11911-11916. [14] Gibrat, J. F., Madej, T. & Bryant S. H. (1996). Surprising similarities in structure comparison. Curr. Op. Struct. Biol. 6, 377-385. [15] Holm, L. & Sander, C. (1993). Protein Structure Comparison by Alignment of Distance Matrices. J. Mol. Biol. 233, 123{138. [16] Holm, L. & Sander, C. (1994). The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600{3609. [17] Holm, L. & Sander, C. (1996). Mapping the Protein Universe. Science 273, 595{602. [18] Holm, L. & Sander, C. (1996). The FSSP database: fold classi cation based on structurestructure alignment of proteins. Nucl. Acids Res. 24, 206{209. [19] Jain, A.K. & Dubes, R.C. (1988). Algorithms for Clustering Data. Prentice{Hall, Englewood Cli s. [20] Jernigan, R. L. & Bahar, I. (1996). Structure-derived potentials and protein folding. Curr. Op. Struct. Biol., 6, 195-209. [21] Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A new approach to protein fold recognition. Nature 358, 86-89. [22] Murzin, A.G., Brenner, S.E., Hubbard, T. & Chotia, C. (1995). SCOP: A Structural Classi cation of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536{540. [23] Murzin, A.G. (1996). Structural classi cation of proteins: new superfamilies. Curr. Op. 29

Struct. Biol. 6, 386-394.

[24] Orengo, C.A., Brown, N.P. & Taylor, W.R. (1992). Fast Structure Alignment for Protein Databank Searching. Proteins: Structure, Function and Genetics 14, 139{167. [25] Orengo, C.A., Jones, D.T., & Thornton, J.M. (1994). Protein superfamilies and domain superfolds. Nature 372, 631-634. [26] Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton, J.M. (1997). CATH - a hierarchic classi cation of protein domain structures. Structure 5 1093{1108. [27] Sali, A. (1998) 100,000 protein structures for the biologist. Nature Struct. Biol. 5, 10291032. [28] Siddiqui, A.S. & Barton, G.J. (1995). Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain de nitions. Protein Sci. 4, 872{884. [29] Stone, M. (1974). Cross validatory choice and the assessment of statistical prediction. Journal of the Royal Statistical Society Series B 36, 111{113. [30] Swindells, M.B., Orengo, C.A., Jones, T.J., Hutchinson, E.G. & Thornton, J.M. (1998). Contemporary approaches to protein structure classi cation. BioEssays 20, 884{891. [31] Taylor, W.R. & Orengo, C.A. (1989). Protein Structure Alignment. J. Mol. Biol. 208, 1{22.

30

Table captions Table 1

Number of correct predictions using the 1NN algorithm with Z-score similarity Table 2

List of proteins for which FSSP does not agree with CATH. Table 3

Success rates, non-rejection rates and purity of the selected algorithms at dilutions 20% and 50%. Table 4

Prediction success rates for proteins according to \majority groups". Group MI contains those proteins for which I methods determined the majority decision. The last column presents the fraction of proteins in each group. The success rates are averaged over dilution 20% and 50%. Table 5

Class and architecture prediction of PR165. 147 proteins of groups M5 to M2 are listed.

31

TABLES Group

number

Class level

Architecture level

of points

1NN (Bound)

1NN (Bound)

1 (Aa)

24

0 (0)

0 (0)

2 (Bb)

17

0 (0)

0 (0)

3 (Cb)

32

23 (32)

0 (0)

4 (Cc)

250

236 (250)

216 (250)

5 (Db)

9

9 (9)

0 (0)

6 (Dc)

102

102 (102)

95 (102)

7 (Dd)

45

45 (45)

45 (45)

Total

479

415 (438) TABLE I.

356 (397)

Getz et al. Table 1

32

PDB

CATH

entry

architecture

1rboC

- Two-layer sandwich

2cas

# of FSSP neighbors

Most frequent neighbor architecture

# of neighbors from frequent arch.

7

Mainly- sandwich

7

Mainly- complex

22

Mainly- sandwich

20

1hgeA

- complex

21

Mainly- sandwich

18

1regX

- complex

9

1celA

Distorted sandwich

- Two-layer sandwich

18 Mainly- sandwich TABLE II.

Getz et al. Table 2

33

7 13

method 1NN (CZ) HR(K20-CZ) 1NN (AVL-CZ) 1NN (SPC-K10-CZ) D(SPC-K10-CZ)

dilution

success rate

non-rejection rate

purity

20%

74:5  0:5%

93:9  0:4%

79:3  0:5%

50%

67:5  0:4%

90:2  0:2%

74:8  0:4%

20%

74:3  0:5%

94:1  0:3%

79:0  0:6%

50%

69:1  0:4%

93:6  0:1%

73:8  0:4%

20%

72:4  0:5%

87:1  0:5%

83:2  0:5%

50%

66:4  0:4%

83:5  0:4%

79:6  0:4%

20%

69:4  0:6%

82:1  0:5%

84:6  0:5%

50%

63:0  0:4%

76:0  0:4%

83:0  0:3%

20%

49:9  0:5%

55:2  0:5%

90:4  0:5%

50%

49:7  0:4% TABLE III.

55:8  0:5%

89:2  0:4%

Getz et al. Table 3

34

Group

success rate

fraction

M5

94  1%

50%

M4

81  2%

24%

M3

63  4%

11%

M2

33  4%

8%

M1

5  3%

2%

M0

0% TABLE IV.

5%

Getz et al. Table 4

35

Class - Architecture

Group PDB entry list

Mainly -

M5 1c53, 1lefA, 1fow

Non-bundle

M4 1ron, 1eciA, 1tafA, 1tafB, 1xsm, 1pueE, 1iba

(1.10)

(3:30)

, 1occE

1thjA, 1tfr

M3 1ctdA, 1ponA, 1ponB, 1ppbL, 1tfe, 1cpo, 1cei, 1nox

(3:40)

M2 1fct, 1hph, 1dmc, 1zwb, 1zwc, 1zwd, 1zwe, 1higB, 1lbd, Mainly -

1occH, 1pbwA, 1jvr M5 2brd, 1rmi, 1hmcB, 1occA, 1occC

Bundle

M4 1lre, 1dkzA

(1.20)

M3 1ery M2 1vnc

Mainly -

M4 1i , 1gcmA, 1fosE, 1fosF, 1lyp, 1psm, 1spf, 1wfbA, 1peh,

Few SS

2ifm, 4ifm, 1tiiC

(1.30)

1occK

(4:10)

(4:10)

, 1occL

M3 1bmfG

(1:20)

, 1mof

(4:10)

, 1occM

, 1occI

(4:10)

, 1occJ

(4:10)

(4:10)

(1:20)

(1:10)

Mainly -

M5 1edmB, 1emn, 1pfxL

Ribbon

M4 2cbh, 7apiB

(2.10)

M3 1hleB M2 1frvB

(4:10)

, 1hdj

M2 1ppt, 1occD

Mainly -

(1:20)

(1:10)

M4 1bnb, 1pft

Single Sheet (2.20) Mainly - Roll (2.30)

M4 1lepA, 1mai, 1whi

Mainly -

M4 1fmb

Barrel (2.40)

M3 1tiiD, 1ghj, 1pfsA, 1iyu, 1ema

Mainly -

M5 1cwpA, 1ahsA, 1wkt, 1wit, 1lcl, 1stmA, 1occB,

(2:40)

36

,

, 1lrv

,

Sandwich

M4 1mspA

(2.60)

M3 1tul

(2:70)

M2 1occF Mainly -

M4 1dec, 1dutA

Distorted sandwich (2.70) Mainly - Trefoil (2.80)

M5 2ila, 1wba

Mainly -

M3 1pmc

Four-propellor (2.110) Mainly -

M2 1npoA

(2:60)

Eight-propellor (2.140) - -

M4 1lit

Roll (3.10)

M3 1alo

- - Barrel (3.20)

M5 1nsj, 1cb2A, 1eceA, 1dhpA, 1onrA

- -

M5 1mli, 1vhiA, 1pytA, 2fua

Two-layer sandwich

M4 1tys, 1far

(3.30)

M3 1qbeA, 1pnh, 1sis, 1sco, 2crd, 1zfd

(3:40)

, 1hqi

(3:90)

(2:20)

M2 1pmaP, 1apyB

(3:50)

- -

M5 1efm, 1din, 1hgxA, 1qrdA, 1kte, 1cydA, 1enp, 1frvA, 1rnl

Three-layer ( ) sandwich

1broA, 1rvv1, 1fds, 1kinA, 1cfr, 1xvaA, 1xel M4 1kuh, 1gdoA , 1mil , 1jud

(3.40)

(3:60)

M2 1srsA, 1pmaA, 1apyA Few SS - Irregular (4.10)

M4 1gur, 1eit TABLE V.

Getz et al. Table 5

37

(3:30)

(3:30)

, 1rie

Figure captions

Figure 1

Z-score matrix between all pairs of PR644 proteins. A black dot represents Z > 2:0. Larger Z corresponds to more similar proteins. For the rst 479 proteins the CATH classi cation is known. For the last 165 proteins we predict the classi cation. Figure 2

The PR479 Z-matrix ordered according to CATH classi cation. Each black dot represents Z > 2:0. The full{line grid separates the class level. The dotted{line grid between the architectures. Marked regions: A - no Z > 2:0 connections between mainly- and mainly- ; B - some connections between (mainly- , mainly- ) and - ; C - - barrels; D - mainly- sandwiches Figure 3

The FSSP alignment if 1tiiC with 1junA (Z score of 4.1). Figure 4

A cartoon plot of 1mil that is predicted to be - three-layer sandwich and is classi ed as - two-layer sandwich. Figure 5

Success and non-rejection rates of S, CS, Z, CZ scores. Figure 6

A dendrogram which describes the sequence of partitions listed on the right.

38

FIGURES PR479

PR165

1

480

644

FIG. 1.

Getz et al. Fig. 1

39

Mainly

α

Mainly β

α−β

1

Mainly

α

73

A

102

B

126 149

Mainly β

186

D 239 256

C

283 309

α−β 373

434 452 479

FIG. 2.

Getz et al. Fig. 2

40

FIG. 3.

Getz et al. Fig. 3

41

FIG. 4.

Getz et al. Fig. 4

42

100

95

non−rejection rate 90

85

percentage

80

75

70

success rate

65

1NN(Z) 1NN(CZ) 1NN(S) 1NN(CS) Success rate upper bound

60

55

50

0

0.1

0.2

0.3 dilution fraction

FIG. 5.

Getz et al. Fig. 5

43

0.4

0.5

Level in the dendrogram

0

{{A,B,C,D,E,F}}

1

{{A,B},{C,D,E,F}}

2

{{A,B},{C},{D,E,F}}

3

{{A},{B},{C},{D,E,F}}

4

{{A},{B},{C},{D},{E},{F}} A

B

C

E

D

FIG. 6.

Getz et al. Fig. 6

44

F