The POPPs: clustering and searching using peptide probability profiles

7 downloads 2823 Views 117KB Size Report
For example, Pfam (Bateman et al., 2002) is a library of profile HMMs covering a large number of protein-domain families. While these search tools have be-.
Vol. 18 Suppl. 1 2002 Pages S38–S45

BIOINFORMATICS

The POPPs: clustering and searching using peptide probability profiles Michael J. Wise Department of Genetics, Cambridge, CB2 3EH, England

ABSTRACT The POPPs is a suite of inter-related software tools which allow the user to discover what is statistically ‘unusual’ in the composition of an unknown protein, or to automatically cluster proteins into families based on peptide composition. Finally, the user can search for related proteins based on peptide composition. Statistically based peptide composition provides a view of proteins that is, to some extent, orthogonal to that provided by sequence. In a test study, the POPP suite is able to regroup into their families sets of approximately 100 randomised Pfam protein domains. The POPPs suite is used to explore the diverse set of late embryogenesis abundant (LEA) proteins. Availability: Contact the author. Contact: [email protected] Keywords: protein clustering; clustering algorithms; protein search; peptide composition; Pfam; late embryogenesis abundant proteins.

INTRODUCTION The method of choice for most biologists faced with protein sequence data is to compare their sequences against those in a protein database such as SwissProt using the Smith–Waterman algorithm, e.g. Scanps (Barton, 1993) or approximations to the Smith–Waterman algorithm, such as BLAST (Altschul et al., 1990). If several closely related sequences are known, more sensitive searches can be undertaken by first constructing a multiple-sequence alignment, e.g. using Clustal-W (Thompson et al., 1994). An alternative approach is to create a profile hidden markov model (Durbin et al., 1998) for the group of known proteins (or more accurately, a group of closely related protein domains) and then test unknown sequences against the HMM. For example, Pfam (Bateman et al., 2002) is a library of profile HMMs covering a large number of protein-domain families. While these search tools have become very powerful, for each method there are caveats that users need to be aware of in order to make best use of the most appropriate tool for the data they have and the scientific question they wish to address. For example, searching with a single sequence means that you are limited by the information provided by the sequence of amino acids in S38

Table 1. Ratios of Counts for Most Common Peptides (1aa-3aa) versus Least Common Peptides

1aa 2aa 3aa

Most Common Peptide

Least Common Peptide

Ratio

L LL LLL

W WC WCM

7.03 44.04 331.07

the sequence. On the other hand, searching with the consensus of a multiple sequence alignment first implies that a group of related proteins exists and these are not too divergent (or else the resulting multiple alignment will have a large number gaps and will require considerable editing). The latter problem is also an issue for HMMs because the more divergent the sequences the greater the number of sequences that are required to train the HMM. Attwood and Parry-Smith (1999) provides a useful discussion of some of these caveats. A very different approach, embodied in the sequence comparison tool, PROPSEARCH (Hobohm et al., 1994), compares vectors consisting of protein properties: the amino acid percentage composition and the sequence length. A more recent search tool, AAPAIR.TAB (Campion et al., 2001), performs a similar analysis, but in this case based on dipeptide composition normalized with respect to a fixed value of 400. In other projects, amino acid composition has been used to directly predict protein fold type (Nakashima et al., 1986) and, in combination with other properties such as sequence length and isoelectric point, broad functional category (desJardins et al., 1997). A factor which limits such peptide counting techniques is the implicit assumption that all amino acids, or dipeptides, or tripeptides, etc, are equally probable. However, the base distribution of peptides is not uniform when viewed across a large number of protein sequences, such as SwissProt plus SPTREMBL, also known as Swall, (which together have 665 472 protein records as of Jan 14, 2002); the ratios of the most common peptides versus the least common are presented in Table 1. This paper describes a new suite of software tools, colc Oxford University Press 2002 

Peptide probability profiles

lectively called The POPPs, which together considerably extend the basic peptide-counting idea. They enable the user to discover statistically ‘unusual’ (and hopefully characteristic) peptides of an unknown protein, to automatically cluster proteins into families and finally to search for related proteins. There are currently three software tools that make up the suite: popp create.py, popp cmp.py and popp search.py.

POPP CREATE.PY The first tool is called popp create.py. Given one or more sequences or files of sequences, which can be protein or nucleotide, (this discussion will focus on proteins), popp create.py compares the distributions of peptides of length 1aa–3aa (typically), found in individual sequences or across files of sequences, versus their distributions across a suitably large database (currently Swall). Only non-overlapping repeats of the same peptide are counted. A single-sided binomial distribution statistic is used to produce a list of those peptides that are either significantly over-represented in the samples versus the database or significantly under-represented, both with respect to a user settable threshold probability value. The user is also able to specify whether zero counts (i.e. complete absence) of particular peptides are to be included in the list, assuming that any are statistically significant with respect to the nominated threshold probability. Peptides whose absolute probability is greater than the threshold are not reported. This list, called a Protein or Oligonucleotide Probability Profile, or POPP, can provide useful information about the sorts of peptides that are characteristic of the sequence or group of sequences. In summary, rather than comparing vectors containing raw counts or percentage compositions of a fixed peptide length, popp create.py scans a range of peptide lengths and only reports those peptides which are statistically ‘unusual’. Using popp create.py Enzymes are an extremely diverse range of proteins which catalyse a wide variety of reactions in cells. The ‘lockand-key’ principle under which enzymes are believed to operate necessarily constrains their shapes compared to, for instance, structural proteins. The questions one might wish to ask, therefore, is whether there is anything special about the range of peptides found in subsets of enzymes versus the universe of proteins. Put another way, individual enzymes are constrained by the substrates they are to manipulate and the mechanisms they employ. How is this evident in the peptides that make up the enzymes? popp create.py is able to provide some insight into such questions. A set of protein sequences were extracted via SRS from the SwissProt database (as at Jan 14, 2002) based on the requirement that the description lines must contain

an Enzyme Commission (EC) number (http://www.chem. qmw.ac.uk/iubmb/enzyme). This results in a set of 33 657 sequences. Two different subsets where drawn from this database: a set of enzymes from the class peptidases (EC 3.4) and a set of enzymes across a range of EC numbers which all refer to RNA or DNA (i.e. differing catalytic activities but common substrate). However, as is often the case, the distribution of the different sorts of enzymes in the database is very skewed with many being represented by a single sequence while a small number have a large number of instances in the database from a wide variety of species. To reduce the imbalance somewhat, the SwissProt identifiers were sorted. Where there is only one sequence in the set for a given protein type, that sequence was retained, but where there is more than one, one was chosen at random. The resulting databases contained, respectively, 892 and 552 sequences. However, there may still be considerable similarities in the actual sequences, so nrdb90.pl (Holm and Sander, 1998) was used to return sets of sequences which have no more than 90% similarity. The final peptidase and DNA/RNA binding enzyme databases contained 850 and 537 sequences, respectively. When popp create.py was run against the set of peptidases with a maximum peptide length of 4aa, the most significantly over-represented peptides and their respective Pr values were D (1.5e−205), AAHC (1.1e−147), DSGG (1.6e−124), Y (1.6e−124), GDSG (4.5e−115), TAAH (2.8e−113) and N (4.7e−111), though more specifically, NG (3.5e−82). The appearance of peptides including S, H and N is significant due to the participation of these in catalytic triads. C (2.6e−51) is also significantly over-represented in this group of enzymes, which is likely due to the added stability afforded by disulphide bonds to peptides working under denaturing conditions (e.g. digestive enzymes). At the other extreme, L (−1.1e−123), R (−1.4e−106), M (−5.3e−73) and E (−2e−31) are significantly under-represented. Note that a negative probability is used to indicate under-representation. Running popp create.py against the set of enzyme sequences that relate to DNA or RNA produced a distinctly different set of statistically significant peptides. The most significantly over-represented peptides now are K (2.5e−206), E (5.1e−68), D (6.2e−48), KK (2.1e−39), EK (1.2e−34) and KF (2.2e−27), while the most significantly under-represented peptides are G (−1.3e−88), GG (−1.2e−44), A (−5.7e−33), P (−3.3e−26) and C (−1.1e−14). It is interesting to note that while P is very significantly under-represented, KP is significantly over-represented (5.4e−05). By way of comparison, the tail of the H1 histone is believed to be involved in chromatin condensation (Woo et al., 1995) and is related to the AT hook DNA binding motif. Running S39

M.J.Wise

popp create.py on the tail of the H1 CHICK sequences reveals significant over-representation of K, KK and KP (though the Pr values are lower because of the smaller number of peptides being counted), and significant underrepresentation of G. On the other hand, A, AK and AAK are over-represented in the H1 tail, although they are significantly under-represented in the DNA/RNA related enzymes, and D is absent altogether.

POPP CMP.PY An alternative output format for popp create.py is to create a POPP vector for each input sequence (or file). POPP vectors contain the same information as the profiles but in a compressed form; the profiles are formatted for inspection by users while the vectors will be used by the second and third components of the POPPs suite, popp cmp.py and popp search.py. popp cmp.py applies a clustering algorithm to the POPP vectors so that related proteins are formed into groups around a consensus POPP (i.e. a POPP composed of those peptides that are significantly over-represented in all the component POPPs or significantly under-represented in all the component POPPs). The clustering algorithm is a considerable development of the k-means algorithm as described in Herwig et al. (1999). While there are several differences between the algorithm in Herwig et al. (1999) and the current implementation, from the user’s point of view the major difference is that POPP vectors are not forced to belong to a single cluster but may appear in any cluster where this is appropriate. The sequence of actions undertaken by popp cmp.py are: Initialisation One or more POPP vector files are read in and the POPPs are compared all-against-all. The scoring algorithm (based on the digital signal processing correlation function) is: where the two POPPs share a common peptide (i.e. it is significant for in both), if the signs are the same (i.e. both over-representations or both under-representations) the length of the peptide (so 1–3 typically) is added to the score. If the signs differ, the score is reduced by the length. The k-means algorithm requires that thresholds be nominated for the the score above which a new POPP can be added to a cluster—here called the add-score—and the score above which two clusters may be merged—the merge-score. This is done by the user nominating specific minimum add and merge scores, or by the user specifying percentile values (typically 90 and 95%) of the scores obtained from the all-against-all comparisons. An initial set of clusters is obtained from those pairs of POPPs whose scores from the all-against-all comparisons are above the add-score threshold. This differs from the algorithm in Herwig et al. (1999), where a random set S40

of sufficiently different POPPs would be chosen as the starting clusters. The current strategy was chosen because it increases the likelihood of high scoring clusters being formed.

Clustering The clustering algorithm can be summarized in the following pseudo-code. repeat until nothing more can be done : for each POPP in the input list : Test the POPP versus the centroid for each cluster found so far if score ≥ add-score : add POPP to cluster recalculate the centroid if any POPPs have been added to clusters : for all pairs of clusters : test the two centroids if score ≥ merge-score : merge the two clusters into a new cluster with new centroid

Post-processing After implementing the clustering stage (which essentially follows Herwig et al. (1999)), it was noticed that as POPPs are added to clusters the centroids may no longer reflect the component POPPs because new POPPs have dragged the centroids away from the earlier POPPs. A three step post-processing stage has therefore been added to refine the clusters. • For each cluster and using the scores retained from the all-against-all comparisons, check the scores of each component POPP versus the others. Each POPP where more than 50% of the scores versus other POPPs are below add-score is removed from the cluster. • It is still possible that within a given cluster, one or more POPPs have a small number of below par scores versus other POPPs in the cluster. The strategy here is to split the cluster so that all the POPPs in a given sub-cluster have pairwise scores greater than or equal to add-score with all of the other POPPs in the cluster, i.e. pairs of POPPs which have below par scores are placed in different sub-clusters. A greedy graph-colouring algorithm is used to achieve this end. POPPs with below add-score scores are linked into a set of graphs with edges joining nodes/POPPs with poor pairwise scores. For each graph, and starting from a random node/POPP, nodes are assigned colours such that adjoining nodes are not given the same colour. Knowing that the first post-processing step has removed what would be the most highly connected nodes, only a small number of colours are required (generally just two).

Peptide probability profiles

• Finally, because a POPP can be part of several clusters it is possible that identical clusters will be created. For example, starting with initial clusters containing POPPs A and B and B and C, C may be subsequently added to A,B while A is added to B,C. Because these situations can arise, an additional post-processing stage deletes clusters that duplicate other clusters, or clusters whose member POPPs are wholly contained within another cluster.

Meta-clustering Finally, the same clustering algorithms are also used to perform meta-clustering. That is, the consensus POPPs found in the first pass are themselves clustered into families. The need for meta-clustering can come about in a number of ways. If multi-domain proteins are present, these will give rise to clusters for each of the component domains so long as there are other proteins in the sample sharing those domains. Alternatively, there may be different flavours of the same domain, which is the case, for example, with the avian and mammalian forms of the prion octapeptide repeat. Meta-clustering brings together into families clusters reflecting similar domains or domain flavours, and for each family a consensus POPP is created of the peptides that are identically over- or underrepresented in all the component clusters. Furthermore, if the various families are sufficiently similar, groups of families are brought together into superfamilies, which are distinguished by the fact that each family in a superfamily shares at least one cluster with at least one of the other families. The most highly connected family is selected as the anchor of its superfamily, and the remaining families in the superfamily are then listed in order of increasing distance from the anchor family, as measured by the smallest number of intermediates between the family and the anchor. The anchor family is said to be at distance 0; families which contain a cluster found in the anchor family form a band at distance 1, and so on. In addition, within each band corresponding to a given distance from the anchor family, families are listed from the most connected to the least, where the number of connections is the number of clusters shared with other families. Similarly, within each family clusters are ordered from the most connected to the least. Testing popp cmp.py The performance of popp cmp.py is now measured against a gold-standard data set: the set of 1242 proteindomain families found in the Pfam seed database, where each Pfam family is the result of a hand-corrected multiple alignment involving a set of related protein domains (Bateman et al., 2002). A small application was created which randomly selects families from the Pfam seed database, and from each family randomly selects a

number of the proteins which supplied the domains used in the multiple alignment for the family. The number of sequences extracted from each family is variable, ranging between a minimum of 4 and a maximum of 10. The domains corresponding to these sequences are then extracted and each is provided with a new identifier in a FASTA-format record. There is, however, a requirement that the domains are at least 50aa in length. Groups of these new records are progressively added to a file until a threshold number of domain sequences has been accumulated, currently 100. Because the order of the records in the new database is also randomized a ‘cheat-sheet’ is also created which provides a mapping between the source families/sequences and the new identifiers used in the database. The aim of the test is to gauge the extent to which popp cmp.py is able to reassemble the randomized domains into their families. Six sets containing approximately 100 randomized Pfam domains were created. For each set, POPP vectors were created for each domain by comparison with counts from Swall as outlined above, based on a threshold value of 0.5 and scaling to a length of 300aa. When creating vectors for later comparison, it is useful to set the threshold probability fairly low, typically 0.5; otherwise the vectors for small proteins will contain too few peptides for subsequent comparison. Clustering and search applications will typically opt to use more stringent thresholds. Secondly, when clustering POPPs representing proteins (or domains) with a wide range of lengths it is useful to nominate scaling to a suitable median length. What this means is that probabilities are, in effect, treated as conditional probabilities; each probability is divided by a factor which is the actual length divided by the target length. Turning to the clustering itself, because popp cmp.py is able to create multiple clusters for a given set of input POPPs, the clustering of a POPP was scored a hit if, counting the POPPs with which it is clustered, more come its Pfam family than from other Pfam families. A miss is recorded if the POPP is clustered with more POPPs from other Pfam families, and both a hit and a miss if the numbers are equal. Furthermore, as outlined above, the clustering algorithm requires that an ‘add-score’ be specified—the threshold score for a POPP to be added to a cluster. It is therefore possible that a given POPP may fail to attract that minimum score against any of the other POPPs, in which case an ‘unknown’ verdict is recorded. The probability threshold used was 0.1, while the add-score and merge-score thresholds were set at the 95th and 98th percentile values generated by the respective all-against-all comparisons. The results are summarized in Table 2. What the figures show is that popp cmp.py is largely able to put the domains back into their families. FurtherS41

M.J.Wise

Table 2. Agreements and Disagreements between POPP clusters and Pfam family assignments

Set Number Hits Misses Unknown Records

1

2

3

4

5

6

77 22 14 107

68 22 24 108

65 14 32 106

48 22 41 104

63 32 16 102

70 22 20 108

more, the misses (i.e. clusterings with more out-of-family POPPs) can often reveal interesting relationships between the Pfam domains. For example, in Set #4 a number of the POPPs corresponding to domains from family PF00909 (Ammonium Transporter) cluster with POPPs from domains due to family PF01940 (Integral membrane protein DUF92) and with domains due to family PF02990 (Endomembrane protein 70). In a second example from Set #4, there are a number of links between members of family PF00603, Influenza RNA-dependent RNA polymerase subunit PA, and family PF02498, BRO family, N-terminus. These links may be explained by a statement in the Pfam documentation for PF02498 that other classes of BRO proteins are DNA binding although the function of this class is currently unknown. Finally, because the number of valid out-of-family links may be sufficient to outweigh the within-family links and thus generate apparent misses, the figures that are in the table above are arguably conservative.

POPP SEARCH.PY The creation of POPP vectors also provides the infrastructure for searching for POPPs which match a given input POPP. This is implemented in the application popp search.py. At present, the search function is quite simple: users specify a string of peptides, each peptide preceded by ‘+’ or ‘-’. Users also specify a threshold probability, a minimum score and the file of POPP vectors that is to be searched. A linear search is undertaken to find POPPs in the file which achieve, using the scoring function described earlier, a score above the minimum for matches against the input POPP (for the given threshold probability). A list of matches is returned to the user of POPPs ranked in order of descending score. The user may, optionally, also specify a second string, similar in format to the first, containing a subset of the query peptides all of which must be present in the POPPs that are returned as a result of the search. In other words, these constraint POPPs represent a necessary condition on the search query. It is anticipated that the constraint POPPs will provide an avenue for future optimization of the search application. S42

AN APPLICATION OF THE POPPs The late embryogenesis abundant (LEA) proteins cover a number of loosely related groups of proteins whose precise function is unknown but for which considerable evidence suggests that they are involved in dessication resistance (Ingram and Bartels, 1996)). LEA proteins have primarily been found in plants, where they are found in seeds, but they have also been found in species as diverse as H. influenzae, E. coli, and C. elegans. Four of the subgroups are also represented by Pfam families. Four LEA genes were recently found using genomic comparison in the extremophile Deinococcus radiodurans (Makarova et al., 2001). On the basis of sequence similarities and predicted structural properties, six subgroups of LEA protein have been identified— unfortunately under two naming schemes (Dure, 1993; Bray, 1993). Using just amino acid composition and the Kyte Doolittle hydrophobicity metric, a study by Garay-Arroyo et al. (2000) found that LEA proteins are characterized by a preponderance of hydrophilic amino acids (resulting in their characterization as ‘hydrophilin’ proteins) and a high glycine content. The POPPs suite was used to review these findings and generally to discover more about the LEA proteins. To this end, a set of 24 LEA proteins were retrieved via SRS from SwissProt. The set included: DH11 GOSHI, DHLE RAPSA, EM6 ARATH, EMB5 MAIZE, EMB8 PICGL, L193 HORVU, L194 HORVU, L19A HORVU, L19B HORVU, LE10 HELAN, LE11 HELAN, LE13 GOSHI, LE14 GOSHI, LE19 GOSHI, LE29 GOSHI, LE34 GOSHI, LE5A GOSHI, LE5D GOSHI, LE76 BRANA, LE7 GOSHI, LEA1 HORVU, LEA3 MAIZE, LEA3 WHEAT and YD39 HAEIN. popp create.py was used to create POPP vectors for this set based on a threshold probability of 0.5 and, because the sequences range in length from 92aa to 555aa, scaling was done to a target length of 300aa. The POPP vectors were then clustered using popp cmp.py based on a threshold probability of 0.05. Because the input set is quite small (and hence the number of all-against-all comparisons), the add-score was based on the 85th percentile with the merge-score set at the 90th percentile. Two superfamilies were created, with three clusters remaining outside any family. In addition, two POPPs, EMB8 PICGL and LE14 GOSHI remained unclustered. A closer examination of literature concerning EMB8 PICGL revealed that evidence supporting its claim to being an LEA protein was insubstantial and the protein, which is characterized by Interpro as: Uncharacterized protein family UPF0017, has since been reannotated. LE14 GOSHI is the sole representative in this set of the group of D95 LEA proteins (Galau et al., 1993), which are covered by Pfam family PF03168. The anchor family for the first superfamily is:

Peptide probability profiles

Superfamily 1 Anchor Family Consensus POPP across family +A, +K, -L, +KA, +AAK Set of all peptides in the family +A, -I, +K, -L, +Q, -R, +T, +AA, +ET, +KA, +KD +KT, +QA, +QK, +TA, +TK, +TT, +AAK, +AKD, +ETA, +QKA 17: LE76 BRANA, LEA1 HORVU, LEA3 MAIZE, LEA3 WHEAT +A, +K, -L, +Q, -R, +T, +AA, +ET, +KA, +KD, +QA, +QK, +TK, +TT, +AAK, +AKD, +QKA 7: LE76 BRANA, LE7 GOSHI, LEA1 HORVU, LEA3 WHEAT +A, +K, -L, +T, +AA, +ET, +KA, +KT, +QK, +TA, +AAK, +ETA 14: LE29 GOSHI, LE76 BRANA, LEA3 WHEAT +A, -I, +K, -L, +KA, +KD, +KT, +TK, +AAK, +AKD LE11 HELAN clusters with LEA3 WHEAT within Superfamily 1 in the first family of the band of families directly associated with the anchor family, i.e. Superfamily 1, Band 1, Family 1, or Family 1-1-1. LE34 GOSHI clusters with LEA1 HORVU in 1-1-2, while YD39 HAEIN clusters with LE7 GOSHI in 1-1-3. All bar one members of this superfamily fall within Group 3 LEA proteins (the D7 group), which is covered by Pfam family PF02987—the LEA family. The one exception is LE29 GOSHI, which is in fact a Group 5 LEA (D29). Its inclusion in Superfamily 1 (and in Pfam seed family PF02987) is due to it having a very similar structure to the Group 3 proteins, including a 11aa repeat and a similar predicted role - sequestering water ions (Bray, 1993). The anchor family for the second superfamily is: Superfamily 2 Anchor Family Consensus POPP across family +E, +G, +EG, +GG, +RK Set of all peptides in the family +E, +G, -I, +K, +Q, +EE, +EG, +ET, +GE, +GG, +GQ, +GT, +KG, +QQ, +RE, +RK, +TR, +ARE, +GET, +GGE, +KGG, +REG, +TRK 2: L193 HORVU, L194 HORVU, LE10 HELAN +E, +G, -I, +EE, +EG, +GE, +GG, +KG, +RK, +TR, +KGG, +TRK 8: L193 HORVU, L194 HORVU, LE19 GOSHI +E, +G, +EG, +ET, +GE, +GG, +KG, +RK, +TR, +GET, +GGE, +KGG 11: EMB5 MAIZE, L193 HORVU, L19A HORVU, L19B HORVU +E, +G, +EG, +GG, +GQ, +RE, +RK, +ARE, +REG 12: EM6 ARATH, LE10 HELAN +E, +G, -I, +K, +Q, +EG, +GE, +GG, +GT, +KG, +QQ, +RK

This family is an expansion of Pfam family PF00477, small hydrophilic plant seed proteins. The Pfam seed (i.e. base) family only contains 3 members EM6 ARATH, LE19 GOSHI and EM1 WHEAT. However, the Pfam full family contains all the members of the Superfamily 2 anchor family. This superfamily encompasses Group 1 LEA proteins (also known as D19). A number of POPPs are sufficiently similar to have formed clusters but were not able to be brought into families. 19: LE5A GOSHI, LE5D GOSHI +A, +R, +S, +AM, +GA, +GY, +RP, +SF, +SS, +YS

Neither LE5A GOSHI nor LE5D GOSHI are covered by any of the Pfam families. However, if the threshold for clustering is reduced to 0.5 and the add-score percentile is reduced to 60% (which still gives a threshold score of 19, albeit on a very low threshold probability), the cluster containing the two proteins ends up in a family linked to other LEA clusters. Unfortunately, at this point all the distinctions between the two superfamilies have disappeared and they have merged to form a single superfamily, so all that can be said is that LE5A GOSHI and LE5D GOSHI appear to be distantly related members of one or other group of LEA proteins. The proteins DH11 GOSHI and DHLE RAPSA form part of Pfam family PF00257—dehydrin proteins, which are also Group 2 LEA proteins (or D11). 0: DH11 GOSHI, DHLE RAPSA -F, +G, +H, -L, -V, +GN, +HH, +NP, +PT, +SS, +TG, +GNP, +SSS

If a greater number of the dehydrins had been revealed by the SRS this cluster would have appeared within a larger set of Group 2 (dehydrin) proteins. In default of that, reducing the threshold to 0.5 and the add-score percentile to 75% reveals that the two dehydrins cluster with LEA1 HORVU, LEA3 MAIZE and LEA3 WHEAT, i.e. members of Superfamily 1 (the Group 3 LEA proteins). The final stand-alone cluster: 4: LE11 HELAN, LE13 GOSHI -L, -V, +EK, +GT, +MQ, +NA, +TG, +AAS, +GTG

is a member of the Group 4 (D113) LEA proteins, which is not represented by a Pfam family. Finally, if the threshold is reduced to 0.5 and the add-score percentile becomes 75%, the cluster containing LE11 HELAN and LE13 GOSHI (which are not found in any Pfam family) also fall into Superfamily 1. It should be noted that when the threshold is thus reduced and the latter two stand-alone clusters are brought into Superfamily 1 they both end up in Band 2, i.e. some distance from the anchor cluster, further indicating that they are only distantly related to the POPPs making S43

M.J.Wise

up the anchor cluster. Finally, turning to the conclusions of Garay-Arroyo et al. (2000), it is clear from the above analysis that while the LEA proteins as a whole are hydrophilic, Gly is only a factor in Group 1 (Superfamily 2) and Group 2 (stand-alone cluster 0) LEA proteins. Gly is not significantly present in the Group 4 proteins (stand alone cluster 4), except in the specific dipeptides GT and TG, and is not significantly presented in any form in Group 3 LEA proteins (represented by Superfamily 1), and arguably, by extension, Group 5. For the two original superfamilies the corresponding anchor families (listed above) each have an associated consensus POPP and a POPP containing all the peptides found in any of the component clusters (excluding any that are found both over-represented and under-represented). In the final stage of the analysis process the latter POPPs were used as search queries against a database of POPP vectors created from SwissProt, where each POPP vector was scaled to 300aa. The consensus POPPs were used as constraints and a threshold of 0.05 was used. The query representing Superfamily 1 anchor family returned a list of 162 POPPs. While some hits require further investigation (e.g. TCOF HUMAN, Treacher Collins syndrome protein, RL29 SPICI, Ribosomal protein and P60 LISIV, invasion associated protein), two themes run through the set of hits, apart from POPPs present in the clusters making up the family. The first is DNA binding, as evidenced by RRP1 DROME, Aprunic endonuclease/exonuclease (DNA repair), but in particular as evidenced by a large number of hits on H1 histone and related proteins. The second theme is shock proteins as evidenced by ASR ECOLI, acid shock protein, and in particular by a number of hits on the 60KDa chaperonin (groel). The query representing the Superfamily 2 anchor family returned a list containing 33 POPPs, beginning with members of the source family’s clusters and a number of related proteins that had been omitted from the original set, e.g. EM1 WHEAT, EMP1 ORYSA and SEEP RAPSA (late seed maturation protein). In this group there is the curious inclusion of AANT HDVNA, delta antigen from hepatitis delta and VP61 BTV10, blue-tongue virus core protein VP6. The latter may be explained by the comments in the associated SwissProt documentation that this protein binds single-stranded and double-stranded RNA. This theme is echoed by the hit on PEP DROME— protein on ecdysone puffs—a zinc finger protein which is found in ecdysone-induced and some heat-shock induced puffs in fruit fly. PEP is a sequence specific DNA/RNA binding protein, with a strong affinity for the HSP70 gene (Hamann and Str¨atling, 1998). Heat shock also appears in other hits, such as DJA1 HUMAN, the human homologue of bacterial heat shock protein DnaJ (which acts as a co-chaperone of human HSP70) and YRY1 CAEEL, a S44

hypothetical protein that matches the DnaJ domains in both Pfam and Prosite. Overall, the association of the LEA proteins with chaperonins is unsurprising given their related role in dessication resistance. However, the strong association with DNA binding is most curious and merits further investigation.

FINAL DISCUSSION While the techniques described in this paper require refinement, e.g. in the handling of sequences of greatly differing lengths, it is clear that the probability-based composition profiling technique employed in the POPPs suite shows a deal of promise. The reason why peptide-counting techniques are effective is because protein sequences are far from random, particularly in portions of the proteins such as catalytic domains that are under strong conservation pressure. Basing the technique on peptide probabilities removes biases due to the inherent likelihood of certain peptides over others, while thresholding and scaling reduce the impact of changes in sequence length, e.g. due to the addition of extra repeats of a given domain. Finally, clustering and meta-clustering provide enhanced sensitivity if related proteins or protein domains are available due to the consensus POPP vectors that are computed for the clusters as a whole. Overall, the utility of the approach embodied in The POPPs is that it is able to provide information that is, to some extent, orthogonal to that provided by sequence based methods. ACKNOWLEDGEMENTS I would like to thank Dr Alan Tunnacliffe, Institute of Biotechnology, Cambridge University, for making me aware of the LEA proteins, and for making extremely useful comments on the results of the investigations that I have undertaken on them. I would also like to acknowledge the generous support for my Fellowship provided by Bristol-Myers Squibb. REFERENCES Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– 410. Attwood,T.K. and Parry-Smith,D.J. (1999) Introduction to Bioinformatics. Longman. Barton,G.J. (1993) An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. CABIOS, 9, 729–734. Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L.L. (2002) The pfam protein families database. Nucleic Acids Res., 30, 276–280. Bray,E.A. (1993) Molecular responses to water deficit. Plant Physiol., 103, 1035–1040.

Peptide probability profiles

Campion,S.R., Ameen,A.S., Lai,L., King,J.M. and Munzenmaier,T.N. (2001) Dipeptide frequency/bias analysis indentifies conserved sites of nonrandomness shared by cysteine-rich motifs. Proteins: Struct. Funct. Genet., 44, 321–328. desJardins,M., Karp,P.D., Krummenacker,M., Lee,T.J. and Ouzounis,C.A. (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. Fifth International Conference on Intelligent Systems for Molecular Biology, June 21–25, Halkidiki, Greece, pp. 92–99. Durbin,R., Eddy,S.R., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabalistic Models of Proteins and Nucleic Acids. Cambridge University Press. Dure,L.III (1993) Structural motifs in lea proteins. In Close,T.J. and Bray,E.A. (eds), Plant Responses to Cellular Dehydration During Environmental Stress, Current Topics in Plant Physiology Volume 10, American Scoociety of Plant Physiologists, pp. 91– 103. Galau,G.A., Wang,H.Y.-C. and Wayne Hughes,D. (1993) Cotton lea4 and lea14 encode atypical late embryogenesis-abundant proteins. Plant Physiol., 101, 696–696. Garay-Arroyo,A., Colmenero-Flores,J.M., Garciarrubio,A. and Covarrubias,A.A. (2000) Highly hydrophilic proteins in prokaryotes and eukaryotes are common during conditions of water deficit. J. Biol. Chem., 275, 5668–5674. Hamann,S. and Str¨atling,W.H. (1998) Specific binding of Drosophila nuclear protein pep (protein on ecdysone puffs) to hsp70 dna and rna. Nucleic Acids Res., 26, 4108–4115. Herwig,R., Poustka,A.J., M¨uller,C., Bull,C., Lehrach,H. and O’Brien,J. (1999) Large-scale clustering of cdna-fingerprinting

data. Genome Res., 9, 1093–1105. Hobohm,U., Houthaeve,T. and Sander,C. (1994) Amino acid analysis and protein database compositional search as a rapid and inexpensive method to identify proteins. Analytical Biochem., 222, 202–209. Holm,L. and Sander,C. (1998) Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics, 14, 423–429. Ingram,J. and Bartels,D. (1996) The molecular basis of dehydration tolerance in plants. Annual Review of Plant Physiology and Plant Molecular Biology, 47, 377–403. Makarova,K.S., Aravind,L., Wolf,Y.I., Tatusov,R.L., Minton,K.W., Koonin,E.V. and Daly,M.J. (2001) Genome of the extremely radiation-resistant bacterium Deinococcus radiodurans viewed from the perspective of comparative genomics. Microbiology and Molecular Biology Reviews, 65, 44–79. Nakashima,H., Nishikawa,K. and Ooi,T. (1986) The folding type of a protein is relevant to the amino acid composition. J. Biochem., 99, 153–162. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal w: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Woo,H.-H., Brigham,L.A. and Hawes,M.C. (1995) Molecular cloning and expression of mrnas encoding h1 histone and an h1 histone-like sequences in root tips of pea (Pisum sativum l.). Plant Mol. Biol., 28, 1143–1148.

S45

Suggest Documents