Automated Improvement of Domain ANnotations using context ...

BIOINFORMATICS APPLICATIONS NOTE

Vol. 23 no. 14 2007, pages 1834–1836 doi:10.1093/bioinformatics/btm240

Sequence analysis

Automated Improvement of Domain ANnotations using context analysis of domain arrangements (AIDAN) Francois Beaussart, January Weiner 3rd and Erich Bornberg-Bauer* Evolutionary Bioinformatics, Institute for Evolution and Biodiversity, Westfa¨lische Wilhelms University, Schlossplatz 4, D48149 Mu¨nster, Germany Received on December 8, 2006; revised on April 18, 2007; accepted on April 27, 2007 Advance Access publication May 5, 2007 Associate Editor: Dmitrij Frishman

ABSTRACT

1

INTRODUCTION

Proteins are composed of domains and these are units of evolution as their integrity at the sequence level remains conserved by selection for functionality. Databases of domain signatures take advantage of this fact and allow a sensitive and selective sequence analysis based on a precompiled set of either HMMs/profiles (Bateman et al., 2002; Bru et al., 2005; Letunic et al., 2004) or regular expression (Attwood et al., 2003; Sigrist et al., 2002) (see also Mulder et al., 2005). However, protein evolution is characterised by a high degree of promiscuity as domains appear in many different combinations (Apic et al., 2003; Bornberg-Bauer et al., 2006; Vogel et al., 2005). Consequently, simple sequence comparisons can fail. For example when two proteins are circularly permutated their domain arrangements could be ABCD and CDAB (where every letter stands for a domain ID) and thus both proteins are

represented as a string of four such characters (Weiner et al., 2005). In a series of recent papers it was suggested that protein domain arrangements arise not so much via recombination and exon-shuffling but primarily via gene fusion and loss of domains, predominantly at the ends (Bjorklund et al., 2005; Pasek et al., 2005; Weiner et al., 2006). The process of domain loss is characterized by accumulation of mutations which become more frequent (i.e. are less selected against) once a mutation renders a domain non-functional without decreasing the fitness of the organism. Consequently, and in particular for automatically annotated domain sets such as ProDom, several IDs may exist for domains which have just one ID in another, manually curated database (e.g. PFAM). The reason may be that they are structurally almost identical and have the same function but have drifted apart into smaller sub-clusters in sequence space (Finn et al., 2006). This means that many database annotations may be improved since they miss some relationships between domains or unannotated sequences. While some tools exist to look for domain combinations (Geer et al., 2002; Krishnadev et al., 2005), none of these makes use of the fact that domain combinations tend to be evolutionary maintained and thus similarities which are overlooked can be tracked using context analysis (Coin et al., 2003; Huynen et al., 2000; Pellegrini et al., 1999). For example, if domains B and X almost always occur in the context ABC and AXC, respectively, and have a high sequence similarity, it is reasonable to assume that they are in fact functionally identical. Indeed, the distribution of E-values obtained from an allagainst-all SSEARCH (Pearson et al., 1988) of consensus sequences of ProDom families shows (i) a significant proportion of similarities below 10 4 and (ii) a strong overrepresentation of similarities below this threshold for all pairs of domains which match the same PFAM domain (Fig. 1). We exploit this redundancy and, in combination with context analysis, identify new annotations to improve the coverage of an automatically obtained database as explained in the following.

2 *To whom correspondence should be addressed.

1834

METHODS AND RESULTS

We first generated protein clusters as follows: all proteins annotated in the ProDom file version 2005.1 are used and proteins with domain

ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on June 8, 2013

Motivation: Since protein domains are the units of evolution, databases of domain signatures such as ProDom or Pfam enable both a sensitive and selective sequence analysis. However, manually curated databases have a low coverage and automatically generated ones often miss relationships which have not yet been discovered between domains or cannot display similarities between domains which have drifted apart. Methods: We present a tool which makes use of the fact that overall domain arrangements are often conserved. AIDAN (Automated Improvement of Domain ANnotations) identifies potential annotation artifacts and domains which have drifted apart. The underlying database supplements ProDom and is interfaced by a graphical tool allowing the localization of single domain deletions or annotations which have been falsely made by the automated procedure. Availability: http://www.uni-muenster.de/Evolution/ebb/Services/ AIDAN Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

0.8 Frequency 0.4 0.6 0.2 0.0 1E 20

1E 15

1E 10 E value

1E 5

0

Fig. 1. Three distributions of E-values between sequence pairs representing three different sets of ProDom domains. First set contains all pairs of overlapping ProDom domains, where both domains match the same PfamA domain and therefore are presumably homologs (yellow histogram, left). Around 64.8 % of the pairs of ProDom domains that match the same PfamA domains have an E-value below 10 4. 99.95% of the pairs of ProDom domains that match a different PfamA domain show an E-value above 10 4. The second set contains 20 000 randomly chosen pairs of ProDom domains (red histogram, middle) and contains both, pairs of ProDom domains that match the same PfamA ID and pairs that do not match the same PfamA ID. The third set (dark blue histogram, right) contains 20 000 randomly chosen pairs of ProDom domains where each domain in a pair matches exactly one PfamA domain, different for each of the two domains in the pair. Thus, the third set contains only pairs of ProDom domains that are most likely not homologs. Colour version of this figure is available as Supplementary material online.

All the events where a protein was found to be involved are classified according to the type of event and the concerned domain. An example is given in Figure. 2. For each case of ‘erosion’ or ‘camouflage’ a similarity score between the two sequences is provided by the web-interface and is linked to the dotplot of the corresponding event. The domain-dot plot shows clearly how the different annotation (camouflage) and the missing annotation of the first domain in the alignment (erosion) have indeed a substantial sequence similarity and are therefore likely to be overlooked homologs. It is possible to query AIDAN with Swiss-Prot IDs, Uniprot ACs and a protein sequence for which the closest homologue in AIDAN is found using a BLAST search.

3

DISCUSSION

In conclusion, we have devised a tool which suggests corrected domain IDs using context analysis, when the sequence identity is too weak to correctly identify homologies in an automated procedure such as ProDom. The tool presented here is primarily meant to make suggestions about domains which are potentially identical or have not been annotated automatically. It will not allow a full functional analysis of sequences by itself but will be a valuable add-on to existing tools for sequence analysis. It may also be useful in speeding up the process of transferring information from ProDom to PFAM. Conflict of Interest: none declared.

1835


arrangements containing more than six domains are selected as query domain arrangements for the cluster creation. For each of those queries all the proteins with domain arrangements containing at least 80% of the domain IDs in common with the query are grouped in the same cluster. Multiple occurrences of one domain are considered as separate domains. A multiple alignment of the domain arrangements is then performed for every newly created protein cluster (Feng, 1987) by (i) aligning two domain arrangements, (ii) creating a consensus of these two, (ii) aligning the consensus to a third domain string, etc. The first alignment is done using an adaptation of the Needleman–Wunsch–like global alignment algorithm we used for identification of circular permutations (Weiner et al., 2006). Matches between characters of domain IDs are given a positive score of 5, a mismatch is penalized with 5, a gap with 3 and gap extension with 1. Only clusters with at least 10 proteins were kept providing this way a data set of 3499 clusters containing 141 932 different proteins and 79 529 different ProDom domains. The cluster creation thresholds (at least six domains for the domain arrangement queries and at least 10 proteins per cluster) were chosen in order to provide a first data set to investigate annotation irregularities. A second, larger data set has been created using at least three domains for the domain arrangement queries and at least five proteins per cluster. Almost all (97%) of the domain events included in the first, smaller data set are also included in the second, larger set. The characteristics of those data sets as well as the results of the investigations performed on them are given in Supplementary Material (Table 1). Clusters are automatically inspected to detect irregularities as follows: a camouflage is found when two differently annotated domains occur at the same position (column) in two proteins of a cluster and if their sequences present a high similarity corresponding to an E-value of 10 4 and less. Almost all the domain pairs defined as being camouflage domains have both domains belonging to the same Pfam clan (Supplementary Material Table 2). An erosion is found if a domain and an unannotated region are found at the same position in two proteins of a cluster and if their sequences are similar (according to the E-value threshold 10 4). We chose this rather conservative value because the idea is to suggest where domains may be identical or may have been overlooked without introducing new errors. We estimate that around 64.76% of all domains having a high similarity with other domains are detected (Fig. 1, yellow histogram). On the other hand, 99.95% of domains that correspond to different PfamA domains show an e-value above this threshold (Fig. 1, dark blue histogram). Therefore, we estimate that with this threshold, in 50.05% of all cases new errors are introduced. Thus, based on amino acid similarity of protein fragments, wedistinguish two groups of events. If the amino acid sequences are similar (E 5 10 4), we annotate these events as annotation artifacts (erosions or camouflage, respectively). If no similarity is detected, we annotate the events as real evolutionary events (physical domain deletions and substitutions). The remaining fragments of proteins that are not annotated after this procedure and are not similar to other domains within the cluster are called shadow domains. The above procedure was followed for both (small and large) cluster sets. All domain investigations are annotated and stored in a database and can be downloaded in flat file format from the AIDAN website. This web site allows browsing clusters (Fig. 2), a domain-centric view by querying domains with a potentially flawed original annotation, a sequence-centric perspective by entering a sequence and looking for a possible corresponding cluster as well as browsing a list (IDs with annotations) of each category of annotation error and real evolutionary event. Inferred ‘correct’ annotations are suggested in case of ‘eroded’ or ‘camouflage’ domains. It is possible to download all data from AIDAN in the same format (XDOM) as ProDom but with additional information suggesting corrected IDs or corresponding similarities.

1.0

Automated Improvement of Domain ANnotations

F.Beaussart et al.

REFERENCES Apic,G. et al. (2003) Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genomics, 4, 67–78. Attwood,T.K. et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 31, 400–402. Bateman,A. et al. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276–280. Bjorklund,A.K. et al. (2005) Domain rearrangements in protein evolution. J. Mol. Biol., 353, 911–923. Bornberg-Bauer,E. et al. (2006) The evolution of domain arrangements in proteins and interaction networks. Cell Mol. Life Sci., 62, 435–445. Bru,C. et al. (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res., 33, D212–5. Coin,L. et al. (2003) Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc. Natl Acad. Sci. USA, 100, 4516–4520. Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. Finn,R.D. et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res., 34 (Database issue), D247–D251. Geer,L.Y. et al. (2002) CDART: protein homology by domain architecture. Genome Res., 10, 1619–1623.

1836

Huynen,M. et al. (2000) Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences. Genome Res., 8, 1204–1210. Krishnadev,O. et al. (2005) PRODOC: a resource for the comparison of tethered protein domain architectures with in-built information on remotely related domain families. Nucleic Acids Res., 33, W126–W129. Letunic,I. et al. (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Res., 32, D142–D144. Mulder,N.J. et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res., 33, D201–D205. Pasek,S. et al. (2005) Identification of genomic features using microsyntenies of domains: domain teams. Genome Res., 15, 867–874. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448. Pellegrini,M. et al. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, 4285–4288. Sigrist,C.J. et al. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief. Bioinform., 3, 265–274. Vogel,C. et al. (2005) The relationship between domain duplication and recombination. J. Mol. Biol., 346, 355–365. Weiner,J. 3rd. et al. (2005) Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics, 21, 932–937. Weiner,J. 3rd. et al. (2006) Domain deletions and substitutions in the modular protein evolution. FEBS J., 273, 2037–47.


Fig. 2. Screenshots of the AIDAN database : cluster made from the human potassium voltage-gated channel subfamily A member 1 protein (SwissProt access number Q09470). Left : List of proteins belonging to a cluster. Top right : alignment of the protein domain arrangements. Colored boxes represent domains (all domains are depicted at equal length), colored bars represent erosion events and black bars shadow domains (unannotated sequences), empty spaces represent physical deletions. Bottom (‘Legend’) : legend of the domain arrangement alignment. Bottom right: dotplots of the sequences involved in domain events. The left dot plot refers to the sequence of the domain J and the sequence of the upper colored bar in the first column of the alignment (erosion). The right dotplot refers to the sequences of the two domains J and K in the first column of the alignment (camouflage). As sequence similarity is displayed by a straight line, the ‘blobs’ on the dotplots indicate repetitions of very short identical sequences. Here, these sequence fragments are streches of short glycine repeats. Colour version of this figure is available as Supplementary material online.

Automated Improvement of Domain ANnotations using context ...

Automated Improvement of Domain ANnotations using context ...

Suggest Documents

Automated Model Selection using Context

Improvement of Automated Right Ventricular Segmentation Using ...

Automated Timetabling Using Stochastic Free-Context ... - CiteSeerX

Improvement of an Automated Neonatal Seizure Detector Using a Post ...

Using Semantic Annotations for Supporting

Semi-automated Placement of Annotations in Videos - CiteSeerX

Automated Evaluation of Crowdsourced Annotations in the Cultural ...

Flexible querying of fuzzy RDF annotations using

Automated Evaluation of Crowdsourced Annotations in the Cultural

Annotations

Domain Specific Semantic Validation of Schema. org Annotations

Dissertation ASSESSMENT AND IMPROVEMENT OF AUTOMATED

Improvement of Automated Detection Method for Clustered

Using GDL to Represent Domain Knowledge for Automated ...

Fast Re-Authentication for Inter-Domain Handover using Context

Improvement of domain-level ortholog clustering by optimizing domain ...

Using context to improve protein domain identification | SpringerLink

Improvement of domain-level ortholog clustering by optimizing domain ...

Using Annotations to Facilitate Power vs Quality

Optimizing Web Search Using Social Annotations - CiteSeerX

In-Context Annotations for Refinding and Sharing - Eelco Herder

Towards Context-Sensitive Domain of Islamic ...

Adding Ownership Domain Annotations to and ... - Computer Science

Genome sequencing and protein domain annotations ... - BMC Genetics