Discovery of Novel Protein Domains: The Parasitic Connection Ian Lee Department of Haematology-Oncology National University Hospital, Singapore
Introduction Domains are key indicators of protein function. While statistical models based on iterated profile searches (PSIBLAST) have been instrumental in their detection, they face certain limitations. Convergence is hampered by inclusion of closely-related sequences that limit the scope of the model or insufficient signal strength to detect distant relationships. Recent large-scale sequencing of parasite genomes provides a means of alleviating this problem. Using sequences from the malaria and trypanosome genomes as “bridging points” in model construction, we have previously identified the RAP and SMP domains (Lee and Hong 2004, 2006). We further illustrate this with identification of a novel domain through the use of another sequenced parasite genome – Entamoeba histolytica. The domain is amplified in this species and other microbes and is of potential relevance for human pathology.
Methods Applications of our procedure have previously been published in Lee and Hong (2004, 2006). Briefly, they are re-iterated in Figure 1. An Entamoeba histolytica protein is used as a seed in this example. The procedure is fairly flexible and can accommodate diverse parasite genomes,various definitions for “known” domains (such as the PANTHER database) and methods for model building (such as PROBE).
Novel/hypothetical proteins
Compute distance between seed and retrieved proteins
Identify known SMART/PFAM domains
Exclude closely-related (orthologous) sequences
Segment/mask known domains from sequences
Enough iterations?
Build model with masked sequences as seeds
Realign sequences using to arrive at final model
no
yes
Fig 1.
Schematic representation of our procedure.
Results and Discussion A novel domain is detected in multiple proteins found in the human gut pathogen, Entamoeba histolytica, as well as the bacterial pathogen, Bdellovibrio bactivorus. The domain is also associated with a known DNA-binding protein, Ndt80, found in both higher and lower eukaryotes. Using parasite sequences as seeds for initial model building leads to more sensitive procedures. Importantly, these procedures detect domains in both higher and lower eukaryotes across a larger spectrum of organisms. The domain identified here is likely to be a part of the mechanism of infection employed by diverse microbial pathogens. References