Discovery of Novel Protein Domains: The Parasitic

0 downloads 0 Views 125KB Size Report
Ep6 3676460 874-970. Gp37. BPA 20140797 986-1082. Gp12. BGA112141289 614-675. Pb1B. SSM110732859 970-1053. LambdaBA1 BC 47569628 1712- ...
Discovery of Novel Protein Domains: The Parasitic Connection Ian Lee Department of Haematology-Oncology National University Hospital, Singapore

Introduction Domains are key indicators of protein function. While statistical models based on iterated profile searches (PSIBLAST) have been instrumental in their detection, they face certain limitations. Convergence is hampered by inclusion of closely-related sequences that limit the scope of the model or insufficient signal strength to detect distant relationships. Recent large-scale sequencing of parasite genomes provides a means of alleviating this problem. Using sequences from the malaria and trypanosome genomes as “bridging points” in model construction, we have previously identified the RAP and SMP domains (Lee and Hong 2004, 2006). We further illustrate this with identification of a novel domain through the use of another sequenced parasite genome – Entamoeba histolytica. The domain is amplified in this species and other microbes and is of potential relevance for human pathology.

Methods Applications of our procedure have previously been published in Lee and Hong (2004, 2006). Briefly, they are re-iterated in Figure 1. An Entamoeba histolytica protein is used as a seed in this example. The procedure is fairly flexible and can accommodate diverse parasite genomes,various definitions for “known” domains (such as the PANTHER database) and methods for model building (such as PROBE).

Novel/hypothetical proteins

Compute distance between seed and retrieved proteins

Identify known SMART/PFAM domains

Exclude closely-related (orthologous) sequences

Segment/mask known domains from sequences

Enough iterations?

Build model with masked sequences as seeds

Realign sequences using to arrive at final model

no

yes

Fig 1.

Schematic representation of our procedure.

Results and Discussion A novel domain is detected in multiple proteins found in the human gut pathogen, Entamoeba histolytica, as well as the bacterial pathogen, Bdellovibrio bactivorus. The domain is also associated with a known DNA-binding protein, Ndt80, found in both higher and lower eukaryotes. Using parasite sequences as seeds for initial model building leads to more sensitive procedures. Importantly, these procedures detect domains in both higher and lower eukaryotes across a larger spectrum of organisms. The domain identified here is likely to be a part of the mechanism of infection employed by diverse microbial pathogens. References

KIAA0954 C11orf9 C11orf9 C11orf9L MGC84361 CG3328-PA F59B10.1 XK800 CBG17153 CBG03089 Orfveg132 127.t00001 115.t00026 100.t00028 96.t00015 82.m00149 68.m00219 68.m00226 68.m00233 54.m00226 65.m00154 59.m00164 53.m00194 46.m00215 13.m00340 12.m00274 1.m00650 Gp37 Gp37 Gp12 PblB LambdaBA1 C3148 BT1865 BQ11070 Kcp Mca2909 Cac1113 BH0962 Bll5175 NMA1824 Bd0704 Bd0705 Bd0884 YapH Bd1641 Bd1712 Bd2088 Bd2548 Bd2565 Bd2582 Bd3182 Bd3266 Bd3267

Hs 20521708 Rn 62641874 Tn 47210713 Tn 47224519 Xl 51703569 Dm 45550508 Ce 50507495 Ce 32566568 Cbg 39585454 Cbg 39597438 Dd 1513238 Eh 67472891 Eh 67473539 Eh 67474316 Eh 67474538 Eh 56470408 Eh 56470931 Eh 56470938 Eh 56470945 Eh 56471466 Eh 56471036 Eh 56471246 Eh 56471518 Eh 56471754 Eh 56473590 Eh 56473651 Eh 56474833 Ep6 3676460 BpA 20140797 BGA112141289 SSM110732859 Bc 47569628 Ec 26248990 Bt 29347275 Bq 49474618 Kp 50846093 Mc 53802955 Ca 15894398 Bh 15613525 Bj 27380286 Nm 15794714 Bv 42522290 Bv 42522291 Bv 42522455 Bv 42522678 Bv 42523143 Bv 42523208 Bv 42523560 Bv 42523973 Bv 42523988 Bv 42524004 Bv 42524565 Bv 42524640 Bv 42524641

Consensus/80% 2Struct/JPRED HHHHHHHHHHHH

612-719 579-686 569-666 654-760 556-653 601-712 477-573 494-590 411-506 468-564 480-587 171-255 179-264 231-317 257-342 181-265 84-169 108-191 243-329 253-349 195-280 211-298 274-359 214-299 181-252 139-225 218-302 874-970 986-1082 614-675 970-1053 1712-1776 221-320 96-194 270-337 267-359 388-483 379-468 1205-1298 357-527 622-746 1223-1324 1123-1211 1009-1101 1352-1439 561-654 1233-1317 1128-1213 1523-1620 1030-1123 1274-1360 776-868 1430-1523 1363-1451

MGSLMHPSDLRAKEHV--QEVDTTEQLKRISRMRLVHYRYKPE(8)ATAPETGVIAQEVKEILPE---AVKDTGD(10)NFLVVNKERIFMENVGAVKELCK MGSLMHPSDLRAKEHV--QEVDTTEQLKRISRMRLVHYRYKPE(8)ATAPETGVIAQEVKEILPE---AVKDTGD(10)NFLVVNKERIFMENVGAVKELCK MGAIMQPSDQRAKCNI--QEVDSEQQLKRINQMRIVEFDYKPE(8)-HTHQTGVLAQEVKELLPS---AVTQVGD(10)NFLMVDKVGAAPAL--------MGSLVHPSDIRAKENV--QEVNTTDNLKRISQMRLVHYQYKPE(8)-NTAETGVIAQEVQQILPE---AVKEGGD(10)NLLVVNKERIFMENVGAVKELCK MGSLMHPSDIRAKESV--EEVDTTEQLKRISQMRLVHYHYKPE(8)ENAAETGVIAQEVQEILPE---AVKESGD(10)NFLVVNKERIFMENVGAVKELCK -GHIVQPSDSRAKQEI--GELDTSVQLRNLQKIRIVRYRYMPE(14)EIEDTGVIAQEVREVIPD---AVQEAGS(10)KFLLVNKDRILMENIGAVKELCK -GRIINPSDIRLKEAI--TERETAEAIENLLKLRVVDYRYKPE(10)QRHRTGLIAQELQAVLPD---AVRDIGD-----YLTIDEGRVFYETVMATQQLCR -GRVMYPSDIRLKDNI--TEKGAKDALENLQKLRIVDYFYKPE(10)QRKRTGVIAQELAAVLPD---AVKDLGD-----YLTVNESRVFYETVLATQELCR -GRVMCPSDIRLKDNI--TEKEAKEALENLQKLRIVDYFYKPE(10)QRKRTGVIAQELAEVIPD---AVKDLGD-----YLTVNESRVFYETVLATQELCR -GRVINPSDIRLKEGI--SEKETAEAIENLLKLRVVDYRYKDE(10)QRQRTGLIAQELQAVLPD---AVRDIGD-----YLTIDEGRVFYETVMATQQLCR -EGVYHPSDLRIKYDL--KSIDSKSNLDNVNRMKLYDYKYNPQ(11)DNCDRGVIAQDLQRILPK---TVRTIGN(8)ENLLVIKNEALVMETIGATQELSK -NGFLQRSDARVKEHI--EPLKG--CVDKILNLTGKSFKYIGK----DEKKLGFIAQEVQEVCPE---LVHEDEF----G-LSVDVIGIIPILVEALKEIHK -NGFLQRSDQRYKKDI--KKISN--ALEKVLLLTGRSYKYLND----KQRRFGFIAQELKEVIPE---AVKEDED----GTLSIDPLALLPFIIESLKELNT -NGFLVRSDARSKTDI--EEIHN--SLNGILSLVGVSYSYKND---SDNKKYGFVAQDVQKIYPD---LVKEDDT----GKLTVDYLGIIPLLVEALKKIHN -NGFLQRSDRRSKKDI--KKISD--ALNTILMITGKSYKYLND----DKTRFGFIAQELKEVIPE---AVREDED----GSLSIDPLALLPFIVESLKQLSF -NGFLTRSDQRTKTDI--ANLNN--SLDLIMQLRGVTFKYKGT----EQRKYGFIAQELKQVLPD---LVREDTQ----G-LYIDTQGILPILVESLKQLNQ -NGFVVRSDKRKKHNI--QKIKN--ALNKIVNIFCCTFKYNND----ETIRSGFIAQQLQQVVPE---LVHEEID----GTLSIDSLALIPVIIESLKTLKN -NGYFVRSDERTKCHI--RPLSD--CLESISQLVGKQYRYKNS----PQLRLGFVAQEVKEVLPD---LVHTDEIT---GTLSVDVLGVIPFLVESLKQLNS -NGYLVRSDARSKTDI--QTIEN--ALNSVTSLVGKKYAYKNE---PNKIKYGFIAQEVQEIIPD---LVQKDET----NNLSVDYLGLIPYIIEALKSIHD -NGYLVRSDARSKTDI--QTIEN--ALNSVTSLVGKKYAYKNE---PNKIKYGFIAQEVQEVIPD---LVQKDES----GNLSVDYLGVIPYIVEALKTIHD -NGFFQRSDSRTKTKI--APIRN--ALERLLNVTGKMYTYDVA----NAETYGFIAQELKEHFPD---LVHEDES----GYLSIDPISLIPFTVEAVKELDK -EGFLVRSDERNKKEI--EKIDK--ALYGLKHLYGREFKYLRD(2)DKARRYGFIAQEVKEIYPE---LVQIDEEG---G-LTVDYLGIIPIMVEALKEIEK -NGFLQRSDIRVKENI--KPLVD--SLNTVLQLTGTSFNYIGK----KEEKLGFIAQEVKKVCPE---LVIEDDK----GELAVDVIGVIPHLVEALKQIYE -NGFLQRSDKRSKKDI--KKISH--ALDTICKITGKSFKYVNE----DRTRFGFIAQELKEVIPE---AVKEDED----GRLSIDPLSLLPFIVESLKELQM -NGFLQRSDSKLKTSI--EPLTN--SLQKLLKLVGVQYNYKED----NTVKYGFIAQEVNKTCPE---LAPNGTS--------LDVVGILPIIIESLKEINL -AGFFQRSDQRNKNEI--QKITG--ALEQLKNVVGYSFVYKND---ENNQKYGFMAQELQKIYPN---TVKVLPD----GTLSIDTVALLPYIVSSLKELYT -NGFLQRSDARVKEHI--EPLKG--CLDKILKLTGKSFNYTGK----EEKNLGFIAQEVQEVCPE---LVHEDEF----G-LSVDVIGIIPILVEALKEINN FSDVYIRSDSRLNINK--QQLEY-GAVEKVCRLKVYIYDKLKS(5)VIKREVGIIAQDLEKELPE---AVSKVEDG--SDVLTISNSAVNALLIKAIQEMSE VRDVYVRSDIRVKKDL--VKFEN--ASEKLSKINGYTYMQKRG(7)KWEPNAGLIAQEVQAILPE---LVEGDPDG--ERLLRLNYNGVIGLNTAAINEHTA -TGAINTSDERHKTDI--APISD-KVLDAWEKVKFYQYKFKDA(6)EARYHFGVIAQQIVKVF---------------------------------------------SDRRYKSNI--KDSQV-SGLDIIEQLKTYSYRKEYD(2)IEDISCGIMAQDVQKYVPE---AFFENPD----GAYSYRTFELVPYLIKAIQELNQ --AFQPTSSRKIKTNF--ADLPF-SALEKVNSVNIKQYNFIKD(11)VETYYGMIAEDVDQVF---------------------------------------TAFNQHSDRDLKDNI--QVIDN--ATDRIRKMNGYTYTLKEN----GMPYAGVIAQEALEAIPE---VVGSAM(14)ERYYTVDYSGVTGLLVQVARESDD VKNVYNYSDARAKINI--NPLGY--GLNVLSKLNAVSYDFKDK(12)DGKEIGLLAQEVEKVLPN---IVLTDPD----GNKLINYTAIIPIMIQSIKELKA MGGILGLSDLRAKENI--VPIGE------KNGYPIYIFNYKGD----PQLYRGVIAQDVLRLKPD---AVYINAKT---KLLHVDY----------------GAWNSSSDARMKTQV--EKIDN--ALEKLDCISGYTYLKQ------GVTEAGVIAQELEEVLPQ---AVSKTE-(10)DARSININGVVALLIEALKEERQ -SAFTVSSNRRLKRVL--GEVRH--ALERVRALQPIRYRLEAD(2)QGRIELGLIAEDAREVLPE---VVYPVTD(7)A-SLSIDYGRLAVLALAAIRELEA ----FNASDERLKTNI--KPITT-NCSSIINQINLKEFDFIKT---GEHIEVGTIAQQLSKINKK(5)YTTEDGT----NFFAPDYNSILPYIIGAIQELSA -NTLVEGSDIKIKSDI--KNSFL--GLAFIKDLRPVDFQLVEN----EQYKTGLIAQEVAAELEH(8)TIVTIND----SVMGVSYTQLIAPTIKAIQELDS VADRAQRADINLKHDV--VLLGR---LDN--GLGYYRFAYNGS----DKAYVGVIAQEVQTVRPD---AVTRGSD----GNLRVYYERL-------------TAVNIRSDGRDKADV--KPLTN--GLDFVMKLKPMTGYYDR(35)EDRLRHWFIAQDIAALEDE---YGRLPMVNKTNDTYTVEYETFIPVLTKAIQEMAA VNGTIQTSDQRLKAEV--QDLSQ--GLDFVMALQPKSYKWKSD(4)AAKTHWGFMAQELEAQVKR---STASAAP(10)DYYGVNYSELIAPVVKALQELYL -GTLTQGSDRRLKRDI--ATINS--ALDSILQINGVTYNWIDPSK-GDQREVGLIAQEVEKVFPE---VVKTDAK----GLKSVAYQNLVSPIIQAIKEFYS -TAWTNASDRRLKDIH--GNYEY--GLNEILKLRTIRYNYKEG(6)SDVPMTGFVAQEVQAVIPD---AVKKRED----GYLELNVDPIHWATVNAVQDLHG -GNLTEASDARLKTDI--QILPD--SLNKVLGLNGYSYYWKNPE--NKEKQIGLIAQEVEKVFPE---AVRTDKD----GSKSVAYQKLVAPLINSIKELYQ -TSWANTSDRRFKRNI--ATIDS--SLEKVLQLRGVTYDWRTD(7)ENGQQVGLIAQEVQSVFPD---VVTKDNE----GFLAVQYANLVAPLIEAMKEFHA -------SDIRLKRDI--ASLSSSEALQRILKIQGVSYNWKNPEY-GKRPQLGFIAQELEKIYPE---LVETDPQ----GMKSVNYSHLVSPLVEAIKALYD --GSYQSSDRRFKTDI--EVIPD--ALNKALQIQGVTYHWKPGVNPDPSQQIGVIAQEVETVFPQ---AVKTDAD----GYKSVTYGNLVAPLFNALKEFYE -NAPDVSSDARLKKNV--KDSDL--GLDFVNSLRPVSWTWKDE(2)GATEHYGVIAQEAELAIAK---AKGEPSD(8)SDSYSVRYTELIAPIIKAVQELYR -AAYVNTSDQRLKKNI--TVIEG--ALEKILRLNGVYFDWRSE(7)EQRHDIGVIAQEVEKVFPE---AVRTDDK----GFKAVAYSKLVPPLIEAAKAVNM -ASYLYTSDARLKKDV--VTLPM--ALENLLKLRGVNFVWKNN----GEKTVGFIAQEVEAVYPE---LVRTDKV---SGFKSVQYGNIVAILVEALKQEHA -SAWTVASDARLKDVH--GDYEF--GLSEILKLHTVRYNYKKD(6)SDVPMVGFIAQEVQQVIPD---AVKTRAD----GYLELNVDPIHWATVNAVKELHG -AAWENLSDARLKSDI--EVIPD--SLKKILSLRGVTFNWRHD(7)IEKKDMGVIAQDVERVFPE---AVDKDEK----GFRAVAYTKLIGPMIEAFKELYK -GTVNGASDIRLKKEI--HVLDG--SLDKILQLKPSSYHWKDPNAD-PRLQMGFIAQELEKVYPN---VVEENKK----GIKAVSYINMIAPITSAVQELYH .s.hh..SD.RhKppl....hps..slp.l.plphhpa.b..........phGhIAQ-lp.lbPc...hVpp..p......b.ls...lh...h.thppb.. HHHHHHHH HHHHHHH EEEE EEEEEEHHHHHHHH EEEEE

1. Lee, I. & Hong, W. (2004). RAP- a putative RNA-binding domain. Trends in Biochemical

Fig 2.

Multiple alignment of the domain. Numbers in parenthesis represent residues not shown The

Sciences. 29(11), 567-570.

predicted secondary structure is shown below it. Sequences are denoted by protein, gene or Genbank locus

2. Lee, I. & Hong, W. (2006). Diverse membrane-associated proteins contain an SMP domain.

identifiers followed by species abbreviations, Genbank protein ids and residue limits.