A Sequence Data Mining Protocol to Identify Best ...

3 downloads 29912 Views 504KB Size Report
bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage ...
2010 IEEE International Conference on Data Mining Workshops

A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families V.S. Gowri #, Khader Shameer #, Chilamakuri Chandra Sekhar Reddy, Prashant Shingate & Ramanathan Sowdhamini* # Equally contributed to this work

National Centre for Biological Sciences (TIFR) National Centre for Biological Sciences (TIFR), GKVK Campus, Bellary Road, Bangalore, India. E-mail: [email protected] where as TrEMBL annotations are based on in-silico annotations [7, 8]. With the advent of plethora of sequence data, efficient computational approaches and analysis pipelines are being developed to deal with genome sequencing and further downstream analysis. After the sequencing efforts, a primary approach is the annotation of the proteins involving a homology search using BLAST suite of programs to identify remote homologs [9-12]. Such approaches can be enhanced further with the application of sequence search techniques to connect the new sequence with known sequence and further in-depth analysis of specific protein domain families reported in integrated protein domain databases like SMART [13, 14], Pfam [1518] or Interpro [19].

Abstract—Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.

Protein sequence analysis based on the evolutionarily conserved protein domains offers a distinct advantage to understand the possible function of proteins [20-22]. Function association can be performed using fast and effective sequence searches. Protein domain based approach can be employed to identify new putative members of protein domain family from hypothetical proteins and enhanced annotation of genes with unknown function [2325]. A typical analysis work-flow of a protein domain family level analysis begins with homology searches and further analysis using the alignment. The sequence used to guide the alignment is considered as ‘reference sequence’. Selection of this reference sequence is being done in an informal manner in such large-scale analyses, due to unavailability of a method that deals with the selection of best representative member. Previously, we defined an objective method to identify ‘Best Representative Position Specific Scoring Matrix (BRP)’. The method is used to generate a best representative profile that encapsulates all the important information of a diverse or highly similar protein domain family to one single profile from the reference sequences derived from seed sequence dataset. In earlier work, BRP was selected from a protein domain family using one sequence search method and applied to whole Pfam database, version 22 [26]. Pfam domain family is divided into two sections as ‘seed sequence alignment’ and ‘full sequence alignment’. The method is based on the effective

Keywords: sequence analysis, sequence data mining, best representative sequence, protein domain, protein family, data mining

I.

INTRODUCTION

Sequence data and the size of sequence databases are increasing at a constant rate in this post-genome era. Genomes and metagenomes are sequenced at a rapid rate with efficient sequencing technologies, faster algorithm and rapidly reducing sequencing cost [1, 2]. These efforts are generating a huge data-to-information-to-knowledge inference paradigm in biology due to extensive efforts required for the functional characterization of proteins encoded in the sequenced genomes using biochemical experiments [3-6]. This trend is visible from the comparison of sequence statistics available in UniProtKB/Swiss-Prot (517802 entries as on July 2010) with the UniProtKB/TrEMBL (11109684 entries as on July 2010). TrEMBL is having more members compared to Swiss-prot, but the Swiss-Prot annotations are based on curated data, 978-0-7695-4257-7/10 $26.00 © 2010 IEEE DOI 10.1109/ICDMW.2010.153

703

II.

utilization of ‘seed sequence’ members derived from seed sequence alignment for every member of Pfam family, create an alignment by PSI-BLAST [27] search using a relaxed evalue of 10 against the seed dataset, generate alignment and create a profile from alignment using PSI-BLAST (Blast version 2.2.16). Following the generation of PSSMs, the profiles could be fed into a neural network based function association program “FASSM” [28]. FASSM is a function association tool that examines the conservation of residues, protein family specific signatures or motifs for the function association of protein sequence. Residues that characterize motifs at different alignment positions were identified using PSIMOT routine within FASSM. FASSM was shown to be effective in detecting difficult relationships such as discontinuous domains during whole-genome surveys and is demonstrated to perform accurate family associations at sequence identities as low as 15%. To find BRP, one can employ the ‘independent sequence’ dataset of a Pfam family and queried to the FASSM with the profiles derived from seed sequence. Independent sequence dataset refers to the sequence members in a family excluding seed sequences. BRP is identified using an effective and generic coverage analysis method. Coverage analysis method can be used with any other sequence annotation method to identify the BRP. PSSMs of 8,524 protein families from Pfam 22 were reported in a database of PSSMs “3PFDB” [26]. In this report, we propose the reference sequence from which we derive a BRP can be considered as the best representative member or ‘Best Representative Sequence (BRS)’ of the protein domain family. We define BRS as the reference sequence used to generate the profile or seed sequence which is reported as a hit in sequence searches with maximum coverage score. BRS can be derived using the coverage analysis formulae explained in the coverage analysis section of materials and methods. To assess the agreement of the BRS using different methods, we used three different and powerful sequence search based function association programs (FASSM [28], RPS-BLAST [29, 30] and HMMER2 [31]) and compared the results. FASSM was shown to be an effective tool for function association of remote homologs. RPS-BLAST is extensively used to establish homology relationship between distant relatives. Pfam database is populated using HMMER suite of programs using HMMs based on the seed sequences. These three programs use different algorithmic approaches and scoring schema for function association, but fundamentally require a reference sequence, alignments derived from the reference sequence and a target database (database of PSSMs for FASSM and RPS-BLAST, database of HMM models for hmmpfam search). We used these three programs and used every member of seed sequence as reference sequence, derived the alignment and used independent dataset to query the respective database of profiles or HMMs. The searches were performed independently; three distinct searches were performed for every sequence family member used in the analysis and coverage analysis score was obtained for individual searches.

MATERIALS & METHODS

The objectives of the current study is to search for a single best representative member by employing multiple search programs and to assess the overall trend in identifying BRP using coverage analysis method. The coverage analysis can be performed using any sequence search program irrespective of the algorithm and scoring scheme. To illustrate this hypothesis, we selected three tools (FASSM, RPS-BLAST 2.2.16 and HMMER 2.3.2) and performed sequence search using sequence from independent dataset as query and derived coverage analysis scores. 100 multimember PFAM families from PFAM database, Version 22 were chosen; the data is selected based on the diverse range of coverage analysis score reported in 3PFDB. Data set of 100 families with different coverage values like high coverage, medium coverage, low coverage and few families from the list of Pfam families not included in current version of 3PFDB were included in the analysis. Seed sequences were used to generate multiple family-specific profiles/PSSMs or HMMs depending on the program (FASSM and PSI-BLAST use profiles, where as HMMER use HMM models derived from the alignment for sequence search). These profiles were then searched using independent sequence data from the full dataset of the family to identify the best representative sequence (BRS) for each of these 100 families. Generation of family-specific profiles and identification of BRS by three different profile based methods such as FASSM, RPS-BLAST and HMMER are discussed below. A flow-chart of the coverage analysis approach is given in Figure 1. III.

PSSM/HMM GENERATION

Protein domain family-specific PSSMs/profiles were generated using every single seed sequence as reference sequence. PSI-BLAST (Blast Version 2.2.16) was used for profile database generation. An E-value threshold of 10 was used for PSSM/Profile generation. These profiles were subsequently searched using the independent sequences of that family. Similarly, Multiple HMMs were generated using HMMER 2.0 package using an E-value threshold of 0.001. Multiple Sequence alignment of the seed sequences for HMM generation were generated using ClustalW. These profiles/PSSMs or HMMs were searched using sequences from the full dataset. IV.

COVERAGE ANALYSIS

FASSM searches were performed by feeding seed-based PSI-BLAST profiles to FASSM and sequence searches were performed using independent data set to derive the coverage. RPS-BLAST searches employs the PSI-BLAST profiles of

seed sequence as the target database and independent sequences were used as the query sequences. Hmmpfam searches were employed using independent sequences

704

against the database of seed-derived HMMs. Coverage was calculated as the ratio between the number of independent sequences annotated using FASSM or RPS-BLAST or HMMER2 to the total number of sequences in the full set for that family (Figure 1). It is expressed as percentage.

number of independent sequence in that family (Figure 2). The coverage analysis score is given as a percentage value. Coverage for three different methods in comparison to the sequence identity of seed sequences are provided in Figure 3. Interestingly, there is very little correlation between average sequence identity and the coverage obtained by BRP. We further examined whether the population of protein families is affecting the coverage. PPFactor is calculated and compared with coverage analysis score in Figure 4 to understand the correlation between the approaches. Whereas sparsely populated families retain poor coverage, in general, PPFactor is also not influential in good coverage of BRP. Coverage analysis score and size of protein domain family (total number of sequence members within a protein domain family) is given in Figure 5. Correlation of coverage of BRP and their sequence identities (supplementary materials are provided in the URL http://caps.ncbs.res.in/download/brs/) suggest that the starting point profile depends much on the type of algorithm and the logic of search propagation in sequence space. Irrespective of poor agreement between the different program to identify the BRS and subsequent identification of BRP, the coverage-based analysis can identify the BRS for a family. Sequence analysis based on BRS may improve the down-stream analysis rather than the analysis using a random sequence. We further employed prealigned set of homologues of seed sequences to jump-start the construction of profile and observe no particular improvement in coverage. In addition, in a majority of instances, a single profile created with all seed sequences within a family give rise to lower coverage than observed for BRS (see: jump-start alignment based results in supplementary material). Average coverage results using the PSI-BLAST based alignment for FASSM, jump-start alignment based for FASSM, RPS-BLAST searches and HMMER searches are provided in Supplementary Material.

Where, the coverage of sequence X from protein domain family Y was calculated using the ratio of the total number of independent sequences annotated by sequence X with the total number of independent sequences in family Y. V.

PERCENTAGE PSSM FACTOR CALCULATION

In addition to coverage analysis, Percentage PSSM Factor (PPF) was calculated for RPS-BLAST hits using the PPF formulae explained in Gowri and coworkers [32] and compared the trend of PPF with coverage analysis score.

Where Ni is the number of profiles hit for a query in a protein domain family ‘i’. N is the total number of PSSM in the protein domain family ’i’ in the PSSM database. VI.

RESULTS

Following the coverage analysis, average coverage of 100 families is provided in Figure 1. Average coverage results shows that HMMER reported BRPs an average coverage of 66.23, followed by FASSM with 57.09 and RPS-BLAST with 23.74. Following Pfam domain families are reported with same BRP from the dataset of 100 families with the best coverage score: Ribosomal protein S28e [33] (Pfam ID: PF01200), Rotavirus NS26 [34] (Pfam ID: PF01525), DNA mismatch repair enzyme MutH [35] (Pfam ID: PF02976), DREV methyltransferase [36] and (Pfam ID: PF05219), a domain of unknown function (DUF1281) (Pfam ID: PF06924). The domain family F-box associated region [37] (Pfam ID: PF04300) was shown to report same BRS using RPS-BLAST and FASSM based searches, but not using HMMER. Same BRS were reported for the domains using FASSM and HMMER searches, but not using RPSBLAST: Protein of unknown function (DUF682) (Pfam ID: PF05081), Tetrahydromethanopterin S-methyltransferase subunit B [38] (Pfam ID: PF05440), Chorismate mutase type I [39] (Pfam ID: PF07736) and M protein trans-acting positive regulator (MGA) PRD domain. Secretin N-terminal domain (Pfam ID: PF07655) was reported with identical BRS from RPS-BLAST and HMMER searches, but not from FASSM search. These interesting cases are highlighted in the supplementary material. Coverage was calculated as the ratio between the number of full independent sequences annotated using FASSM or RPS-BLAST or HMMER2 to the total

VII. DISCUSSION Sampling data and search for representatives is a general problem for large-scale data mining experiments. We defined an objective approach for identifying a best representative member of protein sequence domain families. Previous reports have chosen BRP of a protein sequence domain family, provided it should be able to annotate >=50% of its own members. The earlier method identified BRP for 91.4% of protein families reported in Pfam database (version 22). In the current study, we applied the method used in the coverage analysis to identify the BRP for 100 families using three different sequence search programs. Current study shows that the approach is highly generic in nature and can be easily added to the sequence analysis pipelines. Various biological sequence data analysis approaches based on HMM [40], neural networks [41], regular expression [42] and statistical methods [43-45] are available to deal with specific issues in sequence analysis. To the best our knowledge, no previous approach or methods are reported for the mining of sequence databases to identify a single BRS.

705

We demonstrated that the coverage and efficiency of choice of representatives is highly dependent on the algorithm for search propagation and not on the dispersion of data or the local density of data points. We envisage that sequence analysis driven by BRS, instead of random ‘reference sequence’ based analysis will be an effective approach. As we derive the BRS based on a stringent sequence mining procedure, it is recommended to use the BRS rather than a random sequence as starting point for performing in-depth protein sequence analysis. Before any extensive sequence analysis based on protein families, we propose the sequence mining search could be performed to obtain the best representative member. Irrespective of the diversity in the methods, the coverage analysis method discussed in this manuscript is shown to be an effective method to identify a single best representative member of sequence families. By applying the method and analyzing a diverse dataset of 100 protein domain families derived from Pfam database, we have shown that the concept of coverage analysis can be utilized for different programs to derive a BRP. Irrespective of sequence diversity and subfamily level classifications within protein domain families, we have shown that a BRS can be derived using the method discussed in this manuscript. VIII. CONCLUSION

acknowledge National Centre for Biological Sciences (TIFR) for infrastructural and financial support. X. [1]

[2] [3] [4] [5] [6]

[7] [8]

[9]

[10]

A new biological sequence data mining protocol is proposed to identify a single BRS from a protein domain family. The approach is applied to a sample dataset of 100 Pfam domain families using three different search programs. Primary observation from the analysis is that the BRS depends up on the type of sequence search algorithm. We propose that performing a BRS search before alignment, HMM generation and profile generation may enhance the performance of the programs. BRS based analysis could further improve remote homologue detection, as a BRS only when a given reference sequence recognize >=50% of its own family members using a sequence search program. The diversity of BRS derived using different methods shows that sequence search approaches, in general do not agree with the choice of BRS suggesting the need to perform BRS searches specific to the program of choice to find the best representative member before in-depth protein sequence analysis using protein domains. Current analysis is implemented using a limited dataset of 100 PFAM families and demonstrates the improvement of sequence coverage subsequent to recognition of BRS using three independent methods. This one-time computational intensive data mining approach could increase efficiency in sequence searches. The work will be extended further to the entire families in Pfam 24.0. Due to the generic nature of the data mining method, we envisage that the BRS identification method can be further extended to nucleotide sequence generating from next-generation sequencing approaches. IX.

[11]

[12] [13]

[14]

[15]

[16]

[17]

[18]

[19]

ACKNOWLEDGMENT

R.S. was a Senior Research Fellow of the Wellcome Trust, U.K. R.S. thank Department of Biotechnology, Government of India for financial support. All authors

706

REFERENCES

R. R. Copley, T. Doerks, I. Letunic, and P. Bork, "Protein domain analysis in the era of complete genomes," FEBS Lett, vol. 513, pp. 129-34, Feb 20 2002. J. C. Wooley, A. Godzik, and I. Friedberg, "A primer on metagenomics," PLoS Comput Biol, vol. 6, p. e1000667, 2010. "The Human Genome Project: 10 years later," Lancet, vol. 375, p. 2194, Jun 26 2010. "Human genome at ten: The sequence explosion," Nature, vol. 464, pp. 670-1, Apr 1 2010. D. Butler, "Human genome at ten: Science after the sequence," Nature, vol. 465, pp. 1000-1, Jun 24 2010. J. C. Whisstock and A. M. Lesk, "Prediction of protein function from protein sequence and structure," Q Rev Biophys, vol. 36, pp. 307-40, Aug 2003. "The Universal Protein Resource (UniProt) in 2010," Nucleic Acids Res, vol. 38, pp. D142-8, Jan 2010. U. Hinz, "From protein sequences to 3D-structures and beyond: the example of the UniProt knowledgebase," Cell Mol Life Sci, vol. 67, pp. 1049-64, Apr 2010. A. A. Mironov and N. N. Alexandrov, "Statistical method for rapid homology search," Nucleic Acids Res, vol. 16, pp. 5169-73, Jun 10 1988. M. J. Weber, "New human and mouse microRNA genes found by homology search," FEBS J, vol. 272, pp. 59-73, Jan 2005. R. Bhadra, S. Sandhya, K. R. Abhinandan, S. Chakrabarti, R. Sowdhamini, and N. Srinivasan, "Cascade PSI-BLAST web server: a remote homology search tool for relating protein domains," Nucleic Acids Res, vol. 34, pp. W143-6, Jul 1 2006. X. Cui, T. Vinar, B. Brejova, D. Shasha, and M. Li, "Homology search for genes," Bioinformatics, vol. 23, pp. i97-103, Jul 1 2007. I. Letunic, T. Doerks, and P. Bork, "SMART 6: recent updates and new developments," Nucleic Acids Res, vol. 37, pp. D229-32, Jan 2009. J. Schultz, F. Milpetz, P. Bork, and C. P. Ponting, "SMART, a simple modular architecture research tool: identification of signaling domains," Proc Natl Acad Sci U S A, vol. 95, pp. 5857-64, May 26 1998. R. D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. E. Pollington, O. L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E. L. Sonnhammer, S. R. Eddy, and A. Bateman, "The Pfam protein families database," Nucleic Acids Res, vol. 38, pp. D211-22, Jan 2010. P. Coggill, R. D. Finn, and A. Bateman, "Identifying protein domains with the Pfam database," Curr Protoc Bioinformatics, vol. Chapter 2, p. Unit 2 5, Sep 2008. S. J. Sammut, R. D. Finn, and A. Bateman, "Pfam 10 years on: 10,000 families and still growing," Brief Bioinform, vol. 9, pp. 210-9, May 2008. E. L. Sonnhammer, S. R. Eddy, E. Birney, A. Bateman, and R. Durbin, "Pfam: multiple sequence alignments and HMM-profiles of protein domains," Nucleic Acids Res, vol. 26, pp. 320-2, Jan 1 1998. S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, R. D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo, A. F. Quinn, J. D. Selengut, C. J. Sigrist, M. Thimma, P. D. Thomas, F. Valentin, D. Wilson, C. H. Wu, and C. Yeats, "InterPro: the integrative protein signature database," Nucleic Acids Res, vol. 37, pp. D211-5, Jan 2009.

[20] J. Gouzy, F. Corpet, and D. Kahn, "Whole genome protein domain analysis using a new method for domain clustering," Comput Chem, vol. 23, pp. 333-40, Jun 15 1999. [21] R. A. George and J. Heringa, "Protein domain identification and improved sequence similarity searching using PSI-BLAST," Proteins, vol. 48, pp. 672-81, Sep 1 2002. [22] D. Reshef, Z. Itzhaki, and O. Schueler-Furman, "Increased sequence conservation of domain repeats in prokaryotic proteins," Trends Genet, Jul 17 2010. [23] L. P. Tripathi and R. Sowdhamini, "Genome-wide survey of prokaryotic serine proteases: analysis of distribution and domain architectures of five serine protease families in prokaryotes," BMC Genomics, vol. 9, p. 549, 2008. [24] A. Bhaduri and R. Sowdhamini, "Genome-wide survey of prokaryotic O-protein phosphatases," J Mol Biol, vol. 352, pp. 736-52, Sep 23 2005. [25] R. P. Metpally and R. Sowdhamini, "Genome wide survey of G protein-coupled receptors in Tetraodon nigroviridis," BMC Evol Biol, vol. 5, p. 41, 2005. [26] K. Shameer, P. Nagarajan, K. Gaurav, and R. Sowdhamini, "3PFDB-a database of best representative PSSM profiles (BRPs) of protein families generated using a novel data mining approach," BioData Min, vol. 2, p. 8, 2009. [27] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Res, vol. 25, pp. 3389-402, Sep 1 1997. [28] K. Gaurav, N. Gupta, and R. Sowdhamini, "FASSM: enhanced function association in whole genome analysis using sequence and structural motifs," In Silico Biol, vol. 5, pp. 425-38, 2005. [29] A. A. Schaffer, Y. I. Wolf, C. P. Ponting, E. V. Koonin, L. Aravind, and S. F. Altschul, "IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices," Bioinformatics, vol. 15, pp. 1000-11, Dec 1999. [30] A. Marchler-Bauer, A. R. Panchenko, B. A. Shoemaker, P. A. Thiessen, L. Y. Geer, and S. H. Bryant, "CDD: a database of conserved domain alignments with links to domain three-dimensional structure," Nucleic Acids Res, vol. 30, pp. 281-3, Jan 1 2002. [31] S. R. Eddy, "Profile hidden Markov models," Bioinformatics, vol. 14, pp. 755-63, 1998. [32] V. S. Gowri, K. G. Tina, O. Krishnadev, and N. Srinivasan, "Strategies for the effective identification of remotely related sequences in multiple PSSM search approach," Proteins, vol. 67, pp. 789-94, Jun 1 2007.

[33] M. Yoshihama, T. Uechi, S. Asakawa, K. Kawasaki, S. Kato, S. Higa, N. Maeda, S. Minoshima, T. Tanaka, N. Shimizu, and N. Kenmochi, "The human ribosomal protein genes: sequencing and comparative analysis of 73 genes," Genome Res, vol. 12, pp. 379-90, Mar 2002. [34] S. K. Welch, S. E. Crawford, and M. K. Estes, "Rotavirus SA11 genome segment 11 protein is a nonstructural phosphoprotein," J Virol, vol. 63, pp. 3974-82, Sep 1989. [35] C. Ban and W. Yang, "Structural basis for MutH activation in E.coli mismatch repair and relationship of MutH to restriction endonucleases," EMBO J, vol. 17, pp. 1526-34, Mar 2 1998. [36] E. E. Bates, A. Kissenpfennig, C. Peronne, M. G. Mattei, F. Fossiez, B. Malissen, and S. Lebecque, "The mouse and human IGSF6 (DORA) genes map to the inflammatory bowel disease 1 locus and are embedded in an intron of a gene of unknown function," Immunogenetics, vol. 52, pp. 112-20, Nov 2000. [37] T. Lienard and G. Gottschalk, "Cloning, sequencing and expression of the genes encoding the sodium translocating N5methyltetrahydromethanopterin : coenzyme M methyltransferase of the methylotrophic archaeon Methanosarcina mazei Go1," FEBS Lett, vol. 425, pp. 204-8, Mar 27 1998. [38] J. T. Winston, D. M. Koepp, C. Zhu, S. J. Elledge, and J. W. Harper, "A family of mammalian F-box proteins," Curr Biol, vol. 9, pp. 11802, Oct 21 1999. [39] Y. M. Chook, H. Ke, and W. N. Lipscomb, "Crystal structures of the monofunctional chorismate mutase from Bacillus subtilis and its complex with a transition state analog," Proc Natl Acad Sci U S A, vol. 90, pp. 8600-3, Sep 15 1993. [40] B. J. Yoon, "Hidden Markov Models and their Applications in Biological Sequence Analysis," Curr Genomics, vol. 10, pp. 402-15, Sep 2009. [41] J. Hawkins and M. Boden, "The applicability of recurrent neural networks for biological sequence analysis," IEEE/ACM Trans Comput Biol Bioinform, vol. 2, pp. 243-53, Jul-Sep 2005. [42] R. M. Horton, "Biological sequence analysis using regular expressions," Biotechniques, vol. 27, pp. 76-8, Jul 1999. [43] A. Y. Mitrophanov and M. Borodovsky, "Statistical significance in biological sequence analysis," Brief Bioinform, vol. 7, pp. 2-24, Mar 2006. [44] L. Pachter and B. Sturmfels, "Parametric inference for biological sequence analysis," Proc Natl Acad Sci U S A, vol. 101, pp. 1613843, Nov 16 2004. [45] L. Allison, L. Stern, T. Edgoose, and T. I. Dix, "Sequence complexity for biological sequence analysis," Comput Chem, vol. 24, pp. 43-55, Jan 2000.

707

FIGURES

Figure 1: Flow chart of coverage analysis approach employed in the study using FASSM, RPS-BLAST and HMMER

Figure 2: Average of coverage analysis score of 100 families

708

Figure 3: Comparison of coverage analysis score and average sequence identity of seed sequences

Figure 4: Comparison of PPF and coverage analysis score derived from the analysis of BRS derived using RPS-BLAST

709

Figure 5: Comparison of coverage analysis score with respect to sequence distribution in protein domain families used in the analysis.

710

Suggest Documents