SSHSuite: an integrated software package for ... - BioTechniques

SSHSuite: an integrated software package for analysis of large-scale suppression subtractive hybridization data Stefan Weckx, Peter De Rijk, Christine Van Broeckhoven, and Jurgen Del-Favero University of Antwerp, Antwerp, Belgium BioTechniques 36:1043-1045 (June 2004)

Suppression subtractive hybridization (SSH) is a widely used technique for the identification of differentially expressed genes. SSH as well as other types of sequencing projects generate large amounts of anonymous sequences. SSHSuite automates the handling and storage of these sequences and enables identification through similarity searches. SSHSuite also offers analysis tools for the retrieval and comparison of the resulting similarity data. SSHSuite consists of four programs: SSHHandler, SSHOverview, SSHAnalysis, and SSHCompare.

INTRODUCTION In the last years, the investigation of gene function and expression became a fundamental part of modern biological research. For many years, global approaches to gene expression analysis were employed using sequence-based methods such as expressed sequence tag (EST) sequencing, serial analysis gene expression (SAGE), and microarray analysis. Also, various strategies for selective isolation of differentially expressed genes have been developed. The challenge in this type of analysis is to efficiently extract those genes that are differentially expressed by distinguishing them from the large majority of nondifferentially expressed genes. A few of the most frequently employed strategies for selective gene expression analysis are representational difference analysis (RDA) (1), differential display (DD) (2), and suppression subtracted hybridization (SSH) (3). SSH is a robust, selective method for the identification of differentially expressed genes and is based on subtractive hybridization combined with suppression PCR. This approach has already been successfully applied in cancer research (4), in affective disorders (5), in developmental biology (6,7), and in toxicological research (8,9). The output of an SSH experiment consists of a set of sequence trace files. To idenVol. 36, No. 6 (2004)

tify these anonymous sequences, an integrated bioinformatics approach is needed. This approach must contain sequence handling, alignment searches, and alignment data management and retrieval tools. None of the currently available software addresses all necessary steps in one package. The Staden Package allows sequence handling, but lacks the possibility of automated alignment searches and data management. The Windows®-based DNAStar® package allows manual sequence handling and alignment searches, resulting in extensive hands-on time, which is not advisable in large-scale projects. Also, this program is incapable of storing alignment results in a database hampering data querying. We developed the software package SSHSuite, which automates sequence handling, alignment searches, and data storage, and provides tools for alignment data retrieval and analysis and therefore can be used to process data from SSH experiments as well as from other sequencing projects. REQUIREMENTS SSHSuite runs on a Linux workstation. It is written in the scripting language Tcl (http://www.tcl.tk) and uses several Tcl extensions: the tcllib library (http://tcllib. sourceforge.net/), the tDOM XML parser

library (http://www.tdom.org/), and the dbi database interface library (http://rrna. uia.ac.be/dbi/). SSHSuite requires several external software packages: Phred (10,11), Phrap/cross_match (http://www. phrap.org), and RepeatMasker (http:// repeatmasker.genome.washington.edu/). To perform alignment searches locally, Basic Local Alignment Search Tool (BLAST) (12) and corresponding sequence databases need to be installed on the workstation. The most recent versions can be downloaded from ftp://ftp.ncbi. nlm.nih.gov/blast/db/. The BLAST standalone executable and database-formating program can be downloaded from ftp:// ftp.ncbi.nlm.nih.gov/blast/executables/. In case BLAST is run over the Internet, the BLAST client software must be installed, downloadable from ftp://ftp.ncbi. nih.gov/toolbox/. The current database management system (DBMS) is PostgreSQL (http://www.postgresql.org), but the database model can be easily ported to other DBMS. Creation of SSHSuite databases is executed with SSHMakedb, which comes with SSHSuite. METHODS AND DISCUSSION SSHSuite consists of four programs: SSHHandler, SSHOverview, SSHAnalysis, and SSHCompare (Figure 1). Since SSHSuite is intended to process large data sets in batch, the programs are all command line executables. To start an analysis project, SSHHandler is the first program to use and takes sequence trace files as input together with a FASTA file containing the vector and adaptor sequences. Sequence trace files can be provided in all common formats and are base called using Phred. Pre-base called trace files can also be used. In this case, SSHHandler requires a text file containing the sequence in FASTA format as well as a text file containing the quality values. The sequence reads are trimmed for vector and adaptor sequences using cross_match and assembled with Phrap. The ace output file together with the files containing the sequences of contigs and singlets are parsed into the database and are subsequently repeat masked with RepeatMasker. Next, BLAST analysis is performed either remote on the BioTechniques 1043

BIOCOMPUTING/BIOINFORMATICS

SSHOverview SSHAnalysis SSHCompare

Figure 1. Flow chart of SSHSuite.

National Center for Biotechnology Information (NCBI) computer systems using the BLASTcl3 executable or on a local installation of BLAST (12). Setting up a local BLAST installation requires extra work and computing power, but has higher network reliability, higher level of security, and the possibility to include custom databases such as the UniGene cluster sequences (13) or Pfam protein domain sequences (14). Finally, the BLAST alignments in XML format are parsed and inserted into the database supplemented with a time label referring to the date of the BLAST search. The described strategy is implemented in a modular way and therefore allows the usage of SSHHandler in any project dealing with large-scale sequence data (Figure 1) and is not restricted to data obtained from SSH experiments. In the full mode, the strategy described proceeds without user intervention. The assemble mode is classical sequence assembly and combines base calling and assembly steps (Figure 1). The blast mode automates BLAST searches for sequence homologies, combining base calling, vector clipping, repeat masking, and BLAST steps (Figure 1). The reblast mode allows subsequent automated BLAST searches of a project’s contig sequences to public databases in order to check regularly for new alignment 1044 BioTechniques

be either in HTML or tab-delimited format. The HTML output has hyperlinks to adSSHHandler ditional information on the NCBI web site through GenBasecalling F AB RP Bank® gi numbers and/or the UniGene cluster IDs. SSHCompare provides the Vector clipping F AB RP ability to compare two sets of BLAST hits based on the acAssembly F AB RP cession IDs. In case the data are in two different project Masking repeats F AB RP databases (e.g., when looking for genes that are up- or Database BLAST F AB RP down-regulated in one tissue compared to another tisParsing sue), SSHCompare provides F AB RP an overview of both unique MODULES: F: full and common hits in the daA: assemble B: blast tabases. In combination with R: reblast P: parse the reblast mode of SSHHandler, where all BLAST data are in the same database but with different time labels, SSHCompare presents an hits in updated public databases (Figure overview of new entries in the public 1). With the parse mode, BLAST output databases matching contig sequences. files in XML format obtained outside an SSHSuite project can be added to the database (Figure 1). CONCLUSION Next, SSHOverview can be used to obtain an overview of the sequence We presented a novel software tool reads and contigs and to backtrack designed for automated processing of original reads in order to identify low large sequence-based data sets. This quality or failed sequences (Figure 1). software adequately processes and manSequence quality is indicated by the ages sequence trace files, sequence idenpercentage of bases that have a Phred tification, data storage, and data retrievscore higher then 20. al. SSHSuite is in use in several projects The database can be analyzed uscontaining on average >1000 sequence ing SSHAnalysis and SSHOverview. trace files, resulting in >500 contigs SSHAnalysis is a powerful search tool and on average 500,000 BLAST hits to retrieve sequence similarity data. to be managed per project. SSHSuite is BLAST hits can be extracted using located at http://www.molgen.ua.ac.be/ user-defined cutoff values for e-value, bioinfo/SSHSuite and is freely available percentage similarity, and identity, or to academic users upon request from the by keyword searches through accession corresponding author. number and/or sequence name annotation. Both search strategies have a tabdelimited text file as output that can be ACKNOWLEDGMENTS opened in any spreadsheet program for We thank Robin Lefebvre and Nafurther interpretation of data. In some thalie Verpoorten for their valuable cases, an enumeration of a limited set feedback during the development and of BLAST hits can obscure informavalidation of SSHSuite. This work was tive BLAST hits with lower scores funded in part by the Special Research elsewhere on the contig sequence. Fund of the University of Antwerp, the Therefore, SSHAnalysis enables an Fund for Scientific Research Flanders, additional search in the database, reand the InterUniversity Attraction trieving hits that match to parts of the Poles program P5/19 of the Federal contig sequence that were not covered Science Policy Office, Belgium. by the first set of hits. This output can Vol. 36, No. 6 (2004)

REFERENCES 1.Lisitsyn, N., N. Lisitsyn, and M. Wigler. 1993. Cloning the differences between two complex genomes. Science 259:946-951. 2.Liang, P. and A.B. Pardee.1992. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science 257:967-971. 3.Diatchenko, L., Y.F. Lau, A.P. Campbell, A. Chenchik, F. Moqadam, B. Huang, S. Lukyanov, K. Lukyanov, et al. 1996. Suppression subtractive hybridization: a method for generating differentially regulated or tissue-specific cDNA probes and libraries. Proc. Natl. Acad. Sci. USA 93:6025-6030. 4.Difilippantonio, S., Y. Chen, A. Pietas, K. Schluns, M. Pacyna-Gengelbach, N. Deutschmann, H.M. Padilla-Nash, T. Ried, and I. Petersen. 2003. Gene expression profiles in human non-small and small-cell lung cancers. Eur. J. Cancer 39:1936-1947. 5.Yolken, R.H. 2002. Subtraction libraries for the molecular characterization of gene-environmental interactions in bipolar disorder. Bipolar Disord. 4(Suppl 1):77-80. 6.King, M.W., T. Nguyen, J. Calley, M.W. Harty, M.C. Muzinich, A.L. Mescher, C. Chalfant, M. N’Cho, et al. 2003. Identification of genes expressed during Xenopus laevis

limb regeneration by using subtractive hybridization. Dev. Dyn. 226:398-409. 7.Hofsaess, U. and J.P. Kapfhammer. 2003. Identification of numerous genes differentially expressed in rat brain during postnatal development by suppression subtractive hybridization and expression analysis of the novel rat gene rMMS2. Brain Res. Mol. Brain Res. 113:13-27. 8.Shin, H.J., K.K. Park, B.H. Lee, C.K. Moon, and M.O. Lee. 2003. Identification of genes that are induced after cadmium exposure by suppression subtractive hybridization. Toxicology 191:121-131. 9.Williams, T.D., K. Gensberg, S.D. Minchin, and J.K. Chipman. 2003. A DNA expression array to detect toxic stress response in European flounder (Platichthys flesus). Aquat. Toxicol. 65:141-157. 10.Ewing, B., L. Hillier, M.C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185. 11.Ewing, B. and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194. 12.Altschul, S.F., T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucleic Acids Res. 25:3389-3402. 13.Wheeler, D.L., D.M. Church, A.E. Lash, D.D. Leipe, T.L. Madden, J.U. Pontius, G.D. Schuler, L.M. Schriml, et al. 2002. Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res. 30:13-16. 14.Bateman, A., E. Birney, R. Durbin, S.R. Eddy, R.D. Finn, and E.L. Sonnhammer. 1999. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 27:260-262.

Received 11 February 2004; accepted 30 March 2004. Address correspondence to Jurgen Del-Favero, Department of Molecular Genetics (VIB8), University of Antwerp, Universiteitsplein 1, B-2610 Antwerpen, Belgium. e-mail: [email protected]