BIOINFORMATICS APPLICATIONS NOTE
Vol. 16 no. 2 2000 Pages 180–181
SISEQ: manipulation of multiple sequence and large database files for common platforms Naoki Sato Department of Biochemistry and Molecular Biology, Faculty of Science, Saitama University, 255 Shimo-Ohkubo, Urawa, Saitama Prefecture, 338-8570 Japan Received on June 28, 1999; revised on September 9, 1999; accepted on September 10, 1999
Abstract Summary: A multiple sequence file converter for common platforms, SISEQ, is described, which performs extraction of DNA sequences that correspond to CDS or RNA field of a large database file as well as subsequent multi-sequence conversions for phylogenetic or molecular biological analysis. Command-line interface as well as a GUI and a script-driven operation mode are provided. Availability: The program is freely available to academic users in the form of Macintosh FAT binary, DOS executable, or UNIX source code at http: // www.molbiol.saitama-u.ac.jp/ ∼naoki/ Software.html. Contact:
[email protected]
Introduction Recent development in genome sequence databases makes available for molecular biologists a large quantity of genetic sequence data that are potentially useful but are becoming more and more difficult to manipulate. An important obstacle is the large file size (about 100 kb) of each database entry in the genome databases and the extremely large size (more than 100 Mb) of files of database releases. Although the files of GenBank and EMBL databases contain useful information on the location of coding regions, promoter sequences or RNA genes, it is usually time-consuming to extract the correct DNA sequence that corresponds to a given protein coding sequence or RNA gene by manually selecting the sequence. We sometimes want to extract DNA sequences of translational initiation regions or termination regions from a bacterial genome. We also need to extract the DNA sequence that corresponds to a coding sequence, which is divided into many exons. Unfortunately, no commercial software provides tools to solve these rather simple problems. Also, there is no simple tool to manipulate multiple sequence files such as those used in phylogenetic analysis. Perl scripts that can meet these needs are written by researchers for their own use, but only a small number of such scripts are made available to the public. SEALS (Walker and Koonin, 1997) is an example of a large 180
package of Perl commands only for UNIX. Features (Fristensky, 1993) and SeqIO (Knight, 1996) are also useful on UNIX platforms. In typical laboratories of molecular biology and phylogeny, however, most students and researchers are not familiar with UNIX workstations. A small program that performs simple manipulation of databases to yield input files for a phylogenetic analysis program has been needed on common (Macintosh or Windows) platforms for the use in laboratories in molecular biology and phylogeny. This is indeed a new and important need, since the performance of recent personal computers is equivalent to small workstations.
Overview of SISEQ The simple sequence manipulation tools (SISEQ) package provides various tools to extract sequences from GenBank (Benson et al., 1999) and EMBL (Stoesser et al., 1999) database files on common platforms, such as Macintosh and Windows, as well as UNIX. The source code is written in ANSI C with some Macintosh-specific codes, and portable to any of these common platforms with only changes in the definition of OS in a header file. It has been tested on various OS and compilers as described in the documents attached to the package. Command tools All commands are invoked from a single program called ‘siseq’. The major command group is specialized in the extraction of sequence information from a large GenBank or EMBL file (Figure 1). The extraction relies on the CDS or RNA fields in the feature description. The command ‘extcds’ extracts amino acid sequences of the CDS field. The command ‘cdsnuc’ extracts DNA sequences that begins at a specified position counted from the initiation or termination codon and ends at another specified position counted from the initiation or termination codon. This is useful in extracting either a coding DNA sequence plus flanking sequences, or a flanking sequence around the translational initiation site or the termination codon. Even in the case of exons scattered over many different c Oxford University Press 2000
SISEQ
Fig. 1. Various roles of SISEQ commands in the conversion of multi-sequence files.
database entries, ‘cdsnuc’ extracts exon sequences and combines to a single coding sequence. The command ‘extrna’ extracts DNA sequences corresponding to RNA genes in a manner similar to ‘cdsnuc’. All these commands generate a multiple FASTA file. These commands can also process a multiple (or catenated) database file that are present in the original database distribution or produced as a result of database query (DBGET or SRS). To help extracting sequences of interest, commands ‘getent’ (get entry) and ‘genlist’ (gene list) are also provided in the SISEQ package. The second type of command ‘tofast’ (to FASTA) is a unified tool for the conversion of a GenBank or EMBL catenated sequence file, a multi-FASTA file, or a CLUSTAL W (Thompson et al., 1994) multiple alignment file to a multi-FASTA file, with optional conversions such as DNA-to-RNA, RNA-to-DNA, to the complementary strand, or DNA/RNA-to-protein. The command ‘getclu’ extracts a part of a multiple sequence alignment from a CLUSTAL W file. Additional commands are various converters for a single sequence, namely, ‘txtr’ (text format converter), ‘getseq2’ (partial sequence extractor), and ‘toprot’ (converter to protein sequence).
Three different interfaces The program ‘siseq’ may be invoked either from the command line, a Tcl/Tk-based graphical user interface, or
a siseq script written in the file ‘siseq.cf’. The script mode is convenient in performing an automatic conversion of many files on personal computers.
Acknowledgement This work was supported in part by a Grant-in-Aid for Scientific Research (no. 09874167) from the Ministry of Education, Science, Sports and Culture, Japan. References Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F.F., Rapp,B.A. and Wheeler,D.L. (1999) GenBank. Nucl. Acids Res., 27, 12–17. Fristensky,B. (1993) Feature expressions: creating and manipulating sequence datasets. Nucl. Acids Res., 25, 5997–6003. Knight,J. (1996) SeqIO package. Computer program, University of California, Davis. ftp://ftp.cs.ucdavis.edu/pub/strings/seqio. tar.gz. Stoesser,G., Tuli,M.A., Lopez,R. and Sterk,P. (1999) The EMBL nucleotide sequence database. Nucl. Acids Res., 27, 18–24. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22, 4673–4680. Walker,D.R. and Koonin,E.V. (1997) SEALS: a system for easy analysis of lots of sequences. Intell. Syst. Mol. Biol., 5, 333– 339. http://www.ncbi.nlm.nih.gov/Walker/SEALS/index.html.
181