BIOINFORMATICS

0 downloads 0 Views 1MB Size Report
more than 30% identical amino acids. We also ... is the fact that typical eukaryotic genes are segmented, ... achieve still better predictions, however, more direct .... –30. Gap penalty. Intron penalty. –20. –10. 0. 10. 5. 10. 15. Gap length (nt). 20.
Vol. 16 no. 3 2000 Pages 190–202

BIOINFORMATICS

Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps Osamu Gotoh Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan Received on August 23, 1999; accepted on October 21, 1999

Abstract Motivation: Locating protein-coding exons (CDSs) on a eukaryotic genomic DNA sequence is the initial and an essential step in predicting the functions of the genes embedded in that part of the genome. Accurate prediction of CDSs may be achieved by directly matching the DNA sequence with a known protein sequence or profile of a homologous family member(s). Results: A new convention for encoding a DNA sequence into a series of 23 possible letters (translated codon or tron code) was devised to improve this type of analysis. Using this convention, a dynamic programming algorithm was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frameshift errors, coding potentials, and translational initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient (CC) was about 95% at the nucleotide level for the 288 genes tested, and 97.0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. We also propose a strategy to improve the accuracy of prediction for a set of paralogous genes by means of iterative gene prediction and reconstruction of the reference profile derived from the predicted sequences. Availability: The source codes for the program ‘aln’ written in ANSI-C and the test data will be available via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/ saitama-cc. Contact: [email protected]

190

Introduction Following the completion of genomic sequencing of the yeast Saccharomyces cerevisiae (Goffeau et al., 1996), nearly the complete structure of the nematode Caenorhabditis elegans genome has recently been reported (The C. elegans Sequencing Consortium, 1998). Sequencing projects in several eukaryotic genomes including the human genome are now in progress. Identification of the genes on these genomic sequences and inferring their functions are major themes of current computational genome analyses. One obstacle to gene identification is the fact that typical eukaryotic genes are segmented, and the prediction of precise exonic regions is still a challenging problem (Burge and Karlin, 1998; Claverie, 1997; Murakami and Takagi, 1998). Most gene-identification methods rely on statistical features of coding and non-coding sequences, and signals around the beginnings and ends of transcription, translation, and splicing. Various algorithms including artificial neural networks (Uberbacher and Mural, 1991), hidden Markov models (Burge and Karlin, 1997), and discriminant analyses (Zhang, 1997) coupled with dynamic programming algorithms (Snyder and Stormo, 1995; Xu et al., 1994) have been used to capture specific signals and derive a final prediction by combining several lines of information. The best-performing programs currently available correctly predict 70–80% of exons (Burge and Karlin, 1998; Claverie, 1997; Murakami and Takagi, 1998). This level of accuracy might be sufficient for classification of the gene into a certain gene family, but is insufficient for some other purposes, such as predicting the structure of the encoded protein and evolutionary studies, since the chance of perfect prediction of the entire coding sequence declines exponentially with the number of exons in the gene. Several gene-finding methods incorporate the results of sequence similarity searches (Altschul et al., 1990)

c Oxford University Press 2000 

Gene structure prediction by homology

to significantly improve the overall prediction accuracy (Burset and Guig´o, 1996; Cai and Bork, 1998). To achieve still better predictions, however, more direct involvement of homology information appears to be necessary (Mironov et al., 1998). Homology information has been shown to be useful even for better identification of prokaryotic genes (Lolkema and Slotboom, 1998; Pearson et al., 1997). A handful of methods have been proposed to predict eukaryotic gene structures by direct matching of the genomic DNA sequence and a reference protein (Birney and Durbin, 1997; Gelfand et al., 1996; Huang and Zhang, 1996), or cDNA sequence (Florea et al., 1998; Mott, 1997). ‘Genewise’ from Birney and Durbin (1997) appears to be the most general of these methods, since it considers frameshift errors and accepts a protein profile as the reference, while other methods lack either of the features. The use of a protein profile or a profile hidden Markov model, rather than a single sequence, may improve the alignment accuracy, as has been repeatedly experienced for homology searches (Altschul et al., 1997; Park et al., 1998) and multiple sequence alignment (Gotoh, 1996). However, genewise does not necessarily predict the complete gene structure, and sometimes reports only fragmental accounts. Upon investigating the members of some multi-gene families throughout the genome of C.elegans, I became aware that prediction-based CDSs annotated in public sequence databases, such as GenBank, might be wrong for up to nearly half of the genes (Gotoh, 1998). I reached this notion because multiple protein sequence alignments derived from such CDSs contained several structurally implausible gaps (insertions and deletions). I was able to show that the reassignment of exons greatly improved the quality of alignments (Gotoh, 1998). The principle of the prediction algorithm was straightforward; when the genomic sequence is spliced and translated, the conceptual protein sequence should optimally match the reference sequence or a ‘generalized profile’ as in the usual protein– protein or protein–profile alignment with an affine gappenalty function. Our profile is more general than the usual (Gribskov et al., 1987) in terms of rigorous treatment of internal gaps of various lengths and positions (Gotoh, 1994). Translational initiation, termination, and splicing signals as well as exonic coding potentials are also taken into account for the objective function to be optimized. In this paper, I will show the details of the algorithm, which involves several novel features. First, a genomic sequence is encoded into translated codon (or tron) codes. The 23-letter codes can compactly encode the potential translated amino acid sequence without losing any information of the original nucleotide sequence. Second, a matching score for a codon interrupted by a phase-1 or phase-2 intron is rigorously evaluated, where phase-1 and phase-2 introns imply those located between the first

and second, and the second and third nucleotides in a codon, respectively. Third, a special form of gap-penalty function that takes account of frameshift errors and long gaps is employed. Finally, an iterative strategy is developed to coherently improve the prediction accuracy for a set of paralogous genes within a genome. Empirical examinations with 291 C.elegans genes of experimentally identified exon–intron organizations indicated that our method accurately predicts exonic sequences with a correlation coefficient of about 97% at the nucleotide level, when the amino acid identities between the reference sequence and the objective gene product exceed the generally recognized range (twilight zone) of reliable global protein sequence alignment (Doolittle, 1981; Sander and Schneider, 1991).

Methods Tron code A remarkable feature of the universal genetic code is that the second nucleotide in a codon greatly affects its specificity (Crick, 1968). In fact, all codons for an amino acid have a unique nucleotide at the second position, except for Ser (TCN and AGY) and termination (TAR and TGA) codons. Thus, only 23 letters are necessary and sufficient to unambiguously encode both the original nucleotide sequence and conceptually translated amino acid sequences in the three frames. We propose to call each translated codon a ‘tron’ ∈  t , and express them by the standard one-letter amino acid codes ∈  a , except for ‘J’, ‘O’, and ‘U’, which are used to represent translated AGY, TAR, and TGA codons, respectively (Figure 1). Thus,  t = { a , ‘J’, ‘O’, ‘U’}. In a ‘tron sequence’, b = b1 b2 . . . b J , the tron code substitutes for the second nucleotide of a triplet. The tron codes at the first and last sites in a sequence may be determined based on the arbitrary assumption that a fixed nucleotide, ‘A’, occupies the sites immediately before the first and immediately after the last sites of the original nucleotide sequence. A usual 20×20 amino acid exchange matrix, M(a, b)(a, b ∈  a ), such as PAM250 (Dayhoff et al., 1978) and JTT250 (Jones et al., 1992), is easily expanded to a 20 × 23 amino acid versus tron similarity matrix, S(a, b)(a ∈  a , b ∈  t ), such that S(a, b) = M(a, b) if b ∈  a , S(a, ‘J’) = M(a, ‘S’), and S(a, ‘O’) = S(a, ‘U’) = MIN M(a, b). The matrix S(a, b) facilitates immediate comparison of an amino acid or a profile vector with a translated codon. On the other hand, a single reference to a 23-element table immediately recovers the original nucleotide at a specific site. Boundary signals and coding potential We used the frequency of ‘ditrons’, i.e. neighboring trons every three sites, in three frames to estimate coding 191

O.Gotoh

Fig. 1. The tron codes. The proposed tron (translated codon) codes are shown in boldface letters together with arbitrarily chosen numeric codes (1–23).

potential. All frequency data were normalized with the corresponding reference frequencies obtained from the 27 Mb of C.elegans genomic sequence that was publicly available by August 1998. More specifically, the coding potential at a site j, ϕ Ej is calculated as: +1 ϕ Ej = ϕ 0j + ϕ −1 j−1 + ϕ j+1

(1)

where ϕ kj = log{ f k (b j , b j+3 )/ f (b j ) f (b j+3 )} for k ∈ {−1, 0, +1}, f (b) is the relative frequency of tron b in general genomic sequence, f 0 (a, b) is the relative frequency of ditron (a, b) in ‘in-frame’ coding phase, and f k (a, b)(k = −1 and +1) are those in ‘out-offrame’. To maintain consistency with a PAM matrix, a common logarithm (base 10) was used to calculate the score tables. This method is nearly the same as that which relies on phase-specific diamino usage, but is a little closer to the most popular methods based on hexamer frequencies or fifth-order Markov models (Fickett and Tung, 1992). The training set was obtained from 6298 CDSs described in the INV class of the GenBank database Release 84 (1994). More than three-quarters of the CDSs in the set were derived from species other than C.elegans. Since translated information is less species specific than the nucleotide sequence (Guig´o and Fickett, 1995), and since a ditron table is considerably smaller than a hexamer table (232 /46 = 1/7.74), our choice of the ditron method may be appropriate for the present analysis. The signal strengths of each site as a potential translational start, stop, or 5 or 3 splicing boundary were evaluated by conditional probability matrices with a 192

window size of 20 derived from first-order Markov models at the nucleotide sequence level (Salzberg, 1997; Zhang and Marr, 1993). Experimentally verified splicing donor and acceptor sites (8192 each) of C.elegans genes were taken from the homepage of the Sanger Center (URL: http://www.sanger.ac.uk/Projects/C elegans). Translational start and stop sites were obtained from the CDSs described above. The conditional probabilities were normalized and logarithmically transformed as mentioned above to yield score tables.

Gap-penalty function and matching algorithms A special form of gap-penalty function was adopted (Figure 2). An insertion or deletion (indel) of k nucleotides was penalized by a restricted affine function (Chao, 1999; Huang and Zhang, 1996) if k was a multiple of 3; otherwise an additional penalty was given to allow, but disfavor, potential frameshifts. Since a constant basal penalty was assigned to an indel longer than a specified length K , our gap-penalty function is a little more general than the most commonly used affine functions. The basic matching algorithm was an extension of the ‘longgap algorithm’ proposed previously (Gotoh, 1990), but modified so as to not penalize terminal gaps (Sellers, 1979). The algorithm sketched in the Appendix runs in proportion to the product of the lengths of the sequences under comparison, despite the rather complicated form of the gap-penalty function. [The actual computation time is further reduced by use of the ‘cutting-corners approximation’ (Sankoff and Kruskal, 1983)]. It was assumed that an insertion of nucleotides occurs only

Gene structure prediction by homology

10 0

–20

Intron penalty

Gap penalty

–10

–30 –40 –50 –60 –70 0

5

10

15

20

25

30

Gap length (nt)

Fig. 2. Gap penalties as a function of gap length. The basal form is a restricted affine function, whereas an extra penalty is imposed to a gap whose length is not divisible by three. The shaded area indicates the possible range of a penalty value given to a tentative intron.

at a codon boundary. Moreover, a match between an amino acid or a profile vector and an incomplete codon containing a single- or double-nucleotide deletion was scored solely for that deletion. Although a more elaborate scoring scheme might be possible (Hein, 1994; Peltola et al., 1986), our simple scheme is efficient and likely to perform as well as other alternative schemes (Pearson et al., 1997). When the reference is a multiple sequence alignment containing internal gaps, the alignment was converted into a generalized profile as described previously (Gotoh, 1994). We used a very restricted version of a ‘candidate list’ algorithm (Gotoh, 1993), in which the maximal number of candidates retained at each iteration node is limited to three. Although this does not guarantee a rigorous optimal alignment, excessive rigor as to detailed alignment is unnecessary, since our major purpose is to determine the gene organization. An insertion of nucleotides flanked by 5 and 3 splicing signals above a given threshold value is regarded as a potential intron, which is weighted by the sum of the flanking splicing signals (ψ 5 and ψ 3 ) plus a negative constant (intron penalty, −ν I ) irrespective of the length. At each boundary, the three coding frames were considered independently. Special care must be paid to phase-1 and phase-2 boundaries at which a codon is interrupted by an intron; i.e. the intron is spliced out, the recovered codon is translated, and then a matching score is calculated. A rig-

orous algorithm would require four variables corresponding to the four kinds of nucleotide at the first position of a phase-1 codon (see Appendix). Likewise, we need at least two variables corresponding to purine and pyrimidine at the third position in a phase-2 codon (b j in Figure 3(b)). Considering the balance between rigorousness and efficiency, we adopted a compromise algorithm which uses one, one, and two variables for phase-0, phase-1, and phase-2 boundaries, respectively. The accidental appearance of a premature termination codon, the most harmful consequence of using a primitive method that skips regeneration of an interrupted codon, is thus effectively avoided. A restriction was imposed so that a potential exon must match the reference at least in part. Thus, even if the combined score of the coding potential and exon/intron boundary signals for part of a genomic sequence is positive, that part is not considered a potential exon if the entire region is an ‘insertion’ relative to the reference sequence or profile. This restriction was necessary to avoid excessive false positives, especially outside a real gene. A conventional traceback procedure is still used, where linked lists are recorded and retrieved (Gotoh, 1990). If we are interested in only the gene structure, the number of stored records was roughly 1/20 of the scan space, and this can be easily accommodated in the main memory of most contemporary workstations. We met no trouble other than a single exception in calculating the test data. A preliminary implementation of the linear space traceback

193

O.Gotoh

Tron sequence

Protein sequence

(a)

(b)

phase

intron

potential exon 1 5'

exon 2 3'

Fig. 3. Alignment paths arriving at a node (i, j). (a) At any node, seven paths (1–7) are considered for calculation of partial alignment scores (Appendix). (b) At a potential splicing acceptor site j −δ(δ ∈ {0, 1, 2}), additional paths from potential donor sites must also be considered.

algorithm (Myers and Miller, 1988) took nearly twice as long to execute as our standard method, and the difference was even greater when the scan space was restricted around the main diagonals. We are planning to develop a hybrid method, in which the linear space algorithm is used at initial phases until the expected storage requirement goes under some upper limit.

Test data and assessment of performance Caenorhabditis elegans mRNAs containing complete or nearly complete CDSs were retrieved from GenBank Rel. 110 (December, 1998). After removal of identical sequences and alternative transcripts other than the longest one, the corresponding genes throughout the entire C.elegans genome in six chromosomes were identified, and the exon/intron structures were estimated through a ‘blastn’ search (Altschul et al., 1990) followed by sequence alignment (Gotoh, 1990). We were unable to identify complete structures for 8% (35 of 425) of the genes examined, presumably because the ‘chromosomal’ C.elegans genomic sequences (The C.elegans Sequencing Consortium, 1998) are still incomplete. Some parts of the genomic sequences are probably misassembled, since the corresponding mRNA sequences 194

partially match opposite strands. Although several gene structures could be completed by reference to a daily updated version of the C.elegans genome database (URL: http://www.sanger.ac.uk/Projects/C elegans), we only used the 390 genes identified in the chromosomal sequences. The gene lengths (from the translational initiation codon to the termination codon inclusive) varied from 279 to 35 175 bp and the number of exons in a gene ranged from 1 to 47. A typical gene consisted of 8.59 ± 4.95 exons of length 219 ± 218 nt and introns of length 399 ± 926 nt (Mean±SD). Protein sequences similar to each C.elegans sequence translated from an mRNA were searched for in the SWISS-PROT database Release 34 (October, 1996) with the ‘blastp’ program (Altschul et al., 1990). We screened homologues by three criteria: (i) the sequence is derived from an organism other than C.elegans, (ii) the blast probability is less than 10−3 , and (iii) the sequence is shorter than 120% and longer than 80% of the length of the C.elegans protein. Of the 390 genes tested, 291 had at least one homologue that satisfied criteria (i)– (iii). When more than one homologue was found, the one with the least blast probability was used as the ‘reference protein’. In such a case, a multiple sequence alignment was constructed by the ‘prrp’ program (Gotoh, 1996) from the protein homologues. We finally obtained 260 alignments composed of at least two homologues that satisfied criteria (i)–(iii). Columns in each alignment consisting of more than G% (G = 50 by default) deletion characters were removed, and the trimmed alignment was converted into a ‘generalized profile’ (Gotoh, 1994), and then used as a ‘profile reference’. In principle, it is possible but too demanding to predict the global structure of a gene without any preprocessing. Therefore, to test the performance of our method, we assumed that the location of an objective gene is known within a margin of M nucleotides, i.e. the region of the genomic sequence from the first nucleotide in the initiation codon −M to the last nucleotide of the termination codon +M was subjected to analysis. Unless otherwise specified, M = 100 was used throughout the tests. Several measures of the accuracy of prediction, i.e. sensitivity (Sn), specificity (Sp), and correlation coefficient (CC) at the nucleotide level, were calculated as shown in Snyder and Stormo (1995, Table 6). Three test genes with no intron were omitted from the calculations. Since Sn and Sp were generally well balanced, we used a single value, Pb , to represent the percentage of correctly predicted exon/intron boundaries, which is the harmonic mean of sensitivity and specificity at the boundary level, and defined as 200×number of correctly predicted boundaries / (number of real boundaries + number of predicted boundaries). A similar formula was used to

Gene structure prediction by homology

calculate the percentage of exactly predicted exons (Pe ) (both ends are correct). Note that these accuracy measures could underestimate the real situation, since possibility of alternative splicing is totally neglected.

Coherent prediction of a set of paralogous gene structures An iterative strategy was designed to predict exon/intron organizations of individual paralogous genes in a family even if no homologue in other species is known. The basic idea is similar to that of the ‘doubly nested randomized iterative strategy’ (DNR method) for multiple sequence alignment (Gotoh, 1996). We start with a set of translated sequences that have been previously predicted, say by a statistical gene-finding method, and calculate their multiple sequence alignment by the DNR method. Using this alignment as the seed, structures of individual genes are re-examined in turn. Let A0n,i (1 ≤ n ≤ N , 1 ≤ i ≤ I ) be the initial multiple alignment, where N is the number of sequences and I is the length of the alignment. To re-examine the gene structure corresponding to the mth sequence, we assign a weight of wn = C pwm,n and wm = 0 to sequence n = m, where C is a normalization factor and pwm,n is the weight for the sequence pair (m, n) in A0n,i calculated by the three-way method (Gotoh, 1995). Optionally, columns predominantly composed of deletion characters are eliminated as described in the preceding section. After all of the N sequences are re-examined, a new alignment A1n,i is constructed from the revised conceptual translation products. This process is repeated until no change in predicted gene organizations is observed. Full automation of the above procedure was difficult in practice, since a smooth iterative cycle was readily disrupted by the presence of one or a few irregular gene sequences, and there are many causes of such irregularities. A perl script was written to perform a single cycle of re-examination, while outer processes were executed manually. While we relied on the conservation of translated sequences, information regarding intron insertion sites in paralogous genes was not explicitly considered throughout the procedures. Systems The program ‘aln’ was developed in ANSI-C and tested on a Sun Ultra-II workstation (300 MHz, 128 Mb main memory) under Solaris 2.5. Aln was originally designed to align a pair of nucleotide sequences or generalized profiles (Gotoh, 1990, 1994), but now accepts any combination of protein and nucleotide sequences or profiles. For gene-structure prediction, the inputs must be a DNA sequence and a protein sequence or multiple alignment, and a few specific options must be set. Several forms of out-

Table 1. Default parameter values

Symbol

Default value

u ν K x VI fc fb

2.0 9.0 21(nt) 20.0 68.0 1.0 16.0

Meaning Gap extension penalty per codon Gap opening penalty Minimum gap length of a constant penalty Penalty for a frameshift Intron penalty Relative contribution of a coding potential Relative contribution of a boundary signal

put are selectable, including: (i) DNA versus protein sequence alignment, (ii) gene, predicted cDNA, or translated sequence with or without boundary information, and (iii) a GenBank-like format.

Implementation Choice of parameter values Our algorithm uses seven adjustable parameters (Table 1) besides the amino-acid substitution matrix [JTT250 (Jones et al., 1992) by default] and tables for coding potentials and boundary signals. The initial set of these parameter values was chosen according to the following considerations. The gap-penalties associated with protein sequence alignment (u, v, and K in Table 1) were borrowed from those which best reproduced protein structural alignments (Gotoh, 1996). The factors that controlled the contributions of coding potential ( f c ) and boundary signals ( f b ) relative to amino-acid substitutions were respectively set to 1 and 16 by default, which roughly correspond to the inverse relative chance of assignment. We used a large value of 20 as the default for the extra penalty for a frameshift error on the assumption of high-quality genomic sequences. The last parameter, ‘intron penalty’ ν I , was chosen experimentally to achieve the highest accuracy in gene-structure prediction. As shown in Figure 4(a) and (b), a broad optimum was observed around ν I = 60 ∼ 70 for all of the measures of accuracy, and ν I = 68 was used as the default. The default value for K used in protein sequence alignment corresponds to 30 nucleotides. To search for possibly better restricted affine penalty functions, we examined several values for K under fixed u and ν values. The results showed that K = 21 (7 codons) was optimal. The performance with K = ∞, which corresponds to an affine function, was significantly worse than that with the restricted affine function (Figure 4(a), (b)), where the reduction in performance was mainly ascribed to an increase in the number of false negatives. A significant reduction in accuracy was also observed (Figure 4(a),(b)) when the contribution of coding potential was disregarded ( f c = 0). On the other hand, there 195

Correlation coefficient (%)

O.Gotoh

(a)

(c)

(b)

(d)

Intron penalty

Margin outside CDS (bp)

Fig. 4. Accuracy of our methods in predicting structures of C.elegans genes. (a) Accuracy of prediction measured in CC at the nucleotide level is plotted as a function of intron penalty νI for a subset of genes for which I D ≥ 30%. The four methods examined used either a restricted affine gap-penalty function (RA) or an affine gap-penalty function (AG) in combination with either value, 0 or 1, for the relative contribution of coding potential f c : RA and f c = 1 (◦), RA and f c = 0 (•), AG and f c = 1 (), and AG and f c = 0 (). (b) Same as (a) but all 288 genes are used for the examinations. (c) Dependence of CC (◦), Pb (♦), Pe (), 1 − Sp (), and 1 − Sn () on the length of marginal regions assessed for a subset of genes for which I D ≥ 30%. (d) Same as (c) but all 288 genes are used for the examinations. Table 2. Summary of tests of aln and other methods for structural prediction of C.elegans genes

ID

No. of genes

Method

CC (%)

Sp (%)

Sn (%)

Pb (%)

Pe (%)

100 100 100 0–30 30–93 0–90 0–93 0–93 0–93 0–93 0–93 0–93

385 385 218 118 170 288 278 278 178 178 257 257

AGfc0 RAfc1 GW RAfc1 RAfc1 RAfc1 CESC RAfc1 GW RAfc1 Prof RAfc1

99.99 99.61 74.51 92.36 96.97 95.10 90.22 95.02 74.35 96.57 95.16 95.09

99.99 99.41 99.67 94.73 98.15 96.76 93.53 96.69 94.33 97.63 96.57 96.79

100.00 99.98 71.46 96.80 98.62 97.89 93.91 97.85 77.02 98.52 98.00 97.79

99.65 99.43 76.20 88.00 94.06 91.60 87.51 91.53 60.86 93.33 91.51 91.78

99.29 98.98 73.11 80.63 90.77 86.65 82.93 86.62 45.02 89.55 86.58 86.90

The methods tested are RAfc1: aln with a restricted affine gap-penalty function and f c = 1 (the default parameter set, in boldface); AGfc0: aln with an affine gap-penalty function and f c = 0; GW: genewise (Birney and Durbin, 1997); CESC: taken from the dataset provided by The C. elegans Sequencing Consortium (1998); and Prof: aln with profile references.

was virtually no change in overall performance when the contribution of coding potential was doubled ( f c = 2, data

196

not shown). The optimal value for f b varied in parallel with νI , and

Accuracy (%)

Number of genes

Gene structure prediction by homology

Amino acid identity (%)

Fig. 5. Dependence of prediction accuracy on I D. CC (filled bars), Pb (shaded bars) and Pe (open bars) were evaluated for a subset of genes for which I Ds are classified within specified ranges. The number of genes in each class is shown by a filled triangle. The same measures, CC (◦), Pb (), and Pe (♦), evaluated for a ‘cumulative’ subset of genes for which the I Ds are greater than or within the specified range are also shown together with the number of genes involved (•).

good performance was obtained when these parameters satisfied the relation of νI = 3.4 f b + 15 within the range of 10 ≤ f b ≤ 20 (data not shown). Since f b = 16 and νI = 68 closely satisfy this relationship, our initial set of parameter values appears to be near-optimal.

Dependence of performance on sequence similarity Since our method uses homology information, its performance could significantly depend on the degree of sequence similarity between the objective sequence and the reference. As expected, the measures of performance decline gradually with the percent of amino acid identity (I D) (Figure 5). When I D ≥ 30%, all of the measures (CC, Sp, Sn, Pb , and Pe in Table 2) exceed the corresponding values obtained by popular gene-finding methods applied to human genes (Burge and Karlin, 1998; Claverie, 1997; Murakami and Takagi, 1998). Even for I D < 30%, these measures are comparable to those achieved by the best methods such as GENSCAN (Burge and Karlin, 1997) and MZEF (Zhang, 1997), though direct comparison is difficult due to the species difference. Although the above examination was performed with a single default set of parameters, better results were obtained if we used several parameter sets depending on similarity classes. For example, coding exons were almost perfectly identified upon self-comparison (I D = 100) with an affine gap penalty (K = ∞), whereas some false

positives remained under the default conditions (Table 2). On the other hand, the use of weaker gap penalties gave better performance for distant object-reference pairs. However, further quantitative investigations were not performed to avoid excessive adaptation. Figures 4(c) and (d) show the dependence of prediction accuracy on the uncertainty of gene coverage. Since the true gene boundaries (transcriptional initiation and termination sites) are rarely known, a marginal region considered here may consist of a 5 or 3 untranslated region and the flanking sequence. As more marginal regions are involved in the calculation, the overall performance of our method gradually declines. The increase in errors is largely attributed to the overestimation of exons, as seen at the bottom of Figure 4(c) or (d). Since the ends of a protein are generally prone to vary in sequence and length, the precise identification of translational initiation or termination sites based on sequence homology may be more difficult than identification of intron insertion sites. In addition, a marginal region may actually contain CDS regions of the neighboring gene, since intergenic regions in the C.elegans genome are relatively narrow (The C.elegans Sequencing Consortium, 1998). For better recognition of gene structures, sequence signals associated with the start and stop of transcription would have to be considered. This indicates a future direction for improving our method. 197

O.Gotoh

Performance with profile Most (260 of 291) of the testable genes had more than one homologous protein sequence that passed the three criteria described in the Methods section. The performance of our ‘profile-version’ program was examined on a subset of genes with multiple homologues. The results with the default parameter set were rather disappointing because coding exons were significantly underestimated compared to those predicted with single-protein sequences. This underestimation was caused by negative contributions of non-conserved regions to the alignment score, and could be circumvented by adding a small value (0.5 or 1) to each element of the amino-acid substitution matrix. After this ad hoc measurement, the predictive power of the profile method was indistinguishable from that with single reference sequences (Table 2). This situation did not change appreciably when we used three systems for weighting each member in a multiple alignment: (1) evenly, (2) according to prrp output, and (3) in proportion to the negative logarithm of blast probability. Most exon/intron boundaries that were not correctly predicted by the protein method are located in nonconserved regions. Corresponding regions in the reference multiple alignment are also generally divergent among the members, and often have numerous gaps. Moreover, the members that comprise each reference profile could be only locally related, so that their global multiple alignment might be unreliable. These two points are probably the major reasons why the profile method did not greatly improve the prediction accuracy. More stringent quality control of the reference profile will be necessary to improve the performance at the expense of a reduced chance of application. An example of the coherent prediction of paralogous gene structures It is frequently observed that the closest relative to a gene is another gene in the same genome. Sonnhammer and Durbin (1997) reported more than 70 protein domain families that have at least 10 members encoded in the C.elegans genome. Most of these domains do not comprise whole proteins, and we are currently interested in determining the entire structure of a gene. Thus, the applicability of our method is currently limited to certain enzyme and receptor families. As a demonstrative example, G-protein alpha subunit (Gα) genes were examined. The C.elegans genome appears to contain 20 Gα genes (Jansen et al., 1999). The genomic region covering gpa 16 was uncertain, and omitted from the analysis. The initial amino acid sequences for the remaining 19 genes were retrieved from the dataset published by The C.elegans Sequencing Consortium (1998). We call this dataset ‘CESC’, and the contents of 198

this dataset and Wormpep16 largely overlap. The iterative procedure described in the Methods section converged rapidly at the third cycle; these three cycles of genestructure prediction and multiple alignment took about 10 min on our machine. The sum-of-pairs and weighted sum-of-pairs scores for the multiple alignments were improved by 133.6 and 102.6 per pair, respectively, in the course of the iteration. The final multiple alignment is shown in Figure 6 with predicted sites of intron insertions. Six predicted gene structures (indicated by an asterisk at the end of each sequence in Figure 6) were exactly the same as the published structures (GenBank/EMBL/DDBJ Accession Nos: AB003486, M38249, M38250, M38251, U56864, and X53156). While this particular example was performed with an affine gap penalty, the results with a restricted affine gap penalty were nearly the same as those shown in Figure 6, except for minor variations in the translational initiation sites of a few genes.

Discussion The accurate prediction of eukaryotic gene structures is by no means trivial, even if homologous protein or cDNA sequences are available. It is particularly difficult when the objective and reference sequences are derived from organisms in different phyla. Mironov et al. (1998) were the first to examine the performance of a gene-prediction method that was primarily dependent on sequence homology. They used a program called ‘Procrustes’ that uses a spliced alignment algorithm (Gelfand et al., 1996). We developed another homologybased gene-identification program, aln, and examined its performance on larger inter-phyla combinations of objects and references. With respect to the overall accuracy in predicting coding exons, the results of the present examination are probably the best among all of the reports that have been published so far. One reason for this high accuracy might be the nature of the C.elegans genome we examined; most of the previous predictions have been made on vertebrate (mainly human) genes. We chose C.elegans genes for two reasons. First, since the entire genomic sequences are known, many complete genes are easily available. Second, calculations are economical because of the shorter average gene length compared to that of vertebrate genes. Despite their compact sizes, C.elegans genes are not necessarily easier to predict than vertebrate genes. In fact, the average accuracy of CESC entries does not seem to be much better than that attained by the best gene-finding methods for human genes, as exemplified by cytochrome P450 genes (Gotoh, 1998). Even for genes with known corresponding cDNA sequences, the results of the present method were better than CESC (Table 2), although each CESC entry was derived from various

Gene structure prediction by homology

Fig. 6. Multiple sequence alignment of 19 predicted C.elegans Gα proteins. The locations of potential intron insertion sites are indicated by arrows. Downward arrow: phase 0 intron; arrow toward the lower left: phase 1 intron; arrow toward the lower right: phase 2 intron. The structures of the genes marked by asterisks have been verified experimentally.

199

O.Gotoh

sources of information, including cDNA/EST sequences, homology, and statistical properties. The aln algorithm is a straightforward extension of a sequence alignment algorithm that allows for long gaps (Gotoh, 1990), and thus has a simpler structure than that of the spliced alignment algorithm (Gelfand et al., 1996) implemented in Procrustes. Although the target functions to be optimized are nearly the same, Procrustes may run faster than aln, especially when the gene possesses long introns, since Procrustes filters out most potential non-coding regions before the major routine. On the other hand, aln carefully treats gaps corresponding to frameshift errors, long insertions/deletions, and ‘internal’ gaps in the reference profile. The observation that the prediction accuracy was significantly improved by restricted affine gap-penalty functions compared to more common affine functions indicates the superiority of the present algorithm. However, practical application of the present algorithm to vertebrate gene-identification problems requires a further reduction in computational time and space, which can be realized by the introduction of some pre-filtration processes. Two other related programs are ‘nap’ of (Huang and Zhang, 1996) and ‘genewise’ of (Birney and Durbin, 1997). Both programs were compiled and run alongside aln on the same machine. Nap reports only the alignment of DNA and protein, and so automatic identification of exon-intron boundaries is difficult. Although nap and aln have similar overall architectures, nap does not consider any statistical properties of the genomic sequence or phases at the 5 and 3 ends of an intron. Since these are important factors for the correct identification of gene structures, nap is expected to show significantly lower performance than aln. Genewise (in Wise2 version 2.1.16b, protein model) often reports fragmental gene structures even though the ‘global’ option is used in combination with the ‘worm.gf’ gene-characterization file. Most prominently, a frameshift error easily induces such gene fragmentation. Multiple ‘genes’ or gene fragments were suggested for about one-third (107 of 291) of the test cases where a single gene is expected. Even if we restrict ourselves to uniquely predicted cases, the performance of genewise appears to be considerably inferior to that of aln, as shown in Table 2. Genewise ran about 67 (45) times slower than aln with (without) the cutting-corners approximation, which also prevented extensive comparison of the performances of genewise and aln. This large difference in execution rates is probably due to the fact that aln (and nap) optimizes a single score while genewise calculates probabilities of various states and state transitions. The efficiency of aln is partly due to the use of tron codes (Figure 1). A measure of similarity between a tron code and an amino acid can be easily obtained as a usual measure between amino acids or between nucleotides. 200

Since an amino acid sequence is much more conservative than a nucleotide sequence, tron codes may be useful not only for sequence alignment between a DNA and a protein but also between diverse DNAs, either genomic or complementary in any combination. Systematic intergenomic comparisons will be a promising application of such approaches. We found clear inter phyla homologues for about three-quarters of the 390 C.elegans full-length cDNAs retrieved from the GenBank database. About two-thirds of these homologues were related to the C.elegans sequences more closely than the generally accepted limit, i.e. the ‘twilight zone’ (Doolittle, 1981; Sander and Schneider, 1991), above which reliable protein sequence alignment is attainable. When applied to this subset of genes, our method performed significantly better than the best available gene-finding programs. Thus, the present approach might be useful for about half of all genes to improve the reliability of predicted gene organizations. Since the full-length cDNA sequences in current databases may represent a biased subset of the whole genes, the chance of finding proper reference protein sequence(s) may be generally less than the above estimate. Nevertheless, this chance will increase with progress in ‘functional genomics’, in which a large number of full-length cDNA sequences are to be determined. This information can be used to identify related genes in other organisms and also weakly or temporally expressed paralogues in the same genome. Our approach will facilitate a better understanding of the structure and evolution of various genes on eukaryotic genomes to be sequenced in the near future.

Acknowledgements This work was supported in part by a Grant-in-Aid for Scientific Research on Priority Areas, Genome Science, from the Ministry of Education, Science, Sports and Culture of Japan. Appendix We present here a dynamic programming algorithm for matching a genomic sequence, b = b1 b2 . . . b J , and a protein sequence, a = a1 a2 . . . a I . We assume that b has been converted into tron codes. Let Hi, j be the objective function to be optimized for the subsequences b1 b2 . . . b j ( j ∈ [1, J ]) and a1 a2 . . . ai (i ∈ [1, I ]). Hi,α j are subsidiary variables where a superscript α(1 ≤ α ≤ 7) indicates the direction of the alignment path (Figure 3(a)). Hi,0 j is used as an alias of Hi, j . We begin the following recursion relations with Hi,0 = 0 for i ∈ [0, I ], H0, j = MAX{ψ Ij−1 + ϕ Ej−1 , H0, j−1 + w(1), H0, j−2 + w(2), H0, j−3 −u +ϕ Ej−1 } for j ∈ [1, J ] and H0, j = −∞ for j
0). 0, and Hi,0 0, j



Hi,1 j

 Hi, j−1 + w(1) Hi, j−2 + w(2)    = MAX   Hi, j−3 + w(3) + ϕ Ej−1  Hi,1 j−3 − u + ϕ Ej−1

Hi,2 j = MAX(Hi, j−3 + w(K ),

Fi,7,1 j =



3 Hi−1, j − u)

Hi,4 j = MAX(Hi−1, j + w(K ), Hi,6 j = Hi−1, j−2 + w(1)

Hi,7 j = Hi−1, j−3 + S(ai , b j−1 ) + ϕ Ej−1 (A.1)

where S(a, b) is the measure of similarity between an amino acid a and a tron b, w(k) is the gap-penalty function as depicted in Figure 2, and ϕ Ej denotes the coding potential associated with b j . A translational initiation or termination signal, ψ Ij or ψ Tj , is also added to ϕ Ej when appropriate, but omitted in equation (A.1). The final results are traced back from the point associated with MAX(Hi,J , H I, j ) for i ∈ [1, I ] and j ∈ [1, J ], where H I, j is slightly modified to incorporate ψ Tj but not to assign a gap-open penalty. If j − δ(δ = 0, 1, or 2) is a potential splicing acceptor site at which the 3 splicing signal ψ 3j−δ > 0, we need further considerations. Let us define Fi,α,δ j by Fi,α,δ j =

MAX

(h−δ)∈{5 < j}



α 5 (Hi,h + ψh−δ )

(A.2)

and

3 Hi,α j = MAX(Hi,α j , Fi,α,δ j + ψ j−δ − νI )

(A.3)

where the set {5 < j} consists of potential splicing donor sites l(ψl5 > 0) preceding j, and α ∈ {0, 1, 2, 7}. Hiαj on the right-hand side of equation (A.3) is that obtained with equation (A.1). If the second operand of MAX is greater than the first, j −δ is regarded as a likely splicing acceptor site, and Hi,α j is renewed. For most combinations of α and δ, equation (A.2) can be simplified as

α,δ α 5 Fi,αδj = MAX(Hi,h + ψh−δ , Fi,h )

=

MAX

(h−2)∈{5 < j}

Hi,5 j = Hi−1, j−1 + w(2)

α=1,7

Fi,7,2 j

5 ) +ψh−2

4 Hi−1, j)

Hi,0 j = MAX Hi,α j

MAX

(h−1)∈{5 < j} 5 ) +ψh−1

Hi,2 j−3 ) + ϕ Ej−1

Hi,3 j = MAX(Hi−1, j + w(3),

operations. In two exceptional cases of α = 7 and δ = 1, and α = 7 and δ = 2,

(A.4)

In a computer program, Fi,α,δ j can be represented by a single variable for each δ, and should be updated only just after j − δ ∈ {5 }, which takes a constant number of

(Hi−1,h−3 + S(ai , [bh−2 bh−1 b j ]) (A.5) (Hi−1,h−3 + S(ai , [bh−2 b j−1 b j ]) (A.6)

where the triplet in brackets should be returned to nucleotides and then translated according to the genetic code. To maintain the maximum value for Fi,7,2 j , we must prepare four variables, rather than just one as in equation (A.4), depending on the type of nucleotide at bh−2 (Figure 3(b)). Likewise, for the maximal Fi,7,1 j , we need two variables that correspond to the two possibilities of b j being purine or pyrimidine (Figure 3(b)) according to the specific feature of the universal genetic code (Figure 1).

References Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– 410. Altschul,S.F., Madden,T.L., Sch¨affer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucl. Acids Res., 25, 3389–3402. Birney,E. and Durbin,R. (1997) Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. ISMB, 5, 56–64. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. Burge,C.B. and Karlin,S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8, 346–354. Burset,M. and Guig´o,R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353–367. Cai,Y. and Bork,P. (1998) Homology-based gene prediction using neural nets. Anal. Biochem., 265, 269–274. Chao,K.-M. (1999) Calign: aligning sequences with restricted affine gap penalties. Bioinformatics, 15, 298–304. Claverie,J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, 1735–1744. Crick,F.H.C. (1968) The origin of the genetic code. J. Mol. Biol., 38, 367–379. Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A model of evolutionary change in proteins. In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3, National Biomedical Research Foundation, Washington, D.C., pp. 345– 352. Doolittle,R.F. (1981) Similar amino acid sequences: chance or common ancestry?. Science, 214, 149–159. Fickett,J.W. and Tung,C.-S. (1992) Assessment of protein coding measures. Nucl. Acids Res., 20, 6441–6450.

201

O.Gotoh

Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res., 8, 967–974. Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recognition via spliced sequence alignment. Proc. Natl Acad. Sci. USA, 93, 9061–9066. Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and Oliver,S.G. (1996) Life with 6000 genes. Science, 274, 546– 567. Gotoh,O. (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol., 52, 359–373. Gotoh,O. (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput. Appl. Biosci., 9, 361–370. Gotoh,O. (1994) Further improvement in methods of group-togroup sequence alignment with generalized profile operations. Comput. Applic. Biosci., 10, 379–387. Gotoh,O. (1995) A weighting system and algorithm for aligning many phylogenetically related sequences. Comput. Applic. Biosci., 11, 543–551. Gotoh,O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. Gotoh,O. (1998) Divergent structures of Caenorhabditis elegans cytochrome P450 genes suggest the frequent loss and gain of introns during the evolution of nematodes. Mol. Biol. Evol., 15, 1447–1459. Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 4355–4358. Guig´o,R. and Fickett,J.W. (1995) Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA. J. Mol. Biol., 253, 51–60. Hein,J. (1994) An algorithm combining DNA and protein alignment. J. Theor. Biol., 167, 169–174. Huang,X. and Zhang,J. (1996) Methods for comparing a DNA sequence with a protein sequence. Comput. Applic. Biosci., 12, 497–506. Jansen,G., Thijssen,K.L., Werner,P., van der Horst,M., Hazendonk,E. and Plasterk,R.H.A. (1999) The complete family of genes encoding G proteins of Caenorhabditis elegans. Nature Genet., 21, 414–419. Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci., 8, 275–282. Lolkema,J.S. and Slotboom,D.-J. (1998) Hydropathy profile alignment: a tool to search for structural homologues of membrane proteins. FEMS Microbiol. Rev., 22, 305–322.

202

Mironov,A.A., Roytberg,M.A., Pevzner,P.A. and Gelfand,M.S. (1998) Performance-guarantee gene predictions via spliced alignment. Genomics, 51, 332–339. Mott,R. (1997) EST GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci., 13, 477–478. Murakami,K. and Takagi,T. (1998) Gene recognition by combination of several gene-finding programs. Bioinformatics, 14, 665– 675. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. Comput. Appl. Biosci., 4, 11–17. Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) Sequence comparison using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201–1210. Pearson,W.R., Wood,T., Zhang,Z. and Miller,W. (1997) Comparison of DNA sequences with protein sequences. Genomics, 46, 24–36. Peltola,H., S¨oderlund,H. and Ukkonen,E. (1986) Algorithms for the search of amino acid patterns in nucleic acid sequences. Nucl. Acids Res., 14, 99–107. Salzberg,S.L. (1997) A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Applic. Biosci., 13, 365–376. Sander,C. and Schneider,R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68. Sankoff,D. and Kruskal,J.B. (1983) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, New York. Sellers,P.H. (1979) Pattern recognition in genetic sequences. Proc. Natl Acad. Sci. USA, 76, 3041. Snyder,E.E. and Stormo,G.D. (1995) Identification of protein coding regions in genomic DNA. J. Mol. Biol., 248, 1–18. Sonnhammer,E.L.L. and Durbin,R. (1997) Analysis of protein domain families in Caenorhabditis elegans. Genomics, 46, 200– 216. The C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012–2018. Uberbacher,E.C. and Mural,R.J. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl Acad. Sci. USA, 88, 11261–11265. Xu,Y., Mural,R.J. and Uberbacher,E.C. (1994) Constructing gene models from accurately predicted exons: an application of dynamic programming. Comput. Appl. Biosci., 10, 613–623. Zhang,M.Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl Acad. Sci. USA, 94, 565–568. Zhang,M.Q. and Marr,T.G. (1993) A weight array method for splicing signal analysis. Comput. Applic. Biosci., 9, 499–509.