Prediction of Gene Structure - CiteSeerX

14 downloads 9873 Views 2MB Size Report
Dana-Farber Cancer Institute and Harvard School of Public Health. 44 Binney St. ... 1. Introduction. Current advances in DNA sequencing technology ... of West Florida, ..... compiled internal exons, the highest ranking according to the profile ...
J. Mol. Biol. (1992) 226, 141-157

Prediction

of Gene Structure

Roderic Guigbt$, Steen Knudsen& Neil Drake\\ and Temple Smith? Molecular Biology Computer Research Resource Dana-Farber Cancer Institute and Harvard School of Public 44 Binney St., Boston, MA 02215, U.S.A. (Received 4 September 1991; accepted 14 January

Health

1992)

We have developed a hierarchical rule base system for identifying genes in DNA sequences. Atomic sites (such as initiation codons, stop codons, acceptor sites and donor sites) are identified by a number of different methods and evaluated by a set of filters and rules chosen to maximize sensitivity; these are combined into higher-order gene elements (such as exons), evaluated, filtered and combined as equivalence classes into probable genes, which are evaluated and ranked. The system has been tested on an extensive collection of vertebrate genes smaller than 15,000 bases. Results obtained show that, on average, 880/b of the predicted coding region for a transcription unit is actually coding, and 80% of the actual coding is correctly predicted. This will, in most applications, be sufficient for a search against protein sequence databases for the identification of probable gene function. In addition, the system provides a general test platform for both gene atomic site identification and the rules for their evaluation and assembly. Keywords: gene identification;

exon structure; intron intelligence

1. Introduction Current advances in DNA sequencing technology and commitments to a number of large genome sequencing projects have motivated the investigation of reliable gene identification methods. To date, the vast majority of sequenced proteinencoding genes have been delineated through knowledge either of their corresponding cDNAs or homologous genes. Computational tools to make predictions of gene structure and function from purely syntactical analysis of the sequences would be beneficial in two distinct ways. The first is in providing a direct means for predicting probable genes within long stretches of newly sequenced DXA for which additional information is unavail-

t Present address: &o-Molecular Engineering Research Center, Boston University, 36 Cummington St.. Boston, MA 02215. U.S.A. 3 Author to whom all correspondence should be addressed 9:Present address: CEDB, University of West Florida, 11000 University Parkway, Pensacola FL 32514-5751, IT&A. 11Present address: Tufts University Medical Center, Roston, X4 03114. U.&B.

splicing;

coding sequence; artificial

able; and second, in providing a “test bed” for evaluating our knowledge of the syntactical rules of the genetic language. Substantial knowledge currently exists for identifying certain gene or atomic elements in eukaryotic organisms. It has been used to develop a number of existing computer programs to predict elements such as splice sites (Staden, 1984; Nakata et al., 1985; Kudo et aE., 1987; Lapedes et al., 1988; Brunak et al., 1991), coding regions (Fickett, 1982; Fichant & Gautier, 1990) and gene 1987; Konopka, promoters (Bucher, 1990). The major difficulty with these methods, however, is that none is 100% sensitive and most have poor or unspecified specificity. This arises, in large part, because there is no uneqmvocal definition for many of the gene-defining atomic elements. The sequence patterns used to describe them are often short and/or degenerate. thus having inherently low specificity. In some cases, there appears to be no simple method for distinguishing actual utilized sites from similar (even identical) non-utilized sites, independent of other sequence context information. Since many known genes are composed of more than two or three exons and contain a dozen or more regulatory signals, the identification of these various gene components with less than 100% sensitivity and speci-

142

R. G&g6 et’ al.

ficity leads to serious combinatorial problems. It is clear t’herefore that part, but not all, of the solution to the problem of gene identification relies heavily upon the development of increasingly accurate methods of atomic sites identification. As noted by many researchers, there must’ be various context-dependent “rules” determining which potential components, as recognized by their simple short consensus patterns, are actually utilized or have a higher probability of being used. Such rules might express mandatory or preferred ordering and spacing among the various atomic sites. The first attempts to integrate t,hese different’ categories of information to predict t,he st’ructure of genes have recently been published. Fields & Soderlund (1990) have built, a rule-based gene modeler, gm, designed from the known gene organization in the organism Caenorhahdites elegans. Allowing parameter adjustments, gm was able to predict most of the exons of five test genes. Gelfand (1990) has assembled exon-intron structures using a combination of splice site prediction and coding region evaluation by Fickett’s (1982) method with some success on nine test mammalian genes. We present here a generalizable hierarchical rulebased system (GeneId) for identifying probable protein encoding genes. The strategy may be summarized briefly as follows: given a DNA sequence, identify all potential “atomic sites” or gene element’s such as promoter elements, translation initiation and stop codons. splice acceptor and donor sites and poly(A) signals: assign some measure of relative rank or likelihood to these; then identify and rank all potentially allowed exon combinations. Theoretically, but not practically, this exercise is straightforward. Our method first relies on a priori evaluation of the predicted atomic sites; then on a sequential filtering evaluation of the assembled combination gene components. At each stage of the assembly, a, cut-off evaluation threshold or filter is set’ to maximize sensitivity, while specificity is achieved through the multiplicative effect of sequential filtering as the potential gene is hierarchically assembled. Thus, our method evaluates the atomic signals in an a posteriori context sense as well as through the subsequent filtering of the predict,ed genes. By introducing the appropriate metric in the set of predicted genes, true genes containing very weak (low scoring) atomic signals can be predicted, even though millions of candidate models may have been examined. To efficiently handle the resulting combinatorial explosion, a concept of “exon equivalence” is introduced. More specifically, we limit ourselves to those cases in which genes and transcription units are coextensive. and we assume the gene’s start and end. Then. given the DNA sequence of a transcription unit, we identify the atomic sites, initiation codons. stop codons, acceptor sites and donor sites, with almost absolut,e sensitivity by using weighted profiles. From the atomic sites we obtain a set of predicted exons and we compute a list of statistically significant properties for these predicted exons. Since the

attached exon properties are numerically measurable, the statist’ical distribution of each measure can be evaluated in a “represent,ative” sample of t.ruta exons and distance functions (distance to thv optimal or most frequent values) used to rank the exons. Moreover. t.hreshold values. such t,hat the probability of a true exon having a value for such a measure outside the interval(s) defined by t’hem is less than a very small probability, can be drrived from this distribution. Possible exons whose values are outside the interval(s) can br discardrd. ,4 substantial increase in specificity wit,hout significant, loss of sensitivity. that is a snbstantia,l reducztion in the number of false positives. (aan br obtained by filtering the set of predic%ed c~xons through successive independent measures. Then from t,he filtered set of predicted exons we olbtain the space of predicted genes. Tn order to avoid t’hr combinatorial explosion involved in t,he genera,tion of such a space, the concept of exon equivalence has been introduced. We define two rxons as being equivalent if they can occur exactly in t’hr same gene models. This requires that thev have splicafx sites relat,ive t’o the same t,ranslation iramp and arc’ within the allowed minimum intron distances of the adjacent exons. We define each gene model HS :l linear arrangement of exon eqdivalence (alasses rather than as linear arrangements of all possibk individual exons. Finally. a fun&ion of the \c-alues attached to eacah of the component VNOU c~lassrs is assigned to the potential gene. We use the score’ derived to establish a rank in the spa(~ of gent’ models. The performance of GeneId ha,s hecn tested 011 a large collection of data, and we present here a comprehensive stat,istical evaluation of the results. Tn particular. we propose the correlation c~orfS&~nt as a particularly appropriate measure to evaluatr the performance of prot*ein-coding gene ident,ificaation methods. This study has a second aim. It attempts to carefully outline the problems involved in the identification of complex genes, including an estimate of the magnitude of the resulting combinatorics and the areas of missing atomic element knowledge. We have studied the characteristics of large collections of data and obtained results that, include insight into the nature and the limitations of our mecbhanistic understanding of splice site recognit,ion.

2. Materials In this section we data sets from which parameters and from of GeneId. Then. we

and Methods

describe first the generation of t,he we have estimated the algorithm’s which we evaluate the performance describe the algorithm in detail.

We have obtained sets of first, internal and terminal exons from the primate. mammalian. rodent and vrrtrbrate divisions of CenBank 640 (Burks et ul., 1991). The sets have been used to derive the profiles for the predic-

Prediction

qf Gene Structure

tion of initiation codons, donor sites and acceptor sites and to calculate cut-off values for the variables used to derive the set of rules through which the predicted exons are “filtered”. Only complete exons without alternative sphcmg were considered. Immunoglobulins and histocompatibility antigens, where DNA undergoes a complex process of reorganization, pseudogenes and mutants were discarded. Typical internal data integrity constraints were enforced. These include checking for: GT as the first dinucleotide at the termination of first and internal exons; AG as the preceding dinucleotide to internal and terminal exons; for an initiation codon as the first, and for a stop codon as the last trinucleotide in the coding sequence; and, when possible, checking for the absence of stop radons in-frame inside coding exons, and checking that the total length of the coding segments was a multiple of 3. Totals of 351 first exons, 1388 internal and 256 terminal were obtained in this way. We have assigned weights to the exons in these sets in order to correct for homologous gene families’ unequal representation in the database. The weights have been used in the derivation of profiles (see section (c)(i), below). We have followed the method of Schmelzer & Smith (unpublished modification of Felsenstein’s current-circuit analogy (Felsenstein, 1981)) to assign weights to the exons in a given set. (1) We compute pairwise similarity values for the exon sequences in the set. Typically, we use the Maximal Segment Pair (MSP) score (Altschul et al., 1990) as a measure of sequence similarity. (2) We cluster the sequences according to the above similarity matrix using a maximal linkage method. Clusters correspond to homologous sequences. Each should be considered an independent sampling event. (3) We assign weights to the sequences according to the topology of t,he clusters (M. Schmelzer & T. F. Smith, personal communication). Sum of weights for the sequences in the same cluster is 1.00. Weight for sequences not clustered is 1.00. Additionally we have obtained a set of complete genes. The set has been used to evaluate the performance of GeneId at the different steps. Only genes for which the mRNA and coding region are both complete and not alternatively defined, containing at least one internal rxon and with a length no greater than 15,000 bases were considered. Immunoglobulins, histocompatibility antigens, pseudogenes and mutants were discarded and standard integrity constraints were applied. The size of the resulting set was 169. Note that this set is not independent of previous sets. In particular, the exons from the genes in this set appear in the previous sets of exons (see Discussion),

(b) The algorithm Fig. 1 exemplifies the algorithm in an idealized case. An evaluation of the algorithm’s performance at the different steps for the 169 genes compiled appears in Table 1.

(r) Prediction

of atomic sites

Given a DNA sequence, a transcription unit, we first identify the position in the sequence of potential initiation codons, donor sites, acceptor sites and stop codons in the sequence. See Table 1 for the total number of predicted sites and for the number of true predicted sites in the transcription units corresponding to the set of 169 vertebrate genes compiled.

143

codons (i) Initiation AUG is the universal initiation site for eukaryotic translation. The AUG located nearest the 5’ end of the mRNA is almost always selected, but a relatively weak sequence context also has an effect in that selection (Kozak, 1983). In the DNA primary transcript, however, the first ATG does not necessarily correspond to the first mRNA AUG, since the first spliced exon is not necessarily expressed and the first ATGs can occur inside an intron. Thus, we cannot use the first-ATG rule to identify the initiation codon. Rather, we use the sequence context and the distance to the cap site as criteria to locate potential initiation codons. From the set of first exons that we have compiled, we derived a profile for the initiation codon’s sequence context and computed the distribution of the distance from the start of the mRNA to the (true) initiation codon. Only expressed exons were used to derive the profile (see Table 2). From a set S of DNA aligned sequences of length L, sl, s2, ., s, (Sk = sLI, sk2, ., skL), a profile M 4x1. is usually derived as:

Mij = i k$lZi(skj)

i = A. C. G. T j= l.....L ’

where Z, spliced ones. Note t~hat in this respect. while we impose a minimum length for internal exons t)hat would rrflec*t an actual steric cwnstraint in the formation of the spliwosomr complex. we do not impose such a minimum length for first and terminal exons. This is not meant t,o reflect that such a biological cwnstraint does not exist for first and terminal exons. rather that the initiation codon anti the stop codon can occur anywhere in the first and terminal spliced exons. h’otr finally that we do not impose a maximum length 011any category of tAxor)s. See Table I for the predicted exons for the transcript,iott units

(d) Prediction of exons (i) Individual exons From the initial sets of predicted atomic sites (initiation codons, splice sites and stop codons). we derive the sets of predicted first, internal and terminal exons. A (predicted) first exon is a pair (xi.xJ. where xi is an initiation codon and x2 is a donor site, such that’ x, < x2 and there is no stop codon between xi and x2 in the frame defined bv xi. A (predicted) internal exon is a pair (xl.xz), where xi ;s an acceptor site and x2 is a donor site. such that x, +I, < s2 and that there is at least a frame in which there are no stop codons between xi and x1. I, is t,he minimum internal exon length. A (predicted) terminal exon is a pair (xi .xz). where xi is an acceptor site and x2 is a stop codon. such that .rl < x2 and there is no stop codon between xi and x2 in the frame defined by s2. By using these construction rules. all possible exons compatible with our current knowledge of the structure of t)hr genes would be obt,ained. In practice. however. as we have discussed (see section (r)(ii). above). we assemble first and internal exons (1~ picking only t,he 3 donor sites with highest profile score within an open reading frame from the start of the rxon. since this situation accounts for 98e,,, of the cases observed in the sets of vertrbrat)r exons compiled.

(*orresponding

to t)hr set, of 169 vertebra&

penrr

compiled. (ii) Filtering of exo7t.s A series of numerically measurable variables are tlrtrir mined for each predicted exon. Tn accordance with the values of these variables. t)he predicted exon may be discarded. The goal is to reduce the size of the set of predicted axons by discarding as many false predicted rxons (false positives) as possible, without discarding thr t,rue predicted ones (t,rue positives). The qualitv of’ the discrimination will depend on t)hr biological signihcancr of’ t,he variables considered. as well as on t,hr way in which t,hry are used. We have c~onsiderrd variables known to behave differently in coding than in non-coding regions. Cloncretely. for each predicted rxon we compute: (1) Variables I to 4. Fraction of nuclrotidrs A. (‘. (; ant1 T in the rxon sequence. (2) Variables

5 to 8. (‘odon

position

cwrrrlations:

(‘owe-

lation between the 3rd position of a c~otlon and the 1st position of the next one and between the middle positiott of 2 consrcautivc wdons hal:tb been found cha.racteristic of coding regions (Smith it nl.. 1983). (‘hi-squares of deviation from random dinucleotidr distributjion in thrsr 2 raw’s have been computed. Sinw the reading frame is

Table 4 Weighted pro$le for nccrptor -14

-13

-1”

-II

-IO

-9

-x

-7

-6

-.i

-4

.i;itps -:j

-L’

-I

II

I

L’

Prediction

147

of Gene Structure

Table 5 Proportion

256 (13Wl) Last exons

of last expressed non-expressed

exons versus

6 (45) Non expressed (-30,)

250 (1256) expressed ( - STO,(,)

TAA: TAG: TGA:

71 59

109 5

Abs()lutr frequency of the different stop rodons last CXOIK Sum of weights are in parentheses.

in expressed

unknown for a given predicted internal exon, chi-squares are computed in this case for the 3 possible frames and that corresponding to the greatest deviation from random distribution assigned to the predicted exon. Additionally, chi-square values corrected by the exon length have been computed. (3) Variables 9 to 16. Numeric derivatives of variables 1 to 8 at the beginning of the exon: for each of the above variables we obtain a score related to its slope at the beginning of the exon. by summing the nucleotide frequencies calculated in 300 bpt windows and evaluated at 75 positions at each side around the beginning of the rxon. and by dividing the 75 sums on the left side by the 75 sums on the right. (4) Variables 17 to 24. Numeric derivatives of variables 1 to 8 at the end of the exon. WTe have investigated other variables that behave differently in coding than in non-coding regions such as Fickett’s (1982) coding function and Konopka’s Periodic Asymmetry Index (Konopka, 1990), but found them to be highly correlated to the above measures. In total, we compute 24 different variables for each predicted exon. We have studied the behavior of these variables in true exons and false predicted ones in the “training” sets we compiled, in order to derive a discriminant schema. Note that if Xi, ., X, represent the variables considered (n = 24 in our case). we can represent, each of the sets of predicted exons (first. internal or terminal) for the transcription units in the 169 vertebrate genes compiled in an n-dimensional space. each one of the variables corresponding to a different dimension. Predicted exons are points in such a space. The goal is to obtain the surface in this multidimensional space that best separates the population of points corresponding to true exons from the population corresponding to false exons. Ideally, the surface would separate the 2 populations complet’ely. In a simpler approach. we constructed the hyperplanes orthogonal to each variable Xi in the extreme values, zil and .ri2. that Xi takes in the compiled set of true exons. Predicted exons outside the subspace enclosed by these hyperplanes are discarded. In other words, predicted exons for which any of the variables has a value outside the interval defined by the extreme values of the variable’s distribution for t,rue exons are discarded. Extreme values for the 24 variables considered and the percentage of reduction in false positives obtained for each of them in the sets of predicted exons for the 169 vertebrate genes compiled are given in Table 6. Additionally, we have derived a discriminant function from the set of exons enclosed in the subspace defined by the orthogonal hyperplanes (that is, from the set of predicted exons not discarded after the filtering) in order t Abbreviations

used: bp, base-pair(s); kb, lo3 bp.

0

1000

2000

3000

4000

n 5000

Distance (bases)

Figure 3. Frequency distribution

of the distance from the stop codon to the end of the transcription unit for vertebrate genes (non-expressed last exons and the corresponding introns may be included).

to achieve a further reduction in the number of false positives. The goal now is to obtain the projection of the multidimensional space in a single, most discriminant axis. We obtained the first canonical variable for this space, that is the linear combination of Xi. ., X, that has the highest possible multiple correlation with the 2 populations (false and true predicted exons), but a linear function obtained by a neural net trained to minimize the intersection between the 2 populations performed better. We computed the extreme values of this function in the compiled set of true exons, and predicted exons for which the value of the function ranged outside the interval defined by the extreme values are discarded. Further reduction in the number of false positives is obtained in this way (see Table 6). We will typically refer to the value of the discriminant function for a given exon as the exon score. A competition model between overlapping exons has been finally implemented. Exons that are overlapping or separated by less than a minimum distance (50 bp) are mutually exclusive and ranked within competition classes according to their score. The classes are built, by recursively picking the highest scoring exon of the gene and including in that class all exons that are separated by less than 50 bp. The classes are ranked for each gene according to the score of their highest scoring exon. These 2 rankings can provide additional filters. The results are shown in Table 6. (iii) Exon classes A (predicted) gene for a transcription unit may be defined as a linear arrangement of compatible (predicted) exons, initiated by a first exon, followed by any number (may be zero) of internal exons and ended by a terminal exon. By arrangement of compatible exons we mean that (1) every exon in the arrangement should start after a certain distance, the minimum intron length, from the end of the previous one, and (2) joining the different exons in a single sequence, there are no stop codons in the frame defined by the first exon other than the terminal one. Thus, from the set of predicted first, internal and t’erminal exons it is possible to obtain the set of predicted genes for a given transcription unit. Note, however, that while the number of predicted genes grows only linearly with the number of predicted first and terminal exons, in the worst case, in which all combinations of exons are possible, it may grow exponentially with the number of

R. Guigd et al.

148

Table 6 Extreme

values for the jilter

variables considered and associated percentage of’ reduction in the predicted genes for the 169 vertebrate genes compiled

First

Exons filter

Lower

L’pptv

I. A”,, 2. c’ 0

0.057 0400

0.47

047 OOli o.62 0.053 0.36 239 216 2.18 153 0.04 1.07 @06 057 A-site or ATQ site: -0029 0,023 - 0040 0.031 - 0.04 1 0.046 - 0.024 (bO27 - 9.34 666 -7.14 526 -0.093 0.066 -0.07% 0.053 Dsite or stopcodon: - 0.020 0.035 - 0.044 003 I -CO41 0429 -0030 0.040 - 9.25 7.77 - 7.49 516 -0092 0.07H --0075 0.05 1

3. w;; 4. T “/,, :i. 2-2 Corr. 6. 3-l Corr. 7. 2-P Corr/length. 8. 3-l Corr/length. Slopes of scores at 9. A’& 10. (!“/ 11. G;p, 12. T” 13. %-2°C”rr. 14. %I (:orr. 15. 2-2 Corrjength. 16. 3-l Corr/length. Slopes of scores at 17. A?‘0 18. (‘4,) 19. (‘?’ ” 0 20. To!” 21. 2-2 Cow. 22. 3-2 Corr. 23. 2-2 Cow/length. 24. 3-2 Corrjength. 25. Linear comb. of all filters: 26. Competition filter based on 25 Sc’ore: 27. Total reduction:

(‘ut-off valuest 1nternal Lower I’pper ow 0.09

0.01 O-06 417 two 0.09 013

Incremental reduction ( 00) in. False positives: True positives Virst lnt Last First lnt 1,ast ___-

0.48

.>

047 046 0.43 127, 130 1.56 I.56

I 9 6

- 0037 - 0039 - 0.045 - w045 - 7.02 - x.34 -0470 - 0.084

w27 0.040 0044 0029 13.2 14% 0132

-031

0.03 I

0.14i

- 0.052 0.044 -0045 - 0,039 -19.5 -18.5 -0195 - 0.186

obtained in false positir:es

0.034 0.044 7.39 10.7 0.074 0.106

1 4 3 0

II 0 0 0 0 0 (I 0

1 0 0 I 1 I 0 II

0 0 0 II 0 0 0 0

1 0 0 0 7 0 0 0

0 0 0 0 (I 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 (I 0

0 1 0 0 .i I 0 0

II 0 0 0 0 0 0 0

I4

17

I4

0

.i 67

5 71

a 68

II II

t Minimum (lower) and maximum (upper) values of distribution of true first, internal and last axons. $ All percentages are rounded reductions in those members of the test set that are filtered (only exons larger than 13 hp). ‘I’hc total reduction is calculated with reference to the entire exon population. including those smaller than 1.5 hp. predicted internal exons. To prevent the combinatorial explosion involved in the search of the space of predicted genes. we have introduced the concept of exon equivalence. We can represent the set of gene models starting in a given first exon as a rooted tree. the first’ exon corresponding to t,he root node and each node being expanded by the nodes corresponding t,o all compatible internal and terminal exons. Terminal exons would correspond to terminal nodes and each path from the root to a t,erminal node would correspond to a predict,ed gene (see Fig. 4). The space of predicted genes can be represented as a set of to a different rooted trees, each one corresponding predicted first, exon. We consider that, 2 (predicted) exons are equivalent if the nodes corresponding to them have exactly the same parent nodes and they expand to exactly the same sub&r (see Fig. 4). Tn other words. 2 exons are equivalent if they can oc~aur exactly in the same (predicted) genes. if they are completely int,erchangeablr. It is easy t,o see that the above definition of axon equivalence corresponds to an equivalence relation in the mat,hemat,ical sense (it is reflexive. svmmetrical and transitive). We may then derive the part&ions induced by t)his relation in the sets of predicted first. internal and terminal exons. and assemble predicted genes as linear arrangements of rompatible exon classes, instead of as linear arrangement,s of predicted Axons. Since the size of the srts

of exon c*lassrs arr smaller than the size of’ the sets of predicated exons, the size of the space of predicated genes to search may be substantially smaller. Btricbtly. rxon equivalence classes vrritying the above definition csannot be derived without constructing the tree(s) corresponding to the spare of predicted genes. However. we have used an approximate definition of axon equivalence according to which exon classw for the srts ot predicted RXOI~S can be derived rfficirnt,ly. K,oaghl,v. wo consider t,hat 2 internal rxons are eqw\-alrnt if the! overlap exactly Ah the same predirtrd first. internal antI trrminal rxons. and if they (*an be wad t‘xw%ly in the. same frames and have the sa,me codon remainder. \Vr define axon equivalence similarI)for first and terminal axons. We have obtained axon &sses fi,r thtb transc.riptioll units cwrresponding to thr c-ompiled 16U vertrbratr genes (see Table 1). The size of the resulting sets of exon c~lassrs is substantially srnallrr (3.i0, on average) than thr size of the sets of prrdirtrd exons after tilterinp. \!:c ranked ~LXOII classes within a given transcript,ion unit a(*cwrding to thL HVMC’SFGM -1 HL’MCYPIIE HrMEFl A HVMQOSlSA HUMGOSlOB HI-MGHN HIIMGLTHI Hl’MHMGl4A HlWHSPSOB HT:MILl I3 HIWILll3S H ~‘MILP HI-MI L2.4 HVMIL2B HUMIL.5 HYMILBA HUMKEREP HYMLACTA HUMLYLlB HUMMET HPMMETIFl

RNA 4311 1433 1791 1082 3525 4793 4777 2314 4845 9512 4561 1513 1189 1617 3293 4860 7564 2926 8158 6654 7202 815 2168 3629 3651 810 1796 5650 3475 28.51 IO 226 11:245 3883 2065 3995 2603 2573 1337 1867 3337 4653 3320 3594 7810 1465 2699 1466 1656 2376 11,410 3512 1883 1885 1633 845 6804 6784 7007 7008 5013 504 1 5041 2075 2079 4621 2362 3695 905 131%

CDS 954 459

654 426 1491 942 1134 1134 585 1134 543 411 444 444 273 981 1161 747 1305 1167 756 429 429 633 633 429 654 2577 552 429 1257 570 1587 456 897 1191 1191 303 804 306 252 306 954 1590 405 768 498 651 435 1482 1389 279 2x2 654 499 303 2175 810 810 462 462 462 405 405 1419 429 804 186 186

Lu 0. exons 3 3 5 3 10 ii 6 6 8 4 3 3 3 6 7 7 7 10 7 6 3 3 6 6 3 1; 3 3 4 6 11 3 4 3 3 3 3 3 3 3 3 10 3 5 3 : 9 7 3 3 5 3 6 11 6 6 4 4 4 4 4 8 4 3 3 3

NO.

models 1348 78 87 35 24,540 7039 6295 458 77.418 88,810 115 10 29 141 569 4873 95,534 4414 216,810 31,940 7007 %8 434 471 274 16 41 215,277 115 113 115,724 64,976 11,726 38 1191

542 512 58 25 5291 2624 2088 1432 422.334 145 2364 16 374 648 137,184 1186 105 101 183 % 26,202 218,239 39,615 81,297 2313 1058 1058 89 92 13,818 919 691 25 1X

Rank 1 2 3 6 % 3 262 2 7 1 1 25 34 2

2 1268 1 1 12 9

2

2 3 51 59 1 832 li8 235 1 % 31 5 z 4: 103 16 1 10 5 1 231 586 264 264 1 1 3 5 1 1

Predicted exons Total True 9, 3 3 5 % 4 6 7 4 6 6 5 3 3 3 ; 7 6 9 7 3 3 3 4 5 3 3 13 2 % 5 9 7 1 5 4 4 4 3 3 4 3 3 11 4 6 % 5 7

10 % 4 3 4 2 5 12 6 5 4 4 4 4 4 9 3 3 3 3

3 2 3 1 2 4 5 3 5 5 3 1 3 3 3 0 6 4 5 5 0 3 3 2 5 1 1 9 0 1 3 3 5 0 2 2 P P 3 1 2 1 3 10 % 4 1 3 3 6 0 % z 3 1 5 8 5 3 2 1 2 4 4 8 1 0 3 3

First prediction Predicted CDS cc Length sen I

1.00 067 0.60 0.33 0520 0.80 0.83 0.50 083 062 0.75 0.33 1.oo 1.oo 0.50 0.00 086 0.57 050 0.71 0.00 la0 1.oo ti33 083 0.33 @20 069 0.00 0.33 0.75 0.50 0.45 0.00 0.50 067 0.67 0.67 1a0 0.33 0.67 0.33 l-00 l-00 0.67 0.80 0.33 060 0.75 (b67 (PO0 oti7 (I.67 060 o-33 o-83 o-73 o-83 0.50 tit50 (I.50 0.50 1.00 l-00 1.oo O-95 O-00 1.00 I-00

954

510 792 246 1173 1116 1206 1143 528 1017 465 225 444 444 255 54 1077 582 1005 1062 282 429 429 804 516 315 558 2463 606 456 1275 1062 1116 105 1014 894 894 363 804 414 354 174 954 1644 501 1008 318 603 .501 1446 423 348 240 579 396 273 1980 921 330 255 255 255 405 405 1458 237 495 186 186

l+Ml fi89 0.84 0.67 0.42 0.74 0.90 0.66 0.93 0.80 085 0.63 1.oo 1.00 0.81 @21 0.96 0.82 064 0.9% 0.33 1.oo 1.oo 075 0.89 0.65 0.03 0.87 0.78 0.64 a99 @36 060 0.43 0.73 0.79 0.79 @63 1a0 054 081 0.74 1~00 0.98 082 0.82 073 088 OYiO 072 023 0.7 1 091 w79 0.47 0.95 08.5 0.93 0.60 0.71 0.7 1 071 1a0 I .oo 0.98 @69 075 1QO 1QO

1 .oo 097

1a0 0.58 0.56 0.86 095 0.83 0.89 0.78 080

P52 1~00

1.oo 0.80 0.06 093 07ti 0.61 089 022 1a0 1~00 030 0+42 0.69 0.33 031 0.86 OT2 1w 0.56 0.6 1 0% 084 V’i5 0.75 0.7 9 1w 068 0% 057 1.oo 1.oo 097 1a0 064 a-89 0.64 075

021 o+xi 0.85 08% 070 090 o%s

1~00 040 O-54 054 054 1.OO I Ml 1.OO 054 062 1.00 1.OO

Sen2 1.oo 088 W82 1.00 0.71 073 0.90 @82 0.98 0.86 0.94 @95 1.00 1,oo 0.85 1.oo 1.oo 0.97 0.79 0.98 0.60 1+o 1.00 0.7 1 1.oo 0.93 @38 0.95 0.78 067 0.99 0.30 0.87 1.00 0.75 1a0 1Nl 0.66 1.oo 0.50 @69 1.00 la0 0.97 079 0.76 1w 0.96 0.56 0.77 @iO 0.68 1.oo 0.92 @76 1.oo 0.94 0.88 0.98 0.98 0.98 0.98 1.00 1.oo @9i

0-9R 190 1.oo 1w

R. Guigci et al.

152

Table 7 (continued) Length id 103 104 105 106 107 109 110 111 112 113 114 115 116 117 118 119 120 121 122 125 126 127 128 129 130 131 132 133 134 137 138 140 141 142 143 144 145 147 148 149 150 151 153 154 155 156 157 158 159 160 161 162 163 164 165 166 168 169 171 172 173 174 179 180 182 183 184 185 186 187

Locus HUMMIS HUMOTNPI HUMP45C17 HUMPAIA HUMPALC HUMPCNA HUMPGAMMG HUMPIMlA HUMPLA HUMPPPA HUMPRPHl HUMPRPHS HUMPSAA HUMPSAP HUMRPSl7A HUMSAACT HUMSPBAA HUMTFPB HUMTNFA HUMVPNP LEMHBB LEMHBE LEMHBG LEMHBGA MNKHBD MUSBlOHAl MUSACASA MUSAPOAII MUSAPOIVA MUSASP MUSBCASE MUSCTNC MUSCYPl4X MUSCYP345 MUSERP MUSERPA MUSGFJE MUSGPDX MUSGUSB MUSHBA MUSHBBHO MUSHBBHl MUSIL3B MUSLBPA MUSLTA MUSMETI MUSMETII MUSMHDBD MUSMHIEAD MUSMHP36BG MUSNUCL MUSODCC MUSPIMl MUSRPL30 MUSRPL3A MUSRPL7A MUSSAAl B MUSSAAZB MUSTHYGP MUSUPAA ORAHBAOl ORAHBAOS RABHBBlA2 RABHBBlBl RABUG RATABPG RATACCYB RATACSKA RATAGPAlG RATAPOAOl

RNA 2744 897 6659 12,160 6937 4924 2831 5239 1659 1665 3273 3134 5858 4467 3672 2835 9336 12,434 2767 2170 1618 1587 1560 1548 1630 2633 2944 1284 2377 1744 6794 3390 6215 6716 3477 3366 1870 7342 14,009 819 1791 1533 2195 4202 2105 1086 779 3578 5355 5052 8887 6509 5276 2712 3226 3107 2480 2479 4848 6694 834 835 1289 1288 3076 3032 3567 3009 3169 1788

NO.

CDS 1683 375 1527 1209 444 786 762 942 654 288 501 501 786 747 408 1134 1146 888 702 495 444 444 444 444 444 1464 1134 309 1185 780 696 486 1575 1542 579 579 447 1050 1947 429 444 444 501 399 609 186 186 1062 768 816 2124 1386 942 348 408 813 369 369 489 1302 429 429 444 444 276 1212 1128 1134 618 780

exons 5 3 8 8 4 6 3 6 5 3 3 3 5 4 5 6 10 6 4 3 3 3 3 3 3 10 6 3 3 5 7 6 6 6 5 5 3 8 12 3 3 3 5 4 3 3 3 7 4 9 14 10 6 4 3 6 3 3 3 10 3 3 3 3 3 8 5 6 6 3

No. models 296

7 41,774 76,347 74,542 1349 234 88,440 371 568 328 112 71,404 23,002 869 120 182,302 93,308 311 56 57 17 54 20 36 8059 323 21 193 304 1128 1120 26,907 38,459 976 469 339 45,507 188,433 12 75 PO 57 510 44 4 6 5362 35,060 41,022 206,811 231,248 1244 51 110 1324 259 184 1663 40,312 27 20 16 16 183 516 351 369 2851 37

Rank 12

4016 I 25 1 4 6 86 33 29 4 7 2 5 1 I 1 1 1 1 949 I I 12 1 1 185 2491 8 8 5 1

3504 1 3 7 3 3 1 1 347 -. 1 2 1 IN 1 1 I I 3 3 9 23 I 1 4

Predicted exons Total True y0 1 1 4 4 4 4 3 5 3 1 2 2 2 1 4 4 3 6 2 2 3 3 3 3 3 4 5 3 2 3 5 3 4 3 I 1 3 7 6 3 2 2 2 2 I 3 3 5 2 2 10 6 5 2 3 3 3 3 1 5 3 3 I 1 1 2 5 .5 5 2

0.20 633 050 650 l-00 0.67 166 (k83 (b60 033 067 0.67 040 025 0.80 067 0.30 1.oo 0.50 0.67 I .oo I.00 140 Ia0 190 640 0.83 LGO 6.67 0.60 0.7 1 0.50 0.67 050 wo 0.20 I .oo 088 050 I .oo 0.67 667 0.40 w.50 033 I .oo 1u1 w7 1 0.50 0.22 0.71 6.60 0.83 0.50 I .oo 0.50 I ,oo I .oo 0.33 0.70 1.oo I .oo

w33 w33 033 0.25 I 0) 083 083 067

First prediction Predicted CDS Length Sen 1 cc 1521 516 1692 1521 444 588 762 936 606 258 516 426 621 561 504 1017 1293 888 615 483 444 444 444 444 444 1266 1173 309 1272 768 693 498 1.569 1275 312 312 447 1269 996 429 267 267 348 354 693 186 186 777 804 543 1761 1107 918 192 408 576 369 369 657 1023 429 429 468 468 270 435 1128 942 615 912

0.75 0.73 986 082 l-00 0.84 190 081 0.88 030 960 0.68 665 616 0.81 082 0.37 Ia0 032 0.98 190 190 160 190 190 0.73 0.97 1m 0.70 0.99 0.99 0.86 683 071 636 0.36 IQ0 0.89 661 I Qo 0.73 0.72 0.68 0.7 1 0.84 1Go 1GO 0.81 0.73 042 0.82 085 0.98 067 I .oo w79 1.oo 1m 0.83 0.7 1 1.oo 1.00 0.91 0.91 0.16 0.43 1.00 087 I .oo 0.86

085 190 094 0.94 1.00 0.75 160 084 039 0.40 067 0.67 0.61 0.24 093 084 048 1.00 081 098 1.oo 1.00 160 1a0 160 081 190 140 0.89 098 0.99 089 OX7 0.70 0.32 032 190 1w 646 160 0.60 0.60 0452 069 095 I4M) 1w (hi3 0.79 0.40 w7x *7x 0.97 0.52 1 a0 070

160 1~00 0.98 0.67 1.00 I .oo *9ci 0.96 0.24 033 1a0 0.83 I4H) 1~00

Sen2 0.94

673 0.85 07.5 1ao 160 190 685 0.96 944 o-65 0.79 678 0.32 0.75 @94 043 1.oo 093 160 1,oo 1.00 1.00 1.00 1a0 093 @97 I .oo 683 1.oo 0.99 (187 fkXX 0.84 0.60 0.60 1.00 083 ().!I 1 160 l-00 1.oo OMI 07x OH3 I WI 1fU) I a0 975 0.60 095 C98 I 30 0.94 I QO 099 1.00 1m 6.73 @86 Ia) 1.00 0.9 1 0311 624 091 I-00 160 1.00 0.86

Prediction of Gene Structure

153

Table 7 (continued) Length id 188 189 190 191 192 193 194 196 197 198 199 200 201 204 205 206 207 208 209 210 212 213 214 215 216 217 218 219 220 222

Locus

RNA

RATAPOAOI RATAPOAIG RATAPOEA RATCKBR RATCRYGE RATCRYGF RATCYP45C RATCYPD45 RATFABPLG RATFERLl RATGH 1 RATGH2 RATGNRHA RATKALA RATLHB RATOXTNP RATPKLG RATPTRYI RATRHLl RATRSKGB SHPMTIB SHPMTIC SHPMTII SNKTROX SRAAFPG XELBZGG XELKERlAl XELKERIA XELTUBA XETHBBA

CDS

No. exons

No. models

2373 2375 2779 2878 2600 2107 6046 6929 3788 1963 2087 1981 4303 4083 982 834 8350 3204 3488 4148 2019 1358 1019 7923 1887 2012 3644 3797 6285 1046

1176 1176 939 1146 525 525 1575 1542 384 552 651 651 279 786 426 378 1632 741 855 780 186 186 186 768 492 444 1290 1302 1350 429

3 3 3 7 3 3 6 6 4 4 5 5 3 5 3 3 11 5 8 5 3 3 3 5 5 3 8 9 4 3

82 114 456 2068 81 163 23,367 33,906 681 54 1076 568 10,726 966 72 6 186,814 258 16,044 766 41 520 26 109,762 198 83 2377 18,652 2591 10

3719

762

5

23,024

Rank

15 29 1 1 2 7 12 5 1

168 1 1 1 64 4

218

Predicted exons Total True ‘+&

First prediction Predicted CDS Length CC sen 1

Sen2

3 3 2 7 3 4 8 7 5 4 5 5 3 5 3 3 10 5 9 4 3 3 3 4 3 3 4 9 4 3

1 1 1 5 3 2 3 3 3 4 5 4 2 3 0 3 6 4 3 1 3 3 3 1 1 1 2 9 4 3

033 0.33 033 0.71 1.00 067 @50 0.50 075 1.00 1.00 O-80 0.67 0.60 090 1.00 @55 0.80 0.38 @20 1.oo 1.00 1.00 @20 0.20 0.33 0.25 1.00 1.00 l%lO

1134 1074 948 1065 525 573 1545 1425 573 552 651 537 285 816 207 378 1236 723 990 429 186 186 186 456 126 300 258 1302 1350 429

@97 0.92 0.72 0.78 1.00 0.91 0.88 094 0.80 1.00 1.00 0.76 0.83 053 -0.12 1.00 0.79 0.98 0.84 0.70 1.00 1.oo 1.00 @68 @40 0.79 0.37 1.00 1.00 1.00

@96 @91 @82 083 1.00 097 @90 @91 190 1.00 1.00 075 085 0.63 0.15 l+O 0.72 Cl.98 095 0.55 1.00 lQ0 l+Kl 0.54 0.23 0.68 0.20 1.00 1.00 1.00

1.00 1.00 081 0.89 1.00 0.89 092 0.99 0.67 l+lO 1.00 0.91 0.83 @61 0.31 1.00 0.95 1.00 0.82 0.99 1.00 1.00 1.00 @91 091 1.00 1.00 1.00 1.00 1.00

48

3

063

681

0.79

@SO

0.88

For each gene, the Table shows first an arbitrary identification number (id) the GenBank locus name (release 62.0), the length of the mRNA and of the coding region (in number of nucleotides) and the number of exons. Then, the number of models predicted by GeneId and the rank of the true prediction, that is, the rank of the class containing the true gene. Dashes mean that the true gene has not been predicted. Of the 169 genes, 134 were predicted (79%), 102 (60%) were predicted among the top 20 ranking models and 47 (28%) were predicted at the top. Finally, the table shows an evaluation for the first prediction (that is, for the top ranking model in the top ranking class). First, the number of predicted exons, the number of true exons predicted and the proportion of true exons predicted. Note that in 162 genes (96%), at least one exon is correctly predicted. Then, the length of the coding region predicted, the proportion of true coding nucleotides predicted as coding (Senl), the proportion of predicted coding nucleotides that are actually coding (Sen2) and the correlation coefficient (CC) between actually coding and predicted coding. Averages for all these variables have been computed.

tides. On average, 80% of the coding region is predicted as coding and 88% of the predicted coding is actually coding. The average of the correlation coefficient is @79. 4. Discussion Results obtained show that the GeneId system is capable of correctly predicting a large proportion of vertebrate genes among a set of ranked candidates for sequences shorter than about 15kb, given the domain in which the gene is expected. This qualifier is the result of not yet being able to predict the promoter region accurately. For larger sequences, the number of predicted exons becomes so large that the combinatorics become computationally prohibitive. For such large genes we would predict the candidate exon classes and not all possible sequential arrangements. The concept of this hierarchical, rule-based approach directly allows for modification and refinement as more details of the cellular processes emerge. It would appear to provide an

excellent “test platform” for investigating new or alternative methods of atomic element identification or filtering and assembly rules. For example, coding regions could be predicted using the fast online GRAIL system (Gene Recognition and Analysis Internet Link, Oak Ridge National Laboratory). The study of gene activation is still in its infancy but, as more details of the necessary and sufficient patterns of gene-specific activation emerge, we believe that these results can be included in our hierarchical system as rules that extend the application of the system to the identification of promoters as well. In order to obtain some sense of the general accuracy of the approach, we evaluated GeneId in a set of new genes independent of those in the original studied set. We obtained the set of genes in GenBank 66.0 (1991) not present in GenBank 640, satisfying the same criteria as the set of genes previously compiled from GenBank 64.0: 28 new genes were obtained in this way. Unfortunately, the characteristics of this data set are different’ from the

154

R. Guigd et al.

Table 8 Results obtained for the independent

Id 6 16 47 52 53 54 55 .56 .57 5x .59 61 62 63 65 66 77 81 82 95 106 115 116 126 129 130 142 145

Locu,s BOVPNMTU CHKCMYCA HAMRPSI4B HUMCYPSDG HVMEMBPA HUMGAPDHG HUMHBBBAZ HUMHNRNPA HUMIBP3 HI’MLD78A HUMLD78U HUMPDHBET HUMPP14H HUMREGH HUMTRPY 1 H MNKHGBGG$G MUSCD14.4 MUSCYTCB 1 MVSCYTCC MUSMBlAA MUSRESl D RAUl5LOX RABBCAS RATCOXIVA RATCYPSAl RATCYPZA3A RATODCAA KATPPP

SO. x0. PXOIH models

13last PCOI’C

Length RN.4

t ‘IX

81 91 71 187 95 95 604 105 81 183 192 * 79 93 113 478 87 159 193 82 92 9.5 655 93 298 104 563 69

1594 3049 4659 4378 3281 3856 1606 4546 8870 1889 1890 6312 5040 2921 1848 1496 1541 4254 4945 4805 9630 8007 9430 6106 12.836 8076 6485 1301

852 1251 4.56 1.503 669 1008 444 963 876 279 282 1080 543 501 X28 144 1101 1515 1518 663 1209 1992 687 510 1479 1485 1386 “07

3 2 4 9 ?I 8 3 9 4 3 3 10 6 .i 5 3 2 9 9 r, 9 I4 7 4 9 9 10 2

23 40 31,542 22,640 1813 31,826 91 31,641 118.458 105 102 17,388 45,968 467 415 *5x I0 5980 25.676 li.681 61.168 31.390 73.541 ,5X1 68,880 96.192 199.922 20

4809

922

6

31 .i28

data, nPt Predicted exonr Total Trur -

Rank

First prediction Predictrd

(:I%

3 31.507 4 1

I6 I X3 1 x I 1i(i 58

32 41X0 43,452 5302

The similarity between the genes in this set and the genes in the previous study set has been investigated. For rach gene we provitlr the maximum blast score of similarity (Altschul, 1990) between the gene and the genes in the previous study set (an ast’erisk (*) rneanh genes the value of similarity is less than an internal blast threshold). As it is possible to see. the independent data srt contains sorrrc~ homologous to genes in the previous study set. However. the performancr of GeneId is not better fur thew c~~srs. See Table 7 for an explanation of the meaning of the rest of the values

characteristics of t.he original st,udied set: genes are significantly longer, 4809 bases on average versus 3719, and have more exons. six ~W’TSUX five. Tt comes as no surprise, then, that GeneId performs worse in this independent data set,. Results obtained are shown in Table 8. Now, in only 15 cases (540/,) is the true gene, correct exons with correct splice boundaries, among the set of predicted genes, and in only four cases (14%) is the true gene the top-ranking predicted gene. However. in a nucleotide-by-nucleotide evaluation, GeneId st’ill performs very well. On average, 69 y0 of the coding region is predicted as coding and 84% of the predicted coding is actually coding, the average of the correlation coefficient being 0.7. This predicted coding sequence will, in most instances, be accurate enough for searching against, protein sequence databases for t,he identification of probable gene function. We are aware that our use of open reading frames for the initial identification of exons is sensitive to frameshift errors. This problem can be overcome by including in our exon definition those regions that have a high coding probability by some measure between acceptable splice sites, whether they are an open reading frame or not.

As exemplified by the general identifying. filter rule. combine rule and iterate approaches. when prior knowledge of tjhr gene to be modelled exists. the prediction accuracy can be greatly intreasd If. for instance. the S-terminal amino acid sequence is known. or a particular splice site is known. that information can be used to pick out t,hose models that adhere tjo the prior knowledge. A similar approac*h can be built into t.he rule-based syst.rm t, selecting combinat,ions of at,omir measures that provide highly probable sites. Such abilit)y has t)eprt demonstrat’ed for donor site prediction supported by coding informat’ion (Brunak ef al., 1991) where. on average. one in five donor sites had a strength that made it highly probable (if not certain). During development,. we experiment,ed wit.11 a number of different algorithms for recognit,ion of’ splice sites: (1) neural networks, including rnort’ a~ advanced assemblies of several net,works described by Brunak et al. (1991): (2) dynamic programming methods: and (3) t,hermodynamita models that include stacking energies in the pairing between mRNA and small nuclear ribonucleoproteins. In the context of’ GeneId, however. wherr additional information like reading frame and stop

Prediction of Gene Structure codons is utilized, the overall performance using these more sophisticated algorithms was not significantly improved over the simple profile scoring method described in Materials and Methods. Likewise, methods for recognizing coding regions (Fickett, 1982; Fichant & Gautier, 1987; Konopka, 1990) and neural nets combining the output of such algorithms did not seem to be significantly more sensitive than the simple rules described in Materials and Methods. Thus, while the modular design of our hierarchical filtering and rule-based system can incorporate nearly any of the proposed methods, the use of the most explicit and simplest methods provides the cleanest test of the overall concept. During the t#raining of neural nets to recognize donor sites, two neural nets were compared that differed only in that one of them was given information about the length of the exon (the distance to the upstream acceptor site). Surprisingly, the neural net, given the length information, performed better than that without length information when tested on the same test set, but again giving length information to only one of the nets. When we investigated in what way the neural net used the length information. we discovered that it had modeled a length function that is very similar to the actual length distribution of our exon set. It is tempting to speculate that the neural net had modeled a spatial constraint on the co-ordinated recognition of acceptor and donor sites, which then allowed it to perform better than the neural net that was not given any knowledge of the spatial arrangement. Of course, it cannot be ruled out that the better performance was caused by utilization of the non-random length distribution of exons per se. Our approach to the splice site identification has been to treat genes in which no alternative splicing is known as subject to the same splicing machinery. It is, however, becoming increasingly apparent that this may not be the case (Maniatis, 1991). Apart from cell-specific proteins that bind specifically to pre-mRNA recognition sequences to control splicing, the activities and amounts of the general splicing factors t’hought to participate in regulation of alternative splicing show cell-specific variation (Krainer et al., 1990). The implicit assumption of a common process may have limited our splice site prediction. As more details of this modulation of the splicing machinery emerge, or as more splicing examples from identical cells become available, our splice site prediction may be refined by incorporating this knowledge as rules in our system. There is even a possibility that the identification of cell-specific gene activation sequences may be linked to rules of cell-specific splice site selection. Another further development is to narrow down the taxonomic coverage of our system. By narrowing down the control set from vertebrates to mammals, the performance would probably increase. Division-specific versions for yeast, nematodes, insects, or even bacteria, could be useful as well.

155

This work was supported by Xational Library of fellowship Medicine grant LM05205 and by a postdoctoral from the Ministerio de Education y Ciencia (Spain) to R.G. A beta version of the GeneId system has been made freely available to the research community by an automatic mail server. We thank Kathleen Klose of the BMERC (Boston University) for making this service possible. For instructions, send the following Email: “geneid info” to [email protected]. Questions can be addressed to [email protected] or to [email protected]

References Altschul. S. F., Gish. W.. Miller. WT.. 1lyrrs. E. iv’. Br Lipman, D. J. (1990). Basic local alignment search tool. J. Nol. Biol. 215. 403410. Brunak. S.. Engelbrecht, qJ. & Knudsen. S. (1991). Prediction of human mRPu’A donor and acceptor sites from the DR;A sequence. J. Mol. Biol. 220. 4965. Bucher. P. (1990). Weight matrix descriptions of four eucaryotic RPU’A polymerase II promoter elements derived from 502 unrelated promoter sequences. J.

X,1. Biol. 212, 563-578. Burks. C., Cassidy. M.. Cinkosky. M., Cumella. K., Gilna, P. et aZ. (1991). GenBank. Sucl. dcids Res. 19. 2221-2225. Cramer. H, (1946). M/dhe~Khd Nethods cd Statistics. Princeton I’niversity Press. Princeton. Felsenstein. J. (1981). Evolutionary t,rers from Dr\‘A sequences: B maximum likelihood approacxh. d. &‘oZ.

Enol. 17. 368-376. Fichant. G. & Gautier. C. (1987). Statistical method for predicting protein coding regions in nucleic acid sequences. C’ABIOB, 3. 2877295. Ficket,t. J. W. (1982). Recognitions of protein coding regions in DNA sequences. Surl. .-lCirl:j Krs. 10. 5303-53 18. Fields, (1. & Soderlund. C. A. (1990). yn2: a practical tool for aut,omating DNA sequencr analysis. (‘A BIOS. 6, 263-270. Gelfand. M. S. (1990). Computer prediction of the exon intron structure of mammalian pre-mRh-As. ,Vucl.

Acids Res. 18. 5865-5869. Hertz.

G. Z., Hartzell, G. W. 8: Stormo, (:. 1). (1990). Identifiration of ronsensus patterns in unaligned Dh’A sequences known to be functionally related. CABIOS, 6. 81-92. Konopka, A. K. (1990). Towards mapping functional domains in indiscriminantly sequenced nucleic acids: A4 computer approach. In Structure & Methods, pp. 113-125. Adenine Press, Xew York. KY. Kozak. M. (1983). Comparison of initiation of protein synthesis in procaryotes, eucaryotes. and organelles. Nicrobiol.

Rev. 47, l-45.

Krainer. A. R., Conway. G. C. & Kozak. D. (1990). The essent,ial pre-mRh-A splicing factor SF2 influen.ces 5’ splice site select,ion by activating proximal &es. (‘pII. 62, 35542. Kudo. M.. Lida, Y. & Shimbo. 11. (1987). Syntactic pattern analysis of 5’ splice sit,e sequences of mRru’A precursors in higher eucaryote genes. C.4BJOS, 3. 319-324. Lapedes, A. 8.. Barnes. C., Burks, C.. Farber. R. M. & Sirotkin, K. M. (1988). Application of neural networks and other machine learning algorithms to DXA sequence analysis. In Computers and DNA

R. Guigd et al.

156

(Bell. (4. 1. & Mar-r. T. G., eds), pp. 1577182. Addison-Wesley, Redwood City. CA. Maniatis, T. (1991). Mechanisms of alternative pre-mRKA splicing. Science, 251. 33-34. Mount. S. M. (1982). A ratalogue of’ splice junction sequences. Nucl. Acids Res. 20. 4X-472. h’akata. K., Kanehisa. M. & DeLisi. (‘. (198.5). Prediction of splice junctions in mRNA sequences. Nucl. Acids

A first exon is a triple z = (x1. .c~. x3) with: Xl E II,,, J2 E u,, such that x1 < .rz < z, where such that: xl mod 3 = z mod 3:

Appendix We present here a more rigorous description of the GeneId algorithm described in Materials and Methods in t’he main text.

1. Starts, Donors, Acceptors and Stops

x3 = (x-x1)

An with:

internal

if L(z--1)

# G or L(x-2)

if L(x+ 1) # G or L(x+2)

#A,

of Zt

on (0,l).

if L(x) #A or L(X+l) # G or L(x+2) # T then b(x) = 0:

.x’ = (.r,. x2. x3. .x4)

x2 < z. where z ES,, z > .ri is such that t,hrrr are zl, z2 E 8,. such t’hat’: Zl.22 < 2.

(1)

z1 mod 3 # z2 mod 3 # z mod 3

(2)

and for all other z’ E S,. 7’ > x1. verifying (2), z < z’. x3 is a function.

(1) and

.cj: (0, 1. 2) -+ {V. 1) defined:

0

then a(x) = 0:

# T, then d(x) = 0;

is a tuplr

x2 E I),. such that: x2 > .r, + I,.

In fact, L is a function: Let a, d, b and s be functions verifying:

axon

x1 E ‘4,

0 < % 6 Z}.

I,: z,+ + {A. C’, G, T}.

mod 3.

1$“ewill denote by PO,,the set of first exons for L. Note that, we are defining a first exon as a triple (“start codon“, “donor site”. “codon remainder”). such that there are not stop codons between t’hem in the frame defined by “st,art, c-don”.

Let L be a DNA sequence of length 1 and Z the set, of integers. Let then: z,+ = {XEZ,

is

for all other I’EJY. 2’ > x1 such t.hat 2’ mod 3 = x1 mod 3. z < 2’.

Res. 13. 532775340. Smith. T. F.. Waterman. %I. S. & Sadler. J. R. (1983). Statistical chararterizabion of nucleic acid sequence functional domains. N&. /lcids RRS. 11, 22052220. Staden. R. (1984). Computer methods t,o locate signals in nucleic acids sequences. X’ucl. ilcids Krs. 12. 595-519.

z E 8,

-x3(!/)

if there is 2 ES,, .r, < z < .z2. such that (z--x1) mod 3 = y. = 1 1 otherwise

x4 = (x2 -.x1) mod 3. We will denote by I,, the set of int~ernal exons for I,. x3 is a function such that x3(y) = 1 if s can be read in frame y as defined by x1. otherwise t3(y) = 0. x4 is the codon remainder.

1 if L(z) = T and L(r+ 1) = il and L(x+~) = A 1 if L(z) = Tand L(x+ 1) s(x) = I = A and L(x+2) = 0 1 ifL(z) = TandL(z+l) = GandL(r+2) = A. I 0 otherwise.

A4 terminal

exon

is a triple

.c = (x1- .rz. x3) with:

x1 E A,.

Then, given a, d, b E (0,l) we can define:

x2 E AS,, such that x1 < x2 and there is not 2 E SIL. x1 < z < x2 such that, .r2 mod 3 = z mod 3.

A, = {xEZL+,a(x)> a},A,

x3 = (x2-x1)

is the

set of

acceptor sites for L:

D, = {x EZL+,d(x) > d}, D, is the set of donor sites for L;

mod 3.

We will denote by ToL the set of terminal for L.

exons

B, = {xEZL+, b(x) > b), R, is the set of start codons for L: S,={x~Z,f,s(~)=l},S~isthesetof stop codons for L.

3. Filtered Exons

If z E A,, x is an acceptor site, a(x) is the acceptor site score for x. a is the threshold acceptor site score. Similarly for d and d, and b and b. 2. Exons Let L, E Z be a constant. internal exon length.

1, is the minimum

Let, f = (fI: fi, ., fnf) be a tuple of real functions, fi: FoL + (0,l). Given 5 E F,,, f,(z) = b(x,), f,(x) = d(x,). fI(x) and f2(x) are the inherited properties of the exon x (in our case, the start codon and donor site scores), while the rest of values f3(x), . . f&x) are the specific properties. (In our case, f3(x) may. for instance, be the proportion of nucleotides A in exon 2. .)

Prediction of Gene Structure IJet fil, fi2, fzl, fk, . . .,fnfl,fnf2 be pairs of real numbers, such that for all i = 1, . . ., nf, X1 f,(z) >A,>.

PI, = F,(nf) is the set of filtered first exons. We define on FL, the real function fs, built as a linear combination of f,, fi, . . ., fnf. Then, given x E FL, fs(x) is the score for the first exon x. Similarly, let i = (ir, i,, . ., ini) be a tuple of real functions, ii: I,, + (0,l) and let ij, < i, a pair of real numbers for each j = 1, . ., ni. Then we define recursively the sets: Z,(j+ 1) = {x E Z,(i): ijl > ii(x) > ij2}, I, = I,(&) is the set of filtered internal exons. We define on I,, the real function i,, built as a linear combination of i,, i,, . ., i,;. Then, given L E IL, i,(x) is the score for the internal exon x. Let t = (tl, t,? . ., t,,) be a tuple of real functions, ti: T,, + (0,l) and let til < ti, a pair of real numbers for each i = 1. . ., nt. Then we define recursively the sets:

157

there is not’ z E Z, u TL such that x2 < z1 -li < y2 (assuming, without loss of generality, that x2 < yz); there is not ZE F, u ZL such that x1 < z2 +li < y1 (assuming, without loss of generality, that x1 < yl); and x4 = y4. x3 = y3; R, is an equivalence relation. The partition induced by R, in Z,, [ZJ is the set of internal exon classes for L. Given x E [IL], s(x) = max(i,(i)) jtx

is the score for the internal exon class x. Let R, be the relation on TL defined: xR, y (x, y E TL) if and only if: there is not z E FL v I, such that x, < z2 + li < y1 (assuming, without loss of generality, that x1 < yl), and x3 = y3. R, is an equivalence relation. The partition induced by R, in TL, [T,], is the set of last exon classes for L. Given z E [TJ, s(x) = max(t,(i)) jtx

is the score for the terminal exon class x. Note that the exons in a given class have the same frame and codon remainder. They do not have the same boundaries. If x is an exon, [x] is the exon class to which x belongs.

TdO) = Tw.: T,(i+

1) = {x E T,$):

til

>

&tx)

>

ti2},

T, = T,(nt) is the set’ of filtered terminal exons. We define on TL, the real function t,, built as a linear combination of t,, tl, . . ., t,,. Then, given .r E TL, t,(x) is the score for the terminal exon 2. 4. Exon Classes Let 1, E Z be a constant. li is the minimum length.

intron

Let R, be the relation on FL defined: x R, y (x, y E FL) if and only if: there is not z E Z, u Tt such that x2 < z1 -li < yz (assuming, without loss of generality, that x2 < yz), and x3 = y3. Clearly, R, is an equivalence relation. The partition induced by R, in FL, [FL], is the set of first exon classes for L. Given x E [FL],

5. Genes We first introduce the concept of initial segment, which is defined recursively:

(i) If x = (x1,x2, x3) is a first exon, then [x] is initial gene segment. x2 is the termination and x3 is the remainder the initial gene segment. (ii) If g is an initial gene segment with termination and remainder g1 and x is an internal exon (x1, x3, x4) such that:

an of gt x2,

x1 2 g(+li, x,((3-g,)mod

3) = 1,

then the sequence g, [x] is an initial gene segment with termination x2 and remainder (g, + x4) mod 3. A gene, g, is a sequence g, [xl, where g is an initial gene segment, with termination gt and remainder gr, and x = (x1, x2, x3) is a terminal exon such that:

44 = max(fs(i)) jex

is the score for the first exon class x. Note that the score of a given exon class is the highest score for the exons in the class. Let R, be the relation on IL defined: z R, y (x, y E I,) if and only if:

gene

x1

B

gt+li,

(g, + x4) mod 3 = 0. We denote by G, the set of all genes for the sequence L.

Edited by F. Cohen