BIOINFORMATICS
Vol. 18 no. 7 2002 Pages 1004–1010
Determining a unique defining DNA sequence for yeast species using hashing techniques Jan-Jaap Wesselink 1,∗, Beatriz de la Iglesia 1, Stephen A. James 3, Jo L. Dicks 2, Ian N. Roberts 3 and Vic J. Rayward-Smith 1 1 School
of Information Systems, University of East Anglia, Norwich NR4 7TJ, UK, Research Group, John Innes Centre, Norwich Research Park, Colney, Norwich NR4 7UH, UK and 3 National Collection of Yeast Cultures, Institute of Food Research, Norwich Research Park, Colney, Norwich NR4 7UA, UK
2 Bioinformatics
Received on October 31, 2001; revised on February 6, 2002; accepted on February 12, 2002
ABSTRACT Motivation: Yeasts are often still identified with physiological growth tests, which are both time consuming and unsuitable for detection of a mixture of organisms. Hence, there is a need for molecular methods to identify yeast species. Results: A hashing technique has been developed to search for unique DNA sequences in 702 26S rRNA genes. A unique DNA sequence has been found for almost every yeast species described to date. The locations of the unique defining sequences are in accordance with the variability map of large subunit ribosomal RNA and provide detail of the evolution of the D1/D2 region. This approach will be applicable to the rapid identification of unique sequences in other DNA sequence sets. Availability: Freely available upon request from the authors. Supplementary information: Results are available at http://www.sys.uea.ac.uk/∼jjw/project/paper Contact:
[email protected]
INTRODUCTION Current methods for yeast identification are often time consuming (and hence expensive), as they rely on laboratory tests for growth and biochemical properties. These methods require purification of the target organism and are hence unsuitable for the identification of a mixture of organisms. In addition, these tests may not be species specific and strains of the same species may differ in key characteristics (Kurtzman and Robnett, 1998; Mannarelli and Kurtzman, 1998). There is a need for new methods that can rapidly and accurately identify yeast species. Specific oligonucleotides have already been used in a PCR method for the identifi∗ To whom correspondence should be addressed.
1004
cation of 14 human pathogenic yeast species (Mannarelli and Kurtzman, 1998). This raises the question of whether it is possible to find a short, specific DNA sequence for every yeast species. Such DNA sequences would facilitate the design of species specific PCR primers for the purpose of identifying yeasts. Additionally, knowledge of distinguishing features of yeast species at the molecular level may provide a deeper understanding of the biology of these organisms and the evolution of the gene being studied. A unique defining sequence to identify a yeast species can thus be regarded as a sequence pattern which is diagnostic for a specific DNA sequence of this species. Yeast species, with the exception of Saccharomyces cerevisiae and a few other organisms (model organisms or human pathogens), are largely only characterized with respect to their biochemical and physiological properties. Recently, the partial genomes of 13 additional yeast species have become available (Souciet et al., 2000), but this comprises only a small fraction of the total number of yeast species (678 recognized yeast species described in Barnett et al. (2000)). However for most species, the 26S rRNA gene is either entirely or partially sequenced as this gene is used to infer phylogenetic relationships between species (Kurtzman and Robnett, 1998; Fell et al., 2000). Where the 26S rDNA sequences are only partially sequenced, the partial sequence of this gene consists of the variable D1/D2 domain. Kurtzman and Robnett have shown that most ascomycetous yeast species can be identified from sequence divergence in this domain and that the D1/D2 domain is sufficient to infer phylogenetic relationships between species (Kurtzman and Robnett, 1998). The D1/D2 domain sequences for most basidiomycetous yeast species have recently become available as well (Fell et al., 2000), which means that this sequence is available for almost all currently described yeast species. c Oxford University Press 2002
A unique defining DNA sequence for yeast species
The D1 and D2 domains are approximately located in the first 650 bases of the 26S rRNA gene (Figure 3) and are two rapidly evolving regions that account for most of the sequence divergence amongst 26S rDNA sequences. Therefore the D1/D2 domains were used to search for deterministic patterns for yeast species. Hashing techniques were employed, since conventional approaches, based on sequence alignments were found to be too slow for larger data sets (of more than 30 sequences). Recently, hashing techniques have been applied to protein structural data (Leibowitz et al., 2001; Shatsky et al., 2000) and to DNA sequence data (Vincens et al., 1998; Buhler, 2001; Leung et al., 1991). However, these applications involved alignment problems of long (genomic) sequences, and searches for structural motifs. In this paper, we describe the application of hashing techniques to find unique, defining DNA sequences to identify yeast species.
PROBLEM DEFINITION DNA sequences are represented as strings over the alphabet = {A, C, G, T } and the words sequence and string are used interchangeably. Let S be the set of sequences available. A sequence in S will be denoted by α, the length of α will be denoted by |α|. α will be indexed 0 through to |α| − 1. A subsequence of α, indexed from i to j, will be denoted as α[i . . j], where 0 ≤ i ≤ j < |α| . For every sequence α ∈ S, the task is to find the shortest continuous subsequence α[i . . j] that uniquely distinguishes α in S. The solution to the problem above is a set of patterns . Let πα ∈ be a pattern found as a solution for a string α ∈ S. ∀α ∈ S, |πα | ≤ |α|, but πα may vary in length for different α ∈ S. SYSTEM AND METHODS The D1/D2 domain sequences as published in (Kurtzman and Robnett, 1998) and (Fell et al., 2000) were downloaded from the EMBL and GenBank databases, using the Sequence Retrieval System (http://srs6.ebi.ac.uk) and Batch-Entrez (http://www4.ncbi.nlm.nih.gov/Entrez/ batch.html). Sequences of synonymous species were removed (Barnett et al., 2000). This resulted in a set S of 702 D1/D2 sequences. For each D1/D2 sequence in S, each subsequence α[i . . j] of a certain length k was considered. Every occurrence of this subsequence was scored in a table. Hashing techniques were employed to map every subsequence α[i . . j] to an index in the hash table. As a first step, the subsequences α[i . . j] are encoded as numbers with an encoding function e: e : {A, C, G, T } → {2, 3, 4, 5}; A → 2, C → 3, G → 4, T → 5.
An index in the hashing table was then computed by applying the hashing function, h 1 , to α[i . . j]. h 1 is defined as follows: h 1 : {A, C, G, T }k → Z+ ; k−1 e(x[i]) h 1 (x) = Pi mod M i=0
where k = |x|, Pi is the ith prime number and M is the size of the hash table. At every position i, each character of the subsequence x is encoded using function e. The ith prime number, Pi , is then raised to the power of the result of e. This number is then multiplied by the previous result and divided by the size of the hash table M, the remainder being kept at every step. The result of this hashing function is dependent on every character of α[i . . j], and will therefore help to minimize collisions, as is desirable for a good hashing function (Knuth, 1998). M is chosen to be a large prime number as was originally suggested as early as 1956 (Dumey, 1956). Division by a prime number has been found to be a preferred hashing method (Lum et al., 1971; Maurer and Lewis, 1975; Knuth, 1998). The use of a prime number further prevents collisions by ensuring an even distribution over the hash table. The simplicity of this hashing function also ensures that it is computed relatively quickly. For a set S of |S| strings α of length n, there are |S|(n − k + 1) substrings of length k. For k = 4, . . . , 100, n = 600 and |S| = 702, the number of substrings will vary between 419 094 and 351 702. M has been chosen to be the prime number 815 669 so that the hash table would be filled approximately 51% at most, which is approximately the 50% load factor, generally viewed as the maximum load factor for best performance of a hash table. We are using the ‘add-the-hash re-hash’ form of open addressing for collision resolution. Open addressing was first described by (Peterson, 1957), and is reviewed by (Knuth, 1998). In the ‘add-the-hash re-hash’ scheme, the hash value, h, is set to 2h if the hash table is occupied at position h. If it is occupied at position 2h, it is set to 3h, and so on. The computing time of this algorithm is dependent on the number of accesses to the hash table. Therefore, when the number of collisions are kept to a minimum, the algorithm described here will be O(nm) where n is the number of DNA sequences and m the length of the DNA sequences. This was implemented in C++, and used to search for unique defining sequences for yeast species. The applications were run on a Dell Dimension XPS T700r PC, with an Intel Pentium III 700 MHz processor and 256 Mb RAM. Windows NT Version 4 was used as the operating system. 1005
J.-J.Wesselink et al.
RESULTS In the ideal world with an efficient hashing algorithm we will be able to test if a given sequence has been seen before with a single access to the hash table. This will outperform alternative techniques such as suffix trees (Waterman, 2000). In practise, however, the ideal world is not achievable and collisions occur when entries are placed in the table. A good hashing function will keep such collisions to a minimum. When accessing the hash table we may get 0,1 or more collisions. The number of collisions dictates the efficiency of the algorithm so we have undertaken studies to determine how often an access to the table results in a collision and hence generates a second access. In Table 1 our hashing function, h 1 , is compared with an alternative hashing function, h 2 , found in Waterman (2000): h 2 (x) =
k−1
Table 1. Typical performance values of hash table and hashing functions
k
Occupation (%)a
h b1
h b2
6 10 17 50 66 100
0.5 7.8 13.1 24.0 26.3 28.2
0.00 0.03 0.07 0.16 0.38 0.19
0.00 0.01 0.15 0.94 0.96 0.98
Averagec
22.0
0.14
0.79
a Percentage occupation of the hash table. b Collision ratio. c Average for k = 6, . . . , 100.
e (x[i])d k−1−i mod M,
i=0
using encoding function e : e : {A, C, G, G} → {0, 1, 2, 3}. Some typical performance values of the hashing functions and the hash table are listed in the table. The performance of the hashing functions is expressed in the collision ratio which is defined as the number of collisions divided by the number of accesses to the table at each value of k. The performance of the hash table is shown as the percentage of occupation of the table at each value of k. In the beginning, no collisions occur for both hashing functions. Collisions begin to occur at k = 8 for h 1 (data not shown) and k = 10 for h 2 . h 1 begins to outperform h 2 at k = 17 and higher. The average occupation of the hash table is 22%. The application was set to search for unique sequences of increasing length. Length 4 was chosen as a starting point although it was not expected that unique sequences were to be found of this length. Theoretically, subsequences of length 5 could be sufficient to describe 702 strings over an alphabet that consists of four characters (45 possible combinations), but this depends of course on the degree of variation between the sequences. The results of searching for unique defining sequences up to length 8 are displayed in Figure 1. With sequences of up to length 8, the majority of the species (83%) could be identified. At length 35, 94% (660 species) could be identified. The names of the species that can be identified with sequences of length 35 can be viewed at the website containing the appendix to this paper (see http://www.sys. uea.ac.uk/∼jjw/project/paper). With sequences of length 100, 96% of the yeast species can be identified. The 1006
Fig. 1. The number of species that can be identified by a unique subsequence of the 26S rDNA of increasing length.
results for unique sequences up to length 100 are shown in Figure 2. It is also interesting to search for short strings that occur in very few (two or three) D1/D2 sequences, rather than just in one single D1/D2 sequence. If we look for sequences of length 35, we have 660 unique defining sequences. However, there are a further 22 species for which we found a sequence of length 8, that occurs in that species and just one other species in our list. These species and the species with which they group together are listed in Table 2, group A. In eight instances, the other species can be identified with a unique sequence (species occurs in table in the appendix). The other species form pairs that cannot be distinguished between by two sequences. With a similar approach, we have tried to cluster the remaining unidentified species in groups of three and four. In Table 2, group B, are three groups of three species that can be completely distinguished between. The others form a group of three species that share a sequence of length 8. With sequences of length 8 that occur in four
A unique defining DNA sequence for yeast species
Fig. 2. The number of species that can be identified by a unique subsequence of the 26S rDNA of increasing length
D1/D2 sequences, all remaining D1/D2 sequences, except two, could be placed in a group of four (Table 2, group C). These sequences belonged to the two species varieties of Debaryomyces hansenii (i.e. D. hansenii var. hansenii and D. hansenii var. fabryi) and an alignment revealed that they only differed in three positions over their entire length. The species in this group, except for the ones that are listed in the web-appendix, cannot be distinguished between with a short (≤35 bp) DNA sequence. Figure 3 shows which regions in the D1/D2 molecule are highly variable and the regions that are more conserved. In accordance with (Ben Ali et al., 1999), the first 50 positions are very conserved as there are few unique sequences found in this region. The regions around position 70, 101 and particularly position 173 are very variable. A more conserved region is found again between positions 200 and 350, followed by a more variable region ending at position 577.
DISCUSSION Our results show that hashing can be applied to find unique DNA sequences in a fast manner. When comparing the performance of our hashing function, h 1 , to h 2 , which was found in (Waterman, 2000), we note that h 2 performs much better for low values of k. For short sequences however, hashing is not really necessary. When one encodes the sequences as a base 4 number, such as in h 2 , one can simply compute the corresponding decimal number and use this as an index to an array. Only when searching for sequences larger than nine is it necessary for the number to be taken modulo the size of the hash table as 410 is larger than 815669. When searching for larger sequences (k ≥ 17), h 1 outperforms h 2 by far with an average collision ratio of 0.14 compared to 0.79 for h 2 . In this study, we have used the simple ‘add the hash’ rehash. We are currently investigating alternative
Fig. 3. Histogram showing the number of unique sequences of length 8 found at every position. The locations of the D1 and D2 domains of S. cerevisiae are indicated between the dashed lines (Hassouna et al., 1984).
rehashing functions to further reduce collisions. The average occupation of the hash table of 22% was less than initially expected (50%), due to the fact that not all possible DNA sequences occur. As a technique, hashing will outperform suffix tree techniques because it will take longer to traverse a tree than to access a position in a hash table which has few collisions. When designing primers, one needs to consider melting temperature, GC content, self-annealing etc. which have not been considered in this study. Our linear time algorithm finds all possible primers of a certain length for the entire set of DNA sequences. This compares favorably to other primer design algorithms such as the polynomial time algorithms described in (Kaempke et al., 2001)and the suffix tree approach mentioned in (Setubal and Meidanis, 1997). Most yeast species can be identified by a relatively short (eight bases) oligonucleotide. The fact that this is longer than the theoretical minimum (five bases) is due to the high degree of sequence identity between the D1/D2 sequences. However, it is still remarkable as it is much shorter than the oligonucleotides that are used for example as PCR primers (typically 17–25 bases). Nevertheless, it is still possible to use the sequences found in this exercise for the design of PCR primers. One can simply extend the sequences of length 7 and 8 with their corresponding 26S rDNA sequence to the desired length, retaining the unique sequence at the 3 end of the PCR primer. Therefore, these results are an important step towards the development of a molecular method for the global identification of all yeast species, including mixed cultures, in a single step. Among the last 14 species to be identified, occurred very 1007
J.-J.Wesselink et al.
Table 2. Groups of species that can be identified with a sequence of up to length 9.
Species Name
Accession No.a
Sequence
Uniqueb
Group Ac Bullera armeniaca Cryptococcus hungaricus Bullera pseudoalba Cryptococcus cellulolyticus Candida cellulolytica Candida succiphila Candida vini Pichia fluxuum Cryptococcus kuetzingii Cryptococcus albidus Cryptococcus liquefaciens Cryptococcus vishniacii Cryptococcus saitoi Cryptococcus friedmannii Debaryomyces vanrijiae var. vanrijiae Debaryomyces vanrijiae var. yarrowii Filobasidiella neoformans var. neoformans Filobasidiella neoformans var. bacillispora Pichia stipitis Pichia segobiensis Rhodotorula hinnulea Rhodotorula phylloplana Rhodotorula pustula Rhodotorula buffonii Sporobolomyces coprosmae Sporobolomyces oryzicola Sporobolomyces phyllomatis Rhodotorula armeniaca Trichosporon asteroides Trichosporon faecale
AF189883 AF075503 AF075504 AF075525 U94928 U70189 U70247 U75719 AF181504 AF075474 AF181513 AF075473 AF181512 AF075478 DVU45842 DVU45843 AF075484 AF075526 PSU45741 PSU45742 AF190003 AF190004 AF189964 AF189924 AF189980 AF189990 AF189991 AF189920 AF075513 AF105395
actcccg cgtcgcc attcctaa caatataa acctccct tggcgtca acccccaa gtggtgga tagcgcaa gggcctcta tggtaagca ggatggaa gtgagccg cagcgcga ttccccgga
+ + + + + + + +
Group Bc Candida shehatae var. lignosa Candida shehatae var. shehatae Candida shehatae var. insectosa Leucosporidium scottii Rhodotorula creatinivora Rhodotorula vanillica Pichia toletana Pichia xylosa Pachysolen tannophilus Rhodotorula graminis Rhodosporidium babjevae Rhodosporidium diobovatum Saccharomyces bayanus Saccharomyces pastorianus Kluyveromyces africanus
CSU45772 CSU45761 CSU45773 AF070419 AF189925 AF189970 U75720 U75718 U76346 AF070431 AF070420 AF070421 U94931 U68547 U68550
similar D1/D2 sequences such as those from the different varieties of Williopsis saturnus and Candida santamariae (Table 2, group C). The D1/D2 sequences of the two varieties of Candida santamariae differ in two positions, 1008
atcctaaa
tcatcttc
gatccaaa
tgacggtt
accgttcc
+ + + + + + + +
whereas the sequences of the five variations of Williopsis saturnus differ only in one position. The sequences that are found for identification of these species are identical for all the variations of these species, resulting
A unique defining DNA sequence for yeast species
Table 2 continued. Group Cc Candida krissii Candida santamariae var. membranifaciens Candida santamariae var. santamariae Candida zeylanoides Filobasidium elegans Filobasidium floriforme Cryptococcus magnus Cryptococcus oeirensis Pichia fabianii Pichia veronae Pichia amylophila Pichia mississippiensis Williopsis saturnus var. mrakii Williopsis saturnus var. sargentensis Williopsis saturnus var. saturnus Williopsis saturnus var. suaveolens
CKU45853 CSU45785 CSU45794 CZU45832 AF181548 AF075498 AF181851 AF181519 U73573 U73576 U73577 U74597 U94929 U94936 U75958 U94930
cgttccc
cgtagta
ctaccgaa
tacctgga
+ + + -
a GenBank/EMBL accession number b This species is (+) or is not (-) identified by a unique defining DNA sequence (see web appendix) c Groups of 2, 3 or 4 species are listed in group A, B or C respectively
in groups of up to four species or species varieties that cannot be distinguished using this approach. Similarly, results obtained with data on physiological tests that are commonly used to identify yeast species indicated that certain groups of species cannot be distinguished satisfactorily from each other (De la Iglesia et al., 2001). This suggests that there exists a high degree of similarity between these currently accepted yeast species, which needs to be re-examined. The locations of the unique defining sequences (of length 8) found in this exercise correspond to the variability map of the eukaryotic large subunit ribosomal RNA constructed by (Ben Ali et al., 1999). Here, conserved and variable sites have been projected onto the secondary structure of this molecule. This map was constructed using an alignment of 77 complete large subunit ribosomal RNA sequences taken from a wide variety of eukarya. In this study, 702 yeast D1/D2 domain sequences were used. These yeast species follow the large subunit RNA variability map for the region of the molecule that was studied. Thus hashing techniques, as employed here, offer a rapid alternative to sequence alignment for knowledge discovery in large DNA data sets.
ACKNOWLEDGEMENTS This work was supported by BBSRC Grant Ref. No. 83/BIO 12037. REFERENCES Barnett,J.A., Payne,R.W. and Yarrow,D. (2000) YEASTS: Characteristics and Identification, 3rd edn, Cambridge University Press.
Ben Ali,A., Wuyts,J., De Wachter,R., Meyer,A. and Van de,PeerY. (1999) Construction of a variability map for eukaryotic large subunit ribosomal RNA. Nucleic Acids Res., 27, 2825–2831. Buhler,J. (2001) Effective large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17, 419–428. Dumey,A.I. (1956) Indexing for rapid randomaccess memory systems. Computers and Automation, 5, 6–9. Fell,J.W., Boekhout,T., Fonseca,A., Scorzetti,G. and StatzellTallman,A. (2000) Biodiversity and systematics of basidiomycetous yeasts as determined by large-subunit rDNA D1/D2 domain sequence analysis. International Journal of Systematic and Evolutionary Microbiology, 50, 1351–1371. Hassouna,N., Michot,B. and Bachellerie,J-P. (1984) The complete nucleotide sequence of mouse 28S rRNA gene. Implications for the process of size increase of the large subunit rRNA in higher eukaryotes. Nucleic Acids Res., 12, 3563–3583. De la Iglesia,B., Rayward-Smith,V.J. and Wesselink,J.J. (2001) Classification/Identification on Biological Databases. In de Souza,J.P. (ed.), Proc. MIC2001, 4th International Metaheuristics Conference. Porto, Portugal. Kaempke,T., Kieninger,M. and Mecklenburg,M. (2001) Efficient primer design algorithms. Bioinformatics, 17, 214–225. Knuth,D.E. (1998) Sorting and Searching, vol. 3 of The Art of Computer Programming, 2nd edn, Addison Wesley. Kurtzman,C.P. and Robnett,C.J. (1998) Identification and phylogeny of ascomycetous yeasts from analysis of nuclear large subunit (26S) ribosomal DNA partial sequences. Antonie Van Leeuwenhoek, 73, 331–371. Leibowitz,N., Fligelman,Z.Y., Nussinov,R. and Wolfson,H.J. (2001) Automated multiple structure alignment and detection of a common substructural motif. Proteins, 43, 235–245. Leung,M.Y., Blaisdell,B.E., Burge,C. and Karlin,S. (1991) An efficient algorithm for identifying matches with errors in multiple
1009
J.-J.Wesselink et al.
long molecular sequences. J. Mol. Biol., 221, 1367–1378. Lum,V.Y., Yuen,P.S.T. and Dodd,M. (1971) Key-to-address transformation techniques a fundamental study on large existing formatted files. Communications of the ACM, 14, 228–239. Mannarelli,B.M. and Kurtzman,C.P. (1998) Rapid identification of Candida albicans and other human pathogenic yeasts by using short oligonucleotides in a PCR. J. Clin. Microbiol., 73, 1634– 1641. Maurer,W.D. and Lewis,T.G. (1975) Hash Table Methods. Computing Surveys, 7, 5–19. Peterson,W.W. (1957) Addressing for random-access storage. IBM Journal of Research and Development, 1, 130–146. Setubal,J. and Meidanis,J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Company.
1010
Shatsky,M., Fligelnam,Z.Y., Nussinov,R. and Wolfson,H.J. (2000) Alignment of flexible protein structures. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8, pp. 329–343. Souciet,J.L., Aigle,M., Atiguanave,F., Bland,G., BolotinFukuhara,M., Bon,E., Brottier,P., Casaregola,S., de Nomtigny,J. et al. (2000) Genomic exploration of the Heiascomycetous yeasts: 1. a set of yeast species for molecular evolution studies. FEBS Lett., 487, 3–12. Vincens,P., Buffat,L., Andre,C., Chevrolat,J.P., Boisvieux,J.F. and Hazout,S. (1998) A strategy for finding regions of similarity in complete genome sequences. Bioinformatics, 14, 715–725. Waterman,M.S. (2000) Introduction to Computational Biology. Chapman and Hall/CRC.