Parsimony-spaced suffix trees for DNA sequences - CiteSeerX

Parsimony-Spaced Suffix Trees for DNA Sequences

Yun-Ching Chen and Suh-Yin Lee Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta-Hsueh Rd, Hsinchu, Taiwan [email protected] and [email protected]

Abstract In recent years, bioinformatics becomes an important research field because there are more and more genetic data to be analyzed. The suffix tree is a powerful data structure for string analysis and has many applications on bioinformatics. Besides, its linear construction time, linear construction space and short search time all make it very impressive. However, consuming huge space is a fatal drawback especially while using suffix trees to handle the large amount of DNA sequences. In this paper, we utilize some characteristics of DNA sequences to reduce the space requirement of suffix trees. A new bit layout is proposed for the node of a suffix tree which requires less space than others. We also use an index table, called “prefix table”, which can reduce the number of internal nodes in suffix trees. In addition, we propose a preprocessing technique to improve the construction time based on our data structure. The experiments shows that our proposed method is the most space-parsimony implementation of suffix trees for DNA sequences and it also has a good performance in construction time.

1.

Introduction

In recent year because of the need to handle large amounts of genetic and biochemical data, bioinformatics [1] becomes a more and more important research field. The goal of the post genome era is to annotate these long genetic codes and the most basic and important algorithms for these DNA sequences are string processing methods. Some relative algorithms can be found in [2]. A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the other fundamental preprocessing. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the KMP [3] and the Boyer-Moore [4] algorithms achieve), but their real virtue comes from their use in linear-time solutions to many

string problems more complex than exact matching. If the input sequence is of length m and the query pattern is of length n, then the O(m) preprocessing and O(n) search result for the substring problem is very surprising and extremely useful. That bound is not achievable by the KMP or Boyer-Moore methods, but achievable by suffix trees. Because m may be huge compared to n especially in DNA sequences, those algorithms would be impractical on them. In recent years many complex repeat-finding problems are solved by suffix trees [2] and more and more people use suffix trees to deal with the huge DNA sequences. For example, MUMmer [5, 6], a system of the genome alignment, used suffix trees as its main structure to align two closely relative genomes and then to figure out some important features from them. Comparing with other genome alignment systems, MUMmer provides a faster, simpler, and more systematic way to solve this difficult problem and all of these advantages are coming from the versatility of suffix trees. Despite these superior features, suffix trees have not seen widespread use in string processing software before. There are two main reasons for this. First, suffix trees have a reputation of being very greedy for space. As result of this drawback several people have developed alternative index structures which store less information than suffix trees and are therefore more space efficient, such as the suffix array [7], the level compressed trie [8], the suffix binary search tree [9] and the suffix cactus [10]. However, Kurtz [11] points out that these four index structures have two properties in common: 1. They are not nearly as versatile and efficient as suffix trees. 2. The direct construction methods for these index structures do not run in linear worst case time. Therefore, suffix trees cannot be replaced easily. The second reason is that many string algorithms were satisfied people to solve most simple string problems in short input strings before. However, DNA sequences are too complex to handle by those traditional string

Proceedings of the IEEE Fifth International Symposium on Multimedia Software Engineering (ISMSE’03) 0-7695-2031-6/03 $17.00 © 2003 IEEE

algorithms, so the suffix trees become more and more important in bioinformatics. According to these two reasons, how to reduce the space requirement of suffix trees for DNA sequences is an important problem in the post genome era. Since the suffix tree was proposed by Weiner [12] in 1973, there were many improvements during these decades. Manber and Myers [7] state that their implementation occupies between 18.8n and 22.4n bytes of space where n is the length of the real input strings and they also point out the most parsimony implementation for suffix trees is based on linked lists. McCreight [13] proposed a space efficient implementation requiring 28n bytes in worst case and Kurtz [11] slightly improves this bound to 24n bytes in worst case and 18.8n bytes in general case. Kurtz also takes some properties of suffix trees with bit optimization techniques to reduce the space requirement to 10.1n bytes in general case but 12.6n bytes for DNA sequences [11]. All of these above improvements are for general cases. However, the importance of suffix trees in recent years comes from the complex and huge DNA sequences, so how to reduce the space requirement of suffix trees especially for DNA sequences is very important.

2. Overview of the Suffix Tree The suffix tree ST for an m-character string S is a rooted directed tree that contains several properties: z It has exactly m leaves numbered 1 to m. z Except the root, each internal node has at least two children. z Each edge is labeled with a nonempty substring of S. z No two edge out of a node have edge-labels beginning with the same character. z For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. For convenience, we shall define some terms used in our description: z ‘$’ is a sentinel which is at the last character and dose not occur in the previous string. z If the concatenation of the edge-labels on the path from the root to a node v is w, then we say path(v) = w and use w’ represent this node. If w is a substring of S and w’ exists in ST, then w’ is unique. z The suffix of string S starts at the position j is suffix[j] or Sj. z The depth of a node w’, called depth(w’), is equal to the length of w. z The head position of a node w’, headposition(w’), is the position of the leftmost branching occurrence of w in the suffix tree.

Suffix trees can be constructed in linear time and linear space through several algorithms [11, 12, 13, 14, 15] and some of these use the suffix links to achieve the O(n) construction time [11, 12, 13, 14]. The suffix link is a link from an internal node (aw)’ to another internal node w’. You can see the references to know how it be set and how it works. The suffix tree can be represented by two tables Tleaf and Tbranch [12]. For each leaf number j ∈ [1, n + 1] , Tleaf[j] stores a reference to the right brother of leaf node Sj’. If there is no such brother , then Tleaf[j] is a nil reference. For each branching node w’, Tbranch[w’] stores a branch record consisting of five components firstchild, branchbrother, depth, headposition, and suffixlink. The successors of a branching node are therefore found in a list whose elements are linked via the firstchid, branchbrother and Tleaf references. The references firstchild, branchbrother and Tleaf[j] can be implemented as integers in the range [0, n]. An extra bit with each such integer tells whether the reference is to a leaf or to a branching node.

3. Method In this section, we will describe our improvement techniques to reduce the space requirement of suffix trees for DNA sequences. In Section 3.1, we analyze the current best method and give some improvements on the bit layout in Section 3.2. In Section 3.3, we use an index table, called “Prefix Table”, to eliminate the space requirement used by some nodes close to the root. Finally, we propose a speed up technique especially for long DNA sequences in Section 3.4 and then show the pseudo code of our method in Section 3.5.

3.1 Analyzing the Previous Bit Layout of Records Kurtz finds that internal nodes can be divided into large nodes and small nodes to store the suffix tree information based on their relation of head position values. In the sequence of internal nodes created during the construction process, there are lots of short or long small-large chains which are a sequence of small nodes followed by one large node. In a small-large chain, the values of headposition, depth and suffixlink of all small nodes can be derived from the large node in the end of the chain. Therefore, with the bit optimization technique, MethodK spends four integers for one large node, two integers for one small node and one integer for each leaf node [11]. According to the Kurtz’s finding, the longer the small-large chain is, the more the space is saved. After analyzing, we find that a small-large chain is formed only if all of the nodes in this chain are a series of new nodes just to be created consecutively while a series of suffixes


are added into the suffix tree one by one. However, if a sequence is filled with repetitive patterns, it is hard to cause a long series of consecutive new nodes to be created and that means the small-large chains will be short in average. Unfortunately DNA sequence is not only well known for its repetitive structure but also a small-sized alphabet sequence which has high possibility of repetition. Therefore using MethodK on DNA sequence may not take advantage on small nodes but produces more large nodes. Through the experiments we find the 34% of internal nodes are large nodes and 66% of internal nodes are small nodes when the input string is a general text file. If the input string is an DNA sequence, 33% of internal nodes will be small nodes and 67% will be large nodes in average. This result exactly matches our anticipation mentioned before. The suffix link exists for each internal node in order to make the construction of the suffix tree in O(n) time. However, in reality some researchers find that when we handle a large input sequence, suffix links easily cause the memory bottleneck. Suffix links will inevitably cause random access of the memory and at each level of the memory hierarchy this induces cache miss. In the case of small input size, the effect of cache miss is not eminent. However, in the case of large input size, it will cause more disk swapping and that will be a serious problem if we still concern about the construction speed. These effects we discuss above will limit the size of a suffix tree which can be constructed using suffix-link based algorithms. The experiment in [16] shows in-memory performance comparison of suffix trees with and without suffix links. The largest suffix-link tree they can build is for 25 Mbp (Mega base pair). Because most of the DNA sequences are extremely long, we therefore give up the suffix-link based algorithm and find out another mechanism to speed up the construction time.

3.2 Proposed Bit Layout From Section 3.1 we can understand two things: (1) If the record to keep internal node information can be reduced to just three integers, then we can save some memory space. (2) The suffix-link based algorithms are not suitable when the input data is very large, so discarding the suffix link might be an ideal way. Thus we propose a three-integer bit layout for each internal node record below:

Internal Node (1)

firstchild 29 bits

[c] 3 bits (2) A

branchbrother / depth 29 bits

[b] 2 bits (3)

headposition [c] 22 bits

depth 10 bits

Leaf Node A

null 2 bits

branchbrother / depth 29 bits

Fig 3.4 The bit layout of the large node and the small node Each field of the internal node is introduced in the following: firstchild: the value is stored in last 29 bits of integer (1) branchbrother: the value is stored in last 29 bits of integer (2) headposition: there are 27 bits to store this value including the first 3 bits of integer (1), the first 2 bits of integer (2) and the first 22 bits of integer (3) depth: if depth < 1023, then it is stored in last 10 bits of integer (3). Otherwise the value of last 10 bits in integer (3) will be set to 1023, the value of depth will be set to the branchbrother field of its rightmost child. bit A: bit A just in front of the branchbrother field in integer (2) is set to 1 if and only if the branchbrother field stores the depth value of its parent. Otherwise it is set to 0. The behavior of leaf node is the same as the integer (2) of the internal node except its first 2 bits are no use.

3.3 Reducing the Space Requirement of Internal Nodes Although we propose a new bit layout to save space during construction, the effort we have done is still not enough. Overall speaking, reducing the internal node record to three integers achieves saving in space but not so significantly. Therefore we must find out another new mechanism to achieve more significant result. The


characteristics of DNA sequences will be described in Section 3.3.1 and the detail about our new structure of the method will be described in Section 3.3.2.

AG, TA, TT, TC, TG, CA, CT, CC, CG, GA, GT, GC and GG respectively.

3.4 Speed Up by Locality for Large Data 3.3.1 Characteristics of DNA Sequence Influencing Internal Nodes From the background knowledge, we know that the more internal nodes generated by the input sequence, the more memory space we need. Unfortunately the two characteristics of the DNA sequence, the repetitive structure and the small-sized alphabet, all facilitate generating internal nodes. Therefore except to reduce the space used by one internal node in Section 3.2, how to eliminate the total space used by internal nodes should be more important. 3.3.2 Constructing Suffix Trees with the Prefix Table Although these two characteristics make a lot of internal nodes in the suffix tree, there is one idea deriving from them to save the space. Because of its repetitive structure and small-sized alphabet, we can assume that the density of internal nodes close to the root will be very high. Thus in the ideal case we can derive that the depth from 0 to log 4 seq _ length -1 will be full of internal nodes. Is it necessary to keep information for each node? Since there is full of internal nodes below some depth, we can partition the whole suffix tree into several subtrees according to the depth bound and let each internal node on the bound be the root of the subtree following it. Thus we will construct a table whose size is

4 (log 4 seq _ length ) −1 =

( seq _ length ) 4

to store the root information for each substree. As we know, the root does not need to keep the information of headposition, branchbrother and depth so we can use only one integer to store firstchild and that will save more memory space. We use one table to replace all internal nodes on the bound depth. In ideal case each bound internal node represents some combination of the consecutive alphabets and this combination is precisely the path label from the root to that node. We know that each path label from the root to a leaf is a suffix in the input sequence, so each combination is exactly the prefix of some suffixes in the input sequence. Therefore we call this table as “Prefix Table”. This table must contain every possible prefix. For example: if the bound depth is 2 (the depth of root is 0), this table must have 16 entries to represent AA, AT, AC,

We all know that when constructing the suffix tree, the most time is used to trace along the internal nodes because while adding a suffix into the previous suffix tree, we must trace the whole tree to find the correct position. In physical way tracing along internal nodes is exactly tracing along the memory cells so if we can reduce the memory access time, then accelerating the construction process is possible. According to the concept of memory hierarchy, the basic policy of cache is temporal locality and spatial locality. Thus if we can put the currently related nodes together in the memory, there will be more cache hits. In the original construction method we cannot decide what the currently related nodes are, because it is a whole tree and we do not know which part of it will be used. Therefore we cannot put these nodes together in memory. However, our improvement partitions the whole tree into several subtrees in terms of their prefixes, so we can affirm that adding a new suffix will only use internal nodes in the subtree which belongs to the same prefix table entry of that new suffix. In order to put these nodes together, we must construct the suffix tree according to the prefix of each suffix because the order of internal nodes stored in the physical memory is the order of nodes created in the construction process. Constructing the suffix tree by the prefix of each suffix can cause the related internal nodes to be put together but if we spend much time in finding the identical-prefixed suffixes, then this new method will be nonsense. Therefore how to implement is an important issue. In here we propose a preprocessing method using single linked list data structures. When the preprocessing begins, it scans at the first suffix of the sequence. Each time when dealing with some suffix, we calculate the index of this suffix in the prefix table and then append the suffix number to that entry. Thus when the preprocessing is done, each entry in the prefix table is either followed by a linked list storing suffix numbers or empty. Then in the next step to construct the suffix tree, we can handle the identical-prefixed suffixes together according to the information of the prefix table and that will make the related internal nodes to be grouped in the physical memory. However, if this preprocessing is done by a linked list data structure, it must take 2n integers space and that will spoil our original goal, saving the memory space. Fortunately the space prepared for internal nodes and leaf nodes is no use in the preprocessing so we can use these two area to construct the linked list.


4. Experimental Results and Discussions In this section, we will show the experimental results and give some discussions. In Section 4.1 and Section 4.2, we compare the space requirement and the running time of our method with the current best algorithm respectively and then discuss the experimental results in detail.

4.1 Space Requirement We compare the space requirement of the described implementation techniques, MethodCYC, with the space requirements of MethodK and MethodMcC. MethodMcC uses one integer for each field. That means one internal node needs five integers for firstchild, branchbrother, headposition, depth and suffixlink. Each leaf needs only one integer for branchbrother. We emphasize that the given numbers refer to the space required for construction. It does not include the n bytes used to store the input string. We choose 10 DNA sequences downloaded from NCBI web site. 7 of these 10 sequences are also used by Kurtz [11] as the experimental data for DNA sequences. Table 4.1 the space requirement of three methods Length

MethodMcC

MethodK

MethodCYC

AC008583

122493

16.98

12.62

9.36

AC135393

38480

17.9

12.39

9.47

BC044746

4897

16.87

12.61

9.13

J03071

11427

21.79

12.32

10.5

M13438

2657

16.65

12.5

9.14

M26434

56737

17.38

12.52

9.46

M64239

94647

16.87

12.62

9.23

V00662

16569

16.9

12.69

9.13

X14112

152261

17.12

12.58

9.27

4668239

16.87

12.56

9.1

17.53

12.54

9.38

ecoli [Average]

MethodCYC, respectively. From this table we can see that our method needs the least memory space to construct the suffix tree in every DNA sequence. We can find the following things: 1. Our method requires about 25.2% less space than MethodK and 46.5% less space than MethodMcC. 2. There is no relationship between the space requirement and the sequence size, but the sequence structure has a great influence on the space requirement, for example: J03071.

4.2

Running Time

In this section we will show the running time of our method comparing with MethodK. In Section 4.2.1 we describe our environment setting and in Section 4.2.2 we show the experimental result. 4.2.1

Platform

For our second experiment we implemented two algorithm: MethodK and our method. All programs were written in C++ and the development environment is in Borland C++ Builder 6.0. We compiled our programs with Borland compiler. We disabled the full debug mode and use Pentium instruction set. Our programs were run on an AMD-Athlon, 800 MHz, 768 megabytes RAM under Windows 2000 Advanced Server. The characters of an input string x are always represented by one byte. 4.2.2

Result

In this section we will show the experimental result and give some discussions later. In Table 4.3 we compare our method with MethodK. For each method we show the running time ( in seconds ) and the throughput ( 106 * time / sequence length). In each row a grey box marks the best throughput. Table 4.3 the running time and throughput of two methods MethodK Sequence

In Table 4.1 we want to compare the space requirement of MethodCYC with those of MethodK and MethodMcC. The space requirement is defined as how many bytes one character uses in average. The first column is the names of DNA sequences and the second column is the lengths. The third, fourth and fifth columns are the space requirement (in bytes per character) of MethodMcC, MethodK and

Length

time

MethodCYC

tput

time

tput

AC008583

122493

0.811

6.62

0.26

2.12

AC135393

38480

0.22

5.72

0.09

2.34

BC044746

4897

0.03

6.13

0.01

2.04

11427

0.06

5.25

0.03

2.63

J03071


M13438

2657

0.02

7.53

0

0

M26434

56737

0.34

5.99

0.14

2.47

M64239

94647

0.621

6.56

0.18

1.9

V00662

16569

0.09

5.43

0.02

1.21

X14112

152261

1.092

7.17

0.4

2.63

4668239

62.365

13.36

15.262

3.27

6.5649

6.976

1.6392

2.061

ecoli [Average]

construction time for large sequences and this preprocessing also makes the applications traversing the suffix tree fast. Through the experimental result, the proposed techniques require the least memory space to construct a suffix tree for the DNA sequences and have good performance in construction time. Some problems are worth further investigation in the future. The technique we use to implement the prefix table is just a large index table for all possible prefix combinations. It may waste much space in some special cases. Thus if we use a scalable hash table to implement it, we can save more space in construction.

From the table we can find following facts: 1. MethodK is slower than MethodCYC in average although MethodK has smaller complexity in time. However, we think the speed of the suffix tree construction program is highly dependent on the implementation technique because the main program is a loop running thousands of times. If any basic operation is coded carelessly, that will make the whole program slow. 2. Generally speaking the larger the DNA sequence is, the more time it takes to deal with each character. When the input sequence is longer, the suffix tree will become larger and it will spend more time in traversing the whole tree to find the inserting position. Besides traversing the bigger suffix tree, lots of nodes in memory also cause cache misses in the memory hierarchy. 3. In average the running time advantage of our method is 70.46% over MethodK. We think there are two reasons to explain this big difference: (1) MethodCYC uses the prefix table to replace all internal nodes on the bound depth and it is not necessary to spend time in creating the internal nodes within the bound depth. (2) The data structure of MethodK is somewhat complex, so using C++ based implementation may cause some overheads.

6. References

5. Conclusion and Future Work

[7] U. Manber and E. W. Myers, “Suffix Arrays: A New Method for On-line String Searches,” SIAM Journal on Computing, Vol. 22, No. 5, pp. 935-948, 1993.

In this paper, we have proposed new implementation techniques to reduce the space requirement of the suffix tree especially for DNA sequences. Considering more suitable for handling DNA sequences by suffix trees, we have given up the small-large chain and the suffix links and then proposed a new bit layout of the storing record. Then according to the characteristics of the DNA sequence, we have used the prefix table to partition one big suffix tree into several subtrees so that we can save the space used by the internal nodes near the root. Finally, we have presented a preprocessing technique to speed up the

[1] Hooman H. Rashidi and Lukas K. Buehler, “Bioinformatics Basics,“ 2000 [2] Dan Gusfield, “Algorithms on Strings, Trees, and Sequences,” 1997 [3] D. E. Knuth, J. H. Morris, and V. B. Pratt, “Fast pattern matching in strings,” SIAM Journal on Computing, Vol. 6, pp. 323-50, 1977. [4] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, Vol. 20, pp. 762-72, October 1977. [5] Arthur L. Delcher, Simon Kasif, Robert D. Fleischmann, Jeremy Peterson, Owen White and Steven L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Research, Vol. 27, No. 11, pp. 2369 – 2376, April 1999. [6] Aurthur L. Delcher, Adam Phillippy, Jane Carlton and Steven L. Salzberg, “Fast algorithms for large-scale genome alignment and comparison,” Nucleic Acids Research, Vol. 30, No. 11, pp. 2478 – 2483, 2002.

[8] A.Andersson and S. Nukssib, “Improved Behavior of Tries by Adaptive Branching,” Information Processing Letters, Vol. 46, pp. 295-300, 1993. [9] R. W. Irving, “Suffix Binary Search Trees,” Research Report, Department of Computer Science, University of Glasgow, 1996. [10] J. Karkkainen, “Suffix Cactus: A Cross between


Suffix Tree and Suffix Array,” In Proceeding Of the Annual Symposium on Combinatorial Pattern Matching (CPM’95), LNCS 937, pp. 191-204, 1995. [11] Kurtz, S, “Reducing the space requirement of suffix trees,” Software Pract. Experience, Vol. 29, pp. 1149-1171, 1999. [12] P. Weiner, “Linear pattern matching algorithms,” Proceeding Of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1-11, 1973. [13] E. M. McCreight, “A space-economical suffix tree construction algorithm,” Journal of ACM, Vol 23, pp. 262-272, 1976. [14] E. Ukkonen, “On-line construction of suffix-trees,” Algorithmica, Vol. 14, No. 3, pp. 249-260, September 1995. [15] M. Farach, “Optimal Suffix Tree Construction with Large Alphabets,” In Proceedings of the 38th Annual Symposium on the Foundations of Computer Science, FOCS 97, pp. 137-143, New York, 1997. [16] Ela Hunt, Malcolm P. Atkinson and Robert W. Irving, “A Database Index to Large Biological Sequences,” Proceeding Of the 27th VLDB Conference, pp. 139-148, September 2001.