Constructing Su x Arrays of Large Texts - Semantic Scholar

2 downloads 0 Views 193KB Size Report
speeding-up of sorting su xes for large texts is im- portant. 1.3 Our results. We propose a fast and memory e cient algorithm for sorting all su xes of a text. We use ...
Constructing Sux Arrays of Large Texts K. Sadakane and H. Imai Department of Information Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN

fsada,[email protected]

rable with variants of the PPM [6], its compression speed is faster than them and decompression speed is much faster than compression. Moreover, the required memory is smaller than the PPM, therefore the block sorting is a promising scheme. The Burrows-Wheeler transformation used in the Block Sorting is a permutation of a text to be compressed and it requires sorting of all suxes of the text. Though the Block Sorting is faster than the PPM, it is slower than the gzip because sorting of long text is required. Therefore the Block Sorting generally divides a long text into many blocks and performs the Burrows-Wheeler transformation to each block and that makes the compression ratio worse. Therefore speeding-up of sorting suxes for large texts is important. 1.3 Our results We propose a fast and memory ecient algorithm for sorting all suxes of a text. We use the doubling technique of Karp-Miller-Rosenberg [8], ternary partitioning of Bentley-Sedgewick [3], and numbering method of Andersson-Nilsson [1] and ManberMyers [10]. It requires only 9 bytes for sorting a text of length by using a clever ordering of sorting and an integer encoding. Note that we assume the size of integers is 32bits. We de ne a measure of diculty of sorting suxes: average match length (AML). Our algorithm requires much more memory than the Bentley-Sedgewick, which requires 5 bytes, but our algorithm is faster than other sorting algorithms when AML is large. It is faster than the sux tree construction algorithm of Larsson [9] if the alphabet size is large and it requires less than half of memory for the sux tree. When a text becomes long, many pairs of words appear repeatedly and the AML becomes large, therefore our algorithm is practical for large texts. 1.4 De nitions Assume that we make a sux array of a text = [0 0 1] = 0 1 n01 . The size of alphabet 6 is constant. We add a special symbol `$' at the tail of the text , that is, n = $. It is not in the alphabet and smaller than any other symbols in the alphabet. Suxes of the text are represented by i = [ ]. The sux array [0 ] is an array of indexes of suxes i . All suxes in the array are sorted in lexicographic

Abstract

Recently, Sadakane [12] proposes a new fast and memory ecient algorithm for sorting suxes of a text in lexicographic order. It is important to sort suxes because an array of indexes of suxes is called suf x array and it is a memory ecient alternative of the sux tree. Sorting suxes is also used for the Burrows-Wheeler transformation in the Block Sorting text compression, therefore fast sorting algorithms are desired. In full-text databases, of course the length of texts are quite large, and this algorithm makes it possible to use the sux array data structure and the compression scheme for such larger texts. In this paper, we compare algorithms for making sux arrays of Bentley-Sedgewick, Andersson-Nilsson and Karp-Miller-Rosenberg and making sux trees of Larsson on speed and required memory and compare them with our new algorithm which is fast and memory ecient by combining them. 1 Introduction 1.1 Sux array Today large databases become available, such as full text of newspapers or Web pages, and Genome sequences, therefore it is important to store them on memory for quick queries. The sux array [10] is a compact data structure for on-line string searches. For such purposes, sux tree is used because it enables us to nd the longest substring of a large text that matches the query string in linear time. However, the sux tree requires huge memory and therefore it is impossible to use it for large text. Though the sux tree can be constructed in linear time [11, 14, 9], the time complexity depends on alphabet size. Searching time also depends on the alphabet size, and it becomes slow if the alphabet size is large. On the other hand, construction and searching time of the sux array does not depend on the size of alphabet. Therefore searching time of the sux array may be comparable with the sux tree though it has superlinear time complexity of ( log ) where is the length of a query string and is the length of a large text. 1.2 Block Sorting data compression The Block Sorting is a lossless data compression scheme [4]. Though its compression ratio is compaO P

n

n

n

n

X

x

P

n

::n

x x

X

:::x

x

S

I

S

1

::n

x i::n

order. A sux i is lexicographically less than a sux + 0 1] = [ + 0 1] and [ + ] j if 9  0 [ [ + ]. We use another ordering k de ned by comparison of pre xes of length as follows.  i k j () 0  9 [ + 0 1] = [ + 0 1] and [ + ] [ + ]  i =k j () [ + 0 1] = [ + 0 1]  i k j () 0  9 [ + 0 1] = [ + 0 1] and [ + ] [ + ] Due to the space limitation we skip describing related algorithms in this version.

groups have di erent numbers. Next suxes of each group are split into subgroups according to the [ ] of suxes. As a result, all suxes are split according to their rst two symbols. This operation continues until all suxes belong to groups of size one. If groups are split and ordered in alphabetical order, all suxes can be sorted in lexicographic order. The key of the algorithm is so-called doubling technique. Because all suxes in a group have the same rst symbols [ + 0 1] after an iteration, we can split the groups according to [ + ]. Since the numbers [ + ] are already calculated in the last iteration, we can split the groups according to [ + +2 0 1] in the next iteration. Consequently, all suxes can be sorted in lexicographic order within dlog e iterations. 2.4 Manber-Myers algorithm This algorithm [10] uses the doubling technique in the KMR algorithm. First we sort and group all suf xes by their rst symbols using bucket sort. Suf xes are given [ ] like the KMR algorithm. Next all groups are traversed in order of [ ] to sort suxes in groups by the second symbols. If [0] = , i01 is the smallest sux among the group [ 0 1]. We can move suxes to their correct position according to rst two symbols. The number of iterations is less than dlog e and each iteration can be done in ( ) time, therefore this algorithms works in ( log ) time. 2.5 Larsson's sux tree construction algorithm This algorithm can maintain a sliding window of a text in linear time and it can be used for PPM-style data compression [9]. Though the sux tree requires more memory than the sux array, it enables us to search substrings in linear time.

S

S

x j

l

x i::i

l

x j::j

l

x i

l

Suggest Documents