Succinct Representations of lcp Information and ... - CiteSeerX

2 downloads 0 Views 146KB Size Report
*Graduate School of Information Sciences, Tohoku University. [email protected]. .... index has size O(nHk) bits where Hk is the order- k entropy of the text ...
Succinct Representations of lcp Information and Improvements in the Compressed Suffix Arrays Kunihiko Sadakane∗ Abstract We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any pattern. The former occupies only 2n + o(n) bits for a text of length n for computing lcp between adjacent suffixes in lexicographic order in constant time, and 6n + o(n) bits between any two suffixes. No data structure in the literature attained linear size. The latter has size proportional to the text size and it is applicable to texts on any alphabet Σ such that |Σ| = logO(1) n. These space-economical data structures are useful in processing huge amounts of text data.

1

Introduction Backgrounds: As the size of textual data grows, the importance of indexing data structure for pattern matching increases. Many data structures have been proposed for this purpose, for example inverted files, suffix trees and suffix arrays. Among them, the suffix trees and the suffix arrays are powerful indices because any substring can be found quickly by using them. We call such data structures full-text indices. Full-text indices can be used for not only simple keyword search but also more complicated queries. We can find any pattern of arbitrary length by using the indices. This means that we can find occurrences of two or more keywords that appear in the neighborhood in a text. We can also consider text data mining [16]. Unlike keyword search, we don’t know the keywords we want in advance. We want to find association rules of keywords. In these cases full-text indices are useful because of the plenty amount of information they have. Suffix trees are popular data structures for not only pattern matching but also more complicated queries [7]. A pattern can be found in time proportional to the pattern length from a text by constructing the suffix tree of the text in advance. A problem of the suffix tree ∗ Graduate

School of Information Sciences, Tohoku University. [email protected]

is its size. The size of the suffix tree is O(n) words for the text of length n. Since it is too large for practical use, space reduction of suffix trees [10] or space-economical alternatives, like suffix arrays [11], have been proposed in the literature. However the size is still larger than the text size. If a text occupies n bytes, then the size of its suffix array is 4n bytes because a pointer occupies log2 n ≈ 32 bits. The difference between the sizes of the text and its index becomes larger as the length of the text grows. Further reduction in size of full-text indices can be achieved by using compression of the suffix array. Very recently it was proved that suffix arrays can be compressed in size linear to the text size at the cost of increasing access time from constant time to O(log ǫ n) (0 < ǫ ≤ 1) time [6, 5, 15]. The number of occurrences occ of any pattern P can be found in O(|P |) time by using the FM-index [5] and in O(|P | log n) time by using the compressed suffix arrays [6, 15]. Unfortunately, the linear bound of the size of the FM-index can only be obtained for texts on constant size alphabets Σ. Furthermore, the size of the FM-index becomes smaller than the text size only if |Σ| < log n. Although the suffix tree has huge size, it is useful for complicated queries. Therefore we want to store a part of the information of the suffix tree. One important information is the length of lcp (the longest common prefix) between two suffixes. We especially store the lengths of lcp between adjacent suffixes stored in the suffix array. This lcp information represents the depths of nodes in the suffix tree. Therefore we can simulate a bottom-up traversal of the suffix tree which can be used for text data mining [16]. Because the lengths of lcp vary from 0 to n−1, log n bits are necessary to store a value. However the values are usually small and therefore some heuristics based on probabilistic analysis are used to reduce the space. Clark and Munro [3] used a log log log n bits field to store edge lengths of a suffix tree although the size was still not linear. Moreover the method cannot be used to answer the depth of a node quickly. New results: This paper proposes two succinct data structures for full-text retrieval. We first propose

a simple and compact encoding of lcp information, the lengths of the longest common prefixes between two adjacent suffixes in a suffix array, a list of lexicographically sorted suffixes. It occupies only 2n bits for a text of length n and each value can be extracted in constant time on the unit cost RAM model using an auxiliary data structure of size o(n) bits. The key idea for compact encoding of the lcp information is to use the information of the suffix array. Any type of suffix arrays can be used: the suffix array [11], the compressed suffix array [6] and the FM-index [5]. Among them, the compressed suffix array or the FM-index attains the linear space bound although it takes O(log ǫ n) time to extract an element. Then we extend the data structure to compute the length of lcp between any two suffixes. It takes constant time or O(logǫ n) time, depending on the data structure of the suffix array, by using a data structure of 6n + o(n) bits. Next we propose an additional index to the compressed suffix array for improving the time complexity for pattern matching. The FM-index can find the lexicographic order of any pattern P in the suffix array in O(|P |) time if |Σ| is constant. In this paper we propose a data structure which can find the pattern P in O(|P |) time if |Σ| = logO(1) n. Though it may be larger than the FM-index for small alphabets, our data structure has linear size to the text size, i.e. O(n log |Σ|) bits, even for large alphabets. On the other hand, the FM-index may become larger than the text size if |Σ| > log n. The size of our compressed suffix array with lcp information is not only linear in the text size but also smaller than the text size if |Σ| is not too small. The text and the suffix array can be compressed in 1ǫ nH1 + O(n) bits where H1 is the order-1 entropy of the text. Indices for fast queries have size O(n) + o(nH1 ) bits and the lcp information has size 2n + o(n) bits or 6n + o(n) bits. Because H1 ≤ log |Σ| and other components of the suffix tree have O(n) bits, our data structure will become smaller than the original text if H1 is small enough. Our data structure occupies only 1ǫ nH1 + O(n) + o(nH1 ) bits, and supports O(m) query for a pattern of length m, and O(n logǫ n) time traversal of all nodes of the suffix tree, where 0 < ǫ ≤ 1 is any fixed constant. We use ǫ = 1/2, 1/4, etc. If 1ǫ H1 < log |Σ|, the size will be smaller than n log |Σ| bits, the text size. Compared with the FM-index, the size of our data structure is larger than that for small alphabets because the FMindex has size O(nHk ) bits where Hk is the orderk entropy of the text and Hk ≤ H1 always holds. However our data structure can achieve compression of the text and O(|P |) time search for alphabets such that |Σ| = logO(1) n. By using the proposed data structures

we can construct the suffix tree for a large collection of texts, which can be used to solve many problems we are facing to. 2

Preliminaries

In this section we review suffix arrays, suffix trees and their succinct representations. Let T [1..n] = T [1]T [2] · · · T [n] be a text of length n on an alphabet Σ. We assume that T [n] = $ is a unique terminator which is smaller than any other symbol. The j-th suffix of T is defined as T [j..n] = T [j]T [j + 1] . . . T [n] and expressed by Tj . A substring T [j..l] is called a prefix of Tj . The suffix array SA[1..n] of T is an array of integers j that represent suffixes Tj . The integers are sorted in lexicographic order of the corresponding suffixes. 2.1 Suffix trees The suffix tree of a text T [1..n] is a compressed trie built on all suffixes of T . It has n leaves, each leaf corresponds to a suffix T SA[i] . For details, see Gusfield [7]. Figure 1 shows the suffix tree for a text “ababac$.” Leaf nodes are shown by boxes, and numbers in the boxes represent the elements of the suffix array. Internal nodes are shown by circles.

   



 

















 



        

   

       

Figure 1: The suffix tree for “ababac$” and its balanced parentheses representation. The topology of a suffix tree can be encoded in at most 4n bits [8, 13, 2]. The tree is encoded into at most 2n nested open and close parentheses as follows. During a preorder traversal of the tree, write an open parenthesis when a node is visited, then traverse all subtrees of the node in alphabetic order of

the edges from the node, and write a close parenthesis. Navigational operations on the tree are defined by rank p , select p , etc. The function rank p (P, i) returns the number of occurrences of pattern p up to the position i in a string P , where p is for example ‘().’ The function select p (P, i) returns the position of i-th occurrence of pattern p. Both functions take constant time using auxiliary data structures of size o(n) bits [12]. We denote the parentheses sequence by P . We also denote the position of ‘()’ in P that represents the leaf of the suffix tree corresponding to a suffix T SA[i] by leaf(i). We also have to store the string-depths of the suffix tree in addition to the topology. The string-depth of an internal node is defined as the summation of the lengths of labels on the path from the root node to that node. Because the string-depth of a node is usually O(log n), it can be stored in an O(log log n)bit integer. However it becomes n − 1 in the worst case. Therefore we need O(log n) bits to store a value if we use fixed-width integers. Some heuristics are used to reduce the space to store string-depths. Clark and Munro [3] used O(log log log n) bits fields to store not string-depths but edge lengths. The total size for all nodes is O(n log log log n) bits, which is not linear in n. Furthermore, it takes time proportional to the depth to compute the string-depth of a node because it is the summation of the edge-lengths from the root to the node. 2.2 Leaf indices If the leaves of a suffix tree are numbered by a preorder traversal of the tree, indices to the text stored in the leaves coincide with the suffix array [11] of the text. Since it is an array of the indices, it occupies n log n bits. Any pattern of length m can be found in O(m log n) time by a binary search using the suffix array and the text, or in O(m + log n) time using lcp information of suffixes. The compressed suffix array of Grossi and Vitter [6] reduces the size of the suffix array for binary alphabets from n log n bits to O(n) bits. Instead access time for an entry changes from constant time to O(log ǫ n) time where ǫ is any constant in 0 < ǫ ≤ 1. Sadakane [15] showed that the size of the compressed suffix array is O(nH1 ) bits for arbitrary alphabets where H 1 is the order-1 entropy of the text. The compressed suffix array stores a function Ψ instead of SA. The function Ψ[i] in the compressed suffix array is defined as follows:

The FM-index [5] reduces the size of a suffix array to O(nHk ) bits. More precisely, the size is expressed as n|Σ| log log n n|Σ| log |Σ| n|Σ| + log n ). Since Hk ≤ H1 , O(nHk + log n+ log n the FM-index will be smaller than the compressed suffix array if |Σ| < log n. It stores a function LF , defined as LF [i] = SA−1 [SA[i] − 1], instead of SA. The FM-index is a self-indexing data structure, that is, it can be used to find a pattern without using the text itself. This means that we can find a pattern from a compressed text because the size of the FM-index will be smaller than text size n log |Σ|. The number of occurrences of a pattern of length m can be found in O(m) time. This is faster than using the suffix array and lcp information. Computing the position of each occurrence in the text takes additional O(log ǫ n) time. Sadakane [15] showed that the compressed suffix array can be modified to form a self-indexing data structure. Its size is expressed by the order-1 entropy of the text. Therefore it will be smaller than the original text. A character T [j] in a text can be extracted in constant time if the lexicographic order i of the suffix Tj = T [j..n] is given. The algorithm becomes as follows. We use a bit-vector D[1..n] so that D[i] = 1 if T [SA[i]] 6= T [SA[i − 1]] or i = 1. Since the suffixes are lexicographically sorted, their first characters are also sorted alphabetically. Thus the number of 1’s in D[1..i] represents the number of distinct characters in the text which are smaller than T [SA[i]]. It is computed in constant time using the rank function. This number can be easily converted to character codes. The bitn vector occupies n + o(n) bits or about |Σ| log |Σ| + o(n) bits [14]. Sadakane also showed that the inverse array of the suffix array can be expressed by additional n + o(n) bits data structure to the original compressed suffix array with O(logǫ n) access time. The inverse suffix array is important to traverse a suffix tree quickly without using the original text. Our algorithms use the lexicographic orders of suffixes to represent them. Therefore to obtain the k-th character on a edge of the suffix tree it is necessary to compute SA−1 [SA[i] + k]. On the other hand, if we do not have the inverse function, it takes O(k) time.

2.3 Simulating suffix tree traversal by suffix array and height array Kasai et al. [9] showed that a bottom-up traversal of a suffix tree can be simulated by using only the suffix array and an array storing Definition 2.1.  ′ a set of the lengths of the longest common prefixes i such that SA[i′ ] = SA[i] + 1 (if SA[i] < n) between two suffixes, called Hgt array. Though the Hgt Ψ[i] ≡ i′ such that SA[i′ ] = 1 (if SA[i] = n) array stores the lengths of the longest common prefixes −1 In other words, Ψ[i] = SA [SA[i]+1] unless SA[i] = n. only between adjacent suffixes in a suffix array, it is enough for bottom-up traversal of the suffix tree. Many Each value of Ψ can be computed in constant time.

problems are solved by a bottom-up traversal of the tree. They used a 16 bits field to store an element of the Hgt array. Though this size is enough to store most of the values, it is still large. Moreover some values may not be stored. Our encoding solves this problem.

Lemma 3.1. Given s integers in sorted order, each containing w bits, where s < 2w , we can store them with at most s(2 + w − ⌊log s⌋) + O(s/ log log s) bits, so that retrieving the h-th integer takes constant time.

To encode the Hgt array, the above data structure cannot be directly used because the numbers are not 2.4 Computing the lowest common ancestors in sorted. However we can convert them into sorted ones a tree The lowest common ancestor (lca) between two by using the following lemma. nodes u and v in a tree is the furthest node from the root node that exists on both paths from the root to u Lemma 3.2. Hgt[Ψ[i]] ≥ Hgt[i] − 1 and v. Bender and Farach-Colton [1] proposed a simple data structure to compute lca between any two nodes in Proof. Let p = SA[i], q = SA[i + 1] and l = Hgt[i] = a tree of n nodes in constant time. The data structure lcp(Tp , Tq ). If T [p] 6= T [q], then Hgt[i] = 0 and the occupies O(n) words, that is, O(n log n) bits. inequality holds because Hgt[Ψ[i]] ≥ 0. If T [p] = T [q], They reduce the problem to compute lca into the consider suffixes Tp+1 and Tq+1 . From the definition of following Range Minimum Query (RMQ) problem: Ψ, SA[Ψ[i]] = p + 1 and SA[Ψ[i + 1]] = q + 1. The suffix Tq+1 is lexicographically larger than the suffix Problem 1. [1] For indices i and j between 1 and n Tp+1 from the definition of lexicographic order. That of an array L, query RMQL (i, j) returns the index of is, Ψ[i] < Ψ[i + 1]. Therefore an integer i ′ such that the smallest element in the subarray L[i..j]. Ψ[i] + 1 = i′ ≤ Ψ[i + 1] exists. The suffix T SA[i′ ] has a An example of the array L is shown in Fig. 1. If we know prefix of length l − 1 that matches with both prefixes of the positions x and y of elements in L which correspond Tp+1 and Tq+1 because of the definition of lexicographic to nodes u and v in the tree, the element in position order. This completes the proof. RMQL (x, y) in L represents the lca node between u and v. Therefore in the original algorithm, tables to convert Note that this lemma is identical to that in Kasai et this lemma, we have the following relation a node into/from the position in L are used in addition al. [9]. From −1 for p = SA [1]: to L. The tables also occupy O(n log n) bits. 3 New data structures for lcp information In this section we first propose a new data structure for the Hgt array. Then we show that the array can be used to encode string-depths in a suffix tree.

Hgt[p] ≤ Hgt[Ψ[p]] + 1 ≤ Hgt[Ψn−1 [p]] + n − 1 = n−1

where the equality comes from the fact that n−1 [p]] = n and the Hgt value for the suffix T n 3.1 Data structures for Hgt array The length of SA[Ψ is 0 because T [n] = $ is a unique terminator. the longest common prefix of two strings s, t is expressed Now we have n sorted numbers Hgt[Ψ k [p]] + k for by lcp(s, t). An array Hgt[1..n] is defined as follows. k = 0, 1, . . . , n − 1 in the range [0, n − 1] which can be stored using 2n + o(n) bits and can be accessed in Definition 3.1. Hgt[i] = lcp(TSA[i] , TSA[i+1] ) constant time. The remaining task to obtain Hgt[i] is to compute k such that i = Ψk [p]; however it is easy. We define Hgt[n] = 0. Though values of the Hgt array for a text T are From the definition of Ψ, usually small, they may reach n − 1. Therefore it is SA[Ψk [i]] = SA[i] + k necessary to use an array of integers each has log n bits width if we use fixed width integers. However we can efficiently store the values using the properties of Hgt for any i. Therefore array. SA[i] = SA[Ψk [p]] = SA[p] + k = k + 1, Theorem 3.1. Given i and SA[i], the value Hgt[i] can be computed in constant time using a data structure of that is, k = SA[i] − 1. Figure 2 shows an example of creating a sorted size 2n + o(n) bits. numbers. The last row shows SA[i] + Hgt[i] for i = To achieve this, we use a space efficient data structure 1, 2, . . . , n, which is the summation of the second and for storing sorted integers [4] and the select function [12] the third rows. If we sort the numbers in order of SA[i] as it is used in the compressed suffix array [6]. values, we obtain a sorted sequence “4 4 4 4 5 6 7.”

i

1

2

3

4

5

6

7

among Hgt[l], Hgt[l + 1], . . . , Hgt[r − 1].

Hgt

0

3

1

0

2

0

0

SA

7

1

3

5

2

4

6

SA+Hgt 7

4

4

5

4

4

6

Lemma 3.3. Let x and y be the positions of ‘()’ in P that represent leaves TSA[l] and TSA[r] respectively. Then the index m of Hgt[m] that attains the minimum value among Hgt[l], Hgt[l + 1], . . . , Hgt[r − 1] can be computed by

Figure 2: How to create a sorted sequence from Hgt. Then these are encoded in a 0,1 sequence “00001 1 1 1 01 01 01” whose length is at most 2n bits. The blanks in the sequence are only for explanation. We can uniquely decode the numbers from the sequence. The algorithm to calculate Hgt[i] becomes as follows: 1. Extract the k-th entry v (k ≥ 0) of the sorted numbers where k = SA[i] − 1. 2. Subtract k from v. A problem of our encodings of the Hgt array is that a value Hgt[i] are stored in the bit-vector H in the order of not i but SA[i]. Therefore the access to H becomes random if we retrieve the suffix array lexicographically. Another problem is that both i and SA[i] are necessary to compute Hgt[i]. If we use the compressed suffix array, retrieving SA[i] takes O(log ǫ n) time. 3.2 Computing lcp between arbitrary suffixes We propose a data structure to compute the length of lcp between arbitrary suffixes. It consists of two components: one to store lcp values proposed above, and one to store lca (the lowest common ancestor) between two leaves of a suffix tree. The former occupies 2n + o(n) bits and the latter 4n + o(n) bits. Because the suffixes are lexicographically sorted in the suffix array, computing lcp between two suffixes TSA[l] and TSA[r] (l < r) is equivalent to computing the minimum of Hgt[l], Hgt[l + 1], . . . , Hgt[r − 1]. Let Hgt[m] be the minimum value among them. If there are more than one minimum values, we can choose arbitrary one. Then there must exist a unique node in the suffix tree that has a string-depth Hgt[m] and that is on the paths from the root to leaf(l) and leaf(r). The node is equal to lca between two leaves of the suffix tree that correspond to the suffixes. That is, lca(leaf(l), leaf(r)) = lca(leaf(m), leaf(m+1)). Therefore we first compute the lca node, secondly compute the index m, then compute Hgt[m] by using the above algorithm. To compute lcp(TSA[l] , TSA[r] ), we first compute the node v = lca(leaf(l), leaf(r)), then we compute the index m in [l, r − 1] such that Hgt[m] is the minimum value

m = rank () (P, RMQL (x, y)). Proof. RMQL (x, y) returns the position p of a close parenthesis ‘)’ in the parentheses sequence P corresponding to the node v = lca(leaf(l), leaf(r)). The node is equal to lca(leaf(m), leaf(m + 1)). Because P represents a depth-first traversal of the tree, especially the lexicographic order of leaves, leaves leaf(i) for i = 1, 2, . . . , m appear in the left of p in P , and other leaves appear in the right of p. Therefore m is equal to the number of leaves encoded in the left of p in P . That is, m = rank () (P, p). To compute lca in constant time, we use the algorithm of Bender and Farach-Colton [1]. Because their algorithm uses O(n) words, or O(n log n) bits memory, we propose a new data structure for the algorithm. The original algorithm: To compute the lca between two nodes of an n-node tree, their original algorithm stores the node-depths of nodes into an array L of size 2n − 1 words in the order of a depth-first traversal of the tree. The node-depth of a node is the number of nodes on the path from the root to the node and it is usually different from the string-depth of it. An example is shown in Figure 1. Then the lca between two nodes can be represented as the minimum value in the subarray of L, where the boundaries of the subarray correspond to the nodes. To compute the index of the minimum value in constant time, for every log 2 n elements of L[i] (i is a multiple of ⌊log2 n⌋), the minimum values between L[i] to L[i + 2k ] (k = 0, 1, . . . , log2 n) are stored in a two-dimensional array M [i, k]. Then the index of the minimum value between L[i] and L[j] (i and j are multiples of ⌊log 2 n⌋) can be computed in constant time by min{M [i, k], M [j − 2k + 1, k]} where k = ⌊log2 (j − i)⌋. If i or j is not a multiple of ⌊log2 n⌋, we use table-lookups to find the minimum element in a subarray of length at most log 2 n. This data structure occupies O(n) words because for every log2 n elements O(log n) words are stored. Our algorithm: We store M [i, k] only if i is a multiple of log32 n. Then the array occupies only O( logn3 n · log n · log n) = O( logn n ) = o(n) bits. To compute the minimum element in a subarray of length at most log 32 n, we store another two-dimensional table M ′ [i, k] where i is a multiple of log2 n and k = 0, 1, . . . , log2 (log32 n).

This table occupies O( logn n · log(log3 n) · log(log3 n)) = o(n) bits. We also use the table for subarrays of length at most log2 n. The values of L for the suffix tree can be stored in at most 4n + o(n) bits as follows. Because the difference between two adjacent elements of L are 1 or −1, we encode the differences by a sequence P of open and close parentheses (see Figure 1). To compute the elements of L in constant time, we explicitly store them for every log 22 n elements, and for every log 2 n elements we store the difference from the nearest elements stored explicitly. Actually, the parentheses sequence P is exactly the same as the encoding for a tree [13]. By using the above data structure, the position of the minimum value among L[x], L[x+1], . . . , L[y] can be computed in constant time. To use the data structure for computing lcp between two suffixes TSA[i] and TSA[j] , it is necessary to convert the indices i and j into those in L.

time if |Σ| = (log n)O(1) by using additional data structures of size H1 · o(n) bits. To search for a pattern P [1..m], the FM-index iteratively finds its suffixes from right to left. Given an interval [li+1 , ri+1 ] in the suffix array corresponding to a suffix P [i + 1..m], the interval [li , ri ] corresponding to P [i..m] is computed in constant time. We call this search algorithm backward search. The compressed suffix array can be also used for the backward search. Let [Lc , Rc ] be an interval in the suffix array corresponding to a character c. We store the intervals for all characters in the alphabet. Lemma 4.1. The interval [li , ri ] can be calculated from [li+1 , ri+1 ] by li

=

argmin

Ψ[j] ≥ li+1

j, LP [i] ≤j≤RP [i]

ri

=

argmax

Ψ[j] ≤ ri+1 .

j, LP [i] ≤j≤RP [i]

Lemma 3.4. The position of parentheses ‘()’ in P Proof. Because suffixes are lexicographically sorted in that correspond to leaf(i) can be computed by x = the suffix array, the interval [l , r ] is in [L , R ], i i P [i] P [i] select () (P, i). which corresponds to suffixes with the first character Proof. Because any occurrence of ‘()’ in P corresponds P [i]. And because suffixes correspond to [L P [i] , RP [i] ] to a leaf of the tree and the leaves appear in the begin with the same character P [i], their lexicographic orders are defined by those of the suffixes removed lexicographic order of the suffixes, the lemma holds. the first characters. Since li+1 ≤ Ψ[j] ≤ ri+1 means Now we have the algorithm to compute that a suffix TSA[j]+1 has a prefix P [i + 1..m] and LP [i] ≤ j ≤ RP [i] means that T [SA[j]] = P [i], the suffix lcp(TSA[l] , TSA[r] ). TSA[j] has a prefix P [i..m]. 1. Compute x = select () (P, l) and y = select () (P, r). To compute li and ri takes O(log n) time if we perform 2. Compute m = rank () (P, RMQL (x, y)). a binary search on Ψ. However it can be improved to constant time by using the data structure for rank and 3. Compute Hgt[m]. predecessor queries [14]. To compute the Hgt takes constant time by using the suffix array, or O(log ǫ n) time using the compressed Theorem 4.1. [14] For n = s logO(1) s, a static rank suffix array. Thus we have the following theorem: dictionary with worst case constant query time, supporting rank and predecessor queries, can be stored in Theorem 3.2. Given i, j, the length of the longest  s(log log s)2 n common prefix between suffixes T and T can B + O( log s ) bits where B = ⌈log s ⌉. SA[i]

SA[j]

be computed in constant time using a data structure of s logd s size 6n + o(n) bits and the suffix array, or in O(logǫ n) Precisely, let d ≥ 1 be a constant such that n ≤ 2d . 2 time using the data structure of size 6n + o(n) bits and Then the data structure occupies B + O( ds(log log s) ) log s the compressed suffix array. bits. We can calculate the number of elements of Ψ[j] in the interval [l, r] which are smaller than li+1 using 4 Improved representation of compressed suffix the data structure. arrays The Ψ function is piecewise monotone. If the suffix The FM-index can find any pattern P from a text of array stores in [L c , Rc ] positions of suffixes which begin length n in O(|P |) time, while the compressed suffix with a character c, Ψ[i] for L c ≤ i ≤ Rc are monotone array takes O(|P | log n) time. Unfortunately the FM- increasing. Thus we divide the Ψ function into |Σ| index achieves O(|P |) time query only if |Σ| is constant. monotone functions Ψc , each corresponds to suffixes In this section we show that pattern matching by using which begin with c. Thus they can be stored using the the compressed suffix array can be also done in O(|P |) above data structure. Note that the compressed suffix

array is a hierarchical data structure, and the above data structure is used in only the lowest level. Then we have the following theorem:

5 Concluding remarks We have proposed succinct representations of lcp (longest common prefixes) information between suffixes. Theorem 4.2. The compressed suffix array for a text They can be used with the suffix array or the comof length n on alphabet Σ such that |Σ| = (log n)O(1) , pressed suffix arrays as space-economical alternatives of supporting O(|P |) time existing and counting queries for suffix trees. We have also proposed a new encoding of the compressed suffix array which achieves O(|P |) time any pattern P , can be stored in   counting query for any pattern P from a text of length 1 + o(1) nH1 + O(n) n on any alphabet Σ such that |Σ| = log O(1) n. ǫ bits. Proof. The size of the data structure for storing Ψ[i] function corresponding to a character c becomes as follows. Let nc be the number of occurrences of the character c in the text. In this case d = O(log nnc / log log n c ). Then the size of the data structure is   log nnc nc (log log nc )2 Bc + O · log nc log log nc  n n where Bc = ⌈log nc ⌉ = nc log nc + O(nc ). The summation of the sizes of the data structure for all characters is  X X  nc (log log nc )2 log nnc · Bc + O log nc log log nc c c   log log n = nH1 + nH1 · O + O(n) log n where P H1 is 1the order-1Pentropy of the text defined nc as c pc log pc because c nc = n and pc = n is the probability of the occurrence of the character c in the text. Note that computing the predecessor  n takes O log nc / log log n c , which is not constant if n > nc (log nc )O (1). To make the condition hold for n any c, we add |Σ| log 2 n dummy elements to each Ψc . Let n′c = nc +

n . |Σ| log2 n

Assume that |Σ| ≤ logl n. n′ (log n)l



Then n ≤ n′c |Σ| log2 n ≤ n′c (log n)l+2 < c 2l′ for a certain constant l ′ . Therefore the predecessor query takes constant time and the size of the data structure is increased by only X   n    n  log − log n′c nc c   ′ 2 log n (nc − nc )(log log n) · +O log n log log nc   X n n n − nc log = nc + log n 2 n + n |Σ| log n c c |Σ| log2 n c +o(n) X = nc log c

= o(n).

nc nc +

n |Σ| log2 n

+ o(n)

Acknowledgment The author would like to thank Prof. Takeshi Tokuyama of Tohoku university and Mr. Jesper Jansson of Lund university for their valuable comments. The author also thanks anonymous referees for their helpful comments. Work of the author was supported in part by the Grantin-Aid of the Ministry of Education, Science, Sports and Culture of Japan. References

[1] M. Bender and M. Farach-Colton. The LCA Problem Revisited. In Proceedings of LATIN2000, LNCS 1776, pages 88–94, 2000. [2] D. Benoit, E. D. Demaine, J. I. Munro, and V. Raman. Representing Trees of Higher Degree. In Proceedings of the 6th International Workshop on Algorithms and Data Structures (WADS’99), LNCS 1663, pages 169– 180, 1999. [3] D. R. Clark and J. I. Munro. Efficient Suffix Trees on Secondary Storage. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 383–391, 1996. [4] P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246– 260, 1974. [5] P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. In 41st IEEE Symp. on Foundations of Computer Science, pages 390–398, 2000. [6] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. In 32nd ACM Symposium on Theory of Computing, pages 397–406, 2000. [7] D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. [8] G. Jacobson. Space-efficient Static Trees and Graphs. In 30th IEEE Symp. on Foundations of Computer Science, pages 549–554, 1989. [9] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. the 12th Annual Symposium on Combinatorial Pattern Matching (CPM’01), LNCS 2089, pages 181–192, 2001.

[10] S. Kurtz. Reducing the Space Requirement of Suffix Trees. Software – Practice and Experience, 29(13):1149–1171, 1999. [11] U. Manber and G. Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935–948, October 1993. [12] J. I. Munro. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Computer Science (FSTTCS ’96), LNCS 1180, pages 37–42, 1996. [13] J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs. In 38th IEEE Symp. on Foundations of Computer Science, pages 118–126, 1997. [14] R. Pagh. Low redundancy in static dictionaries with O(1) worst case lookup time. In Proceedings of ICALP’99, LNCS 1644, pages 595–604, 1999. [15] K. Sadakane. Compressed Text Databases with Efficient Query Algorithms based on the Compressed Suffix Array. In Proceedings of ISAAC’00, number 1969 in LNCS, pages 410–421, 2000. [16] S. Shimozono, H. Arimura, and S. Arikawa. Efficient Discovery of Optimal Word-Association Patterns in Large Text Databases. New Generation Computing, 18:49–60, 2000.

Suggest Documents