Indexing Factors with Gaps - CiteSeerX

4 downloads 0 Views 184KB Size Report
Michael Brudno, Michael Chapman, Berthold Göttgens, Serafim Batzoglou, and ... Michael Brudno, Chuong B. Do, Gregory M. Cooper, Michael F. Kim, Eugene.
Indexing Factors with Gaps M. Sohel Rahman and Costas S. Iliopoulos Algorithm Design Group Department of Computer Science, King’s College London, Strand, London WC2R 2LS, England {sohel,csi}@dcs.kcl.ac.uk http://www.dcs.kcl.ac.uk/adg

Abstract. Indexing of factors or substrings is a widely used and useful technique in stringology and can be seen as a tool in solving diverse text algorithmic problems. A gapped-factor is a concatenation of a factor of length k, a gap of length d and another factor of length k0 . Such a gapped factor is called a (k − d − k0 )-gapped-factor. The problem of indexing the gapped-factors was considered recently in [22]. Given a text T of length n over alphabet Σ and the values of the parameters k, d and k0 , the construction of the corresponding index, i.e. gapped factor tree (GFT) [22], requires O(n|Σ|) time. Once GFT is constructed, a given (k − d − k0 )-gapped-factor can be reported in O(k + k0 + Occ) time, where Occ is the number of occurrences of that factor in T . In this paper, we present a new improved indexing scheme for the gappedfactors. The improvements we achieve comes from two aspects. Firstly, we generalize the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k0 . Secondly, our data structure can be constructed in O(n log1+ n) time and space, where 0 <  < 1. The only price we pay is a slight increase, i.e. an additional log log n term, in the query time.

1

Introduction

Indexing of words or factors is a widely used and useful technique in stringology. Use of k-factors1 or q-grams, as is sometimes mentioned in the literature, can be seen in solving diverse text algorithmic problems ranging from different string matching tasks [21, 23, 8] to motif finding [11] and alignment problems [19, 7, 4, 5, 15, 10, 16]. In order to efficiently use the k-factors we need an efficient data structure to index them. Depending on the nature of the problem different types of factors and hence different data structures may be needed. Very recently, Peterlongo et al. [22] presented a data structure to index gapped-factors. A gapped-factor, as defined by the authors in [22], is a concatenation of a factor of length k, a gap of length d and another factor of length k 0 . Such a gapped factor is called a (k−d−k 0 )-gappedfactor. In [22], the authors presented an index called a gapped-factor tree (GFT), modifying the k-factor tree2 [2], which itself is an extension of the original suffix tree data structure [24, 18]. Given a text T of length n over alphabet Σ and the values of the parameters k, d and k 0 , the construction of the corresponding GFT requires O(n|Σ|) time and space. Once GFT is constructed, a given (k − d − k 0 )gapped-factor can be reported in O(k + k 0 + Occ) time, where Occ is the number of occurrences of that factor in T . In this paper, we present a new improved indexing scheme for the gapped-factors. The improvements we achieve comes from two aspects as follows: 1 2

factors or words of length k A k-factor tree indexes all k-factors of a text

1. We generalize the indexing data structure in the sense that it is independent of the parameters k and k 0 . As a result our data structure can index any factor consisting of 2 sub factors of arbitrary lengths separated by a gap of length d. Note carefully that a GFT [22] is specific for a particular value of k, k 0 and d. In our case, on the other hand, only the parameter d is fixed a priori. 2. We also improve considerably the construction cost of the data structure and make it alphabet independent 3 . Our data structure can be constructed in O(n log1+ n) time and space, where 0 <  < 1. The only price we pay is a slight increase, i.e. an additional log log n term, in the query time. The rest of the paper is organized as follows. In Section 2, we present the preliminary concepts. Section 3 presents the main result of this paper i.e. the construction of the data structure GFI to index the gapped factors. In Section 4, modifications to GFI is presented to handle multiple strings. Finally, we conclude in Section 5.

2

Preliminaries

A text, also called a string, is a sequence of zero or more symbols from an alphabet Σ. A text T of length n is denoted by T [1..n] = T1 T2 . . . Tn , where Ti ∈ Σ for 1 ≤ i ≤ n. The length of T is denoted 3

Alphabet independency comes under the assumption that the alphabet is of fixed size. Otherwise, for an alphabet set Σ, a log Σ factor remains present in the complexity.

← − by |T | = n. The string T denotes the reverse of the string T , i.e., ← − T = Tn Tn−1 . . . T1 . A string w is a factor of T if T = uwv for u, v ∈ Σ ∗ ; in this case, the string w occurs at position |u| + 1 in T . The factor w is denoted by T [|u| + 1..|u| + |w|]. A k-factor is a factor of length k. A prefix (or suffix) of T is a factor T [x..y] such that x = 1 (y = n), 1 ≤ y ≤ n (1 ≤ x ≤ n). We define ith prefix to be the prefix ending at position i i.e. T [1..i], 1 ≤ i ≤ n. On the other hand, ith suffix is the suffix starting at position i i.e. T [i..n], 1 ≤ i ≤ n. We define a gapped-factor to be a concatenation of two factors separated by a gap or, equivalently, a block of don’t care characters, where a don’t care character ‘∗’ can match any character a ∈ Σ and ∗ ∈ / Σ. A d-gapped-factor is a gapped-factor where the length of the gap is d. A (k − d − k 0 )-gapped-factor is a d-gapped-factor where the length of the two sub-factors are, respectively, k and k 0 . If X is a (k − d − k 0 )-gapped-factor, then X = Xf ∗d X` , where Xf = X[1..k], Xf = X[k + d + 1..|X|] and ∗d denotes the concatenation of d don’t care characters. A (k − d − k 0 )-gapped-factor X is said to occur at position i of a string Y if and only if Y [i..i + k − 1] = Xf and Y [i + k + d..i + k + d + k 0 − 1] = X` . The position i is said to be an occurrence of X in T . We denote by OccTX the set of occurrences of X in T . Example 1. Suppose we are given a text T and a gapped-factor X as follows: T = AGGACCGGGT T GACT T CGT T GAAG

X = GAC ∗3 GT T GA Note that, we have |X| = 11, Xf = GAC, X` = GT T GA, and d = 3. It is easy to see that X occurs at position 3 and 12 of T (see Fig. 1).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 T =AG G A C CGG G T T G A C T T C G T T G A A G X=

G A C ∗ ∗ ∗ G T T G A L99 k 99K

X=

L99

k0

99K G A C





∗ G T T G A

L99 d 99K Fig. 1. The gapped-factor X and its occurrences in T of Example 1

In traditional indexing problem one of the basic data structures used is the suffix tree data structure. In our indexing problem we make use of this suffix tree data structure. A complete description of a suffix tree is beyond the scope of this paper, and can be found in [18, 24] or in any textbook on stringology (e.g. [6, 9]). However, for the sake of completeness, we define the suffix tree data structure as follows. Given a string T of length n over an alphabet Σ, the suffix tree STT of T is the compacted trie of all suffixes of T $, where $ ∈ / Σ. Each leaf in STT represents a suffix T [i..n] of T and is labeled with the index i. We refer to the list (in left-to-right order) of indices of the leaves of the subtree rooted at node v as the leaf-list of v; it is denoted by LL(v). Each edge in STT is labeled with a nonempty substring of T such that the path from the root to the

leaf labeled with index i spells the suffix T [i..n]. For any node v, we let `v denote the string obtained by concatenating the substrings labeling the edges on the path from the root to v in the order they appear. Several algorithms exist that can construct the suffix tree STT in O(n) time and space, assuming an alphabet of fixed size [18, 24]. Given the suffix tree STT of a text T we define the “locus” µP of a pattern P as the node in STT such that `µP has the prefix P and |`µP | is the smallest of all such nodes. Note that the locus of P does not exist if P is not a substring of T . Therefore, given P , finding µP suffices to determine if P occurs in T . Given a suffix tree of a text T , a pattern P , one can find its locus and hence the fact whether T has an occurrence of P in optimal O(|P |) time [18, 24]. In addition to that all such occurrences can be reported in constant time per occurrence.

3

Gapped-Factor Index

In this section we present the data structure to index the gapped factors. Suppose we are given a text T and a gapped factor X = Xf ∗d X` . We first discuss how we can find the occurrences of X in T and then we use the underlying idea to construct the data structure to index the gapped-factors. The idea is to first find OccTXf and OccTX` . Now we need to find the common occurrences that is according to the definition of X. In other words we need to find {i | i ∈ OccTXf and (i + |Xf | + d) ∈ OccT X` }. Algorithm 1 presents the steps formally.

Algorithm 1 Finding the Occurrences of X = Xf ∗d X` in T 1: Compute OccT Xf 2: 3:

for i ∈ OccTXf do i = i + |Xf | + d

4:

end for

5:

Compute OccTX`

6:

Compute OccTX = OccTXf

7:

return OccTX

T

OccTX`

In order to maintain a data structure to index the gapped factors we basically use the idea presented in Algorithm 1. We maintain two − . We use STT to find the suffix tree data structures STT and ST← T

occurrences of X` . We can find the occurrences of Xf using STT as well. But we need to take a different approach because we have to “align” the occurrences of Xf (Step 2) with the occurrences of X` so that we can find the occurrences of X by intersecting (Step 6) them just as is done in Algorithm 1. However it is not as straightforward as Algorithm 1 because our aim is to maintain an index rather than finding a match for a particular pattern. What we do is as follows. − , to find the We use the suffix tree of the reverse string of T , i.e. ST← T ← − occurrences of Xf . By doing so, in effect, we get the end positions of

the occurrences of Xf in T . However we still have to do a bit more “shifting” because of the gap of length d to complete the alignment. This is handled as follows. According to the definition of suffix tree, each leaf in STT is labeled by the starting location of its suffix.

− is labeled However, to achieve the desired alignment, each leaf in ST← T

by (n+1)−i+d+1, where i is the starting position of the leaf’s suffix ← − − . It is easy to see that getting the occurrences of Xf in ST← − in ST← T T is equivalent to getting the occurrences of Xf in STT according to our desired alignment. So it remains to show how we could perform the intersection (Step 6) efficiently in the context of indexing. In − order to do that we first do some preprocessing on STT and ST← T

as follows. For each of the two suffix trees we maintain a linked list of all leaves in a left-to-right order. In other words, we realize the list LL(R) in the form of a linked list where R is the root of the suffix tree. In addition to that, for each of the two suffix trees, we set pointers v.lef t and v.right from each tree node v to its leftmost leaf v` and rightmost leaf vr (considering the subtree rooted at v) in the linked list. It is easy to realize that, with these set of pointers at our disposal, we can indicate the set of occurrences of a pattern P by the two leaves µP` and µPr because all the leaves between and including µP` and µPr in LL(R) correspond to the occurrences of P in T .

In what follows we define the term `T and rT such that LL(R)[`T ] = X

X

µ` f and LL(R)[rT ] = µr f , where R is the root of STT . Similarly ← − ← − X − and r← − such that LL( R)[`← − ] = µ ` and LL( R)[r← −] = we define `← ` T T T T ← − ` ← − µX r , where R is the root of ST T . Now we have two lists LL(R) and ← − − ..r← − ] respectively. Now our LL( R) and two intervals [`T ..rT ] and [`← T T problem is to find the intersection of the indices within these two

intervals. We call this problem Range Set Intersection Problem. We first define the problem formally below. Problem “RSI” (Range Set Intersection Problem). Let V [1..n] and W [1..n] be two permutations of [1..n]. Preprocess V and W to answer the following form of queries. Query: Find the intersection of the elements of V [i..j] and W [k..`], 1 ≤ i ≤ j ≤ n, 1 ≤ k ≤ ` ≤ n. In order to solve the above problem we reduce it to the wellstudied Range Search Problem on a Grid. Problem “RSG” (Range Search Problem on Grid). Let A[1..n] be a set of n points on the grid [0..U ]2 . Preprocess A to answer the following form of queries. Query: Given a query rectangle q ≡ (a, b) × (c, d) find the set of points contained in q. We can see that Problem RSI is just a different formulation of the Problem RSG. This can be realized as follows. We set U = n. Since V and W in Problem RSI are permutations of [1..n], every number in [1..n] appears precisely once in each of them. We define the coordinates of every number i ∈ [1..n] to be (x, y), where V [x] = W [y] = i. Thus we get the n points on the grid [0..n]2 , i.e. the array A of Problem RSG. The query rectangle q is deduced from the two intervals [i..j] and [k..`] as follows: q ≡ (i, k) × (j, `). It is easy to verify that the above reduction is correct and hence we can solve Problem RSI using the solution of Problem RSG. There has

been significant research work on Problem RSG. We are going to use the data structure of Alstrup et al. [3]. This data structure can answer the query of Problem RSG in O(log log n + k) time where k is the number of points contained in the query rectangle q. The data structure requires O(n log1+ n) time and space, for any constant 0 <  < 1. Algorithm 2 formally states the steps to build our data structure (GFI) to index the gapped factors. One final remark is that, we can use the suffix array data structure [17, 12–14] as well to build GFI with some standard modifications in Algorithm 2.

3.1

Analysis

Let us analyze the the running time of Algorithm 2. The algorithm can be divided into 3 main parts. Part 1 deals with the suffix tree of the text T and comprises of Step 1 to 6. Part 2 consists of Step 7 ← − to 12 and deals with the suffix tree of the reverse text T . Part 3 deals with the reduction to Problem RSG from Problem RSI and the subsequent preprocessing step. The computational effort spent for Part 1 and 2 is identical and is O(n) as follows. Step 1 (Step 7) builds the traditional suffix tree and hence can be done in O(n) time and space. Step 2 (Step 8) can be done easily while building the suffix tree. Step 3 and Step 4 (Step 9 and Step 10) can be done − ) using a breadth first or together in O(n) by traversing STT (ST← T

in order traversal. So, in total, Part 1 and Part 2, i.e. Step 1 to 12 requires O(n) time and space.

In Part 3 we first construct the set A of points in the grid [0..n]2 on which we will apply the range search. This step can also be done ← − in O(n) as follows. Assume that L ( L ) is the linked list realizing ← − ← − LL(R) (LL( R)). Each element in L ( L ) is the label of the cor← − responding leaf in LL(R) (LL( R)). We construct L−1 such that ← − L−1 [L[i]] = i. Similarly we construct L −1 . It is easy to see that ← − with L−1 and L −1 in our hand we can easily construct A in O(n). A detail is that in our case there may exist i, 1 ≤ i ≤ n such that ← − ← − L [j] 6= i for all 1 ≤ j ≤ n. This is because L is a permutation of [2 + d..n + 1 + d] instead of [1..n]. The straightforward way to overcome this situation is to assume U = n + 1 + d. But this would increase the asymptotic running time of Step 21 unless d is constant. ← − On the other hand it is easy to observe that any i ∈ L such that i > n is irrelevant in the context of the occurrence of any gapped ← − factor. So we ignore any such i ∈ L while creating the set A. After A is constructed we perform Step 21, which requires O(n log1+ n) time and space, for any constant 0 <  < 1. So the overall index is built in O(n log1+ n) time and space.

3.2

Query processing

So far we have concentrated on the construction of the gapped factor index (GFI) and we have shown that we can build GFI in O(n log1+ n) time and space, for any constant 0 <  < 1. Here we discuss the query processing. Suppose we are given the GFI of a text T for the gap d and a query for gapped factor X = Xf ∗d X` . We first

find the locus µX` in STT . Let i = µX` .lef t and j = µX` .right. Now ←−

←−

←−

− . Let k = µXf .lef t and ` = µXf .right. we find the locus µXf in ST← T

Then we find all the points in A that are inside the rectangle q ≡ (i, k)×(j, `). Let B is the set of those points. Then it is easy to verify that OccTX = {(L[x] − d − |Xf |) | (x, y) ∈ B}. The steps are formally presented in the form of Algorithm 3. The running time of the query processing is deduced as fol←−

lows. Finding the loci µX` and µXf requires O(|X` | + |Xf |) time (Step 1 and 2). The corresponding pointers can be found in constant time (Step 3). The construction of the set B in Step 4 is done by performing the range query and hence requires O(log log n + |B|) time. Note that |B| = |OccTX | and hence in total the query time is O(|X` | + |Xf | + log log n + |OccTX |).

4

Multiple String

So far we have considered indexing only one string. However our techniques generalizes to multiple strings. In this section we present the modifications to generalize GFI data structure to handle multiple strings. In the generalized case we are given a library Q of q text documents T 1 , . . . , T q , each T i , 1 ≤ i ≤ q being a string over alP phabet set Σ. We have 1≤i≤q |T i | = n. Our aim is to construct the Generalized GFI (GGFI), given Q and a parameter d, so that given a gapped factor X = Xf ∗d X` we can find all the occurrences of X in the library Q. To construct GGFI we use the idea of Generalized Suffix Tree (GST) [9] although we don’t use the data structure di-

rectly. As is the case in GST, we create a string T = T 1 $i . . . $q−1 T q where $i ∈ / Σ, 1 ≤ i < q. We still can build the index using Algorithm 2; but there are two important issues that we need to resolve. Firstly, since we have created a big string we have to somehow ensure that we do not report the occurrences that occurs across the original string boundaries if there exists one. Secondly, we have to report the occurrences with respect to the original strings T i , 1 ≤ i ≤ q, in the Library; not with respect to T . Each occurrence, now, has to be reported as a two tuple (i, j), which indicates an occurrence starting at T i [j]. Example 2. Suppose we have a library Q and a gapped factor X as follows: Q = {T 1 = GAAAGCT GA, T 2 = AACT GGACT CCT } X = GA ∗3 CT Now we have T = GAAAGCT GA$1 AACT GGACT CCT . It is easy to see that there is an occurrence of X at location 8 of T which is not valid. The other two occurrences are at location 1 and 16 and both are valid. Note that we have to report OccQ X = {(1, 1), (2, 6)}. See Figure 2. We resolve the two issues, mentioned above, as follows. We construct two lists O and D from the list L such that O[k] = j and D[k] = i if and only if L[k] corresponds to T i [j] for some i, j, 1 ≤ i ≤ q, 1 ≤ j ≤ |T i |. This can easily be done in O(n) during the preprocessing step. Once we have O and D, we can easily find the actual

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 T = G A A A G C T G A $1 A A C T G G A C T C C T X=GA ∗ ∗ ∗ CT X=

GA ∗ ∗ ∗ C T G A ∗ ∗ ∗ C T

X= 1 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 10 11 12

1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2 2 2 2 2

Fig. 2. The valid and invalid occurrences of X according to Example 2

occurrences in constant time per occurrences and hence the second issue is resolved. To resolve the first issue we first need to realize that, in T , we can’t have any occurrence of Xf or X` crossing the boundaries of the original strings. This is because of the presence of $i , 1 ≤ i < q between the original strings in T . So the only case when the first issue can occur is when we have an occurrence at location i of Xf in T such that i+|Xf |+|d| crosses the original string boundaries i.e. when D[i] 6= D[i + |Xf | + |d|]. So, we can resolve the first issue by first, identifying these occurrences and then, excluding them while constructing the set A. We do it as follows. Recall that we label each − with (n + 1) − i + d + 1 where i is the starting location of leaf of ST← T

its suffix. It is easy to verify that if the locations (n + 1) − i + d + 1 and (n + 1) − i in T corresponds to two different strings, we have our case. To handle this issue, in the preprocessing phase we construct ← − ← − ← − ← − a list D from the list L such that D [k] = j if and only if L [k] ← −i ← − corresponds to T [j] for some i, j, 1 ≤ i ≤ q, 1 ≤ j ≤ | T i |. While constructing the set A we exclude those locations i where we have

← −← − ← −← − D [ L [i]] 6= D [ L [i] − (d + 1)]. As a result the first issue is resolved as well. In the rest of this section we discuss a different but interesting problem involving multiple strings. In this problem, instead of occurrences of a pattern, we are interested in identifying the strings in which the pattern occurs. This problem is motivated by practical applications and was introduced and studied by Muthukrishnan in [20] under the name “Document Listing Problem”. Let us formally define the problem below. Problem “DL” (Document Listing Problem). We are given a library Q of q text documents T 1 , . . . , T q , each T i , 1 ≤ i ≤ q P being a string over alphabet set Σ. We have 1≤i≤q |T i | = n. Preprocess the library Q to answer the following form of queries. Query: Given a gapped factor X = Xf ∗d X` , find the set of all documents in the library in which X is present, i.e. find the set i i ListQ X = {i | T [j..j + |Xf | − 1] = Xf and T [j + |Xf | − 1 + d +

1..j + |Xf | − 1 + d + 1 + |X` | − 1] = X` for some j} To solve this problem the change that is required is in the range search algorithm. We make use of the following variant of Problem RSG. Problem “CRSG” (Colored Range Search Problem on Grid). Let A[1..n] be a set of n colored points on the grid [0..U ]2 . Preprocess A to answer the following form of queries. Query: Given a query rectangle q ≡ (a, b) × (c, d) find the set of distinct colors of points contained in q.

Agarwal et al. [1] presented a data structure that takes O(n log2 n) time and space to answer colored range search query in O(log log U + k) time where k is the output size. In Problem DL, instead of reporting all the points of A contained in q we need only to report the distinct documents those points corresponds to. This can be achieved by using the list D as the color of the points and then applying the solution to Problem CRSG. So we can build the GFI to solve Problem DL in O(n log2 n) time and space and can answer the queries in O(|X` | + |Xf | + log log n + k) time where k is the number of documents where the given gapped factor occurs.

5

Conclusion

In this paper, we have presented GFI, a new data structure to index gapped factors. Given a text T of length n, and the value of the parameter d, GFI construction requires O(n log1+ n) time and space, for any constant 0 <  < 1 and the subsequent queries of a gapped factor X = Xf ∗d X` can be answered in O(|X` | + |Xf | + log log n + |OccTX |). GFI is an improved indexing scheme than GFT [22] from two different aspects. Firstly, GFI is more general than GFT in the sense that it is independent of the parameters k = |Xf | and k 0 = |X` |, whereas GFT is specific for a particular value of k, k 0 and d. Secondly, the construction cost of GFI is significantly better and is, unlike that of GFT, alphabet independent. However, this improvement is achieved at the cost of a slight increase in the query time with the presence of an additional log log n term. We also have shown how to

modify the GFI data structure to handle multiple strings. Finally, we have shown how we can solve the document listing problem for gapped factors with our data structures. The future research may be directed towards building a data structure that is independent of the parameter d, i.e. the length of the gap.

References 1. Pankaj K. Agarwal, Sathish Govindarajan, and S. Muthukrishnan. Range searching in categorical data: Colored range searching on grid. In Rolf H. M¨ ohring and Rajeev Raman, editors, ESA, volume 2461 of Lecture Notes in Computer Science, pages 17–28. Springer, 2002. 2. Julien Allali and Marie-France Sagot. The at most k-deep factor tree. Technical Report 2004-03, 2004. 3. Stephen Alstrup, Gerth Stølting Brodal, and Theis Rauhe. New data structures for orthogonal range searching. In FOCS, pages 198–207, 2000. 4. Michael Brudno, Michael Chapman, Berthold G¨ ottgens, Serafim Batzoglou, and Burkhard Morgenstern. Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics, 4:66, 2003. 5. Michael Brudno, Chuong B. Do, Gregory M. Cooper, Michael F. Kim, Eugene Davydov, Eric D. Green, Arend Sidow, and Serafim Batzoglou1. Lagan and multilagan: Efficient tools for large-scale multiple alignment of genomic dna. Genome Research, 13(4):721–731, 2003. 6. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002. 7. Robert C. Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput . Nucleic Acids Research, 32(5). 8. Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, editors, VLDB, pages 491–500. Morgan Kaufmann, 2001.

9. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. 10. Michael H¨ ohl, Stefan Kurtz, and Enno Ohlebusch. Efficient multiple genome alignment. In ISMB, pages 312–320, 2002. 11. Costas S. Iliopoulos, James A. M. McHugh, Pierre Peterlongo, Nadia Pisanti, Wojciech Rytter, and Marie-France Sagot. A first approach to finding common motifs with gaps. Int. J. Found. Comput. Sci., 16(6):1145–1154, 2005. 12. Juha K¨ arkk¨ ainen and Peter Sanders. Simple linear work suffix array construction. In Jos C. M. Baeten, Jan Karel Lenstra, Joachim Parrow, and Gerhard J. Woeginger, editors, ICALP, volume 2719 of Lecture Notes in Computer Science, pages 943–955. Springer, 2003. 13. Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Constructing suffix arrays in linear time. J. Discrete Algorithms, 3(2-4):126–142, 2005. 14. Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. J. Discrete Algorithms, 3(2-4):143–156, 2005. 15. Ming Li, Bin Ma, Derek Kisman, and John Tromp. Patternhunter ii: Highly sensitive and fast homology search. Genome Informatics, 14:164–175, 2003. 16. Bin Ma, John Tromp, and Ming Li. Patternhunter: faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002. 17. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993. 18. Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976. 19. Morris Michael, Christoph Dieterich, and Martin Vingron. Siteblast-rapid and sensitive local alignment of genomic sequences employing motif anchors. Bioinformatics, 21(9):2093–2094, 2005. 20. S. Muthukrishnan. Efficient algorithms for document retrieval problems. In SODA, pages 657–666, 2002. 21. Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, and Jorma Tarhio. Indexing text with approximate q-grams. In Raffaele Giancarlo and David Sankoff, editors, CPM, volume 1848 of Lecture Notes in Computer Science, pages 350–363. Springer, 2000. 22. Pierre Peterlongo, Julien Allali, and Marie-France Sagot. The gapped-factor tree. In The Prague Stringology Conference, page to appear, 2006.

23. Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate string matching. In Paul G. Spirakis, editor, ESA, volume 979 of Lecture Notes in Computer Science, pages 327–340. Springer, 1995. 24. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

Algorithm 2 Algorithm to build the index (GFI) for the gapped factors 1: Build a suffix tree STT of T . Let the root of STT is R. 2:

Label each leaf of STT by the starting location of its suffix.

3:

Construct a linked list L realizing LL(R). Each element in L is the label of the corresponding leaf in LL(R).

4: 5:

for each node v in STT do Store v.lef t = i and v.right = j such that L[i] and L[j] corresponds to, respectively, (leftmost leaf) vl and (rightmost leaf) vr of v.

6:

end for

7:

← − ← − − of T . Let the root of ST← − is R. Build a suffix tree ST← T T

8:

− by (n + 1) − i + d + 1 where i is the Label each leaf of ST← T

9:

starting location of its suffix. ← − ← − ← − Construct a linked list L realizing LL( R). Each element in L ← − is the label of the corresponding leaf in LL( R).

10: 11:

− do for each node v in ST← T

← − ← − Store v.lef t = k and v.right = ` such that L [k] and L [`] corresponds to, respectively, (leftmost leaf) vl and (rightmost leaf) vr of v.

12:

end for

13:

for i = 1 to n do

14:

Set A[i] = 

15:

end for

16:

for i = 1 to n do

17: 18: 19:

← − if there exists (x, y) such that L[x] = L [y] = i then A[i] = (x, y) end if

20:

end for

21:

Preprocess A for Range Search on a Grid [0..n]2 .

Algorithm 3 Algorithm for Query Processing 1: Find µX` in STT . ←−

2:

−. Find µXf in ST← T

3:

Set i = µX` .lef t, j = µX` .right, k = µXf .lef t and ` = µXf .right.

4:

Set B = {(x, y) | (x, y) ∈ A and (x, y) is contained in q ≡

←−

(i, k) × (j, `)} 5:

return OccTX = {(L[x] − d − |Xf |) | (x, y) ∈ B}

←−