A New Model to Solve Swap Matching Problem ... - Semantic Scholar

0 downloads 0 Views 229KB Size Report
The resulting algorithm is an adaptation of the classic shift- or algorithm. ... studied variant of the classic pattern matching problem. The pattern ..... 4 [cab] 0 0 0. 1.
A New Model to Solve Swap Matching Problem and Efficient Algorithms for Short Patterns Costas S. Iliopoulos and M. Sohel Rahman Algorithm Design Group Department of Computer Science, King’s College London, Strand, London WC2R 2LS, England {csi,sohel}@dcs.kcl.ac.uk http://www.dcs.kcl.ac.uk/adg

Abstract. In this paper, we revisit the much studied problem of Pattern matching with Swaps (Swap Matching problem, for short). We first present a new graph-theoretic approach to model the problem, which opens a new and so far unexplored avenue to solve the problem. Then, using the model, we devise an efficient algorithm to solve the swap matching problem. The resulting algorithm is an adaptation of the classic shiftor algorithm. For patterns having length similar to the word-size of the target machine, the algorithm runs in O((n + m) log m) time, where n and m are the length of the text and the pattern respectively.

1

Introduction

The classical pattern matching problem is to find all the occurrences of a given pattern P of length m in a text T of length n, both being sequences of characters drawn from a finite character set Σ. This problem is interesting as a fundamental computer science problem and is a basic need of many practical applications such as text retrieval, music retrieval, computational biology, data mining, network security, among many others. In this paper, we revisit the Pattern Matching with Swaps problem (the Swap Matching problem, for short), which is a wellstudied variant of the classic pattern matching problem. The pattern P is said to match the text T at a given location i, if adjacent pattern characters can be swapped, if necessary, so as to make the pattern identical to the substring of the text ending (or equivalently, starting) at location i. All the swaps are constrained to be disjoint, i.e., each character is involved in at most one swap. Amir et al. [1] obtained the first non-trivial results for this problem. They showed how to solve the problem in time O(nm1/3 log m log σ), where σ = min(|Σ|, m). Amir et al. [3] also studied certain special cases for which O(n log2 m) time can be obtained. However, these cases are rather restrictive. Finally, Amir et al. [2] solved the Swap Matching problem in time O(n log m log σ). We remark that all the above solutions to swap matching depend on the fast fourier transform (FFT) technique. It may be noted here that approximate swapped matching [4] and swap matching in weighted sequences [7] have also been studied in the literature.

2

The contribution of this paper is as follows. We first present a new graphtheoretic approach to model the problem which opens a new and so far unexplored avenue to solve the problem. Then, using the model, we devise an efficient algorithm to solve the swap matching problem. The resulting algorithm is an adaptation of the classic shift-or algorithm and runs in O((n + m) log m) if the pattern is similar in size to the size of word in the target machine. This seems to be the first attempt to provide an efficient solution to the swap matching problem without using the FFT techniques. The rest of the paper is organized as follows. In Section 2, we present some preliminary definitions. Section 3 presents our new model to solve the swap matching problem. In Section 4, we present the algorithm to solve the swap matching problem. Finally, we briefly conclude in Section 5.

2

Preliminaries

A string is a sequence of zero or more symbols from an alphabet Σ. A string X of length n is denoted by X[1..n] = X1 X2 . . . Xn , where Xi ∈ Σ for 1 ≤ i ≤ n. The length of X is denoted by |X| = n. A string w is called a factor of X if X = uwv for u, v ∈ Σ ∗ ; in this case, the string w occurs at position |u| + 1 in X. The factor w is denoted by X[|u| + 1..|u| + |w|]. A k-factor is a factor of length k. A prefix (or suffix) of X is a factor X[x..y] such that x = 1 (y = n), 1 ≤ y ≤ n (1 ≤ x ≤ n). We define i-th prefix to be the prefix ending at position i i.e. X[1..i], 1 ≤ i ≤ n. On the other hand, i-th suffix is the suffix starting at position i i.e. X[i..n], 1 ≤ i ≤ n. Definition 1. A swap permutation for X is a permutation π : {1, . . . , n} → {1, . . . , n} such that: 1. if π(i) = j then π(j) = i (characters are swapped). 2. for all i, π(i) ∈ {i − 1, i, i + 1} (only adjacent characters are swapped). 3. if π(i) = 6 i then Xπ(i) 6= Xi (identical characters are not swapped). For a given string X and a swap permutation π for X, we use π(X) to denote the swapped version of X, where π(X) = Xπ(1) Xπ(2) . . . Xπ(n) . Definition 2. Given a text T = T1 T2 . . . Tn and a pattern P = P1 P2 . . . Pm , P is said to swap match at location i of T if there exists a swapped version P 0 of P that matches T at location i, i.e. Pj0 = Ti−m+j for j ∈ [1..m]. Problem “SM” (Pattern Matching with Swaps). Given a text T = T1 T2 . . . Tn and a pattern P = P1 P2 . . . Pm , we want to find each location i ∈ [1..n] such that P swap matches with T at location i. Definition 3. A string X is said to be degenerate, if it is built over the potential 2|Σ| − 1 non-empty sets of letters belonging to Σ.

3

Example 1. Suppose we are considering DNA alphabet i.e. Σ = ΣDN A = {A, C, T, G}. Then we have 15 non-empty sets of letters belonging to ΣDN A . In what follows, the set containing A and T will be denoted by [AT ] and the singleton [C] will be simply denoted by C for ease of reading. Definition 4. Given two degenerate strings X and Y each of length n, we say X[i] matches Y [j], 1 ≤ i, j ≤ n if, and only if, X[i] ∩ Y [j] 6= ∅. Example 2. Suppose we have degenerate strings X = AC[CT G]T G[AC]C and Y = T C[AT ][AT ]T T C. Here X[3] matches Y [3] because X[3] = [CT G] ∩ Y [3] = [AT ] = T 6= ∅.

3

A Graph-Theoretic Model for Swap Matching

In this section, we present a new model to solve the swap matching problem. In our model, we view the text and the pattern as two separate graphs. We start with the following definitions. Definition 5. Given a text T = T1 . . . Tn of Problem SM, a T -graph, denoted by T G = (V T , E T ), is a directed graph with n vertices and n − 1 edges such that V T = {1, 2, . . . n} and E T = {(i, i + 1)|1 ≤ i < n}. For each i ∈ VT we define label(i) = Ti and for each edge e ≡ (i, j) ∈ ET we define label(e) ≡ label((i, j)) ≡ (label(i), label(j)) = (Ti , Tj ). Note that the labels in the above definition may not be unique. Also, we normally use the labels of the vertices and the edges to refer to them. a→c→a→c→b→a→c→c→b→a→c→a→c→b→a Fig. 1. The corresponding T -graph of Example 5

Example 3. Suppose, T = acacdaccbacacba. Then the corresponding T -graph is shown in Figure 1. Definition 6. Given a text P = P1 . . . Pm of Problem SM, a P-graph, denoted by P G = (V P , E P ), is a directed graph with 3m − 2 vertices and at most 5m − 9 edges. The vertex set V P can be partitioned into three disjoint vertex sets P P P P P namely V(+1) , V0P , V(−1) such that |V(+1) | = |V(−1) | = m − 1 and |V(0) | = m. The partition is defined in a 3 × m matrix M [3, m] as follows. For the sake of notational symmetry we use M [−1], M [0] and M [+1] to denote respectively the rows M [1], M [2] and M [3] of the matrix M . P = {M [−1, 2], M [−1, 3], . . . M [−1, m]} 1. V(−1) P 2. V(0) = {M [0, 1], M [0, 2], . . . M [0, m]}

4 P 3. V(+1) = {M [+1, 1], M [+1, 2], . . . M [+1, m − 1]}

The labels of the vertices are derived from P as follows: P 1. For each vertex M [−1, i] ∈ V(−1) , 1 < i ≤ m:

( label(M [−1, i]) =

Pi−1 X

if Pi−1 = 6 Pi , if Pi−1 = Pi , where X ∈ /Σ

(1)

P , 1 ≤ i ≤ m, label(M [0, i]) = Pi 2. For each vertex M [0, i] ∈ V(0) P 3. For each vertex M [+1, i] ∈ V(+1) , 1 ≤ i < m:

( Pi+1 label(M [+1, i]) = X

if Pi 6= Pi+1 , if Pi = Pi+1 , where X ∈ /Σ

(2)

P P P , E(0) and E(+1) as The edge set E P is defined as the union of the sets E(−1) follows: P 1. E(−1) = {(M [−1, i], M [0, i + 1]), (M [−1, i], M [+1, i + 1]) | 2 ≤ i ≤ m − V S 2 label(M [−1, i]) 6= X } {(M [−1, m − 1], M [0, m]) | label(M [−1, m − 1]) 6= X } S P 2. E(0) = {(M [0, i], M [0, i+1]) | 1 ≤ i ≤ m−1} {((M [0, i], M [+1, i+1]) | 1 ≤ V i≤m−2 label(M [+1, i + 1]) 6= X } V P 3. E(+1) = {(M [+1, i], M [−1, i + 1]) | 1 ≤ i ≤ m − 1 label(M [+1, i]) 6= X }1

The labels of the edges are derived from using the labels of the vertices in the obvious way. Example 4. Suppose, P = acbab. Then the corresponding P-graph P G is shown 0 in Figure 2. On the other hand, the corresponding P-graph P G for P 0 = accab is shown in Figure 3. Note that in P 0 we have P20 = P30 = c. The dotted edges in 0 Figure 3 are non-existent in P G and are shown only for the sake of understanding. Definition 7. Given a P-graph P G , a path Q = u1 Ã u` = u1 u2 . . . u` is a sequence of consecutive directed edges h(u1 , u2 ), (u2 , u3 ), . . . (u`−1 , u` )i in P G starting at node u1 and ending at node u` . The length of the path Q, denoted by len(Q), is the number of edges on the path and hence is ` − 1 in this case. It is easy to note that the length of a longest path in P G is m − 1. Definition 8. Given a P-graph P G and a T -graph T G , we say that P G matches T G at position i ∈ [1..n] if and only if there exists a path Q = u1 u2 . . . um in P G having u ∈ {M [0, 1], M [+1, 1]} and v ∈ {M [−1, m], M [0, m]} such that for j ∈ [1..m] we have label(uj ) = Ti−m+j 1

Note that, if label(M [+1, i]) = X then label(M [−1, i + 1]) = X as well.

5 1

2

a

−1

3

4

5

c

b

a

b

0

a

c

b

a

+1

c

b

a

b

Fig. 2. P-graph of the Pattern P = acbab 1

2

a

−1

3

4

5

X c

c

a

b

0

a

c

c

a

+1

c

c X

a

b

Fig. 3. P-graph of the Pattern P 0 = accab

The above definitions set up our model to solve the swap matching problem. The following Lemma presents the idea for the solution. Lemma 1. Given a pattern P of length m and a text T of length n, suppose P G and T G are the P-graph and T -graph of P and T , respectively. Then, P swap matches T at location i ∈ [1..n] of T if and only if P G matches T G at position i ∈ [1..n] of T G . Proof. The proof basically follows easily from the definition of the P-graph. At each column of the matrix M , we have all the characters as nodes considering the possible swaps as explained below. Each node in row (−1) and (+1) represents a

6

swapped situation. Now consider column i of M corresponding to P G . According to definition, we have M [−1, i] = Pi−1 and M [+1, i − 1] = Pi . These two nodes represents the swap of Pi and Pi−1 . Now, if this swap takes place, then in the resulting pattern, Pi−1 must be followed by Pi . To ensure that, in P G , the only edge starting at M [+1, i−1], goes to M [−1, i]. On the other hand, from M [−1, i] we can either go to M [0, i + 1] or to M [+1, i + 1]: the former is when there is no swap for the next pair and the later is when there is another swap for the next pair. Recall that, according to the definition, the swaps are disjoint. Finally, the nodes in row 0 represents the normal (non-swapped) situation. As a result, from each M [0, i] we have an edge to M [0, i + 1] and an edge to M [+1, i + 1]: the former is when there is no swap for the next pair as well and the later is when there is a swap for the next pair. So it is easy to see that all the paths of length m − 1 in P G represents all combinations considering all possible swaps in P . Hence the result follows. ¤ It is clear that the number of possible paths of length m−1 in P G is exponential in m. So spelling all the paths and then perform a pattern matching against, possibly, a index of T is very time consuming unless m is constant. We on the other hand exploit the above model in a different way and apply a modified version of the classic shift-or [5] algorithm to solve the swap matching problem. In the rest of this section, we present a notion of “Forbidden Graph” and in the next section we show how to exploit this notion and modify the shift-or algorithm to solve the swap matching problem. G

Definition 9. Given a P-graph P G = (V P , E P ), the forbidden Graph P = P P P P P (V , E ) is such that V = V P and E is defined as follows: W E = {(M [i, j], M [i, j+ 1]) |Vi ∈ {−1, 0, +1}, 1 ≤ j < m, (label(M [i, j]) 6= X label(M [i, j + 1]) 6= X) (∀(M [k, j], M [k, j + 1]) ∈ E P , k ∈ {−1, 0, +1}, label((M [k, j], M [k, j + 1])) 6= label((M [i, j], M [i, j + 1])))}. G

In other words, the forbidden graph P contains an edge (u, v) from column j to j + 1, where 1 ≤ j < m, if, and only if, there exists no edge from j to j + 1 in P-graph having the same label. G

Example 5. Suppose, P = acbab. Then the forbidden graph P corresponding to the P-graph P G is shown in Figure 4. The edges of P G are shown in dashed lines G and the edges of P are shown in solid lines. Note that, (M [+1, 3], M [+1, 4]) G is nonexistent in P , because, label((M [+1, 3], M [+1, 4])) = (a, b) and we have (M [+1, 3], M [−1, 4]) ∈ E P with the same label (a, b).

4

Algorithm for Swap Matching

In this section, we present a new efficient algorithm based on the model presented in Section 3. Our algorithm is a modified version of the classic shift-or algorithm for pattern matching. For the sake of completeness, we first present a

7 1

2

a

−1

3

4

5

c

b

a

b

0

a

c

b

a

+1

c

b

a

b

Fig. 4. Forbidden graph (solid edges) corresponding to the P-graph (dashed edges) of the Pattern P 0 = acbab

brief account of the shift-or algorithm in the following subsection. In Section 4.2 we present the modifications needed to adapt it to solve the swap matching problem. 4.1

Shift-Or Algorithm

The shift-or algorithm uses the bitwise techniques and is very efficient if the size of the pattern is no greater than the word size of the target processor. The following description of the shift-or algorithm is taken from [6] after slight adaptation to accommodate our notations. Let R be a bit array of size m. Vector Rj is the value of the array R after text character Tj has been processed. It contains information about all matches of prefixes of P that end at position j in the text. So, for 1 ≤ i ≤ m we have: ( 0 if P [1..i] = T [j − i + 1..j], Rj [i] = (3) 1 Otherwise. The vector Rj+1 can be computed after Rj as follows. For each Rj [i] = 0: ( 0 if Pi+1 = Tj+1 , Rj+1 [i + 1] = (4) 1 Otherwise. and ( 0 if P0 = Tj+1 , Rj+1 [0] = 1 Otherwise.

(5)

8

If Rj+1 [m] = 0 then a complete match can be reported. The transition from Rj to Rj+1 can be computed very fast as follows. For each c ∈ Σ let Dc be a bit array of size m such that for 1 ≤ i ≤ m, Dc [i] = 0 if and only if Pi = c.The array Dc denotes the positions of the character c in the pattern P . Each Dc for all c ∈ Σ can be preprocessed before the pattern search. And the computation of Rj+1 reduces to two operations, shift and or: Rj+1 = SHIF T (Rj ) OR DTj+1 4.2

Modifying Shift-Or Algorithm for Swap Matching

In this section, we modify the shift-or algorithm to solve swap matching problem. To do that we use the graph model, particularly the forbidden graph, presented in Section 3. The idea is quite simple and described as follows. First of all, the shiftor algorithm can be extended easily for the degenerate patterns [5]. In our swap matching model the pattern can be thought of a having a set of letters at each position as follows: P˜ = [M [0, 1]M [+1, 1]] [M [−1, 2]M [0, 2]M [+1, 2]] . . . [M [−1, m− 1]M [0, m − 1][+1, m − 1]] [M [−1, m]M [0, m]]. Note that we have used Pe instead of P above because, in our case, the sets of characters in the consecutive positions in the pattern P don’t have the same relation as in a usual degenerate pattern. In particular, in our case, a match at position of i + 1 of P will depend on the previous match of position i as the following example shows. Example 6. Suppose, P = acbab and T = bcbaaabcba. The P-graph of P is shown in Figure 2. So, in line of above discussion, we can say that Pe = [ac][acb][cba][ba][ab]. Now, as can be easily seen, if we consider degenerate match, then Pe matches T at position 2 and 6. However, P swap matches T only at position 6; not at position 2. To elaborate, note that at position 2, the match is due to c. So, according to the graph P G the next match has to be an a and hence at position 2 we can’t have a swap match. In what follows, we present a novel technique to adapt the shift-or algorithm to tackle the above situation. We use the forbidden graph as follows. For the sake of convenience, in the discussion that follows, we refer to both Pe and the pattern P as though they are equivalent; but it will be clear from the context what we really mean. Suppose we have a match up to position i < m of Pe in T [j −i+1..j]. Now we have to check whether there is a ‘match’ between Tj+1 and Pi+1 . For simple degenerate match, we only need to check whether Tj+1 ∈ Pi+1 or not. However, as the Example 6 shows, for our case we need to do more than that. What we do is as follows. Suppose that Tj = c = M [`, i]. Now, from the forbidden graph we know which of the M [k, i + 1], k ∈ [−1, 0, +1] can’t follow M [`, i]. So, for example, even if M [q, i + 1] = T [j + 1] we can’t continue if there is an edge from M [`, i] to M [q, i + 1] in the forbidden graph (or equivalently if there is no edge from M [`, i] to M [q, i + 1] in the P-graph). In the rest of this section, we show how we use the forbidden graph to modify the shift-or algorithm to solve the swap matching problem. Recall that, we first process the pattern to compute the masks Dc for every c ∈ Σ. This can be done

9

in O(m/w(m+Σ)) time [5] when pattern is not degenerate. However, in our case, we need to assume that our pattern has a set of letters in each position. In this case, we require O(m/w(m0 + Σ)) time where m0 is the sum of the cardinality of the sets at each position [5]. In general degenerate strings, m0 can be m|Σ| in the worst case. However, in our case, m0 = |V P | = O(m), where V P is the vertex set of the P-graph. So, computation of the D-mask requires O(m/w(m+Σ)) time in the worst case. Then we do a further processing on P as follows. We compute the G P P forbidden graph P = (V , E ) from the P-graph P G = (V P , E P ). Recall that P V P = O(m) and E P = O(m) and therefore, by definition, we have V = O(m) P and E = O(m). So we can compute the forbidden graph in O(m) time. Two edges (u, v), (x, y) of the forbidden graph (and the P-graph) are said to be ‘same’ if label(u) = label(x) and label(v) = label(y), i.e. if the two edges have the same labels. Also, given an edge (u, v) ≡ (M [i1 , j1 ], M [i2 , j2 ]) we say that edge (u, v) ‘belongs to’ column j2 , i.e. where the edge ends; and we say col((u, v)) ≡ col((M [i1 , j1 ], M [i2 , j2 ])) = j2 . Now we traverse all the edges and construct a set of sets S = {S1 . . . S` } such that each Si , 1 ≤ i ≤ ` contains the edges that are ‘same’. The set Si is named by the (same) label of the edges it contains and we may refer to Si using its name. Now, we construct forbidden masks FSi , 1 ≤ i ≤ ` such that FSi [k] = 1 if, and only if, there is an edge (u, v) ∈ Si having col((u, v)) = k. Note that ` = O(m). The construction of the forbidden mask can be done in O(m/w m log m) time as follows. We first initialize all the entries of the forbidden masks to 0 which requires O(m/w m) time. Then we start traversing the edges. Consider the first edge (u1 , v1 ). We know the label of this edge is label((u1 , v1 )) ≡ (label(u1 ), label(v1 )). We include the label of this edge in a name database and assign a set Si ∈ S to this name and keep pointers for constant time reference later. We also set FSi [j] = 1, where col((u1 , v1 )) = j. Now, consider another edge (uk , vk ). This time we first check whether label((uk , vk )) already exists in the name database. If yes, then we use the existing name to do the update otherwise we include the label in the name database and continue as before. It is clear that this check in the database can be done in O(log `) = O(log m). Since we have O(m) edges, the complete construction of the forbidden mask requires O(m/w m log m) time. With the forbidden masks at our hand, for our problem, we simply need to compute Rj+1 as follows: Rj+1 = SHIF T (Rj ) OR DTj+1 OR F(Tj ,Tj+1 ) Note that, to locate the appropriate forbidden mask we again need to perform a look up in the name database constructed during the construction of the forbidden mask. So, in total the construction of the R values require O(n log m) time. One detail is that, if F(Tj ,Tj+1 ) doesn’t exist then we assume the mask to have all 0’s. It is easy to see that this works because the forbidden mask allows Rj+1 to have 0 at position i if, and only if, the edge (Tj , Tj+1 ) is not ‘forbidden’. Example 7 shows a complete execution of our algorithm.

10 1

2

a

−1

3

4

5

X c

c

a

b

0

a

c

c

a

+1

c

c X

a

b

Fig. 5. Forbidden graph for P = accab

1 2 3 4 5 a

D [ac] [acc] [acc] [cab] [ab]

Da 0 0 0 0 0

Db 1 1 1 0 0

Dc DX a 0 1 0 1 0 1 0 1 1 1

Here X indicates all letters that are not present in P Fig. 6. The D-masks for Example 7

Example 7. Suppose, P = accab and T = acacbaccbacacba. The P-graph and corresponding forbidden graph of P is shown in Figure 3 and 5 respectively. The D-masks and F -masks are shown in Figure 6 and 7 respectively. Figure 8 shows the detail computation of the R bit array up to the first match found. Figure 8 shows the complete computed values of R. The running times of the different phases of the algorithm are listed in Figure 10. Therefore, in total the running time of our algorithm is O(m/w(m log m+ |Σ| + n log m)). So, when pattern size is similar to the word size of the target machine, we achieve a very good running time of O(m log m + |Σ| + n log m) = O((m + n) log m). Therefore we have the following theorem. Theorem 1. The swap matching problem can be solved in O(m/w(m+n) log m) worst case running time. Corollary 1. The swap matching problem can be solved in O((m + n) log m) worst case running time if the pattern is similar to the word size of the target machine.

11

1 2 3 4 5 a

F(a,a) F(a,b) F(b,b) F(c,c) F(c,a) F(X,X) a 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0

Here (X, X) indicates all edges that are not present in the forbidden graph Fig. 7. The F -masks for Example 7

1 2 3 4 5

− 1 1 1 1 1

SH 0 1 1 1 1

Da F(X,X) OR SH 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1

Dc F(a,c) OR SH 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1

Da F(c,a) OR SH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1

Dc F(a,c) OR SH 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0

Db F(c,b) OR . . . 1 0 1 ... 1 0 1 ... 1 0 1 ... 0 0 0 ... 0 0 0 ... X ...

Fig. 8. Detail steps up to the first reported match of Example 7. Here SH means Shift operation on the previous column and OR means or operation on the previous 3 columns.

1 2 3 4 5

a c c a b

1 a 0 1 1 1 1

2 c 0 0 1 1 1

3 a 0 0 0 1 1

4 c 0 0 0 0 1

5 b 1 1 1 0 0 X

6 a 0 1 1 1 0 X

7 c 0 0 1 1 1

8 c 0 1 0 1 1

9 b 1 1 1 0 1

10 a 0 1 1 1 0 X

11 c 0 0 1 1 1

12 a 0 0 0 1 1

13 c 0 0 0 0 1

14 b 1 1 1 0 0 X

15 a 0 1 1 1 0 X

Fig. 9. The complete computed values of R in Example 7. The occurrences of swap match are shown using tick marks. Note that the end location of the matches are identified here.

12 Phase Running Time Computation of D-masks O(m/w(m + |Σ|)) Computation of F -masks O(m/w m log m) Computation of R-values O(m/w n log m) Fig. 10. Running times of the different phases

5

Conclusion

In this paper, we have revisited the Pattern Matching with Swaps problem, a well-studied variant of the classic pattern matching problem. We have presented a new graph-theoretic approach to model the problem which opens a new and so far unexplored avenue to solve the problem. Then, using the model, we have devised an efficient algorithm to solve the swap matching problem. The resulting algorithm is an adaptation of the classic shift-or algorithm and runs in O((n + m) log m) if the pattern-length is similar to the word-size in the target machine. Notably, the best known algorithm for swap matching runs in O(n log m log σ) and uses the FFT technique, which has large hidden constants inside its good theoretical bound. This seems to be the first attempt to provide an efficient solution to the swap matching problem without using FFT techniques. Moreover the techniques used in our algorithm is quite simple and easy to implement. We believe that the new graph theoretic model could be used to devise more efficient algorithms and a similar approach can be taken to model similar other variants of the classic pattern matching problem. Furthermore, it would be interesting to ‘swap’ the definitions of T - graph and P- graph and investigate whether efficient pattern matching techniques for Directed acyclic graph can be employed to devise efficient off-line and online algorithms for swap matching.

References 1. A. Amir, Y. Aumann, G. M. Landau, M. Lewenstein, and N. Lewenstein. Pattern matching with swaps. J. Algorithms, 37(2):247–266, 2000. 2. A. Amir, R. Cole, R. Hariharan, M. Lewenstein, and E. Porat. Overlap matching. Inf. Comput., 181(1):57–74, 2003. 3. A. Amir, G. M. Landau, M. Lewenstein, and N. Lewenstein. Efficient special cases of pattern matching with swaps. Inf. Process. Lett., 68(3):125–132, 1998. 4. A. Amir, M. Lewenstein, and E. Porat. Approximate swapped matching. Inf. Process. Lett., 83(1):33–39, 2002. 5. R. Baeza-Yates and G. Gonnet. A new approach to text searching. Communications of the ACM, 35:74–82, 1992. 6. C. Charras and T. Lecroq. Handbook of Exact String Matching Algorithms. Texts in Algorithmics. King’s College, London, 2004. 7. H. Zhang, Q. Guo, and C. S. Iliopoulos. String matching with swaps in a weighted sequence. In J. Zhang, J.-H. He, and Y. Fu, editors, CIS, volume 3314 of Lecture Notes in Computer Science, pages 698–704. Springer, 2004.