Speeding up two string-matching algorithms

0 downloads 0 Views 1MB Size Report
The BM algorithm goes as far as the scanned segment (also called the factor) is a ..... a word y, that has v as a prefix, to be a good candidate? Let u be the prefix of .... ___. ,. I}++++++ ++++m p++++++~++, .lb.[++++++++.++. i i. , l+J. , patter. I '.
Algorithmica (1994) 12:24%267

Algorithmica 9 1994Springer-VerlagNewYorkInc.

Speeding Up Two String-Matching Algorithms1 M. Crochemore, 2 A. Czumaj, 3 L. Gasieniec,s S. Jarominek, 3 T. Lecroq, 2 W. Plandowski, 3 and W. Rytter 3 We show how to speed up two string-matching algorithms: the Boyer-Moore algorithm (BM algorithm), and its version called here the reverse factor algorithm (RF algorithm). The RF algorithm is based on factor graphs for the reverse of the pattern.The main feature of both algorithms is that they scan the text right-to-left from the supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (factor) is a suffix of the pattern. The RF algorithm scans while the segment is a factor of the pattern. Both algorithms make a shift of the pattern, forget the history, and start again. The RF algorithm usually makes bigger shifts than BM, but is quadratic in the worst case. We show that it is enough to remember the last matched segment (represented by two pointers to the text) to speed up the RF algorithm considerably (to make a linear number of inspections of text symbols, with small coefficient), and to speed up the BM algorithm (to make at most 2 . n comparisons). Only a constant additional memory is needed for the search phase. We give alternative versions of an accelerated RF algorithm: the first one is based on combinatorial properties of primitive words, and the other two use the power of suffix trees extensively. The paper demonstrates the techniques to transform algorithms, and also shows interesting new applications of data structures representing all subwords of the pattern in compact form. Abstract.

Key Words. Analysis of algorithms, Pattern matching, String matching, Suffix tree, Suffix automaton, Combinatorial problems, Periods, Text processing, Data retrieval.

1. Introduction. The Boyer-Moore algorithm IBM] is one of the string-matching algorithms which is very fast on the average. However, it is successful mainly in the case of large alphabets. For small alphabets, its average complexity is f~(n) (see [BR]) for the Boyer-Moore-Horspool version [HI. The reader can refer to [HS] for a discussion on practical fast string-searching algorithms. We discuss here a version of this algorithm, called the RF algorithm, which is much faster on the average, not only on large alphabets but also for small alphabets. If the alphabet is of size at least 2, then the average complexity o f the new algorithm is O(n log(m)/m),and reaches the lower bound given in [Y]. The main feature of both algorithms is that they scan the text right-to-left from a supposed right position of the pattern. The BM algorithm goes as far as the scanned segment (also called the factor) is a suffix of the pattern, while the RF algorithm matches the text x The work by M. Crochemore and T. Lecroq was partially supported by PRC "MathrmatiquesInformatique," M. Crochemore was also partially supported by NATO Grant CRG 900293, and the work by A. Czumaj, L. Gasieniec, S. Jarominek, W. Plandowski, and W. Rytter was supported by KBN of the Polish Ministry of Education. 2 LITP, Institut Blaise Pascal, Universit6 Paris 7, 2 Place Jussieu, 75251 Paris Cedex 05, France. 3 Institute of Informatics, Warsaw University, ul. Banacha 2, 00-913 Warsaw 59, Poland. Received August 15, 1991; revised August 17, 1992, and November 11, 1992. Communicated by Alberto Apostolico.

248

M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, W. Rytter

against any factor of the pattern, traversing the factor graph or the suffix tree of the reverse pattern. Afterward, both algorithms make a shift of the pattern to the right, forget the history, and start again. We show that it is enough to re-~ member the last matched segment to speed up the algorithms: an additional constant memory is sufficient. We derive a version of the BM algorithm, called the Turbo_BM algorithm. One of the advantages of this algorithm with respect to the original BM algorithm is the simplicity of its analysis of complexity. At the same time, the Turbo_BM algorithm looks like a superficial modification of the BM algorithm. Only a few additional lines are inserted inside the search phase of the original algorithm, and two registers (constant memory to keep information about the last match) are added. The preprocessing phase is left unchanged. Recall that an algorithm remembering a linear number of previous matches has been given before by Apostolico and Giancarlo [AG] as a version of the BM algorithm. The Turbo_BM algorithm given here seems to be an efficient compromise between recording a linear-size history, as in the Apostolico-Giancarlo algorithm, and not recording the history of previous matches, as in the original BM algorithm. Our method to speed up the BM and RF algorithms is an example of a general technique called the dynamic simulation in [BKR]--for a given algorithm A construct an algorithm A' which works in the same way as A, but remembering part of the information that A is wasting; during the process this information is used to save part of the computation carried out by the original algorithm A. In our case, the additional information is the constant-size information about the last match. The transformation of the Boyer-Moore algorithm gives an algorithm of the same simplicity as the original Boyer-Moore algorithm, but with the upper bound of 2 . n on the number of comparisons, which improves slightly on the bound 3" n of the original algorithm. The derivation of this bound is also much simpler than the 3" n bound in [Co]. The previous bounds, established when the pattern does not occur in the text, are 7" n in [KMP] and 4" n in [GO]. It should be noted that a simple transformation of the BM algorithm to search for all occurrences of the pattern has quadratic-time complexity. Galil [G] has shown how to make it linear in this case. Several transformations of the RF algorithm show the applicability of data structures representing succinctly the set of all subwords of a pattern p of length m. We denote this set by FACT(p).The set of all suffices ofp is denoted by SUF(p). For simplicity of presentation, we assume that the size of the alphabet is constant. The general structure of the BM and RF algorithms is shown in Figure 1. scanned part x of the text text

I i

i+j

!

m

- -

m

- -

i+m I

window on the text shift of the window Fig. 1. One iteration of Algorithm 1. The algorithm scans right-to-left a segment (factor) x of the text.

Speeding Up Two String-Matching Algorithms Algorithm 1

249

/* common scheme for the BM and RF algorithms */

i:= O; while i < n - m do

{ align pattern with positions t[i + 1..i + m] of the text; scan the text right-to-left from position i + m; let x be the scanned part of the text; if x = p then report a match at position i; compute the shift; i:= i + shift; } end.

In algorithms BM and RF we use the synonym x for the last-scanned segment t[i + j . . i + m] of the text t. This shortens the presentation. In one algorithm it is checked whether x is a suffix of p and, in the second algorithm, whether it is a factor of p. Shifts use a precomputed function on x. In fact, in the BM algorithm x is identified with a position j on the pattern, while in the RF algorithm x is identified with a node corresponding to x R in a data structure representing FACT(pR). We use the reverse pattern because we scan right-to-left, while most data structures for the set of factors are oriented to left-to-right scanning of the pattern. These orders are equivalent after reversing the pattern. In both cases a constant size memory is sufficient to identify x. Both the BM and RF algorithms can be viewed as instances of Algorithm 1. For a suffix x at position k, denote here by BM_shift[x] the match-shift d2[k] defined in [ K M P ] for the BM algorithm (see also [Ah] or [R]). The value of d2[k] is, roughly speaking, the minimal (nontrivial) shift of the pattern over itself such that the symbols aligned with the suffix x, except the first letter of x, agree. The symbol at the position, denoted by *, aligned with the first letter of x in Figure 2, is distinct if, in fact, any symbol aligns. The BM algorithm also uses heuristics on the alphabet. A second shift function serves to align the mismatch symbol in the text with an occurrence of it in the pattern. We mainly consider the BM algorithm without the heuristics. However, this feature is integrated in the final version of the Turbo_BM algorithm.

A l g o r i t h m BM

/* reversed-suffix string matching */ i:= O; /* denote t[i + j . . i + m] by x, it is the last-scanned part of the text */

while i _< n - m do { j := m; w h i l e j > 1 and x ~ SUF(p) do j := j - 1; if x = p then report a match at position i;

shift := B M_shift[ x ] ; i:-= i + shift; } end.

250

M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, W. Rytter next

!

different letters ~

'

i

alignment of pattern

,

J sl~ft=r~tchshifi[l ]

"~, ~--------~,

1

',

m',

, I

m

m

.

text 9

I

I

i+j

i+m

Fig. 2. One iteration in the BM algorithm.

Algorithm RF /* reverse factor string matching */ i:= 0; /* denote t[i + j.. i + m] by x, it is the last-scanned part of the text */ w h i l e i _< n -

m do

{ j : = m; whilej > 1 and x E F A C T ( p ) d o j : = j - 1; /* in fact, we check the equivalent condition x R ~ FACT(p R) */ if x = p t h e n report a match at position i; shift := RF shift[x]; i:= i + shift; } end. The work which Algorithm 1 spends at one iteration is denoted here by cost, and the length of the shift is denoted by shift. In the BM algorithm cost is usually small but it gives a small shift. The strategy of the RF algorithm is more "optimal": the smaller the cost, the bigger the shift. In practice on the average, the match (and cost) at a given iteration is usually small; hence, the algorithm, whose shifts are inversely proportional to local matches, is close to optimal. The straightforward application of this strategy gives an RF algorithm that is very successful on the average. It is, however, quadratic in the worst case. Algorithm RF makes essential use of a data structure representing the set FACT(p). See [BBE +] for the definition of directed acyclic word graphs (dawg's), see [-Cr] for the definition of suffix automata, and see [Ap] for details on suffix trees9 The graph G = dawg(p R) represents all subwords of p~ as labeled paths starting from the root of G. The factor z corresponds in a many-to-one fashion to a node vert(z), such that the path from the root to that node "spells" z. Additionally, we add information to each node indicating whether all nodes corresponding to that node are suffixes of the reversed pattern pR (prefixes of p). We traverse this graph when scanning the text right-to-left in the RF algorithm. Let x' be the longest word, which is a factor of p, found in a given iteration. When x = p, then x' = x; otherwise x' is obtained by cutting off the first letter of x (the mismatch symbol). The time spent scanning x is proportional to [xl. The multiplicative factor is constant if a matrix representation is used for transitions in the data structure. Otherwise, it is O(log ]Af) (where A can be restricted to the pattern alphabet), which applies for arbitrary alphabets. We now define the shift RF_shift, and describe how to compute it easily. Let u be the longest suffix of x' which is a proper prefix of the pattern p. We can assume

Speeding Up Two String-Matching Algorithms

251 nextalignmentof pattern

l u I

I

I 1 I

1 ! ! ! i

I shift

= m - lul

i

m

!

factorof patternl. --

i+j

I'-

-

i+m

Fig. 3. One iteration of algorithm RF. Word u is the longestprefixofthe pattern that is a suffixof x'. that we always know the actual value of u, associated with the last node on the scanned path in G corresponding to a suffix ofp R. Then shift RF_shift[x] = m - l ul (see Figure 3). The use of information about the previous match at a given iteration is the key to improvement. However, this application can be realized in many ways: we discuss three alternative transformations for RF. They lead to three versions of the RF algorithm, Turbo_RF, Turbo_RF', and Turbo_RF", that are presented in Sections 2 and 3. Algorithms Turbo_BM, Turbo_RF, Turbo_RF', and Turbo_RF" can be viewed as instances of Algorithm 2 presented below. Algorithm 2

/* general scheme for algorithms Turbo_RF, Turbo_RF', Turbo_RF", and Turbo_BM: a version of Algorithm 1 with an additional memory */ i:= 0; memory := nil;

w h i l e i < n -- m do

{ align pattern with positions t[i + 1 . . i + m] of the text; scan the text right-to-left from the position i + m, using memory to reduce number of inspections; let x' be the part of the text scanned; i f x = p then report a match at position i; compute the shift shiftl according to x and memory; i:= i + shifti; update memory using x; } end.

2. Speeding up the R e v e r s e F a c t o r Algorithm. To speed up the RF algorithm we memorize the prefix u of size m-shift of the pattern (see Figure 6). The scan, between the part of the text to align with part v of the pattern, is done from right to left. When we arrive at the boundary between u and v in a successful scan (all comparisons positive), then we are at a decision point. Now, instead of scanning u until a mismatch is found, we can just scan (again) a part of u, due to the combinatorial properties of primitive words. A word is primitive iffit is not a proper power of a smaller word. We denote by per(u) the length of the smallest period of u. Primitive words have the following useful properties: (a) The prefix of u of size per(u) is primitive.

252

M. Crochemore, A. Czumaj, L Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, W. Rytter

1

,,

I

I

z

z

I

',

/

Fig. 4. If z is primitive, then such an overlap is impossible.

(b) A cyclic shift of a primitive word is also primitive, hence the suffix z of u of size per(u) is primitive. (c) If z is primitive, then the situation presented in the Figure 4 is impossible. If y ~ F A C T ( p ) we denote by displ(y) the least integer d such that y = p[m - d - [y[ + 1.. m - d] (see Figure 5). The crucial point is that if we successfully scan v and the suffix of u of size per(u), then we know the shift without further calculations: many comparisons are saved and the RF algorithm increases speed in this moment. In terms of the next lemma, we save Ixl - Izvl comparisons when Ixl >-- Izvl. Algorithm Turbo_RF /* denote t[i + j . . i + m] by x; it is the last-scanned part of the text; we memorize the last prefix u of the pattern; initially u is the empty word; */ i:= 0; u.'= empty; while i 1, and log m < 3" m/8 (all logarithms are to base r). We consider the situation when the text is random. The probability of the occurrence of a specified letter on the ith position is 1/r and does not depend on letters on other positions. THEOREM 5.

The expected time of the R F algorithm is O(n.log(m)/m).

PROOF. Let Li be the length of shift in the ith iteration of the algorithm and let S~ be the length of substring of the pattern that is found in this iteration. Examine the first iteration of the algorithm. There are no more than m subwords of the pattern of length 2. log m and there a r e r 2 ' l ~ = m 2 possibilities of words that we read from the text which are all equally probable. Thus, with probability greater than (or equal to) 1 - 1/m, $1 < 2" log m, so L 1 > m - 2. log m > m/4. Let us call the ith shift long iff L i >_ m/4 and short otherwise. Divide computations into phases. Each phase ends on the first long shift. This means that there is exactly one long shift in each phase. It is obvious (by the definition of the long shift) that there are O(n/m) phases in the algorithm. We now prove that an expected cost of each phase is O(log m). CLAIM 1. Assume that shifts i and i + 1 are both short. Then with probability more than 1/2, the (i + 2)th shift is long. PROOF. If the ith and (i + 1)th shifts are both short, then the pattern is of the form V(wv)ksz where k > 3, w, z r e, [wv[ = Li+ 1, and [szl = Li (s may be equal to when L i < L i + 1). Without loss of generality we can assume that wv is the minimal period of v(wv)k in such a sense that if there exists a word w'v' such that v(wv)k = v'(w' v')k' and Iw'v'l O. For stage k of type (ii), costk = SUfk + 1 2; the case shiftk = 1 can be treated directly). If stage k + 1 is of type (i), then eostk+ 1 = 1, and then costk + e o s t k + l < 2 " s h i f t k + s h i f t k + l , an even better bound than expected. If at stage k + 1 we have SUfk+l < s h i f t k + l , then we get what was expected: eostk + COStk + 1 shiftk + ~. This means, as previously mentioned, that a BM-shift is applied at stage k + 1. Thus, the above analysis also applies at stage k + 1, and, since only Case (a) can occur then, we get costk + 1 ~ shiftk + 1 + shiftk + 2 . We finally get c o s t k + costk + 1 s h i f t k , . . . , SUfk + j > shiftk + j , then c o s t k + . . . + COStk+j