Fast practical string matching (Extended Abstract) Christian Charras Thierry Lecroq y Joseph Daniel Pehoushek z Abstract
We are interested in the exact string matching problem which consists of searching for all the occurrences of a pattern of length m in a text of length n. Both the pattern and the text are built over an alphabet of size . We present three versions of an exact string matching algorithm. They use a new shifting technique. The rst version is straightforward and easy to implement. The second version is linear in the worst case, an improvement over the rst. The main result is the third algorithm. It is very fast in practice for small alphabet and long patterns. Asymptotically, it performs O(log m(m + n=(m ? log m))) inspections of text symbols in the average case. This compares favorably with many other string searching algorithms.
1 Introduction Pattern matching is a basic problem in computer science. The performance of many programs is determined by the work required to match patterns, most notably in the areas of text processing, speech recognition, information retrieval, and computationalbiology. The kind of pattern matching discussed in this paper is exact string matching. String matching is a special case of pattern matching where the pattern is described by a nite sequence of symbols. It consists of nding one or more occurrences of a pattern x = x0 x1 xm?1 of length m in a text y = y0 y1 yn?1 of length n. Both x and y are built over the same alphabet . Numerous solutions to string matching problem have been designed (see [2] and [6]). The two most famous are the Knuth-Morris-Pratt algorithm [3] and the Boyer-Moore algorithm [1]. ABISS (Atelier BiologieInformatiqueStatistiquesSociolinguistique),Universit e de Rouen, Faculte des Sciences et des Techniques, 76821 Mont-Saint-Aignan Cedex, France, e-mail:
[email protected]
y LIFAR (Laboratoire d'Informatique Fondamentale et Appliqu eee de Rouen), and ABISS (Atelier Biologie Informatique Statistiques Sociolinguistique), Universite de Rouen, Faculte des Sciences et des Techniques, 76821 Mont-Saint-Aignan Cedex, France, e-mail:
[email protected] z e-mail:
[email protected]
1
We use a new shifting technique to construct a basic and straightforward algorithm. This algorithm has a quadratic worst case time complexity though it performs well in practice. Using the shift tables of Knuth-Morris-Pratt algorithm [3] and Morris-Pratt algorithm [5], we make this rst algorithm linear in the worst case. Then using a trie, we present a simple and very fast algorithm for small alphabets and long patterns. We present the basic algorithm in Section 1. Section 2 is devoted to the linear variation of the basic algorithm. In the Section 3 we introduce the third algorithm which is very fast in practice.
2 String matching algorithms Theoretically, string matching algorithms use a \window" that shifts through the text, comparing the contents of the window with the pattern; after each comparison, the window is shifted some distance over the text. Speci cally, the window has the same length as the pattern x. It is rst aligned with the left end of the text y, then the string matching algorithm tries to match the symbols of the pattern with the symbols in the window (this speci c work is called an attempt). After each attempt, the window shifts to the right over the text, until passing the end of the text. A string matching algorithm is then a succession of attempts and shifts. The aim of a good algorithm is to minimize the work done during each attempt and to maximize the length of the shifts. After positioning the window, the algorithm tries to quickly determine if the pattern occurs in the window. To decide this, most string matching algorithms have a preprocessing phase, during which a data structure, z, is constructed. z is usually proportional to the length of the pattern, and its details vary in dierent algorithms. The structure of z de nes many characteristics of the search phase. Practically, the process is more complete (see [?]), two components are added to the above scheme of a string matching algorithm: a skip loop and guard checks. The skip loop consists in nding very eciently one character of the word x into the text y. The guard checks consist in checking one or several other characters of x in y before the full match loop. Figure 1 summarizes the dierent steps of a string matching algorithm. We will now give some details on the dierent components. The terminology used is the one of [?].
2.1 The skip loop
Some possible strategies for the skip loop are: none : most string matching algorithms do not incorporate a skip loop ; sfc : search for the rst (leftmost) character of x ; fast : search for the last (rightmost) character of x ; 2
Generic-String-Matching(x; y)
1 preprocess x to construct z 2 while not end y 3 do skip loop 4 if guard checks 5 then match loop 6 if x is found 7 then report 8 shift
Figure 1: A generic string matching algorithm. : search for the least frequent character of x which needs to have some knowledge on the underlying alphabet ; ufast : unrolled variant of fast ; lc : least cost frequency dependent variant of ufast.
slfc
2.2 The guard checks
The guard checks are not completely independent of the skip loop. It would be useless, and even penalizing, to check on the rst character with the sfc skip loop. Some possibilities for the check guards are: none : most string matching algorithms do not incorporate a guard check ; first : check rst (leftmost) letter of x ; last : check last (rightmost) letter of x ; middle : check middle letter of x. Any combination of two or more guard checks are of course possible.
2.3 The match loop
Here are some known methods for the match loop: fwd : forward linear scan (left to right) ; rev : reverse linear scan (right to left) ; auto : scan with an automaton ; autorev : scan with the sux automaton of the reverse of x.
3
Table 1: Classi cation of somme string matching algorithms. skip guard match shift
bf bf+ kmp bm bm fast qs Tuned bm
none last none none fast none last
none first none none none none first
* * fwd rev rev fwd fwd
inc inc kmp d12 d12 sd1 d1
2.4 The shift
The shift strategy is not independent from the match loop strategy, for instance it is not possible to use the Knuth-Morris-Pratt shift function with a rev match loop. : shift by one position ; kmp : shift according Knuth-Morris-Pratt's shift function ; d1 : shift according Boyer-Moore's occurrence shift ; d2 : shift according Boyer-Moore's matching shift ; d12 : shift according max(d1,d2) ; sd1 : shift according Sunday's occurrence shift ; sd2 : shift according Sunday's matching shift ; sd12 : shift according max(sd1,sd2) ; inc
2.5 Classi cation of some string matching algorithms
It is possoble to classify some string matching algorithms according to their strategy for the four components (see table 1). The Brute Force algorithm (bm) is characterized by the fact that it does not have neither skip loop nor guard check and that it shifts the window by exactly one position to the left after each attempt. Incorporing a last skip loop and a first guard check leads to an algorithm bf+ which is three times faster than the bf one. As the bf algorithm, the Knuth-Morris-Pratt kmp and the Boyer-Moore bm algorithms do not have neither skip loop nor guard check. In [1] a version bm fast is given with a skip loop. Two of the fastest practical string matching algorithm are the Quick Search (qs) by Sunday and the Tuned Boyer-Moore (Tuned bm).
4
SkipSearch(x; m; y; n)
1 2 3 4 5 6 7 8 9 10 11 12 13
. Initialization for all symbols s in do z [s] ; . Preprocessing phase for i 0 to m ? 1 do z [xi] z [xi ] [ fig . Searching phase j m?1 while j < n do for all i in z [yj ] do if yj ?i yj ?i+m?1 = x then Report(j ? i) j j +m Figure 2: The basic algorithm.
3 The basic algorithm 3.1 Description
We want to solve the string matching problem which consists in nding all the occurrences of a pattern x of length m in a text y of length n. Both x and y are build over the same nite alphabet of size . We are interested in the problem where the pattern x is given before the text y. In this case, x can be preprocessed to construct the data structure z. The idea of our rst algorithm is straightforward. For each symbol of the alphabet, a bucket collects all of that symbol's positions in x. When a symbol occurs k times in the pattern, there are k corresponding positions in the symbol's bucket. When the pattern is much shorter than the alphabet, many buckets are empty. The main loop of the search phase consists of examining every mth text symbol, yj ( so there will be n=m main iterations). For yj , use each position in the bucket z[yj ] to obtain a possible starting point p of x in y. Perform a comparison of x to y beginning at position p, symbol by symbol, until there is a mismatch, or until all match. The entire algorithm is given Figure 2.
3.2 Complexity
The space and time complexity of this preprocessing phase is in O(m+). The worst case time complexity of this algorithm is O(mn). This bound is tight and is reached when searching for am?1 b in an. The expected time complexity when the pattern is small, m < , is O(n=m). As m increases with respect to , the expected time rises to O(n). 5
y
start
x
wall
j
6=
= i
Figure 3: General situation during the searching phase of the linear algorithm.
4 A linear algorithm 4.1 Description
It is possible to make the basic algorithm linear using the two shift tables of Morris-Pratt [5] and Knuth-Morris-Pratt [3]. For 1 i m, mpNext [i] = length of the longest border of x0x1 xi?1 and mpNext [0] = ?1. For 1 i < m, kmpNext [i] = length of the longest border of x0x1 xi?1 followed by a character dierent from xi, kmpNext [0] = ?1 and kmpNext [m] = m ? period (x). The lists in the buckets are explicitly stored in a table (see algorithm PreKmpSkipSearch in Figure 4). A general situation for an attempt during the searching phase is the following (see Figure 3): j is the current text position; i = z [yj ] thus xi = yj ; start = j ? i is the possible starting position of an occurrence of x in y; wall is the rightmost scanned text position; ystart ystart +1 ywall ?1 = x0 x1 xwall ?start ?1; ywall 6= xwall ?start . The comparisons will be performed from left to right between ywall ywall +1 ystart +m?1 and xwall ?start xwall ?start +1 xm?1 Let k wall ? start be the smallest integer such that xk 6= ystart +k or k = m if an occurrence of x starts at position start in y. Then wall takes the value of start + k. 6
Pre-KmpSkipSearch(x; m)
1 2 3 4 5 6 7 8
. Initialization for all symbols s in do z [s] ?1 list [0] ?1 z [x[0]] 0 for i 1 to m ? 1 do list [i] z [x[i]] z [x[i]] i
Figure 4: Preprocessing of KmpSkipSearch algorithm. Attempt(start ; wall )
1 k wall ? start 2 while k < m and xk = ywall +k 3 do k k + 1 4 return k Figure 5: Return k wall ? start the smallest integer such that xk 6= ystart +k or k = m if an occurrence of x starts at position start in y when comparing ywall ywall +1 ystart +m?1 and xwall ?start xwall ?start +1 xm?1 . . Then compute two shifts (two new starting positions): the rst one according to the skip algorithm (see algorithm AdvanceSkip in Figure 6 for details), this gives us a starting position skipStart , the second one according to the shift table of Knuth-Morris-Pratt, which gives us another starting position kmpStart . Several cases can arise: case 1 skipStart < kmpStart then a shift according to the skip algorithm is applied which gives a new value for skipStart , and we have to again compare skipStart and kmpStart ; case 2 kmpStart < skipStart < wall then a shift according to the shift table of Morris-Pratt is applied which gives a new value for kmpStart , and we have to again compare skipStart and kmpStart ; case 3 skipStart = kmpStart then another attempt can be performed with start = skipStart ; case 4 kmpStart < wall < skipStart then another attempt can be performed with start = skipStart . The complete algorithm is given in Figure 7. When an occurrence of x is found in y, it is, of course, possible to shift by the length of the period of x.
7
AdvanceSkip()
1 repeat j j + m 2 until j n or z [yj ] 0 3 if j < n 4 then i z [yj ] Figure 6: Compute the shift using the buckets.
KmpSkipSearch(x; m; y; n)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
wall
i j
?1 ?1
0
AdvanceSkip ()
start
j ?i
while start n ? m do wall max(wall ; start )
k
wall
if
if
Attempt(start ; wall )
start + k
k=m
then Report(start ) else
i