A Technique for Two- Dimensional Pattern Matching - CiteSeerX

10 downloads 0 Views 860KB Size Report
By reducing an array matching problem to a string matching problem in a natural way, it is ... The first two pattern characters are matched, but there is a mismatch on .... d number system, where d is the number of possible characters. ... repeat KMP' (PATTERN',. TEXT', found, column:,; if (found = false) and (row. < Nl) then ...
COMPUTINGPRACTICES

By reducing an array matching problem to a string matching problem in a natural way, it is shown that efficient string matching algorithms can be applied to arrays, assuming that a linear preprocessing is made on the text.

Edgar H. Sibley Panel Editor

A Technique for TwoDimensional Pattern Matching Rui Feng Zhu and Tadao Takaoka

Three major efficient pattern matching algorithms were proposed in the past decade, one by Knuth, Morris, and Pratt (KMP algorithm) [8], one by Boyer and Moore (BM algorithm) [5j, and one by Rabin and Karp (RK algorithm) [6]. A natural generalization of the pattern matching problem is the two-dimensional pattern matching problem where the pattern is a two-dimensional array of characters PATTERN [ 1 . . Pl , 1 . . P2] and the text is a second array TEXT [ 1 . . N 1 , 1 . . N2 ] . In this (case the problem is to determine where, if anywhere, the pattern occurs as a subarray of the text. In this article we first present an efficient pattern matching algorithm for the two-dimensional case, one whic:h is a combination of the KMP and RK algorithms. Assuming, in the worst case, that the RK algorithm runs linearly for one-dimensional array matching, this algorithm takes time proportional to O(N1 * N2 + Pl * P2) in the worst and average cases, which is clearly optimal since both the pattern and the text have to be read and this takes O(N1 * N2 + Pl * P2) time. The algorithm needs O(N1 * N2 + Pl * P2) space, and also works on-line (assuming the text is read row by row). Second, we describe another approach, which takes O(N1 * N2 * log(P2)/P2 + Pl * P2) time on average and also runs faster when the row length of the pattern increases. Computer experiments show that for various pattern sizes the average cost for either of our algorithrhs is much less than that of the algorithm proposed by Bird (B algorithm) [4].

O1989ACM

1110

OOOl-0782/89/0900-1110

Communications of the ACM

$1.50

THE KMP ALGORITHM The KMP algorithm [8] solves the problem by comparing the characters of the pattern with those of the text from the left end of the pattern. To illustrate this approach, consider an example, search.ing for in text = pattern = ‘ababb’ ‘babbababacababb’. pattern text:

:

ababb babbababacababb

Since the text pointer points to the character b, which does not match 1a 1 in the pattern, the pattern is shifted one place to the right and the next text character is inspected. pattern text:

:

ababb bgbbababacababb

There is a match, so the pattern remains where it is while the next several characters are scamled until we come to a mismatch. pattern: text:

ababb babbababacababb

The first two pattern characters are matched, but there is a mismatch on the third. Thus it is known that the last three characters of the text have been ab?, where I ? ’ is not equal to I a I, Next the pattern can be shifted three more places to the right, since ’ ? I is not equal to 1a I and a shift of one or two posi-tions could not lead to a match. Thus the situation appears as: pattern: text:

ababb babbababacababb

September 1989

Volume 32

Number 9

Computing Practices

The characters are scanned again and another partial match occurs. A mismatch occurs on the fifth pattern character. ababb babbababacababb

pattern: text:

Similiarly we know that the last five text characters were abab?, where ? is not equal to ’ b’ but might be equal to ’ a ‘. Thus the pattern should be shifted two places to the right. pattern text:

:

ababb babbababacababb

Comparing the two characters we get a mismatch, we shift the pattern further four places. pattern: text:

so

ababb babbababacababb

Scanning the following characters, the full pattern is found. The pattern match has succeeded. pattern: text:

ababb babbababacababb

This example shows that by knowing the characters in the pattern and the position where a mismatch occurs, it can be determined where in the pattern to continue to search for a match without moving backward in the text. It also tells us that the process of pattern matching will run much more efficiently if we have an auxiliary table which will provide us with the information of how far to slide the pattern when a mismatch has been detected at a certain position in the pattern. The following algorithm summarizes the behavior of the KMP algorithm. ALGORITHM 1: KMP Algorithm. INPUT. The text(pattern) is stored in the array TEXT[l . . N] (PT[l . . Ml), h is the auxiliary function, p and q are pointers of pattern and text, respectively. OUTPUT. Location at which the pattern is found . METHOD. begin compute-h; 1; . P := I; q .= while (p I M) and (q 5 N) do begin while (p > 0) and (PT[p] # TEXT[q]) do p := h[p]; P := p + 1; q:=q+l end; if p I M then writeln('not found') else writeln ('The left end of the pattern is found at', q - p + 1)

September 1989

Volume 32

Number 9

end. procedure compute-h; begin t := 0; h(l] := 0; for p := 2 to M do begin while (t > 0) and (PT[p - I] # PT[t]) do t := h[t] ; t := t + 1; if PT[p] # PT[t] then h[p] := t else h[p] := h[t] end end. Note: The definition of the h function is: h[p] = max(s 1 (s = 0) or ((PT[l . . s - I] = PT[p - s + 1 . . pI] and PT[s] # PT[p] and (1 I s 0) and (PATTERN [PI = TEXT[q] do begin p := p - 1; q := q - 1 end; if p # 0 then q := q

+ max(D [TEXT [Sl 1 , DDLPI until (p = 0) or (q > N) end. Note: The definitions of the D shift function and the DD shift fu.nction are: D[ch] = min[s ( s = M or (0 < s < M and PATTERN[:?I - s] = ch)]; DD[p] = min[s + M - p 1 (s 11) and (S 2 P or PATTERN [p - s] # PATTERN [p] and (s L i or PATTERN[i - s] = PATTERN[i], p < i I M)]. funcHere D and DD are often referred to as shift tions. The DD shift function is based ontheideathat when the pattern is moved right, it has to (1) match over all the characters previously matched, and (2) bring a different character over the character of the text that caused the mismatch. Figure 3 ill.ustrates the definition of the DD shift function. The D shift function uses the fact that we must bring over ch = text [q] (the character that caused mismatch), the first occurrence of 1ch 1 in the pattern viewed from the right end. The definition of the D shift function is shown in Figure 4. Both shift functions can be obtained from precomputed tables based solely on the pattern and the alphabet used. The DD shift function requires a table of length equal to the patter:n, while the D shift function requires a table of size equal to the alphabet size. Given the two values of the two shift functions, the BM algorithm chooses the larger one. For reasons of space the details regarding computations of the D shift function and the DD shift functions are omitted. (Details are given in [5] and [8].) The Algorithm Examining the example in Figure 2, we find that the values of PATTERN 1 and TEXT 1 vary over a great range. This means that for most cases the comparisons between PATTERN @and TEXT I are mismatches. Based on this observation, we modify our algorithm in the following way. First, using any efficient so:rting method, we sort PATTERN I and store the sorted PATTERN 1 and its indices value into PH 1 [ 1 . . 2, 1 . . P2 1. The PH1 values of the example in Figure 2 are given in Figure 5. Second, incorporating the DD shift fun.ction of the BM algorithm into our situation, we replace procedure KMP1 of algorithm 3 by procedure SEARCHin which

September 1989

Volum? 32

Number 9

Computing Practices

15.

end.

procedure search found); begin 1 := P2; repeat i := P2; while TEXT'[row,

(PATTERN',

j]

TEXT'

= PATTERN' [i]

a0

text

I I I I

I II

19

q

I I I I q +c5 iY- p 1

(4

FIGURE3. The Definition of DD Shift Function. (a) s = M, (4 s < P, (c) s 2 P

we begin comparison at the end of PATTERN'. The algorithm is summarized as follows. ALGORITHM 5. INPUT: The text and pattern are stored in the TEXT[l . . Nl, 1 . . N2] and PATTERN[l . . Pl, 1 . . P2], respectively. OUTPUT: Location at which pattern occurs. METHOD: 1. begin 2. texthash; 3. for row := Pl to (Nl - 1) do change(row); 4. row := Pl; 5. am := 1; for j := 1 to Pl - 1 a0 am := (a * am) mod q; 6. read(PATTERN); 7. pathash; 8. repeat SEARCH(PATTERN', TEXT', found); 9. row := row + 1; 10. until (found = true)or (row > Nl); 12. if found = true 13. then writeln ('The beginning of the pattern is found at', row - Pl + 1, column + 1) 14. else writeln ('not found')

September 1989

Volume 32

Number 9

begin j := j - 1; i := i - 1 end; if i := 0 then begin directcompare (found, row, j + P2); column := j; end else begin [row, j]); Y := binarysearch(TEXT' if y = P2 + 1 then j := j + P2 else begin k := P2 - PH1[2, y]; > k if DD[i] then j := j + DD[i] else j :=j+k end end until (j > N2) or (found = true) end; function binarysearch(v: integer; begin 1. -= 1; r := P2; repeat

X

:=

(1

+

r)

integer):

aiv

2;

if

v < PHl[l, x] then r := x - 1 else 1 := x + 1 until (v = PHl[l, x]) or (1 > r); if v = PHl[l, x] then binarysearch := x else binarysearch := P2 + 1 end; Notice that the mismatch occurs at the right end of PATTERN ' on almost every occasion when PATTERN I is positioned over TEXT', and this will make the BM

FIGURE4. The Definition of the D Shift Function

Communications of the ACM

1115

Computing Practices

1 2

1

2

3

146

2163

129%

1

6

El

4 '105066

6

2

5

6

7

196897

2097491

3146398

5

3

8

4

7

FIGURE5. An Example of PHZ

strategy work efficiently. As the elements of TEXT 1 are large numbers bounded by 4, the size of the character alphabet is equal to 4, and also the definition of the D shift function in the BM algorithm is based on the size of function the character alphabet; use of the D shift is not appropriate in this context. Instead, we use function binarysearch to examine whether the value of TEXT I at position j is the same as any of 1 I k I ~2 and give the position in PHI (1, kl, PH 1 if the same value exists. As discussed above, in most cases the answer is NO, which means that j is increased by P2, that is, PATTERN I is slid to the right PZ places. On the other hand, occasionally the answer may be YES. In this case, comparing the two amounts of shifts made by the DD shift function and binarysearch, we choose the larger one. The amount k of shift made by binarysearch is the distance from the position kk where PATTERN' [kk] = TEXT' [row, j] top;!. As for the DD shift function,referto[8]. For each binarysearch,log(PZ)stepsare needed; so the average time of this approach is O(N1 * N2 * log(P2)/P2 + Pl * Pz). Obviously, it takes less time when P2 increases, that is, the row length of the pattern gets longer. Refer to the third column of Table 1. We c:an find that this approach works best as Pl and P2 are larger than 30. The advantages of this approach are simil.ar to those listed in the previous section dealing with algorithm 3.

TWO-DIMENSIONAL TEXT MODIFICATION The main aim of pattern matching is to modify a portion of the TEXT at positions where the pattern is found. In the one-dimensional case, the portion is often lengthened or shortened after modification. But, in the two-dimensional case, the modified portion generally does not change in size. This can be defined formally as follows. Assuming the pattern is found at k2 ] , that is: TEX'P[kl, PATTERN[i, j] = TEXT[kl

+ i - Pl, k2 + j - P2] lSiIPl,lIjSP2

and TEXT[kl

+ i - Pl, (1

k2 + j - P2] I il PI, 1 I

j I

P2)

is replaced by PATTERN'[t

. . Pl,

1

. . P2];

correspondingly, a portion of TEXT 1 should be substituted. The range of such substitution in TEXT t is

1116

Communications of the ACM

[kl

- PI + i, k2 - P2 + j] (1 5 i I2 * Pl - 1,

15730667

1

I

j 5 P2)

and is shown in the shaded part in Figure 16.The algorithm ofmodifying TEXT and TEXT' is summarized as follows: procedure MODIFYTEXT'; begin for i := 1 to Pl do for j := 1 to P2 do TEXT[kl - Pl + i, k2 - P2 -t j] := PATTERN'[i, j]; for i := kl - Pl + 1 to kl + Pl - 1 do for j := k2 - P2 + 1 to k2 dc' - I, j] - TEXT TEXT'[i, j] := (TEXT'[i [i - Pl, j] * dm) * d + ord(TEXT [I, jl) end; 123

k2-PZtl

k2 ._.

N2

kl-PItI kl kltPl-1

Nl

FIGURE6. The Range of Substitution in TEXT’

Procedure MODIFYTEXT' takes O(P1 * P;!) time. This means that we can use TEXT I many times without spending much time to modify TEXT I when a portion of TEXT is replaced.

ti The 6 algorithm invented by Bird scribed as follows. The text [ 1 is read from the upper left corner cornerintheorderoftext[l,

[4] is briefly de. . N, 1 . . Nl to the bottom right

I],..., text[l, N],text[2, l],...,text[2, N],..., text[N, l],...,text[N, N]. Thepattern pat[l . . M, 1 . . M] is divided into iM rows of R, = pat[l, 1 . . M],...,R, = pat[M, 1 . . MI . Note that some of R, , . . . , R, may be equal with each other. Each time text [ i , j I is read, it is determined whether the portion text [i , j-M-tl.. j I = R matches any of RI, . . . . R, (horizontal matching). See Figure 1 and Figure 2. We use a one-dimensional array a [ 1 . . Nl such

September 1989

Volume 32

Number 9

r

RI

FIGURE1. An Example of Pattern that a [ j ] = k means that k - 1 rows R, , - . - , Rk-, match the portion text [i - k + 1 . . i - 1, j - M + 1 . . j].IfR=R,, a [ j 1 is increased by one. If not, we find maximum s such that:

R = R,, R, = Rkms+,, . . . . R,-, = Rk-,

(1)

using the KMP method vertically, and set a [ j ] to s + 1. If there is no such s, a [ j I is set to 0. If a [ j 1 = M + 1, we can say that the pattern has been found. This vertical use of the KMP method is done by regarding the sequences of rows R, , . . . , R, as a linear pattern r [ 1 . . M] . The elements of r[l . . Ml can be up to M different symbols or integers for row patterns. The condition (1) above is illustrated by Figure 3. Anexampleofpat[l . . 5, 1 . . 5]and r[l . . 5 I is given in Figure 4; we do not distinguish between rows and their corresponding integer ValUEi. While R # r [k] , we perform k - h [k] where h is the h function used in the KMP method. We repeat this until condition (1) is true. For the above r, we have the following h function:

r

1

2

3

4

5

1

2

3

1

3

FIGURE3. Illustration of Condition (1) multiple pattern matching invented by Aho and Corasick [l] as shown in Figure 5. In this machine, 0 is the initial state and 5, 8, and 12 are final states shown by double circles. The final states output integer values for corresponding rows. Using the output function out, these are specified as:

out(5)

=

out(8)

= 2 = r[2],

out(12)

1

=

r[l]

= 3 = r[3]

= r[4], = r[5].

11 h

0

1

1

0

2

Now we go back to the problem of the horizontal matching. The solution is well explained by using the above example. We construct a finite automaton for I 2 3,

i-Y+1

j

___

2

2

3

3

4

1

N

5

I 2

1

Pat

3 r

FIGURE4. An Example of Pat [l . . 5, 1 . .5] and Its Corresponding r [l . .5]

i-ktl i-kt2

i

NI

(I

1

I I I I

FIGURE2. An Example of Text

September 1989

Volume 32

Number 9

The goto function g is shown in Figure 5, for example, a) = 6 and so on. If there is cI(2, b) = 3,g(2, no definition of g in the machine, we set the value of g to “fail,” for example, g ( 6 , a ) = fail. If the value of g is fail, the failure function f is consulted, which tells us from which state the processing should resume. For the above example, f becomes as follows:

Communications of the ACM

1117

Computing Pracrices

The B algorithm is given below, where i and j are expressed by row and column.

B. ( a ) . Algorithm and a pattern matching machine A text array T[l . . Nl, 1 . . N2] of string The machine is composed constructed from pattern PT[l . . Pl, 1 . . P2]. of got0 function g, failure function f, and output function out. locations in T at which PT occurs. output: Input:

Method: begin for

row := 1 until Nl do begin state := 0; := 1 until N2 do for column begin while g(state, T[row, column]) = fail do := f(state); state state := g(state, T[row, column]); if out(state) # 0 then begin k := a[column]; while (k > 0) and (PT[k, 1 . . P2] # out(state)) k := h(k); a[column] := k + 1; if k = Pl then print('PT is found at (row - PI, column end else a[column] := 0 end end

do

-

P2)')

end.

b

a

@-@L(g@

FIGURE5. Multiple Pattern Matching Machine -

1118

-

Communications of the ACM

September 1989

Volume 32

Number 9

Computing Practices

of the goto function. (b). Construction Pattern PT[l . . PI, 1 . . P2]. Input: Goto function g and output function out. output: Method: We assume out(s) = 0 when state s is first created, and g(s, a) = fail if 'a' is undefined or if g(s, a) has not yet been defined. The procedure enter(PT) inserts into the goto graph a path that spells out PT. The elements of r[l . . PI] represent each unique row of PT. begin := 0; out(O) := 0; newstate for i := 1 until PI a0 enter(r[i], PT[i, 1 . . P2]); a) := 0 for all Ia' such that g(0, a) = fail do g(0 end. procedure enter(i1, A, Az . . . A,); begin := 0; j :zz 1; state while g(state, A,) # fail do begin state := g(state, A,); j := j + 1 end; m a0 for p := j until begin newstate := newstate + 1; g(state, A,) := newstate; state := newstate; out(state) := 0 end; out(state) := il end. Construction of the failure function. (cl. Goto function g and output function out. Input: Failure function f. output: Other variables: queue is a first-in-first-out data structure. Method: begin := empty; queue for each 'a' such that g(0, a) = s # 0 do begin queue := queue.s; (add s to the end of queue] f(S) := 0 end; while queue # empty do begin let r be the head of queue, that is, queue = r.tail; queue := tail; for each 'a1 such that g(r, a) = s # fail do begin := queue.s; queue state := f(r); while g(state, a) = fail do state := f(state); := g(state, a); f(S) end end end.

Cd). Input:

Computation Pattern

September 1989

Volume 32

of PT[l

h function. . . PI, 1 . . P2].

Number 9

Communications of the ACM

1119

Computing Practices I Shift function h. output: Method: begin t := 0; h[l] := 0; for p :=- 2 to PI do begin whi1.e (t > 0) t :== t + 1; ifi PT[p, 1 . . end

and

(PT[p

P2]

#

-

PT[t,

1,

1 . .

1 . .

P2]

P2]

#

then

PT[t, h[p]

1 . . :=

t

P2]) else

do

t

h[p]

:= T=

h[t

I i

h[t]

end.

CR Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems-pattern nrafchiq; 1.5.4 [Pattern Recognition]: Applications--text

Acknowledgments. The authors would like to thank Mr. Mark F. Villa of the University of Alabama at Birmingham for correcting the English of the first manuscript.

processing General Terms: Algorithms. Performance Additional Key Words and Phrases: Analysis of algorithms, information retrieval, pattern matching, pattern recognition, siring searching, text editing

REFERENCES 1. Aho. A.V., and Corasick. M.J. Efficient string matching: An aid to bibliographic search. Commun. ACM 18, 6 (June 1975). 333-340. 2. Apostolico, A.. and Giancarlo, R. The Boyer-Moore-Galli string searching strategies revisited. SIAM /. Comput. 15, 1 [Feb. 1966).

ABOUT

RUI FENG ZHU is a graduate student of computer the University of Ibaraki, Japan.

!%%105. 3.

Baker, T.P. A technique for extending rapid exact-match string matching to arrays of more than one dimension. SIAM J Comput. 4 [Nov.

1978),

13ird, R.S. Two dimensional pattern matching. Info. Process. Left. 6, 5 (Oct. 1977). 168-170. 5. Bayer. R.S.. and Moore, J.S. A fast matching algorithm. Commun. ACM 20, 10 (Oct. 1977). 762-772. 6. Davies, G.. and Bowsher, S. Algorithms for pattern matching. Softm. Pracf. Exper. 16, 6 (June 1986), 575-601. 7. Galil. Z. On improving the worst case running time of the BoyerMoore string matching algorithm. Commun. ACM 22. 9 (Sept. 1979). 505-508.

8. Knuth, DE.. Morris. J.H., and Pratt. V.R. Fast pattern matching in strings. SIAM \. Comput. 6, 2 (June 1977). 323-350. 9. Zhu. R. F., and Takaoka, T. On improving the average case of the 13oyer-Moore string matching algorithm. I. I# Process. 10. 3 (Mar.

Abstracts

173-177.

(continued

from

ALGORITHM 674 FOR’lRAN Codes for Estimating the One-Norm of a Real or Complex Matrix, with Applications to Condition Estimation

1120

1. H&ham

Communications

of

the ACM

TAKAOKA is a professor of computer science at the University of Ibaraki, Japan. His research interests include analysis of algorithms and program verification. and he is coauthor of Fundamental Algorithms. Authors’ Present Address: Department of Computer Science, University of Ibaraki, Hitachi, Japan 316. TADAO

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish. requires a fee and/or specific permission.

p. 2 135)

strutting dynamic Huffman codes that is described and analyzed in a companion paper [3]. The program runs in real time; that is, the processing time for each letter of the message is proportional to the length of its codeword. The number of bits used to encode a message oft letters is less than f bits more than that used by the well-known two-pass algorithm. This is best possible for any one-pass Huffman scheme. In practice. it uses fewer bits than all other Huffman schemes. The algorithm has applications in file compression and network transmission. For Correspondence: Depxtment of Computer Science, Brown University. Providence. R.I. 02912.

Nicholas

science at

7,

533-541.

4.

‘1987).

THE AUTHORS:

FORTRAN 77 codes SONEST and CONEST are present.ed for estimating the l-norm (or the m-norm) of a real or complex matrix, respectively. The codes are of wide applicability in condition estimation since explicit access to the matrix, A, is not required; instead, matrix-vector products Ax and A’x are computed by the calling program via a reverse communication interface. The algorithms are based on a conwx optimization method for estimating the l-norm of a real matrix devised by Hager [Condition estimates, SIAMJ. Sri. Stat Compuf. 5 (1984), 311-3161. We derive new results concerning the behavior of Hager’s method, extend it to complex matrices, and make several algorithmic modifications in order to improve the reliability and efficiency.

For Correspondence: chester, Manchester

Department of Mathematics, Ml3 9PL, England.

September 1989

University

Volume 32

of Man-

Number 9

Suggest Documents