Approximate Subset Matching with Don't Cares 1 ... - Semantic Scholar

4 downloads 49 Views 223KB Size Report
Bar Ilan University and. Georgia Tech. Weizmann Institute. Abstract. The Subset Matching problem was recently introduced by Cole and Hariharan. The input of.
Approximate Subset Matching with Don't Cares Amihood Amiry

Bar-Ilan University and Georgia Tech

Moshe Lewenstein z Bar-Ilan University

Ely Porat

Bar Ilan University and Weizmann Institute

Abstract

The Subset Matching problem was recently introduced by Cole and Hariharan. The input of the problem is a text array of n sets totaling s elements and a pattern array of m sets totaling s elements. There is a match of the pattern in a text location if every pattern set is a subset of the corresponding text set. Subset matching has proven to be a powerful technique and enabled nding an ecient solution to the Tree Matching problem. The subset matching model may prove useful in solving other hard problems, e.g. Swap Matching. In this paper we investigate the complexity of approximate subset matching with \don't care"s. We provide twop algorithms for the problem. A randomized algorithm whose complexitypis O((s + n + mn s ) m log2 m) and a deterministic algorithm whose complexity is O((s + n) s log m). 0

0

0

Key Words: Design and analysis of algorithms, combinatorial algorithms on words, approximate string matching.

1 Introduction The need for the general task of \pattern matching" is ubiquitous. However, the combinatorial model of the problem varies with the di erent application domain. This variability is manifested in a number of research directions. 1. The Object Space: Text Editing, Information Retrieval, and computational biology motivate that the text and pattern being sought at the text, be strings (e.g. [15, 9]). Image processing and Computer Vision motivate that the text and pattern be two dimensional arrays (e.g. [8, 7]). Searching in Hypertext motivated the case where the text is a general graph and the pattern is a string [18, 5]. Some models were considered as tools for solving other problems. An example is tree pattern matching (e.g. [16, 12]), that was used for solving problems with grammars. Some models have an intrinsic interest. The Subgraph isomorphism problem [14] (NP -complete) is the pattern matching problem where both the text and Department of Mathematics and Computer Science, Bar-Ilan University, 52900 Ramat-Gan, Israel, (972-3)5318407; famir,moshe,[email protected] y Partially supported by NSF grant CCR-96-10170, BSF grant 96-00509, and a BIU internal research grant. z Supported by the Ministry of Science, Israel, Eshkol Fellowship 061-1-97. Part of this work was done while the author was visiting the Courant Institute at NYU. 

1

pattern are general graphs. Recently, Cole and Hariharan introduced the Subset Matching problem where the text and patterns are arrays of sets [10]. The subset matching problem is an interesting generalization of string matching and has proven useful in producing more ecient solutions for tree matching. 2. The Matching Relation: Various applications require a di erent notion of \match". This led to the concept of generalized matching. Examples are exact matching [15], parameterized matching [6], matching with \don't cares" [13], less-than matching [3], and swapped matching [19, 2]. 3. Approximation: In approximate matching, one de nes a distance function between the objects and seeks all text location where the pattern matches the text by a pre-speci ed \small" distance. One of the earliest and most natural metrics is the Hamming distance, where the distance between two strings is the number of mismatching characters. Levenshtein [17] identi ed three types of errors, mismatches, insertions, and deletions. These operations are traditionally used to de ne the edit distance between two strings. Much of the recent research in string matching concerns itself with understanding the inherent \hardness" of the various distance functions on di erent objects with di erent match relations, by seeking upper and lower bounds for string matching under these conditions. Such algorithms either provide ecient solutions for applications that need them or advance the general theory of pattern matching by giving new insights or techniques. The goal of this paper is to investigate the complexity of approximate matching with don't cares in subset matching. Such ecient algorithms both advance our understanding of the subset matching paradigm and could provide tools for more ecient algorithms for other problems, e.g. swap matching. Consider the case of subset matching with don't cares. If the \don't care"s are all in the pattern, the problem can be easily solved by the methods of Cole and Hariharan simply by introducing a new symbol for don't care, and adding it to all text sets, where it will only appear as the single element in the pattern sets that are \don't care". However, \don't care"s in the text are problematic. Conceptually, every \don't care" in the text can be replaced by the universal set, but that may prohibitively increase the algorithm's complexity. A new method is needed for handling \don't care"s in text sets. An additional new result of this paper, that could not be addressed with current techniques, is counting the number of mismatches wherever the subset relation does not hold. This is, in e ect, a generalization of Hamming distance to subset matching. The paper is organized as follows. We begin with the formal problem de nition and preliminaries in section 2. Section 3 presents a randomized algorithmpthat solves the approximate subset matching problem with don't cares in time O((s + n + mn s0 ) m log2 m), where n is the text length, m is the pattern length, s is the size of the union of all text sets and s0 is the size of the union of all pattern sets. p In section 4 we present a deterministic algorithm that solves the problem in time O((s + n) s0 log m). Note that for the special case of both text and pattern being composed of singletons (the string matching with mismatches and don't cares problem) our time complexity matches the currently best known algorithm [1]. There are two reasons why we present the randomized algorithm, when its complexity is actually inferior to that of the deterministic algorithm. 2

1. From a deductic point of view, it makes the deterministic algorithm easier to understand. 2. The randomized algorithm creates a fairly \random" text. In fact, this method a ords the addition of more random elements to further \randomize" the text. This leads to an intriguing concept in pattern matching. It basically means that, for many pattern matching problems, solving the problem over a random text will yield equally ecient (with high probability) solutions in non-random texts. This gives new motivation for the study of pattern matching in random texts.

2 De nitions and Preliminaries De nition: The Subset Matching Problem is de ned as follows.

INPUT: Text T = T1 ; T2 ; :::; Tn of sets Ti  ; i = 1; :::; n and pattern P = P1; P2; :::; Pm of sets Pi  ; i = 1; :::; m, where  is a given alphabet. OUTPUT: All locations i; 1  i  n ? m + 1 where 8` = 1; :::; m; P`  Ti+`?1 .

For an example of subset matching see gure 1.

T:

{a,b,c}

P: Figure 1: Exact Subset Matching in text location 2. P P Let s = ni=1 jTi j and s0 = mi=1 jPi j. Cole and Hariharan [10] give an O((s + s0 + n) log2 m log(s + s0 + n)) randomized algorithm for solving the subset matching problem. This was subsequently improved in [11] to a deterministic algorithm. Before de ning the approximate version of the subset matching problem with \don't care"s we need the following notation. Let  be our alphabet and let ? 2= . ? will denote our \don't care" symbol. Note that traditionally the symbol used for \don't care" is . We do not use this symbol to avoid confusion with the even more common use of  to denote the \null set". Let S1 ; S2   or S1 = ? or S2 = ?. Denote  S1 ; S2 6= ?; mis(S1; S2 ) = j0S; 1 ? (S1 \ S2)j; ifotherwise. In words, if one of S1; S2 is ? then mis(S1; S2 ) = 0. Otherwise mis(S1 ; S2 ) is the number of elements in S1 that do not appear in S2. De nition: The Approximate Subset Matching Problem with don't cares is de ned as follows. INPUT: Text T = T1 ; T2 ; :::; Tn of sets Ti   or Ti = ?; i = 1; :::; n and pattern P = P1; P2 ; :::; Pm of sets Pi   or Pi = ?; i = 1; :::; m, where  is a given alphabet and ? 2= , where ? is a \don't care" symbol that matches any set.

3

OUTPUT: For every location i; 1  i  n ? m + 1, compute m X mis(P`; Ti+`?1 ): `=1

For an example of approximate subset matching see gure 2.

T:

{a,b,c}

P: Figure 2: Approximate Subset Matching { there are three errors (surrounded by circles) in location 2 of the text. The special case where every set in T or P has either a single element or is ? is known as the string matching with mismatches and \don't care"s problem. Abrahamson [1] developed a dividep and-conquer algorithm that solves that problem in time O(n m log m). We will be using that algorithm in section 3. Variations on the Abrahamson approach were used for the less-than matching problem [3], for the swap matching problem [2] and for the k-mismatches problem [4]. In section 4 we will develop a new variation of this technique.

3 The Randomized Algorithm It was pointed out in [10, 11] that without loss of generality one may assume that n = 2m and 3m  s + s0  6m. This is true since the text may be cut into n=m overlapping segments, each having length 2m (except for the last segment that may be shorter). For each segment, the sets may be broken down into smaller sets such that the sum of their elements will range between 3m and 6m elements. We will, therefore, subsequently assume that n = 2m and 3m  s + s0  6m. We will make the correction for the general numbers at the end of the analysis. We separate our discussion to two stages. We rst consider approximate subset matching without \don't care"s and then show how to handle the case of \don't care"s.

3.1 Intuition We present the intuition behind both the randomized and deterministic algorithm. Let S1 and S2 be sets. We need to compute mis(S1; S2 ). Consider all pairs of elements ha; bi; a 2 S1; b 2 S2. Charge such a pair with 1 if a 6= b and with 0 if a = b. Let pairs(S1; S2 ) be the sum of the costs of all pairs.

Lemma 1 Let S ; S be sets. Then mis(S ; S ) = pairs(S ; S ) ? (jS j(jS j ? 1)). 1

2

1

2

1

4

2

1

2

Proof: Recall that mis(S1 ; S2) is the number of all a 2 S1 such that a 2= S2 . Every such a participates in jS2 j pairs ha; bi; a 2 S1; b 2 S2 where a 6= b. However, every a 2 S1 such that a 2 S2 participates in jS2j ? 1 pairs. Thus pairs(S1; S2 ) = (jS1j(jS2 j ? 1)) + mis(S1 ; S2). We need to consider separately the special case where S2 is empty. In this case pairs(S1; S2 ) = 0. However, then (jS1 j(jS2j ? 1)) = jS1 j(0 ? 1) = ?jS1j. This still gives the desired result because pairs(S1; S2 ) ? (jS1 j(jS2j ? 1)) = 0 ? (?jS1j) = jS1 j = mis(S1; S2 ). ut It is easy to compute, for every i, the sum m X (jP`j(jTi+`?1j ? 1)) `=1

in time O(n log m). This can be done by the following polynomial multiplication. Construct T 0 = jT1 j?1; jT2 j?1; :::; jTn j?1 and P 0 = jP1j; jP2 j; :::; jPm j. The polynomial multiplication T 0 P 0R, where P 0R is the reversal of P 0, gives the desired result. Such a polynomial multiplication, using the Fast Fourier Transform (FFT) can be done in time O(n log m) in a RAM with word length O(log n). The challenge is an ecient computation, for every i = 1; :::; n ? m + 1, of m X pairs(P`; Ti+`?1 ): `=1

The problem lies in the fact that these sets may be of size m each, and thus the computation will require O(m2) per location. In section 3.2 we will show a randomized solution to this problem, and in section 4 we will show a deterministic solution. In both cases we will separately show how to handle the \don't care"s.

3.2 The Randomized Solution Following [10] we construct a new text and pattern where the sizes of the sets are not \too large". Recall that 3m  s + s0  6m. Construct a new text T 0 of length 14m and a new pattern P 0 of length 13m in the following manner. For each alphabet symbol a, randomly choose a number na between 1 and 12m. Subsequently, for every i such that a 2 Ti , write a in set i + na of T 0. similarly, for every i such that a 2 Pi write a in set i + na of P 0.

Lemma 2 For each index j , Pr(Tj0 6= ;)  1=2: Proof: See [10].

ut

Lemma 3 For each index j , Pr(jTj0j  x + 1jjTj0 j  x)  1=2: Proof: Similar to lemma 4.2 of [10].

ut

The two corollaries below follow directly from the above lemmas.

Corollary 1 For each index j , Pr(jTj0j > 3 log m)  1=m . 3

5

Corollary 2 Pr(9i such that jTi j > 3 log m or jPi j > 3 log m)  1=m. We conclude that in an expected constant number of tries we can compute T 0 and P 0 where no set is greater than 3 log m. The advantage of this scheme lies in the fact that for all i = 1; :::; m + 1 m X `=1

pairs(P`; Ti+`?1 ) = P13m

Xm

13

`=1

pairs(P`0; Ti0+`?1 ):

However, it is easier to compute `=1 pairs(P`0; Ti0+`?1 ) because no set has more than 3 log m elements. We can compute all pairs in a brute-force manner and still have no more than O(log2 m) computations. The idea is as follows: Assume that every set has exactly 3 log m ordered elements (if j )) the string where the there are less elements, pad them with \don't care"s). Denote by T 0(j ) (P 0(P m 0 0 0 0 i-th location has element j of set Ti (Pi ). The following algorithm computes 13 `=1 pairs(P` ; Ti+`?1 ). 1. for i = 1 to 3 log m do 2. for j = 1 to 3 log m do 3. Compute mismatches of P 0(i) in T 0(j ). 4. end 5. end p Time: The mismatches can be counted by Abrahamson's algorithm [1] in time O(m m log m) p making the total time to compute the pairs O(m m log2:5 m). Note that since the text will consist mostly of \don't care"s, it is possible to use the techniques of [2] and reduce the time to p O(m m log2 m). The correction of the sums just calculated by subtracting Xm

13

`=1

(jP`0j(jTi0+`?1 j ? 1))

will be done using a single other polynomial multiplication in the manner previously described, thus adding only an additional O(m log m). Time for General Text and Pattern: Recall that our entire discussion thus far has been assuming text and pattern of size 2m and m, respectively, and the sum of set sizes between 3m and 6m. The adjustment to general text and pattern is standard and mostly similar to the adjustment of [10]. The only di erence is that we need to account for the case where there are more elements in the pattern sets than in the text sets. This means that we may actually need to pre-count pattern sets for every length 2m text subsequence making the total time O((n + s + mn s0 ) m log2 m).

3.3 Adding the \don't care"s The \don't care"s in the pattern can be easily solved by simply considering every \don't care" as the null set ;. We therefore only need to consider the case of \don't care"s in the text. Our solution 6

will be to consider the text \don't care"s also as null sets and proceed with the approximate subset matching algorithm described in section 3.2. We apply that algorithm as a black box. Assume the results we get are stored in array R. The results we get are wrong in that we count as errors all elements of pattern sets that are matched to \don't care" text sets. However, these amounts can be easily adjusted by a single polynomial multiplication. The multiplication is the following. Construct T 00 where location i is 1 if location i in T 0 was ?, and 0 otherwise. Construct P 00 where location i is jPij. Subtracting T 00  P 00R from the respective locations in array R gives the desired result. Time: O((n + s + mn s0)pm log2 m).

4 The Deterministic Algorithm As mentioned in section 3.1, our main challenge is summing the pairs when the sets involved may be large. We solve this problem with a new variation of the Abrahamson divide-and-conquer idea [1]. As before, we rst describe the approximate subset matching algorithm and then add the part to handle \don't care"s.

4.1 Deterministic Approximate Subset Matching As before we consider texts of length 2m. However, the further restriction on the sum of the set elements is di erent from that of [10]. We restrict ourselves to cases where the sum of all text sets is between 2m and 4m. We then restrict the pattern sets to the alphabet symbols in the text sets. Clearly the sum of the pattern sets may now exceed O(m) (but can not be greater than s0). Denote the sum of the sizes of the pattern sets by s00. Our algorithm will have two stages. In the polynomial multiplication stage, we count mismatches of block representatives by polynomial multiplication, using the Fast Fourier Transform (FFT). In the correction stage we count mismatches within blocks. We describe the details of division into blocks, choosing representatives, and handling the correction.

Dividing into Blocks:

p

m We call a symbol that appears in the text at least f = m plog s00 times a frequent symbol. We now divide the text alphabet symbols into blocks in the following fashion.

7

1. Every frequent symbol de nes a block. 2. Designate occurrence of non-frequent symbol a 2 T` by the pair ha; `i and create a list L of these pairs. 3. Sort L lexicographically by the alphabet symbol. Call the sorted array L0 . 4. Divide L0 into blocks, each containing no more than 2f elements, in a manner that no number appears in more than one block. [We are assured that such a division is possible because all remaining numbers are non-frequent.] 5. From each block choose one of the alphabet symbols appearing in the block's pairs as representative.

p p We are guaranteed to have no more than 4fm = p4logs00m blocks, i.e. no more than p4logs00m representatives.

Summing Mismatches of Elements in Di erent Blocks:

p

We may sum all pairwise di erences of elements in di erent blocks by the following p4 logs00m polynomial multiplications. 1. For each representative a do (a) Create string T 0 where the ith location of T 0 is the number of elements in Ti whose representative is a. (b) Create string P 0 where the ith location of P 0 is the number of elements in Pi whose representative is not a. (c) Compute Ra T 0  P 0R. 2. end P 3. Compute R a Ra. We now have the pairwise di erences of elements in di erent blocks. We have not counted the di erences between pairs of elements in the same block.

Correction within Blocks:

For every one of the s00 elements in the pattern, compare it with all text elements in its block and use the index [the second element in the pair] to add a mismatch to the appropriate mismatch counters.

Time:

O(mplog m) for block construction. p O( p4 logs00m m log m) = O(m s00 log m) for summing mismatches of pairs in di erent blocks.

p

p O(s00f ) = O(s00 m plog m ) = O(m s00 log m) for the correction stage. s00

Total Time: O(mps00 log m). Total Time for General Text and Pattern: O((s + n)ps0 log m). 8

P Note that we still need a polynomial multiplication for computing m`=1(jP`j(jTi+`?1 j?1)) to achieve the subset mismatch value from the pairwise mismatch.

4.2 The Deterministic \don't care" case In the pairwise mismatch computation always put a 0 in the location of a \don't care". In the correction phase always put a 0 in the location of a \don't care".

References [1] K. Abrahamson. Generalized string matching. SIAM J. Comp., 16(6):1039{1051, 1987. [2] A. Amir, Y. Aumann, G. Landau, M. Lewenstein, and N. Lewenstein. Pattern matching with swaps. Proc. 38th IEEE FOCS, pages 144{153, 1997. [3] A. Amir and M. Farach. Ecient 2-dimensional approximate matching of half-rectangular gures. Information and Computation, 118(1):1{11, April 1995. [4] A. Amir, M. Lewenstein, and E. Porat. Faster algorithms for string matching with k mismatches. In Proc. 11th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 794{803, 2000. [5] A. Amir, N. Lewenstein, and M. Lewenstein. Pattern matching in hypertext. J. of Algorithms, 35:82{99, 2000. [6] B. S. Baker. A theory of parameterized pattern matching: algorithms and applications. In Proc. 25th Annual ACM Symposium on the Theory of Computation, pages 71{80, 1993. [7] T.J. Baker. A technique for extending rapid exact-match string matching to arrays of more than one dimension. SIAM J. Comp., 7:533{541, 1978. [8] R.S. Bird. Two dimensional pattern matching. Information Processing Letters, 6(5):168{170, 1977. [9] R.S. Boyer and J.S. Moore. A fast string searching algorithm. Comm. ACM, 20:762{772, 1977. [10] R. Cole and R. Harihan. Tree pattern matching and subset matching in randomized o(n log3 m) time. Proc. 29th ACM STOC, pages 66{75, 1997. [11] R. Cole, R. Harihan, and P. Indyk. Tree pattern matching and subset matching in deterministic o(n log3 n) time. Proc. 10th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 245{254, 1999. [12] M. Dubiner, Z. Galil, and E. Magen. Faster tree pattern matching. J. ACM, 41(2):205{213, 1994. [13] M.J. Fischer and M.S. Paterson. String matching and other products. Complexity of Computation, R.M. Karp (editor), SIAM-AMS Proceedings, 7:113{125, 1974. 9

[14] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman and Co., 1979. [15] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comp., 6:323{350, 1977. [16] S. R. Kosaraju. Ecient tree pattern matching. Proc. 30th IEEE FOCS, pages 178{183, 1989. [17] V. I. Levenshtein. Binary codes capable of correcting, deletions, insertions and reversals. Soviet Phys. Dokl., 10:707{710, 1966. [18] U. Manber and S. Wu. Approximate string matching with arbitrary cost for text and hypertext. In Proc. Int'l Workshop on Structural and Syntactic Pattern Recognition, pages 22{33, 1992. [19] S. Muthukrishnan and K. Palem. Non-standard stringology: Algorithms and complexity. In Proc. 26th Annual Symposium on the Theory of Computing, pages 770{779, 1994.

10

Suggest Documents