Boolean Similarity Measures for Resource Discovery Technical Report USC-CS-94-579
Shih-Hao Li and Peter B. Danzig
Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli,
[email protected]
Abstract
We develop a new method to rank the degree of similarity between Boolean expressions, contrast it with other known methods, and describe its implementation. Our method reduces time and space complexity from exponential to polynomial in the number of Boolean terms.
Index Terms - Boolean query, information retrieval, ranking, resource discovery, similarity measure.
1 Introduction Most library information systems let users make Boolean queries against their database. Internet resource discovery systems, such as WAIS [1] and our Indie [2], also support Boolean queries. Frequently, users nd it convenient if the retrieval system returns the answers to their queries in a ranked order. This paper develops an ecient algorithm to rank the similarity between a user's Boolean query and a set of objects, each described by a Boolean expression. Our method produces similarity rankings between zero and one. If the query and the object with which it is compared contain some identical terms, the similarity is non-zero. If they are identical, the similarity is close or equal to one. If they contain no common terms whatsoever, the similarity is zero. Directory of services are retrieval systems that greatly bene t from ranking. The objects they contain are descriptions of other retrieval systems. Directory of services enable users to nd retrieval systems that they previously did not know existed. As shown in Figure 1, a user sends his query to the directory of services which determines and ranks the retrieval systems relevant to the user's request. It does this by estimating the \similarity" between the user's query and the description of each retrieval system. The user employs the rankings when selecting the retrieval systems to query directly. Indie, WAIS, and GLOSS [3] are resource discovery tools that support this feature. Throughout this paper, we assume that retrieval systems (or servers) are described by Boolean expressions that are either computed automatically [4] or are assigned by hand. It is conceivable that a particular server description includes the popular keywords occurring in the server's documents, but the mechanism by which these descriptions are computed is beyond the scope of this paper.
Example 1 Consider the following Boolean expression, ((keyword = network) or (keyword = UNIX)) and (author = Smith); 1
Figure 1: Resource discovery process: (1) A user sends a query to the directory of services. (2) The directory of services returns a ranked list of relevant retrieval systems. (3) The user sends his query to one or more of the relevant systems which (4) return matching documents. where keyword and author are prede ned attribute names, network, and UNIX and Smith are their corresponding values, In the discussion, we would represent this expression as (t1 _ t2) ^ t3, where ti (1 i 3) are called descriptors, _ is the or and ^ is the and operators. 2 Below, we describe two existing similarity measures for Boolean expressions, introduce our new measure, and experimentally contrast it with the well known Jaccard's coecient. We prove that our method requires polynomial time and space complexity in contrast to the previous method which exhibits exponential complexity. We describe the implementation of our similarity measure and discuss its application to other information retrieval applications.
2 Similarity Measure Well-known similarity measures, Dice's coecient, Jaccard's coecient, Cosine coecient, and Overlap coecient, have been used to compute the similarities of one document to another document and documents
to queries for automatic classi cation, clustering, and indexing [5]. For these measures, documents and queries are represented as sets of keywords. In the \cluster-based retrieval" system, documents with high similarities are grouped into a cluster. User queries are rst compared with cluster representatives, then compared with documents in the clusters that have high similarities with the queries [5]. Indie's directory of services is similar to cluster-based retrieval, where servers are clusters described by cluster representatives. Because Indie's user queries and cluster representatives are both Boolean expressions, the above similarity measures can not be applied directly. The degree of similarity between user queries and server descriptions is determined by how \much" these Boolean expressions overlap. Consider the example below.
Example 2 Suppose RA and RB are the server descriptions of two retrieval systems (A and B) stored in the directory of services, and Q1 and Q2 are two user queries: RA = t 1 ^ t 3 ; 2
RB = t2 ^ t4; Q1 = t1 ^ t2 ^ t3 ; Q2 = (t1 _ t2) ^ t3: Both RA and RB overlap with Q1, but RA contains two overlapped terms (t1 and t3 ) while RB contains only one (t2 ). Thus, RA is more relevant to query Q1 than RB . However, for an and-or-combined query Q2, it becomes more complicated to determine which server description is more relevant. 2 We need a systematic method to measure the overlap between user queries and server descriptions. Furthermore, this method must perform eciently even when the number of server description increases. Radecki employed several measures to rank similarity between Boolean expressions [6, 7]. In the following sections, we review Radecki's measures and present our modi ed measure. We demonstrate our improvements in space and time complexity and compare the two measures on a synthetic benchmark.
2.1 Background Radecki proposed two similarity measures, S and S , based on Jaccard's coecient. He de ned the similarity value S between queries Q1 and Q2 as the ratio of the number of common documents to the total number of documents returned in response to both queries. This ratio, commonly known as Jaccard's coecient, can be described as (Q1) \ (Q2 )j ; (1) S(Q1 ; Q2) = jj (Q ) [ (Q )j 1
2 where \ denotes set intersection, [ denotes set union, and (Q1 ) and (Q2 ) are the response sets to Q1 and Q2. To apply S in our environment, we denote (Q) and '(R) as the sets of documents in the response to
Q and in the cluster represented by R, respectively. The similarity value S between Q and R is then de ned as the ratio of the number of common documents to the total number of documents in (Q) and '(R), (Q) \ '(R)j : S(Q; R) = jj (Q) (2) [ '(R)j Because all the documents satisfying query Q belong to cluster R (i.e. (Q) '(R)), Eqn. (2) can be simpli ed as (Q)j : (3) S(Q; R) = jj'(R) j
Example 3
Using the de nitions from Example 2, we assume system A (represented by RA ) contains documents
fa1; a2; a3; a4g and system B (represented by RB ) contains documents fb1 ; b2; b3g. Thus, '(RA ) = fa1; a2; a3 ; a4g; '(RB ) = fb1; b2; b3g:
Assume for query Q1 , the system responses are A (Q1) = fa1; a2; a4g; B (Q1) = fb1; b3g; where A (Q1) and B (Q1 ) are the responses to Q1 in systems A and B respectively. The similarity measures between Q1 against RA and RB are then S(Q1 ; RA) = 3=4 = 0:750; S(Q1 ; RB ) = 2=3 = 0:667: 3
2 However, in the case of a directory of services, the similarity measure is used to estimate the importance of entire information systems and decide the order in which users should search them. If the similarity is calculated based on the query results from every information system, the searching order is no longer needed because you have already searched them all. Radecki proposed similarity measure S that is independent of the responses to the queries [7]. In ~ S , Boolean expression Q is transformed into its reduced disjunctive normal form (RDNF), denoted as Q, which is the disjunction of a list of reduced atomic descriptors. If set T is the union of all the descriptors that appear in the to-be-compared Boolean expression pair, then the reduced atomic descriptor is de ned as the conjunction of all the elements in T in either their original or negated forms. Let Q and R be two Boolean expressions and TQ and TR be the sets of the descriptors that appear in Q and R respectively. Suppose TQ [ TR = ft1; t2; : : :; tk g, where k is the set size of TQ [ TR . Then the RDNFs of Q and R are ~ TQ [TR = (~q1;1 ^ q~1;2 ^ ^ q~1;k) _(~q2;1 ^ q~2;2 ^ ^ q~2;k ) _ _ (~qm;1 ^ q~m;2 ^ ^ q~m;k ); (Q) {z } | reduced atomic descriptor
~ TQ [TR = (~r1;1 ^ r~1;2 ^ ^ r~1;k ) _(~r2;1 ^ r~2;2 ^ ^ r~2;k ) _ _ (~rn;1 ^ r~n;2 ^ ^ r~n;k ); (R) {z } | reduced atomic descriptor
~ TQ [TR , respectively. ~ TQ [TR and (R) where m and n are the numbers of reduced atomic descriptors in (Q) Each reduced atomic descriptor in these two RDNFs consists of the same number of descriptors, k, which is the set size of TQ [ TR . Each q~i;j and r~i;j in the RDNFs represents the original or negated form of the corresponding descriptor tj . Speci cally tj original, 1 i m; 1 j k; q~i;j = :t negated, j tj original, 1 i n; 1 j k; r~i;j = :tj negated, where : is the not operator. For example, q~2;1 denotes the rst descriptor in the second reduced atomic ~ TQ [TR , where q~2;1 is either t1 or :t1 depending on how Q is transformed. The descriptor of RDNF (Q) following example shows how RDNFs look like after transforming from Boolean expressions.
Example 4 From Example 2, TQ2 = ft1; t2; t3g, TRA = ft1; t3g, TRB = ft2; t4g,
Q2 = (t1 _ t2) ^ t3 , RA = t1 ^ t3 , RB = t2 ^ t4 ,
where set TX is the union of all the descriptors in Boolean expression X (X = Q2 ; RA, or RB ). To transform Q2 to its RDNF, we can apply the distributive law (t1 _ t2 ) ^ t3 = (t1 ^ t3) _ (t2 ^ t3); (4) and expand the two conjunctions, (t1 ^ t3) and (t2 ^ t3), to their associated reduced atomic descriptors. The expansion process is based on the equation ta = (ta ^ tb) _ (ta ^ :tb); (5) where ta and tb are descriptors. 4
Consider Q2 and RA rst. Since TQ2 [ TRA = ft1 ; t2; t3g, each reduced atomic descriptors in (Q~ 2)TQ2 [TRA and (R~ A )TQ2 [TRA must contain all the ti 's (1 i 3) or their negated forms. Thus, the conjunctions in Q2 are expanded to t1 ^ t3 = (t1 ^ t3 ^ t2 ) _ (t1 ^ t3 ^ :t2 ); t2 ^ t3 = (t2 ^ t3 ^ t1 ) _ (t2 ^ t3 ^ :t1 ): The RDNFs of Q2 and RA are (Q~ 2)TQ2 [TRA = (t1 ^ t2 ^ t3) _ (:t1 ^ t2 ^ t3 ) _ (t1 ^ :t2 ^ t3 ); (R~ A)TQ2 [TRA = (t1 ^ t2 ^ t3) _ (t1 ^ :t2 ^ t3 ): Similarly, because TQ2 [ TRB = ft1 ; t2; t3; t4g, the RDNFs of Q2 and RB are (Q~ 2 )TQ2 [TRB = (t1 ^ t2 ^ t3 ^ t4) _ (:t1 ^ t2 ^ t3 ^ t4) _ (t1 ^ :t2 ^ t3 ^ t4) _ (t1 ^ t2 ^ t3 ^ :t4) _ (:t1 ^ t2 ^ t3 ^ :t4 ) _ (t1 ^ :t2 ^ t3 ^ :t4); ~ (RB )TQ2 [TRB = (t1 ^ t2 ^ t3 ^ t4) _ (:t1 ^ t2 ^ t3 ^ t4) _ (t1 ^ t2 ^ :t3 ^ t4) _ (:t1 ^ t2 ^ :t3 ^ t4 ):
2
Radecki de nes the similarity value S between two Boolean expressions (Q and R) as the ratio of the
number of common reduced atomic descriptors in Q~ and R~ to the total number of reduced atomic descriptors in them, ~ TQ [TR j ~ TQ [TR \ (R) : (6) S (Q; R) = j(Q) ~ ~ j(Q)TQ [TR [ (R)TQ [TR j
Example 5 Continuing with Example 4,
j(Q~ 2)TQ2 [TRA \ (R~A )TQ2 [TRA j 2 = = 0:667; j(Q~ 2)TQ2 [TRA [ (R~A )TQ2 [TRA j 3 j \ (R~ ) j(Q~ ) S (Q2 ; RB ) = ~ 2 TQ2 [TRB ~B TQ2 [TRB = 28 = 0:250: j(Q2)TQ2 [TRB [ (RB )TQ2 [TRB j S (Q2; RA) =
Therefore, RA is more relevant to query Q2 than RB . 2 From Example 4, we can notice that Q2 is transformed to dierent RDNFs, (Q~ 2)TQ2 [TRA and ~ (Q2)TQ2 [TRB , when comparing with RA and RB . This means whenever a new user query is compared against N server descriptions, it needs 2N RDNF transformations to calculate the similarity between them. This method suers when the number of server descriptions is large and users query frequently. The system will spend signi cant amounts of time recomputing RDNFs, and consequently will perform badly. To solve this problem, we modify Radecki's method so that it need not recompute RDNFs of server descriptions while still providing statistically equivalent results.
2.2 New Similarity Measure We propose a new measure based on Radecki's similarity measure S , that is independent of the underlying information systems and requires less computation. We transform Boolean expression Q to its compact dis^ which is the disjunction of a list of compact atomic descriptors. junctive normal form (CDNF), denoted as Q, 5
Each compact atomic descriptor is the conjunction of a subset of descriptors that appear in its own Boolean expression. The CDNF can be obtained by using the distributive law that described in the previous section. Let Q and R be two Boolean expressions, and TQ and TR be the sets of the descriptors that appear in Q and R respectively. Then the CDNFs of Q and R are Q^ = (^ ^ ^ q^1;x1 }) _(^q2;1 ^ q^2;2 ^ ^ q^2;x2 ) _ _ (^qm;1 ^ q^m;2 ^ ^ q^m;xm ); |q1;1 ^ q^1;2 {z compact atomic descriptor
R^ = (^ ^ ^ r^1;y1 }) _(^r2;1 ^ r^2;2 ^ ^ r^2;y2 ) _ _ (^rn;1 ^ r^n;2 ^ ^ r^n;yn ); |r1;1 ^ r^1;2 {z compact atomic descriptor
^ xi is the number of descriptors in where m and n are the numbers of compact atomic descriptors in Q^ and R, th ^ the i (1 i m) compact atomic descriptor of Q, and yj is the number of descriptors in the j th (1 j n) ^ In contrast, all RDNFs of S have the same number of descriptors. Each compact atomic descriptor of R. q^i;u and r^j;v in the CDNFs represents a descriptor in TQ and TR respectively. Speci cally, q^i;u 2 TQ ; where (i; u) = (1 : : :m; 1 : : :xi); r^j;v 2 TR ; where (j; v) = (1 : : :n; 1 : : :yj ):
Example 6 The CDNFs of Q2, RA, and RB in Example 2 are Q^ 2 = (t1 ^ t3 ) _ (t2 ^ t3 ); R^ A = (t1 ^ t3 ); R^ B = (t2 ^ t4 ):
2 In Example 6, each compact atomic descriptor consists of only the descriptors in its original Boolean expression, which is clearly independent of the descriptors of any other Boolean expressions. We denote our similarity measure S and de ne the similarity of two Boolean expressions as the average value of the individual similarity measures (s ) between each compact atomic descriptor. The individual similarity measure s is de ned as ( 0 if TQi \ TRj = ; or 9t 2 TQi ; :t 2 TRj ; i j ^ ^ 1 s (Q ; R ) = (7) otherwise, jT j ?T i j jT i ?T j j 2 R Q +2 Q R ?1
^ R^ j indicates the j th compact atomic where Q^ i indicates the ith compact atomic descriptor of CDNF Q, j i i ^ ^ descriptor of CDNF R. TQ , TR are the sets of descriptors in Q and R^ j , respectively. jTRj ? TQi j is the number of descriptors that appear in TRj but not in TQi , vice versa, jTQi ? TRj j is the number of descriptors that appear in TQi but not in TRj . Thus, PjQ^j PjR^j s (Q^ i ; R^ j ) i=1 j =1 S (Q; R) = ; (8) jQ^ j jR^ j ^ respectively. The denominator, jQ^ j jR^ j, is where jQ^ j and jR^ j are the numbers of descriptors in Q^ and R, the total number of individual similarity measures in the calculation.
Example 7 6
Continuing with Example 6,
^ t3}), Q^ 2 = (t ^ t3}) _ (t| 2 {z | 1 {z Q^ 12 Q^ 22 ^ RA = (t ^ t3}), | 1 {z
TR1 A = ft1; t3 g,
R^ B = (t ^ t4}), | 2 {z
TR1 B = ft2; t4 g,
TQ1 2 = ft1; t3 g, TQ2 2 = ft2 ; t3g,
R^1A R^1B
where TQ1 2 , TQ2 2 , TR1 A , and TR1 B , are the sets of descriptors in the compact atomic descriptors Q^ 12, Q^ 22 , R^ 1A , and R^ 1B , respectively. Thus, s (Q^ 12; R^ 1A) = s (Q^ 22; R^ 1A) = Hence, and
Hence,
1
1
1
1
= 20 + 20 ? 1 = 1:000; 1 1 1 1 2jTRA ?TQ2 j + 2jTQ2 ?TRA j ? 1 = 21 + 21 ? 1 = 0:333: 1 2 2 1 2jTRA ?TQ2 j + 2jTQ2 ?TRA j ? 1
^ 1 ^1 ^ 2 ^1 S (Q2 ; RA) = s (Q2; RA) +2 s (Q2; RA) = 0:667;
s (Q^ 12 ; R^ 1B ) = 0 (because TQ1 2 \ TR1 B = ;); s (Q^ 22 ; R^ 1B ) = jT 1 ?T 2 j 1jT 2 ?T 1 j = 21 + 211 ? 1 = 0:333: R Q R Q 2 2 B ?1 +2 2 B ^1 ^ 1 ^ 2 ^1 S (Q2 ; RB ) = s (Q2 ; RB ) +2 s (Q2; RB ) = 0:167:
Therefore, RA is more relevant to query Q2 than RB . 2 Our individual similarity measure s is actually a special case of Radecki's S . Below, we show that s can be derived from S .
Theorem 1 Let TQi , TRj be the sets of descriptors in compact atomic descriptors Q^i and R^j , if TQi \ TRj 6= ;, then s (Q^ i ; R^ j ) = S (Q^ i ; R^ j ).
< Proof > We rst calculate the S between Q^ i and R^ j using Eqn. (6), j(Q^~i )TQi [TRj \ (R~^j )TQi [TRj j i j ^ ^ S (Q ; R ) = ~ ; j(Q^i )TQi [TRj [ (R~^j )TQi [TRj j where (Q~^i)TQi [TRj and (R~^j )TQi [TRj are the RDNFs of Q^ i and R^ j , respectively. 7
The above measure can be simpli ed further. Suppose there are k common descriptors between TQi and TRj , i.e. jTQi \ TRj j = k, and m and n are the numbers of remaining (unique) descriptors in TQi and TRj respectively. Then, Q^ i = (c1 ^ c2 ^ ^ ck ^ a1 ^ a2 ^ ^ am ); R^ j = (c1 ^ c2 ^ ^ ck ^ b1 ^ b2 ^ ^ bn ); TQi = fc1; c2; : : :; ck ; a1; a2; : : :; am g; TRj = fc1; c2; : : :; ck ; b1; b2; : : :; bng; where c1 : : :ck are the common descriptors between Q^ i and R^ j , a1 : : :am and b1 : : :bn are their remaining descriptors. Note that a descriptor t and its negation (:t) mean two dierent descriptors in our notation. A descriptor and its negation can not both appear in the same compact atomic descriptor. Otherwise, this compact atomic descriptor is \null" and will not exist. To calculate S (Q^ i ; R^ j ), the RDNFs of Q^ i and R^ j can be obtained by using Eqn. (5), (Q~^i )TQi [TRj = (R~^j )TQi [TRj =
b =(1[ ;1;:::;1)
(c1 ^ ^ ck ^ a1 ^ ^ am ^ b1b1 ^ b2b2 ^ ^ bnbn );
(9)
(c1 ^ ^ ck ^ b1 ^ ^ bn ^ a1a1 ^ a2a2 ^ ^ amam );
(10)
b =(0;0;:::;0) a =(1[ ;1;:::;1) a =(0;0;:::;0)
where a = (a1 ; a2 ; : : :; am ); b = (b1 ; b2 ; : : :; bn ); and bwbw
aw :a w bw = :bw
awaw =
if aw = 1, if aw = 0, if bw = 1, if bw = 0,
1 w m; 1 w n:
Let Ta and Tb be two descriptor sets, where Ta = fa1; a2; : : :; am g and Tb = fb1 ; b2; : : :; bng. Although there is no common descriptor between Ta and Tb , this does not eliminate the possibility that a descriptor is in one set and its negation is in the other. We discuss the two cases: (1) for every descriptor in Ta , its negation does not appear in Tb , and (2) there exists a descriptor in Ta whose negation appears in Tb . Case I. For every descriptor in Ta , its negation does not appear in Tb , i.e. 8t 2 Ta ; :t 62 Tb . From Eqns. (9) and (10), we see that the numbers of reduced atomic descriptors in (Q~^i )TQi [TRj and (R~^j )TQi [TRj are 2n and 2m respectively. From observation, the only common reduced atomic descriptor between (Q~^i )TQi [TRj and (R~^j )TQi [TRj is (c1 ^ ^ ck ^ a1 ^ ^ am ^ b1 ^ ^ bn), which occurs when all the elements in a and b are 1's. Because jAj + jB j = jA [ B j + jA \ B j, we have j(Q~^i)TQi [TRj \ (R~^j )TQi [TRj j i j S (Q^ ; R^ ) = ~ j(Q^i)TQi [TRj [ (R^~j )TQi [TRj j 8
j(Q~^i )TQi [TRj \ (R~^j )TQi [TRj j = ~ j(Q^i)TQi [TRj j + j(R~^j )TQi [TRj j ? j(Q~^i )TQi [TRj \ (R~^j )TQi [TRj j
= 2n + 21m ? 1 = jT j ?T i j 1jT i ?T j j ; (11) 2 R Q +2 Q R ?1 where jTRj ? TQi j is the number of descriptors that appear in TRj but not in TQi , vice versa, jTQi ? TRj j is the number of descriptors that appear in TQi but not in TRj . Note that c1; c2 ; : : :; ck must be distinct from both a1 ; a2; : : :; am and b1; b2; : : :; bn. Therefore each :cr (1 rj k) does not appear in either Ta nor Tb . This means \8t 2 Ta ; :t 62 Tb " also implies \8t 2 TQi ; :t 62 TR ". Case II. There exists one descriptor in Ta and its negation appears in Tb, i.e. 9t 2 Ta ; :t 2 Tb . Assume ag and bh are two descriptors, where :ag = bh , ag 2 Ta , and bh 2 Tb . Some of the reduced atomic descriptors in (Q~^i )TQi [TRj and (R~^j )TQi [TRj will become null when they contain both ag and bh in the same conjunction. For the other reduced atomic descriptors, they will never be identical because every reduced atomic descriptor in (Q~^i)TQi [TRj contains ag and every reduced atomic descriptor in (R~^j )TQi [TRj contains bh (= :ag ). Therefore j(Q~^i )TQi [TRj \ (R~^j )TQi [TRj j i j ^ ^ S (Q ; R ) = ~ j(Q^i ) i j [ (R~^j ) i j j TQ [TR
TQ [TR
0 ~ j(Q^i )TQi [TRj [ (R~^j )TQi [TRj j = 0: =
Combining Eqns. (11) and (12), we obtain ( 0 i j 1 S (Q^ ; R^ ) = j i
i j 2jTR ?TQ j+2jTQ ?TR j ?1
if 9t 2 TQi ; :t 2 TRj ; otherwise.
(12) (13)
The only dierence between Eqn. (13) and Eqn. (7) is that s is always zero when TQi \ TRj = ;. Therefore, s (Q^ i ; R^ j ) = S (Q^ i ; R^ j ) if TQi \ TRj 6= ;. 2 Theorem 1 shows that the individual similarity measure can be obtained from examining the dierence between the two CDNFs without transforming to their RDNFs. Hence, it avoids the complicated computing process of RDNFs.
3 Experiments Radecki conducted an experiment [7] to compare the results of using Jaccard's measure S (Eqn. (2)) and S (Eqn. (6)). In this section, we extend Radecki's experiment by applying S on his Boolean expression samples and compare it with the results of using S and S . To further explore the ranking ability of each similarity measure, we conduct a new experiment with more Boolean expression samples on a dierent database. 9
3.1 Radecki's Experiment Radecki obtained S by calculating the responses from a real information system (the INSPEC database at the Technical University of Wroclaw) and is used as the criterion to justify S . Both the \sign test" [8] and the \t test" [9] are applied to prove that the two measures are statistically equivalent. We repeated his experiment on the Homer database at the University of Southern California, which contains about 800,000 records, and compared his results with ours. Figure 2 shows the similarities of 36 Boolean expression pairs by using S, S , and S .
1
similarity
0.8
0.6
0.4
0.2
0 0
5
10
15
20 sample
25
30
35
40
Figure 2: The results of using dierent similarity measures. Each , , and in the gure represent the data obtained by S, S , and S on one of the 36 Boolean expression pairs, respectively. We applied the sign test and the t test on the data obtained in Figure 2 and discovered that both S and S are statistically equivalent to S. In addition, we computed the \con dence intervals" by using the formula [10]
(x z1? 2 psn );
(14)
where n is the total number of computations (36 in our case), x and s are the sample mean and sample standard deviation of the dierences between S and S or between S and S , z1? 2 is the (1 ? 2 )-quantile of a unit normal variate. For 95% con dence level, z1? 2 = 1:960, we obtain 95% con dence interval for mean of S ? S = (-0.012, 0.159), 95% con dence interval for mean of S ? S = (-0.171, 0.015). Since both intervals include zero, we can say S and S are indistinguishable from S. 10
3.2 New Experiment The con dence intervals show that both Radecki's S and our S are statistically equivalent to the one based on Jaccard's coecient (S), for a small experiment. To compare the ranking generated by each similarity measure, we conduct a new experiment with 32 Boolean expression samples on USC Homer database. The 32 Boolean expression samples, each having 3.6 descriptors in average, are manually created from 24 descriptors picked up from diverse elds. For the given 32 Boolean expression samples (denoted as Qi ; 1 i 32), we calculate the similarities S, S , and S for each Boolean expression pair. Based on that, we compute the degree of association between (S , S) and between (S , S) by applying the Spearman rank-order correlation coecient (rs) [11]. The rs ranges between ?1 and 1. If two rankings are identical, rs = 1. If one ranking is the reverse of the other, rs = ?1. The larger the rs , the closer the rankings. The rs coecient allows us to compare whether S or S generates a ranking closer to that of S. For each Boolean expression Qi (1 i 32), we rank Qj (1 j 32) according to their similarity values S(Qi ; Qj ). For tied values, each Qj is assigned the average of the ranks that would have been assigned had no ties happened. Similarly, we calculate the ranking for each Qi by using S and S respectively. Then we compute the rs coecients among them. Let a1 ; : : :; an and b1 ; : : :; bn be two rankings for Qi generated by various similarity measures, where n is the number of elements in the ranking (32 in our case). The tied ranks in each ranking form a group. Assume there are gu dierent groups in a1 ; : : :; an, each group has uk (1 k gu) tied elements. Similarly, ranking b1 ; : : :; bn has gv groups, each has vk (1 k gv ) tied elements. The rs coecient can be obtained by [11] 1 (n3 ? n) ? Pn (ak ? bk )2 ? U 0 ? V 0 k=1 ; (15) rs = 6q 1 3 [ 6 (n ? n) ? 2U 0 ][ 61 (n3 ? n) ? 2V 0] where gu 1X U 0 = 12 (u3k ? uk ); k=1 gv X 1 0 V = 12 (vk3 ? vk ): k=1 Let rs(S ; S) and rs(S ; S) denote the rs 's between (S , S) and between (S , S) respectively. Figure 3 shows the results of rs(S ; S) and rs (S ; S) for Qi (1 i 32). Among the 32 samples, rs(S ; S) is higher than and lower than rs(S ; S) for 22 and 10 times, respectively. This indicates S generates a ranking closer to that of S for 22 out of 32 times, whereas S only has closer order for 10 out of 32 times. To measure the con dence that S is superior than S , we calculate the \con dence interval for the proportion", de ned as follows [10]: Sample proportion = p = nn1 ; (16) r (17) Con dence interval for proportion = p z1? 2 p(1 n? p) ; where z1? 2 is de ned as above, n is the total number of samples, and n1 is the number of times S is superior than S . The result shows 95% con dence interval for proportion = (0:527; 0:848): The con dence interval does not include 0.5. Therefore, we can say with 95% con dence that S is superior to S . 11
Spearman coefficient
1
0.8
0.6
0.4
0.2
0 0
5
10
15
20
25
30
35
sample
Figure 3: The Spearman rank-order correlation coecients rs between (S , S), represented as , and between (S , S), represented as , for 32 Boolean expression samples.
4 Analysis and Comparison In this section we analyze the space and time complexities of computing the similarity measures S and S . As mentioned eariler, to calculate S , we need to apply the distributive law such as (t1 _ t2 ) ^ t3 = (t1 ^ t3) _ (t2 ^ t3); to obtain CDNFs, where t1 , t2 , and t3 are descriptors. To calculate Radecki's S , we need to transform Boolean expressions to RDNFs. Two steps are required in the transformation: (1) \distribution", where the distributive law is used to produce the corresponding disjunctive normal form; and (2) \expansion", where we use t1 = (t1 ^ t2 ) _ (t1 ^ :t2 ) so that each reduced atomic descriptor contains all the descriptors (original or negated) in the to-be-compared Boolean expressions. The order of these two steps will aect the complexity but not the result, of transforming Boolean expression to RDNF. If the distribution is performed before the expansion, it is equivalent to transforming the Boolean expression to its CDNF and then expanding the CDNF to an RDNF. If the expansion is performed before the distribution, it needs more space and computation because extra negated descriptors (:t2) will be generated in the expansion step. The following example will clarify this idea. ^ then expand it to RDNF Q. ~ Case I. We transform Boolean expression Q to CDNF Q, Q = = = =
(t1 _ t2 ) ^ t3 ^ (t1 ^ t3 ) _ (t2 ^ t3 ) () Q) (t | 1 ^ t2 ^ t3 ) _{z(t1 ^ :t2 ^ t3}) _ (t| 1 ^ t2 ^ t3 ) _{z(:t1 ^ t2 ^ t3}) t1 ^t3
t2 ^t3 (t1 ^ t2 ^ t3 ) _ (t1 ^ :t2 ^ t3 ) _ (:t1 ^ t2 ^ t3 )
12
~ () Q)
(18) (19) (20)
2
(21)
Case II. We expand Q rst, then distribute it: Q = (t1 _ t2 ) ^ t3 = ((t | 1 ^ t2 ^ t3) _ (t1 ^ :t2 ^ t3 ) _{z(t1 ^ t2 ^ :t3) _ (t1 ^ :t2 ^ :t3}) _
= =
t1 (t | 1 ^ t2 ^ t3) _ (:t1 ^ t2 ^ t3 ) _{z(t1 ^ t2 ^ :t3 ) _ (:t1 ^ t2 ^ :t3})) ^ t2 ((t | 1 ^ t2 ^ t3) _ (:t1 ^ t2 ^ t3 ) _{z(t1 ^ :t2 ^ t3 ) _ (:t1 ^ :t2 ^ t3})) t3 ((t1 ^ t2 ^ t3) _ (t1 ^ :t2 ^ t3 ) _ (t1 ^ t2 ^ :t3) _ (t1 ^ :t2 ^ :t3 ) _ (:t1 ^ t2 ^ t3 ) _ (:t1 ^ t2 ^ :t3)) ^ ((t1 ^ t2 ^ t3 ) _ (:t1 ^ t2 ^ t3 ) _ (t1 ^ :t2 ^ t3 ) _ (:t1 ^ :t2 ^ t3 )) ~ (t1 ^ t2 ^ t3) _ (t1 ^ :t2 ^ t3 ) _ (:t1 ^ t2 ^ t3 ) () Q) 2
(22)
(23)
(24) (25)
In case I, the expansion is performed after the distribution. Therefore each compact atomic descriptor is expanded (Eqn. (19) to (20)) instead of each descriptor (Eqn. (22) to (23)), as in case II. A compact atomic descriptor usually contains more than one descriptor after applying the distributive law to its original Boolean expression. In the above example, each of the two compact atomic descriptors in Q^ (Eqn. (19)), i.e. (t1 ^ t3) and (t2 ^ t3 ), contains two descriptors. Eight additional descriptors are added from Eqn. (19) to (20) after the expansion. On the other hand, each individual descriptor in the original Boolean expression is expanded in case II. Thirty-three additional descriptors are added from Eqn. (22) to Eqn. (23). The second approach needs more space than the rst one for storing those intermediate descriptors, which consequently ~ cause it to spend more time checking the duplicates before obtaining the nal Q. In our example, the original Boolean expression contains only 3 descriptors. It is the simplest transformation case. For more complicated Boolean expression, the dierence between case I and II will be bigger. Therefore we assume the rst approach (i.e. Boolean expression ) CDNF ) RDNF) is used in our complexity analysis. Based on this, the time complexities of S and S are equal to the transformation time from Boolean expression to CDNF or RDNF plus the time to compute the similarity measures. For a single Boolean expression, (26) TimeS = TimeS (transformation) + TimeS (computation); (27) TimeS = TimeS (transformation) + TimeS (computation); where TimeS (transformation) = Time(Boolean expression ) CDNF); (28) TimeS (transformation) = Time(Boolean expression ) CDNF) + Time(CDNF ) RDNF): (29) Similarly, the space complexities of S and S are determined by the storage requirements for the CDNF and RDNF respectively. For a single Boolean expression, (30) SpaceS = Space(CDNF); SpaceS = Space(CDNF) + Space(RDNF): (31) In the following sections, we will discuss the complexities of the individual steps.
4.1 From Boolean Expression To CDNF To simplify the analysis, we use binary trees [12] to represent the boolean expressions. Each external node or \leaf" represents a descriptor. All the internal nodes, including the root, are logical operators. The negation 13
not can be stored with the associated descriptor, therefore we do not denote it separately. The height of a tree is the longest path from any leaf to the root. The binary trees are transformed to their equivalent CDNF binary trees using the distributive law. The technique is to transform an and-rooted subtree to an equivalent or-rooted subtree one at a time in a top-down approach. An example is shown in Figure 4. A, B, and C are the subtrees of associated nodes.
Figure 4: Compact disjunctive normalization. We use the distributive law in (A _ B) ^ C = (A ^ C) _ (B ^ C), on the subtrees A, B, C. We rst change the current root node from and to or, and change its or-rooted child node to be Then we demote the other child (C) by one level, and add one and node at its original position to be its new parent. Finally, we replicate the demoted child (C) and exchange it with one of the children (B) on the other subtree. The same procedure is repeated until reaching the leaves. The complete algorithm will be described in the following section. The space complexity of transforming the Boolean expression to a CDNF varies from O(n) to O(n2) depending on how the Boolean expression is constructed, where n is the total number of descriptors and logical operators in the Boolean expression. For example, a linear binary tree (Figure 5(a)) generates a O(n) CDNF, while a complete binary tree (Figure 5(b)) generates a O(n2) CDNF. Notice that n is equal to the total number of nodes if the Boolean expression is represented as a binary tree. The time complexity is primarily determined by the number of times the distributive law is invoked and the size of the subtree to be duplicated. Basically it is the same order as the space complexity, O(n) for a linear binary tree and O(n2) for a complete binary tree. Below, we use two examples to further investigate these complexities. The binary trees are designed to be distributed as much as possible, and each descriptor is represented as one node. and-rooted.
4.1.1 Linear Binary Trees Figure 6 shows an n-node linear binary tree. Every time the distributive law is applied on an androoted subtree, an additional and node and a copy of its left child is created. Figure 6(a) is transformed to 6(b) by creating an and node and a duplicate t1 . Similarly another t1 is created from Figure 6(b) to 6(c). The time complexity T(n) is composed of the time for creating the additional nodes, 2 units, and the time for processing the remaining subtree, T(n ? 2). If a binary tree contains less than or equal to 3 nodes, there is no need for distribution. Therefore T(n) = 0 for n 3. T(n) can be computed recursively as below, 0 if n 3; T(n) = 2 + T(n ? 2) otherwise. Let h be the height of the original binary tree, then n = 2h ? 1; (h 1). Thus, T(n) = T(2h ? 1) 14
Figure 5: Various binary trees, where and and or are logical operators, t1 ; : : :; tp are descriptors. (a) A linear binary tree. (b) A complete binary tree.
Figure 6: The CDNF transformation of an n-node linear binary tree. Originally, only the root is the and operator, the other internal nodes are all or operators. t1; t2; t3 are descriptors. The number in the brackets means the number of nodes in this subtree. (a) The original binary tree. (b) The binary tree after one distribution. (c) The binary tree after two distributions.
15
= O(n):
(32)
Similarly, the space complexity M(n) is composed of the spaces for the root, its 3-node left subtree, and the to-be-distributed right subtree. Since there is no need for distribution for n 3, M(n) = n in these cases. And M(n) can be computed recursively as below, n if n 3; M(n) = 4 + M(n ? 2) otherwise. The order of M is also n. M(n) = M(2h ? 1) = O(n):
(33)
Notice that a binary tree with N internal nodes has (N + 1) external nodes [12]. Thus, an n-node n?1 binary tree consists of n+1 2 leaves (or descriptors in Boolean expression) and 2 internal nodes (or logical operators in Boolean expression). To present Figure 6 mathematically, Q = = = =
t1 ^ (t2 _ t3 _ _ t n+1 ) 2 (t1 ^ t2 ) _ (t1 ^ (t3 _ _ t n+1 )) 2 (t1 ^ t2 ) _ (t1 ^ t3) _ (t1 ^ (t4 _ _ t n+1 )) 2 ); (t1 ^ t2 ) _ (t1 ^ t3) _ _ (t1 ^ t n+1 2
(34) (35) (36) (37)
where ti (1 i n+1 2 ) are descriptors, Eqn. (34), (35), and (36) represent Figure 6(a), 6(b), and 6(c), ^ which consists of n?2 1 compact atomic descriptors respectively. Eqn. (37) represents the nal CDNF Q, with 2 descriptors in each one of them. Since Q^ is still a binary tree, it preserves all the characteristics of a binary tree. During the derivation of Eqn. (33), Q^ contains (2n ? 3) nodes, therefore (n ? 1) of them are descriptors, the others are logical operators. To summarize the characteristics of the CDNF of an n-node linear binary tree, we have:
(2n ? 3) total nodes, (n ? 1) descriptors, (n ? 2) logical operators, n?1 compact atomic descriptors, 2 each compact atomic descriptor contains 2 descriptors, Timelinear (Boolean expression ) CDNF) = O(n), Spacelinear (CDNF) = O(n).
4.1.2 Complete Binary Trees Figure 7 shows an n-node complete binary tree where each internal node contains two children (i.e. a complete binary tree). Every time the topmost and node is distributed, an additional and node and a copy of one of its subtree will be created. Figure 7(a) is transformed to 7(b) by creating an and node and a duplicate C. Similarly, A and B are duplicated from Figure 7(b) to 7(c). The time complexity T(n) consists of the 16
Figure 7: The CDNF transformation of an n-node complete binary tree. Originally, only the root is the and operator, the other internal nodes are all or operators. A; B; C1, and C2 are subtrees. The number in the brackets means the number of nodes in this subtree. (a) The original binary tree. (b) The binary tree after distribution on the rst level node (i.e. the root). The subtree C is duplicated. (c) The binary tree after distributions on the second level nodes. The subtrees A and B are duplicated. 17
time for adding additional and nodes and for duplicating subtrees. Since there is no need for distribution for n 3 , T(n) = 0 in these cases. Otherwise, T(n) = 1 + n ?2 1 + 2 [1 + n ?4 3 + 2 T( n ?2 1 )]: Therefore, 0 if n 3; T(n) = n + 1 + 4 T( n?2 1 ) otherwise. Let h be the height of the original binary tree, then n = 2h ? 1; (h 1). We can derive T(n) = T(2h ? 1) = 2h + 4 T(2h?1 ? 1) = 2h + 4 2h?1 + 42 T(2h?2 ? 1) = 2h + 4 2h?1 + 42 2h?2 + + 4h?3 23 + 4h?2 T(2h?(h?2) ? 1) = 2h + 2 2h + 22 2h + + 2h?3 2h = 2h (2h?2 ? 1) = (n + 1)[ n +4 1 ? 1] = O(n2 ): (38) Similarly, the space complexity M(n) consists of the spaces to store the root and its two to-bedistributed subtrees. For n 3, the space is not changed because there is no need for distribution. Otherwise, M(n) = 1 + 2 [1 + 2 M( n ?2 1 )]: Therefore, n if n 3; M(n) = 3 + 4 M( n?2 1 ) otherwise, which can be derived as M(n) = M(2h ? 1) = 3 + 4 M(2h?1 ? 1) = 3 + 4 3 + 42 M(2h?2 ? 1) = 3 + 4 3 + 42 3 + + 4h?3 3 + 4h?2 T(2h?(h?2) ? 1) = 3(1 + 4 + 42 + + 4h?2 ) = 4h?1 ? 1 2 = (n +4 1) ? 1 = O(n2): (39) n+1 Let ti (1 i n+1 n-node full binary tree. 2 ) denote the 2 descriptors (i.e. leaves) in the original We divide ti into four groups (A, B, C1, C2) of equal size, each group having n+1 descriptors. Let k = n+1 8 8 , then A = (t1 _ _ tk ); B = (tk+1 _ _ t2k ); C1 = (t2k+1 _ _ t3k ); C2 = (t3k+1 _ _ t4k ):
18
Therefore, Figure 7 can be presented as Q = (A _ B) ^ (C1 _ C2 ) = (A ^ (C1 _ C2 )) _ (B ^ (C1 _ C2)) = (A ^ C1) _ (A ^ C2)) _ (B ^ C1 ) _ (B ^ C2)) = ((t | 1 _ {z _ tk}) ^ (t| 2k+1 _{z _ t3k })) _ ((t| 1 _ {z _ tk}) ^ (t| 3k+1 _{z _ t4k })) _ A
C1
A
(40) (41) (42)
C2
((t | k+1 _ {z _ t2k}) ^ (t| 2k+1 _{z _ t3k )}) _ ((t| k+1 _ {z _ t2k}) ^ (t| 3k+1 _{z _ t4k )}) = =
2k _
4k _
B
C1
B
C2
(ti ^ tj )
i=1 j =2k+1 n+1 n+1 _4 _2
i=1 j = n+5 4
(ti ^ tj )
(43)
where A; B; C1; C2 are subtrees, Eqn. (40), (41), and (42) represent Figure 7(a), 7(b), and 7(c) respectively. 2 ^ which consists of ( n+1 Eqn. (43) represents the resulting CDNF Q, 4 ) compact atomic descriptors with 2 descriptors in each of them. Therefore, the characteristics of the CDNF of an n-node complete binary tree are: 2
( (n+1) 4 ? 1) total nodes, 2 (n+1) 8 descriptors,
2
( (n+1) 8 ? 1) logical operators, 2 ( n+1 4 ) compact atomic descriptors, each compact atomic descriptor contains 2 descriptors, Timecomplete (Boolean expression ) CDNF) = O(n2 ), Spacecomplete (CDNF) = O(n2 ).
4.2 From CDNF To RDNF Assume Q1 and Q2 are two Boolean expressions, which have n1 and n2 total nodes and p1 and p2 distinct descriptors respectively. Let p be the size of the union of these two distinct descriptor sets, c1 and c2 the numbers of compact atomic descriptors of Q^ 1 and Q^ 2 , and r1 and r2 the numbers of reduced atomic descriptors of Q~ 1 and Q~ 2 , respectively. We can observe that the resulting RDNFs of Boolean expressions Q1 and Q2 contain the following characteristics:
each reduced atomic descriptor contains p descriptors, max(p1; p2) p p1 + p2 , r1 = min(c1 2p?2; 2p), r2 = min(c2 2p?2; 2p), 19
Q~ 1 has r1 p descriptors, Q~ 2 has r2 p descriptors. In Q~ 1 and Q~ 2, each compact atomic descriptor containing 2 descriptors is expanded to 2p?2 reduced atomic descriptors containing p descriptors. However, some of these reduced atomic descriptors are replicated. Therefore the total number should not exceed 2p , which is the number of all possible combinations of reduced atomic descriptors containing p descriptors. If Q1 is a linear binary tree, then c1 = n1 2? 1 ; r1 = min( n1 2? 1 2p?2; 2p ) (n1 ? 1)2p?3 if n1 < 9; = 2p otherwise, p 2: The space complexity of Q~ 1 is Spacelinear (RDNF) = (r1 p) + (r1 p ? 1) 2p+1 p ? 1 = O(2p p); (44) where (r1 p) is the number of descriptors in Q~ 1 and (r1 p ? 1) is the number of logical operators in Q~ 1. If Q2 is a complete binary tree, then c2 = ( n2 4+ 1 )2 r2 = min(( n2 4+ 1 )2 2p?2 ; 2p) (n2 + 1)22p?6 if n2 < 7; = 2p otherwise, p 2: The space complexity of Q~ 2 is Spacecomplete (RDNF) = (r2 p) + (r2 p ? 1) 2p+1 p ? 1 = O(2p p); (45) where (r2 p) is the number of descriptors in Q~ 2 and (r2 p ? 1) is the number of logical operators in Q~ 2. From Eqn. (44) and (45), we notice that Spacelinear (RDNF) = Spacecomplete (RDNF), which means the space complexity of RDNF is independent of its Boolean expression construction and is bounded by O(2pp). The time for transforming CDNF to RDNF consists of (1) expanding each compact descriptor, and (2) checking and removing duplicate reduced atomic descriptors. Because (2) can be done as (1) is being executed, it is omitted in our analysis. Thus, the time complexity for an n-node linear binary tree is Timelinear (CDNF ) RDNF) = n ?2 1 2p?2 p = O(2p pn): (46) 20
For an n-node complete binary tree, Timecomplete (CDNF ) RDNF) = ( n +4 1 )2 2p?2 p = O(2p pn2):
(47)
4.3 Computation To calculate the similarity measure S between two CDNFs, we need to compare their compact atomic descriptors against each other. Using the same notations given above, we further de ne the ith (1 i c1) atomic descriptors of Q^ 1 contains ki descriptors and the j th (1 j c2 ) atomic descriptors of Q^ 2 contains hj descriptors, where ki p1, hj p2 , and max(p1 ; p2) p p1 + p2. To speed up the computation time, all the descriptors within the compact atomic descriptors or reP1 duced atomicPdescriptors are sorted before calculating their similarities. Therefore it takes P ci=1 k log ki to i 1 Pc2 max(k ; h ) sort Q^ 1 , and cj2=1 hj log hj to sort Q^ 2 . To compare these two CDNFs term-by-term, it takes ci=1 i j j =1 time. Hence, c1 X c2 c1 c2 X X X ki log ki + hj log hj + max(ki; hj ) TimeS (computation) =
i=1 j =1
j =1 c2 X
i=1
c1 X
p1 log p1 + p2 log p2 + c1 c2 max(p1 ; p2) j =1 i=1 (c1 + c2 )p logp + c1 c2p:
(48)
Similarly, we transform the same two Boolean expressions to RDNFs Q~ 1 and Q~ 2. Each reduced atomic descriptors of them contains exact p descriptors. Using the same optimal sorting method, it takes P r1 p logp + Pr2 p log p time to sort Q~ and Q~ , and Pr1 Pr2 p time to compare them. Thus, 1 2 i=1 j =1 i=1 j =1 TimeS (computation) = =
r1 X
p log p +
r2 X
p logp +
r1 X r2 X
i=1 j =1 j =1 i=1 (r1 + r2)p logp + r1r2 p:
p (49)
Because RDNF is obtained by expanding its CDNF, we are certain that c1 r1 and c2 r2. Therefore, TimeS (computation) is always less than or equal to TimeS (computation). If Q1 and Q2 are both n-node binary trees as described above, then ki = 2 (1 i c1 ) and hj = 2 (1 j c2). Thus, n?1 n?1 2 log2 + n?1 n?1 2 linear binary trees, 2+1 22 log 2 + 2 n+1 2 2 TimeS (computation) = n 2 ( n+1 )2 2 complete binary trees, ( 4 ) 2 log2 + ( 4 )2 2 log2 + ( n+1 ) 4 4 2 ) linear binary trees, O(n = O(n4 ) complete binary trees, TimeS (computation) = 2p p logp + 2p p log p + 2p 2p p = O(22pp):
4.4 Remarks Below, we summarize the previous time and space analysis. TimeS = TimeS (transformation) + TimeS (computation) 21
= Time(Boolean expression ) CDNF) + TimeS (computation) 2 O(n) + O(n ) linear binary trees, = O(n2 ) + O(n4) complete binary trees, TimeS = TimeS (transformation) + TimeS (computation) = Time(Boolean expression ) CDNF) + Time(CDNF ) RDNF) + TimeS (computation) O(n) + O(2p pn) + O(22pp) linear binary trees, = O(n2 ) + O(2p pn2) + O(22pp) complete binary trees, SpaceS = Space(CDNF) O(n) linear binary trees, = O(n2 ) complete binary trees, SpaceS = Space(CDNF) + Space(RDNF) p p) linear binary trees, O(n) + O(2 = O(n2 ) + O(2p p) complete binary trees. The above comparisons are analyzed based on a pair of Boolean expressions only. For (N +1) Boolean expressions, consisting of one incoming query and N server descriptions, the time and space complexities for S and S are TimeS (one query, N server descriptions) = TimeS (transformation) + N TimeS (computation) = Time(Boolean expression ) CDNF) + N TimeS (computation) O(n) + N O(n2 ) linear binary trees, = O(n2) + N O(n4 ) complete binary trees,
(50) (51) (52)
TimeS (one query, N server descriptions) = 2N TimeS (transformation) + N TimeS (computation) = Time(Boolean expression ) CDNF) + 2N Time(CDNF ) RDNF) + N TimeS (computation) O(n) + 2N O(2p pn) + N O(22pp) linear binary trees, = O(n2) + 2N O(2p pn2) + N O(22pp) complete binary trees,
(54) (55)
SpaceS (one query, N server descriptions) = (N + 1) SpaceS ; (N + 1) O(n) linear binary trees, = (N + 1) O(n2) complete binary trees,
(56)
SpaceS (one query, N server descriptions) = (N + 1) SpaceS ; (N + 1) O(n) + (N + 1) O(2p p) linear binary trees, = (N + 1) O(n2) + (N + 1) O(2pp) complete binary trees.
(57)
(53)
In Eqn. (50), all the server descriptions have already been precomputed and stored in their CDNFs. Therefore we only need to transform the incoming query to its CDNF and compare it to all the N existing CDNFs. 22
In Eqn. (53), when the query is compared to each server description, both of them need to be transformed to associated RDNFs based on the union set of their descriptors. Therefore it takes 2N transformations to compare the query against N server descriptions. However, the CDNFs do not change during the 2N transformations. Therefore we only need to count Time(Boolean expression ) CDNF) once in Eqn. (54). in Boolean As discussed previously, an n-node binary tree consists of n+1 2 leaves (or descriptors n+1 = O(n). Thus, expression). The number of distinct descriptors p must be no larger than n+1 , i.e. p 2 2 the complexities of the two measures, S and S , can be simpli ed in Table 1. Boolean expressions n-node binary tree similarity measure Time complexity Space complexity
1 query, 1 server description linear
complete
1 query, N server descriptions linear
complete
S S S S S S S S 2 2 n 4 2 n 2 2 n 4 O(n ) O(2 n) O(n ) O(2 n) O(Nn ) O(N2 n) O(Nn ) O(N22nn) O(n) O(2nn) O(n2 ) O(2nn) O(Nn) O(N2nn) O(Nn2 ) O(N2nn)
Table 1: Time and space complexities of S and S for one user query against one or N server descriptions. Both the user query and the server descriptions are n-node binary trees. Apparently, S outperfroms S in both time and space complexities. The above analysis shows that our similarity measure based on CDNFs consumes up to exponentially less time and space than Radecki's method. The following example further illustrates the performance dierence between the two measures.
Example 8 Consider a directory of services containing 100 server descriptions, each consisting of 5 descriptors. The time and space used to calculate the similarities S and S for a 5-descriptor user query are: TimeS (1 query, 100 server descriptions) = 100 225 5 = 512000; TimeS (1 query, 100 server descriptions) = 100 54 = 62500; SpaceS (1 query, 100 server descriptions) = 100 25 5 = 16000; SpaceS (1 query, 100 server descriptions) = 100 52 = 2500: When using S , the directory of service is eight times faster in searching the relevant servers, and takes only one-sixth space than S . 2
5 Implementation on Indie 5.1 Indie The Distributed Indexing project, or Indie, is an Internet resource discovery tool that provides an ecient way to organize and retrieve information. Each Indie resource is managed by a server called Indie broker, which maintains a generator that describes the objects stored in its database. The generator, a nested boolean expression, is used as a lter to collect data from information providers. The logically centralized but replicated server, called directory of services, is a specialized broker that contains only the generators of every Indie broker in the system. The search for information in Indie is accomplished in two steps. First, the user sends a query to the directory of services. The query can be a list of nested boolean expressions composed of attribute-value pairs 23
with and or or operators in between. For example, = UNIX))} and (|author {z= Smith)}; ((|keyword {z = network)} or (|keyword {z E1
| |
{z
E3
E2
{z
E4
}
}
E5
where Ei is the ith (1 i 5) boolean expression, keyword and author are prede ned database attributes, and network, UNIX and Smith are their corresponding values. The directory of services compares the user query with each generator in its database, nds the similarity between them, then replies a ranked relevant list of Indie brokers to the user. In the second step, the user sends the original query or a revision of it to brokers in the list. The Indie brokers search and compute the set of objects that satisfy each Ei, and then return the results to the user.
5.2 Methodology To apply our similarity measure to Indie, we rst transform the user query Q and a generator R to the ^ Assuming Q^ contains m compact atomic descriptors Q^ i, each having xi (1 i m) CDNFs Q^ and R. descriptors; and R^ contains n compact atomic descriptors R^ j , each having yj (1 j n) descriptors. Thus, Q^ = (^ ^ ^ q^1;x1 }) _ _ (^ ^ ^ q^m;xm }); |q1;1 ^ q^1;2 {z |qm;1 ^ q^m;2 {z Q^ 1
Q^ m
^ ^ r^n;yn }); R^ = (^ ^ ^ r^1;y1 }) _ _ (^ |rn;1 ^ r^n;2 {z |r1;1 ^ r^1;2 {z R^1
R^n vth descriptor
where each descriptor has a pair of subscripts (u; v) denoting the within the uth compact atomic descriptor. Next, we compare Q^ and R^ term by term using s (Eqn. (7)). The similarity measure S between ^ Q and R^ is evaluated using Eqn. (8). Assume that there are N generators (R^ 1; R^ 2; : : :; R^ N ) stored in the directory of services. We calculate ^ R^ k ) pair, 1 k N. The generators that are relevant to Q are collected in the similarity value of each (Q; a set E(Q). Therefore, E(Q) = fRkjS (Q; Rk) > 0; 1 k N g: We then sort E(Q) in descending order of their similarity values. Suppose there are d elements in E(Q), the resulting list L(Q) is then L(Q) = fRw1 ; Rw2 ; ; Rwd jS (Q; Rw1 ) S (Q; Rw2 ) S (Q; Rwd )g: The directory of services returns this sorted relevant generators list L(Q) to the user.
Example 9 Let Q1 be an incoming user query and RA; RB ; RC be three generators stored as CDNFs (R^ A ; R^ B ; R^ C ) in the directory of services, Q1 : ((keyword = network) or (keyword = UNIX)) and (author = Smith); 24
R^ A : (keyword = network ); R^ B : (keyword = database ) or (keyword = computer ); R^ C : ((keyword = UNIX ) and (author = Smith )) or ((keyword = database ) and (author = McLeod )): The Q1 is normalized as Q^ 1 before comparison, Q^ 1 = ((keyword = network) and (author = Smith)) or ((keyword = UNIX) and (author = Smith)): The similarity values between the user query and the three generators are, 1 +0 S (Q1 ; RA) = 2 2 = 14 ; S (Q1 ; RB ) = 0 + 0 +4 0 + 0 = 0; 1 +1+0+0 1 = 3: S (Q1 ; RC ) = 3 4 Thus, E(Q1) = fRA; RC g and L(Q1) = fRC ; RAg. RA and RC are the two relevant generators for Q1, and RC is more relevant to Q1 than RA. 2
5.3 Implementation We use UNIX tools ex and bison to parse the nested boolean expressions and build the associated binary parse trees. Each attribute-value pair in the user query and generator, is presented as a three-element subtree in the binary parse tree. The three-element subtree consists of one parent node and two child nodes. The left and right child nodes, the leaves, are the attribute name and its value. The leaves are joined by the parent node, which is a relational operator (could be \=", \6="). These subtrees are merged by the logical operators (and and or) to form the binary parse tree. Then the binary parse trees are transformed to their equivalent CDNF binary trees based on the distributive law. Notice that while replicating the subtree (such as C in Figure. 4), we only copy the logical operator nodes in order to save space. For relational operator nodes, only their associated pointers are copied. All the nodes in the binary tree whose parents are or are linked together after the distributive normalization. Figures 8 and 9 show the binary parse tree of the user query Q1 in Example 9 before and after normalization. Figure 10 shows the generator RC after normalization. The link generated in each normalized binary tree is pointed by head. After the normalization process, we compare each component in the links of the two binary trees. Each element in the generator link represents a compact atomic descriptor R^ jC in the generator R^ C . Each element inj the user query link represents a compact atomic descriptor Q^ i1 in the juser query Q^ 1. To calculate s (Q^ i1 ; R^ C ), we compare all the nodes under Q^ i1 with all the nodes under R^ C and nd out the number of uncommon nodes between them. Then we can compute the average value of all the s (Q^ i1 ; R^ jC ) to get S (Q1 ; RC ).
6 Conclusions We have developed a new method using compact disjunctive normal form (CDNF) to rank the similarity between Boolean expressions. We repeated Radecki's experiments on a dierent information system and 25
Figure 8: User query Q1 before normalization.
Figure 9: Normalized user query Q^ 1 . The head links all the nodes whose parents are or. The dashed subtree is a replicated subtree. The Q^ i1 is one of the compact atomic descriptors in Q^ 1.
26
Figure 10: Normalized generator R^ C . The head links all the nodes whose parents j ^ are or. The RC is one of the compact atomic descriptors in R^ C . used the sign test, the t test, and con dence intervals to show that both Radecki's and our method are statistically equivalent to the measure based on Jaccard's coecient. We also calculated the Spearman coecients between them to show that our method can get a closer ranking order to the one generated by Jaccard's coecient. The theoretical analysis proves that this new measure outperforms the one proposed by Radecki signi cantly in terms of time and space complexity. These results demonstrate that our similarity measure can greatly improve the searching process in today's overwhelming world of information. In addition to ranking results, similarity estimates can be used to help identify similar but autonomously managed retrieval systems. For example, the similarity measure can be used to cluster servers with similar descriptions in a single directory entry. When the similarity measures of two servers exceed a certain value, they can be merged to remove redundancy.
References [1] B. Kahle and A. Medlar, \An information system for corporate users: Wide area information servers," ConneXions { The Interoperability Report, vol. 5, no. 11, pp. 2{9, 1991. [2] P. B. Danzig, S.-H. Li, and K. Obraczka, \Distributed indexing of autonomous internet services," Computing Systems, vol. 5, no. 4, pp. 433{459, 1992. [3] L. Gravano, H. Garcia-Molina, and A. Tomasic, \The ecacy of GLOSS for the text database discovery problem," Technical Report STAN-CS-93-2, Stanford University, 1993. [4] D. R. Hardy and M. F. Schwartz, \Essence: A resource discovery system based on semantic le indexing," in 1993 Winter USENIX, January 1993. [5] C. J. van Rijsbergen, Information Retrieval. Butterworth & Co (Publishers) Ltd., second ed., 1979. [6] T. Radecki, \A model of a document-clustering-based information retrieval system with a boolean search request formulation," in Information Retrieval Research (R. N. Oddy, S. E. Robertson, C. J. van Rijsberge, and P. W. Williams, eds.), pp. 334{344, Butterworth & Co (Publishers) Ltd., 1981. [7] T. Radecki, \Similarity measures for boolean search request formulations," Journal of the American Society for Information Science, vol. 33, no. 1, pp. 8{17, 1982. 27
[8] S. Siegel and N. J. Castellan, Jr., Nonparametric Statistics: For the Behavioral Sciences. McGraw-Hill Book Company, Inc., second ed., 1988. [9] A. A. A and S. P. Azen, Statistical Analysis { A Computer Oriented Approach. Academic Press, INC., 1972. [10] R. Jain, The Art of Computer Systems Performance Analysis. John Wiley & Son, INC., 1991. [11] M. Kendall and J. D. Gibbons, Rank Correlation Methods. Edward Arnold, fth ed., 1990. [12] R. Sedgewick, Algorithms. Addison-Welsey Publishing Company, Inc., second ed., 1988.
28