Scaling SDI Systems via Query Clustering and Aggregation

2 downloads 0 Views 331KB Size Report
Query Clustering and Aggregation. Xi Zhang, Liang Huai Yang, Mong Li Lee, and Wynne Hsu. School of Computing, National University of Singapore. {zhangxi ...
Scaling SDI Systems via Query Clustering and Aggregation Xi Zhang, Liang Huai Yang, Mong Li Lee, and Wynne Hsu School of Computing, National University of Singapore {zhangxi,yanglh,leeml,whsu}@comp.nus.edu.sg

Abstract. XML-based Selective Dissemination of Information (SDI) systems aims to quickly deliver useful information to the users based on their profiles or user subscriptions. These subscriptions are specified in the form of XML queries. This paper investigates how clustering and aggregation of user queries can help scale SDI systems by reducing the number of document-subscription matchings required. We design a new distance function to measure the similarity of query patterns, and develop a filtering technique called YFilter* that is based on YFilter. Experiment results show that the proposed approach is able to achieve high precision, high recall, while reducing runtime requirement.

1

Introduction

Traditional Selective Dissemination of Information (SDI) systems express user profiles as keywords whereby simple string matching can be used to retrieve the relevant documents. In contrast, XML-based SDI systems express user profiles in XML query languages such as XPath [8] and XQuery [9]. A filtering engine matches incoming XML documents against the user profiles and decides which users/group of users the documents should be directed to. While path information is able to capture the context of the user interests, the matching process is expensive. Advanced techniques such as XFilter [1], XTrie [4], and YFilter [5] have been developed for efficient matchings to make XML-based SDI systems scalable for Internet applications. XFilter [1] treats each tree pattern as a Finite State Machine(FSM) and use an inverted list to index the FSMs on the label of states to allow simultaneous processing of XML data for multiple queries. XTrie [4] decomposes tree patterns into collections of substrings and indexes them using a trie. [3] develops a tree aggregation technique to combine multiple queries into a generalized pattern to reduce storage and to speed up document-subscription matching. YFilter [5] combines multiple tree patterns into a single Nondeterministic Finite Automata(NFA). The construction of NFA allows different tree patterns to share common paths. While YFilter is able to efficiently handle XPath queries with no predicates, and queries with simple value-based predicates, it requires an expensive post-processing step for queries which have predicates with path expressions, also known as nested paths, e.g. /a/b[c/d]/e. This work examines a different approach to improve the efficiency of XML-based SDI systems. Similar queries are clustered and aggregated into a representative query. Matching of documents is performed on these generalized representative queries. The Y. Lee et al. (Eds.): DASFAA 2004, LNCS 2973, pp. 208–219, 2004. c Springer-Verlag Berlin Heidelberg 2004 

Scaling SDI Systems via Query Clustering and Aggregation

209

tradeoff is the loss of precision and the time savings as a result of the reduced number of matchings. Experiment results show that the loss in precision is around 20% to 30% while the average response time improved by an order of one magnitude. The contributions of this paper include: 1. Define a new distance function called aggregation similarity for clustering queries. 2. Design two approaches for clustering and aggregating queries. The first method C → A first performs clustering of query patterns followed by aggregation. The second method C + A clusters and aggregates queries at the same time. 3. Develop an efficient filtering method called YFilter* for matching nested paths.

2

Distance Function for Query Patterns

XML documents and queries are modelled as trees. [10] defines a query pattern tree as a rooted tree QP T =< V, E > where V is the vertex set, and E is the edge set. Each vertex has a label whose value is in {“*”, “//”} ∪ tagSet, where tagSet is the set of all the element and attribute names in the underlying DTD. A tree pattern is similar to a QPT except that a tree pattern has a special root node labelled “/.”. A QPT can be converted to a tree pattern by adding a special root node labelled “/.”. [3] develops a method to determine the least upper bound aggregate tree pattern for two tree patterns. They consider position-preserving and off-position generalizations. The former captures the class of common sub-patterns that occurs in the same position with respect to the root nodes of the input tree patterns, while the latter captures the class of common sub-patterns that are located in different positions with respect to the root nodes of the input tree patterns. [3] also proves that given a set of tree patterns, the aggregation result is the same regardless of the sequence of aggregation. We advocate that clustering and aggregating queries can reduce the number of document-subscription matchings needed in an XML-based SDI system. This involves a distance function to determine how similar two tree patterns are. Standard tree edit distance algorithms [7] cannot handle off-position generalizations which are introduced by relative paths in XML queries. This is because a “//” node can be matched to zero, one or more nodes. This motivates us to design an aggregation similarity function that handles “//” and finds tree patterns that minimize information loss after aggregation. Let Subtree(u, p) denote the subtree rooted at node u of a tree pattern p. Let label(u) and Child(u, p) denote the label of node u and the set of child nodes of u respectively. The algorithm AggrSim in Figure 1 calls the subroutine MaxMatch which recursively computes the maximal number of matching nodes between Subtree(u, p) and Subtree(v, q) in a bottom-up manner. This maximal matching number between each pair of subtrees of p and q is computed only once. It is stored in a matrix M . Successive references to this value can be retrieved from M directly. We use a list called BMP to store the Best Matched Pairs of subtrees rooted at the child node of u and v. Best matched pairs are pairs with maximal number of matching nodes. MaxMatch is called to determine the number of matching nodes between two subtrees in BMP calculation. The length of BMP list is min(|Child(u, p)|, |Child(v, q)|). We have  M axM atch = M axM atch(ui , vj ) + IsM atch(u, v) (1) (ui ,vj )∈BM P

210

X. Zhang et al. Algorithm AggrSim(p, q) Input: Tree patterns p and q Output: Aggregation similarity of p and q for each ui ∈ N odes(p) and vj ∈ N odes(q) do M [ui , vj ] = null; M [uroot , vroot ] = M axM atch(uroot , vroot ); sim = √M [uroot ,vroot ] ; Size(p)×Size(q)

return sim; Algorithm MaxMatch(u, v) Input: u and v are nodes of p, q respectively Output: maximal number of matching nodes in Subtree(u, p) and Subtree(v, q) if (M [u, v] = null) then return M [u, v]; else compute BM P for u and v; if((label(u) = “//”  and label(v) = “//”) or (label(u) = “//” and label(v) = “//”)) then M [u, v] = (ui ,vj )∈BM P M [ui , vj ] + IsM atch(u, v); else if (label(u) = “//” and label(v) = “//”) then Ncand1  = max {M axM atch(u, vi )|∀vi ∈ Child(v, q)}; if ( (ui ,vj )∈BM P M [ui , vj ] = 0) then Ncand2 = 0;  else Ncand2 = (ui ,vj )∈BM P M [ui , vj ] + 1; Ncand3 = max {M axM atch(uj , v)|∀uj ∈ Child(u, p)}; else (label(u) = “//” and label(v) = “//”) then Ncand1  = max {M axM atch(ui , v)|∀ui ∈ Child(u, p)}; if ( (ui ,vj )∈BM P M [ui , vj ] = 0) then Ncand2 = 0;  else Ncand2 = (ui ,vj )∈BM P M [ui , vj ] + 1; Ncand3 = max {M axM atch(u, vj )|∀vj ∈ Child(v, q)}; M [u, v] = max (Ncand1 , Ncand2 , Ncand3 ); return M [u, v]; Fig. 1. Algorithm to Compute Aggregation Similarity q

p

q

p v'

u u1

...

... u k

v

v' (a) '//' maps to 0 node

//

...

u

v1 ... vm

u1

(c)

(b) '//' maps to exactly one node

...

v1 ... vm ...

... u k

'//' maps to one or more nodes

q

p q

p

v' v' u

...

u1 ... uk

u

...

v1 ... vm

u

u

...

u1 ... uk

v

//

v1 ... vm

Fig. 2. Matching Relative Paths

...

Scaling SDI Systems via Query Clustering and Aggregation

where IsM atch(u, v) =

211

 1 if label(u) = label(v), label(u), label(v) ∈ {“*”, “//”} ∪ tagSet   or label(u) = “*” | “//”  or label(v) = “*” | “//”,  0 otherwise

(2) Equation (1) can be used to find the maximal number of matching nodes between Subtree(u, p) and Subtree(v, q) when both u and v are labelled “//”, or both u and v are not labelled “//”. However, when only one of the nodes is labelled “//”, we need to consider whether “//” is matched to zero, one or more nodes (see Figure 2). Without loss of generality, let label(u) =“//” and label(v) =“//”. (a) “//” maps to an empty chain. The subtree Subtree(vi , q) rooted at the child node of “//” that is most similar to Subtree(u, p) is a candidate matching pattern. The number of actual nodes matched is captured in Ncand1 . (b) “//” maps to exactly one node. The matchings given in the BMP list is a candidate matching pattern. The total number of nodes matched is given by Ncand2 . Note that Ncand2 = 0 if there is no matching between the children of u and v. (c) “//” maps to one or more nodes. The subtree Subtree(uj , p) rooted at the child node of u that is most similar to Subtree(v, q) is a candidate matching pattern. The variable Ncand3 denotes the number of actual nodes matched. Note that “//” is matched along the path which will yield the most number of matching nodes. The final number of maximal matching nodes is given by M axM atch = max(Ncand1 , Ncand2 , Ncand3 )

(3)

Note that MaxMatch will try to compute the maximal number of matching nodes in a “position-preserving” manner when “//” is matched to exactly one node. At the same time, MaxMatch will try to determine the number of matching nodes in an “offposition” manner when “//” is mapped to zero, one or more nodes, corresponding to Ncand1 or Ncand3 number of matching nodes respectively. Finally, MaxMatch selects the maximum of Ncand1 , Ncand2 , Ncand3 to determine the matching approach. The final aggregation similarity value between tree patterns p and q is normalized  by Size(p) × Size(q). Note that the artificial root node “/.” is excluded from the calculation.

3

Combining Clustering and Aggregation

Next, we describe two ways to cluster and aggregate user queries. The first approach, C → A, finds clusters of similar queries first, before aggregating the queries within each cluster. User queries Q1 , Q2 , . . . , Qn are first partitioned into clusters C1 , C2 , . . . , Cm . A representative query Qci is computed for each cluster Ci . The representative queries Qc1 , Qc2 , . . . , Qcm become the input for the filtering in the SDI system.

212

X. Zhang et al.

The second method, C + A, clusters and aggregates the users queries at the same time. Given a set S of user queries Q1 , Q2 , . . . , Qn , we choose the most similar pair of query patterns, (Qi and Qj ), to aggregate. This can be viewed as a step in the hierarchical clustering. Qi and Qj are removed from S and the aggregated query (of Qi and Qj ) is inserted into S. This process is repeated until the stop criterion of clustering is satisfied. Since query aggregation is carried out at the same time as clustering, we will finally obtain the representative queries Qc1 , Qc2 , . . . , Qcm for the clusters that have been generated implicitly in the C + A approach. Recall that an important property of the tree aggregation technique in [3] is that the final result depends only on the set of tree patterns involved and is independent of the sequence. Therefore, the quality of clustering determines the quality of aggregation.

4

NFA-Based XML Filtering

We develop a technique called YFilter* that is based on YFilter [5] to filter XML documents. YFilter focuses on a subset of XPath queries that has no predicates or simple value-based predicates. For nested path queries, or queries with predicates containing path expressions, YFilter requires an expensive post-processing step. The size of the information maintained by YFilter for the post-processing step is proportional to the number of matching instances in the document. Useless matching instances of paths cannot be discarded and the actual matching of tree patterns cannot be determined until the end of parsing. This motivated us to design YFilter* to efficiently handle nested path queries. YFilter* uses the same techniques as in YFilter to decompose a tree pattern into a set of rooted paths and to construct the NFA by viewing each rooted path as being independent from each other. The main aim of YFilter* is to detect both a successful matching of a tree pattern and an unpromising partial matching of a tree pattern as early as possible. Example. Consider the nested path query and the XML documents in Figure 3. Elements with the same tag name are numbered in preorder. For Doc 1, when YFilter* is about to finish processing b1 , all its descendent nodes would have been already parsed. It will only find the matching instance of Q11 at e1 , but no matching instance of Q12 . This Q11 matching instance can be discarded since no matching instance of Q12 will share b1 with it and it cannot be developed to be a matching instance of Q1 in Doc 1. In contrast, for Doc 2, beforeYFilter* finishes processing b1 , it will find the matching instance of Q11 at e1 , and the matching instance of Q12 at d. YFilter* will determine that these two matching instances share the same b and conclude that tree pattern Q1 has been matched by Doc 2. When YFilter* parses Doc 2 further, it will ignore the Q11 matching instance at e2 because the tree pattern Q1 , from which Q11 is shredded from, has been matched already.2 The above example indicates that a branch point node such as b in Q1 is important in the execution of YFilter*. It is also important to locate the materialized branch point while executing YFilter* on a document. Hence, YFilter* has the following features: 1. Maintain information of the branch point and the number of relative path nodes. 2. Associate the matching instance of a shredded path with a runtime stack entry, whose popup indicates the invalidation of the context of a certain branch point node.

Scaling SDI Systems via Query Clustering and Aggregation

213

Fig. 3. Examples of Document Filtering

4.1

NFA Construction

YFilter* decomposes each QPT to a set of rooted paths and uses the same techniques in YFilter to construct NFA. Figure 4 shows the NFA constructed for a query QP T2 . A circle represents a state. A directed edge represents a transition, which is triggered by the element labelling that edge. Two concentric circles denote an accepting state, which is marked with the matched shredded paths it represents. Note that “*” is considered as a special label that matches any element, and “//” is always interpreted as a transition marked with a special label “ε” leading to a state with a self-loop. The transition marked “ε” requires no input to trigger. YFilter* differs from YFilter in the collection of additional information when shredding rooted paths from QPTs. To facilitate evaluation, we impose an arbitrary order on the rooted paths. A QPT Qi is decomposed into an ordered list {Qi1 , Qi2 , . . . , Qik }. We have prev(Qij ) = Qij−1 , for j = 2, 3, . . . , k, and succ(Qij ) = Qij+1 , for j = 1, 2, . . . , k − 1. We define the branch point of any two paths, Qij and Qij+1 , for j = 2, 3, . . . , k, as the common node of these two paths that is closest to their leaf nodes, denoted by BP (Qj ,Qj+1 ). When shredding a query, YFilter* determines a few statistics for each path p with respect to prev(p) and succ(p). These information will be used in the NFA execution of YFilter* to associate matching instances to runtime stack entry. 1. Count// (p, prev(p)) - number of “//” from BP (prev(p), p) to leaf node of p. 2. Countnon−// (p, prev(p)) - number of non-“//” from BP (prev(p), p) to the first “//” if it exists, or to leaf node of p if “//” does not exist. 3. Count// (p, succ(p)) - number of “//” from BP (p, succ(p)) to leaf node of p. 4. Countnon−// (p, succ(p)) - number of non-“//” from BP (p, succ(p)) to the first “//” if it exists, or to leaf node of p if “//” does not exist.

214

X. Zhang et al.

4.2

NFA Execution

The execution of YFilter* is the same as that of YFilter except that YFilter* associates matching instances found to the runtime stack entry during the execution. When an accepting state of the NFA is encountered, YFilter* does the following: (1) (2) (3) (4)

Find all paths p matched at this accepting state. Backtrack runtime stack to find actual matching instances of p. Count the number of “//” nodes materialized in backtracking. Associate matching instance of p to runtime stack entry. (a) When Count// (p, succ(p)) number of “//” is counted, decide the stack entry r1 , which is created when BP (p, succ(p)) is encountered, by taking Countnon−// (p, succ(p)) into account. Associate current matching instance to r1 . (b) When Count// (p, prev(p)) number of “//” is counted, decide the stack entry r2 , which is created when BP (p, prev(p)) is encountered, by taking Countnon−// (p, prev(p)) into account. Associate current matching instance to r2 . (5) Check whether this matching instance of p can be used to update the matching status of its corresponding QPT.

A QPT is said to be matched when its matching status is updated by its last shredded path. Example. Suppose the NFA in Figure 4 is executed on Doc 2 in Figure 3. Figure 5 shows the various snapshots of the runtime stack where each stack entry contains some states of the NFA. The underlined state indicates a state with a self-loop.

Fig. 4. NFA of a Query

The snapshot in Figure 5(a) is captured when the element c is read. State 6 in the topmost stack entry is an accepting state of Q21 . When backtracking state 6 in the runtime stack, we can either take the path 6-4-3-3-2-1-0 or the path 6-5-4-3-2-1-0, corresponding to two matching instances of Q21 , namely /a1 /a2 / ∗ / ∗ /b4 /c and /a1 /a2 / ∗ /b3 / ∗ /c, respectively.

Scaling SDI Systems via Query Clustering and Aggregation

215

Fig. 5. Execution Episode for NFA in Figure 4

Consider the path 6-4-3-3-2-1-0. When backtracking from 6 to 4, a “//” node is materialized, and from 3 to 2, the second “//”node is materialized. Since Count// (Q21 , succ(Q21 )) = 2, there are only two “//” nodes from BP (Q21 , succ(Q21 )) to Q21 ’s leaf. In addition, we have Countnon−// (Q21 , succ(Q21 )) = 1, which indicates there is only one non-“//” node from the BP node to the first “//” node on the path towards Q21 ’s leaf. Therefore, when the second “//” node is materialized at stack entry r, we step one entry back to locate entry r1 . YFilter* associates 6-4-3-3-2-1-0 to r1 . Before r1 is popped, if there is any matching instance of Q22 , it is guaranteed to branch at a1 , and therefore can be used to advance the matching status of QP T2 . After r1 is popped, no matching instance of Q22 will branch at a1 . Hence the matching instance of Q21 should be discarded. The snapshot in Figure 5(b) is captured when element e3 is encountered. A matching instance of Q23 is found, but no matching instance of Q22 has been found yet. This matching instance of Q23 should be stored for future use. Consider the information related to BP (prev(Q23 ), Q23 ), that is, Count// (Q23 , prev(Q23 )) = 0 and Countnon−// (Q23 , prev(Q23 )) = 2. There are no “//” nodes and only two non-“//” nodes from BP node to Q23 ’s leaf. Thus stack entry r2 is the second entry from the top. YFilter associates this matching instance to it. Later, when d2 is encountered, as shown in Figure 5(c), a matching instance of Q22 is found. The stored matching instance of Q23 can now be used to update the matching status of QP T2 . QP T2 is matched because its matching status is updated by its last shredded path.2

5

Performance Study

We carry out experiments on a Pentium IV 1.6 GHz 256 MB RAM running JVM 1.4.0 on Windows 2000 to show the effectiveness and scalability of the proposed methods in filtering information. We implement the two methods C → A and C + A in Java, and a baseline method called A that randomly chose QPTs to aggregate. We also compare YFilter* and YFilter in the handling of nested path queries. We use the IBM XML Generator tool [6] to generate XML documents based on the auction DTD from the XMark benchmark [2]. The auction DTD contains a recursive structure that can be nested to produce XML documents with arbitrary number of levels. The objective of clustering and aggregating user subscriptions is to capture the common interest shared by a group of users. QPTs represent the preferences of users, and

216

X. Zhang et al.

tend to be biased towards one aspect of the DTD for users with similar interest. By removing the root node of the auction DTD, we obtain subtrees each of which represent different user group interests. QPTs are generated based on these subtrees. Table 1 summarizes the parameters used in the experiments, together with their default value and range of values tested. C indicates the number of user group interests, or the number of subtrees used in the QPT generation. QPTs from the same subtree will have a relatively higher similarity compared to QPTs from different subtrees. Sq denotes the minimal similarity between QPTs from the same subtree. The QPT dataset follows a Zipf distribution. Sc specifies the clustering granularity. It denotes the minimal similarity within a cluster. In order to measure the quality of the query results, we compare them with the set of actual query results, which can be obtained by executing the original QPTs individually on the XML dataset. Table 1. Parameters Pr Nd Nq C Z Sq Sc

5.1

Description Default Range Number of XML documents 100 100 ∼ 1000 Number of QPTs 1000 1000 ∼ 4000 Number of subtrees 5 1∼8 Zipf distribution of QPTs 0.8 0.0 ∼ 1.0 Minimal similarity of QPTs from same subtree 0.4 Minimal similarity of result cluster 0.8 0.0 ∼ 1.0

Scalability

This experiment shows how the additional step of clustering and aggregating user subscriptions is able to capture the common interest shared by a group of users, thus allowing the SDI system to deliver the relevant XML documents to this group of users quickly. Figure 6 shows the response time of the system for 1000 QPTs. When no clustering and/or aggregation is used (no C/A), the system has to filter all the XML documents against all the QPTs. Hence, the response time increases rapidly with the increase in the number of XML documents. Although aggregation alone (A) is able to reduce the number of QPTs against which the documents are filtered, its filtering quality is poor, as we will see later. Both C + A and C → A scale well when the number of XML documents increases since the documents are filtered against a small number of representative QPTs obtained from the clustering and aggregation process. C → A outperforms C + A because it utilizes the hierarchical clustering method. It is clear that the additional time incurred by clustering and/or aggregation is compensated very early. 5.2

Sensitivity Experiments

Next, we examine how the performance of the system is affected by the clustering granularity, the diversity of user preferences and the distribution of QPTs. The performance metric used is precision, which is the ratio of the documents that are retrieved by the original set of queries over the documents that are retrieved by the set of representative queries.

Scaling SDI Systems via Query Clustering and Aggregation 1800

C→A C+A A No C/A

1600 1400 1200 ) s 1000 ( e m i 800 T 600 400 200 0 0

200

400

600

Fig. 6. Scalability

800

1000

Nd

217

C→A C+A A

100 90 80 ) 70 % ( n 60 o i s 50 i c e 40 r P 30 20 10 0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sc

Fig. 7. Precision vs. Cluster Granularity

Clustering Granularity. The clustering granularity determines the number of representative QPTs obtained. It indicates the minimal similarity of each cluster. Figure 7 shows the precision and response time when Sc is varied. For C → A, the QPTs in each result cluster will have higher similarity when Sc increases. Aggregation of more similar QPTs are likely to yield more informative aggregation results, which further leads to better filtering results. Similarly for C + A, each iteration of C + A reduces the total number of clusters by 1. With the increase of Sc for merging clusters, C + A is likely to terminate although many clusters remain. With more clusters, and fewer QPTs of higher similarity within each cluster, we obtain more informative aggregation results. Recall that the minimal similarity between QPTs from the same subtree is 0.4. When Sc is below 0.4, the precision for all three methods are low since QPTs from different subtrees are likely to be clustered together. When the value of Sc is between 0.4 and 0.8, C → A outperforms both C + A and A since it is based on the QPT’s original similarity. C + A is based on the similarity of the temporary aggregation results, or the aggregation of QPTs that are already clustered, and the QPTs which have not yet been clustered, which is an approximation of all the QPTs already in that cluster. Hence, the similarity computation will be less accurate compared to that in C → A. However, C + A still performs better than A, which has no clustering at all. When Sc is high (0.9 ∼ 1.0), both C → A and C + A can achieve very high precision. Further, the precision of C → A increases gradually with the increase of Sc , while the precision of C + A or A has a sudden increase when Sc reaches 0.9. It turns out that when Sc increases to 0.9, the number of result clusters in addition to the quality of aggregation starts to dominate the precision. In fact, in C + A, the number of result clusters increases six folds when Sc increases from 0.8 to 0.9, while the number of result clusters only doubles in C → A. This also shows that C → A has a more stable performance compared to C + A. Clearly, there is a trade-off between the clustering granularity and system performance. In the following experiments, we set Sc in the range of [0.4, 0.8] so that the quality of aggregation dominates the performance of the system and the number of representative QPTs generated is reasonable. Diversity of User Preference. The number of subtrees used to generate the QPTs determines how diverse the user preferences are. We vary the number of subtrees (C)

218

X. Zhang et al.

to study the influence of user preference on precision. In order to have a stable filtering time, we fix the number of clusters at 50. →

120

C→A C+A A

80

C A C+A

100

A

) 60 % ( n o i s 40 i c e r P 20

Time(s)

80 60 40 20

0

0 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1

Sc

Fig. 8. Time vs. Cluster Granularity

2

3

4

5

6

7

8

C

Fig. 9. Query Diversity

Figure 9 shows that the precision for all methods decreases when C increases. This is because when the user preference becomes more diverse, the number of QPTs within each cluster is reduced, and QPTs from different subtrees may be aggregated, leading to a more general representative query, and hence lower precision. Distribution of QPT. Next, we examine how the distribution of QPTs affects the performance. We use both uniform and Zipf distributions to generate QPTs. The Zipf parameter Z determines the skewness of the query distribution. In order to show the improvement in the performance of methods involving clustering, we compute the precision gain of C → A and C + A over A. P recision Gain =

(C→A) s or (C+A) s P recision A s P recision

× 100%

(4)

We observe from Figure 10 that the precision gain increases when the distribution of QPT becomes more skewed. This is because there is less distinct QPTs, leading to more informative aggregation results. There is no difference in precision gain when Sc is very low because the effect of skewness in QPT has been overwhelmed by the generality of aggregation.

uni Z=0.65 Z=0.75 Z=0.85 Z=0.95

350 300

%)( 250 ina g 200 no is 150 ic er 100 P 50

YFilter* YFilter

2500 2000 ) 1500 s ( e m i T 1000 500

0 0.4

0.5

0.6

0.7

Fig. 10. QPT Distribution

0.8

Sc

0 200

400

600

800

Fig. 11. YFilter* vs. YFilter

1000

Nd

Scaling SDI Systems via Query Clustering and Aggregation

219

5.3 YFilter* vs. YFilter Finally, we compare the performance of YFilter* and YFilter in the handling of nested path queries. We implement the post-processing step for YFilter as described in [5] to handle nested path queries. Figure 11 shows thatYFilter* outperformsYFilter by a factor of 2 on average.

6

Conclusion

In this work, we have studied how clustering and aggregating user queries can help increase the scalability of SDI systems by reducing the number of document-subscription matchings needed. We have designed an aggregation similarity function for clustering tree patterns involving wildcards and relative paths. We have developed YFilter*, which enhances YFilter’s ability to handle nested path queries. Experiment results have indicated that the proposed techniques are able to increase the response time of SDI systems, achieve 100% recall with 20% to 30% precision loss.

References 1. M.Altinel and M. J. Franklin. Efficient filtering of XML documents for selective dissemination of information. In 26th Int. Conference on Very Large Data Bases, 2000. 2. R. Busse and M. Carey. Benchmark DTD for XMark, an XML Benchmark project. http://monetdb.cwi.nl/xml/downloads.html, 2002. 3. C.-Y Chan, W. Fan, P. Felber, M. Garofalakis and R. Rastogi. Tree pattern aggregation for scalable XML data dissemination. In 28th Int. Conference on Very Large Data Bases, 2002. 4. C.-Y. Chan, P. Felber, M. Garofalakis and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In 18th IEEE Int. Conf. on Data Engineering, 2002. 5. Y. Diao and M. J. Franklin. High-Performance XML Filtering: An Overview of YFilter. In IEEE Data Engineering Bulletin, 26(1):41–48, 2003. 6. A. L. Diaz and D. Lovell. XML Generator. http://www.alphaworks.ibm.com/tech/xmlgenerator, 1999. 7. D. Shasha and K. Zhang. Pattern Matching in Strings, Trees and Arrays. Oxford University Press, 1995. 8. W3C. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, November 1999. 9. W3C. XQuery 1.0: An XML Query Language. http://www.w3.org/TR/xquery, May, 2003. 10. L. H. Yang, M. L. Lee, W. Hsu and S. Acharya. Mining frequent query patterns from XML queries. In 8th Int. Symposium on Database Systems for Advanced Applications, 2003.

Suggest Documents