Eager Evaluation of Partial Tree-Pattern Queries on. XML Streams. Dimitri Theodoratos and Xiaoying Wu. Department of Computer Science,. New Jersey ...
Eager Evaluation of Partial Tree-Pattern Queries on XML Streams Dimitri Theodoratos and Xiaoying Wu Department of Computer Science, New Jersey Institute of Technology, USA {dth,xw43}@njit.edu
Abstract. Current streaming applications have stringent requirements on query response time and memory consumption because of the large (possibly unbounded) size of data they handle. Further, known query evaluation algorithms on streaming XML documents focus almost exclusively on tree-pattern queries (TPQs). However recently, requirements for flexible querying of XML data have motivated the introduction of query languages that are more general and flexible than TPQs. These languages are not supported by known algorithms. In this paper, we consider a language which generalizes and strictly contains TPQs. Queries in this language can be represented as dags enhanced with constraints. We explore this representation to design an original polynomial time streaming algorithm for these queries. Our algorithm avoids storing and processing matches of the query dag that do not contribute to new solutions (redundant matches). Its key feature is that it applies an eager evaluation strategy to quickly determine when node matches should be returned as solutions to the user and also to proactively detect redundant matches. We experimentally test its time and space performance. The results show the superiority of the eager algorithm compared to the only known algorithm for this class of queries which is a lazy algorithm.
1 Introduction Current streaming applications require efficient algorithms for processing complex and ad hoc queries over large volumes of XML streams. Unfortunately, existing algorithms on XML streams focus almost exclusively on tree-pattern queries (TPQs). A distinguishing restrictive characteristic of TPQs is that they impose a total order for the nodes in every path of the query pattern. We consider a query language for XML, called partial tree-pattern query (PTPQ) language. The PTPQ language generalizes and strictly contains TPQs. PTPQs are not restricted by a total order for the nodes in a path of the query pattern since they can constrain a number of (possibly unrelated) nodes to lie on the same path (same-path constraint). They are flexible enough to allow on the one side keyword-style queries with no structure, and on the other side fully specified TPQs. In this paper, we address the problem of designing an efficient streaming algorithm for PTPQs. Our focus is on demanding streaming applications that require low response time and memory consumption. X. Zhou et al. (Eds.): DASFAA 2009, LNCS 5463, pp. 241–246, 2009. c Springer-Verlag Berlin Heidelberg 2009
242
D. Theodoratos and X. Wu
2 Query Language A partial tree-pattern query (PTPQ) specifies a pattern which partially determines a tree. PTPQs comprise nodes and child and descendant relationships among them. Their nodes are grouped into disjoint sets called partial paths. PTPQs are embedded to XML trees. The nodes of a partial path are embedded to nodes on the same XML tree path. However, unlike paths in TPQs the child and descendant relationships in partial paths do not necessarily form a total order. This is the reason for qualifying these paths as partial. PTPQs also comprise node sharing expressions. A node sharing expression indicates that two nodes from different partial paths are to be embedded to the same XML tree node. That is, the image of these two nodes is the same – shared – node in the XML tree. We represent PTPQs as node and edge labeled directed graphs. The graph of a PTPQ can be constructed by collapsing every two nodes that participate in node sharing expressions into a single node. The nodes in the graph are annotated by the (possibly many) PPs they belong to. The output node of a PTPQ is shown as a black circle. Figure 1 shows the query graph of PTPQ Q1 . For simplicity of presentation, the annotations of some nodes might be omitted and it is assumed that a node inherits all the annotating PPs of its descendant nodes. For example, in the graph of Figure 1, node C is assumed to be annotated by the PPs p2 and p3 inherited from its descendant nodes D and F . The answer of a query on an XML tree is a set of results, where each result is the image of the output node in a match of the query on the XML tree. The formal definitions of a PTPQ and its embedding to an XML tree as well as the expressiveness results on PTPQs can be found in the full version of the paper [1].
3 Eager Evaluation Algorithm In this section, we describe our streaming evaluation algorithm for PTPQs. The algorithm is called Eager Partial TPQ Streaming evaluation on XML (EagerP SX). Let Q be the input query to be evaluated on a stream of events for an XML tree T . Algorithm EagerP SX is event-driven: as events arrive, event handlers (which are the procedures startEval or endEval), are called on a sequence of query nodes whose label match the label of the tree node under consideration. Algorithm EagerP SX is stack-based. With every query node X in Q, it associates a stack SX . 3.1 Open Event Handler Procedure startEval is invoked every time an open event for a tree node x arrives. Let X be a query node whose label matches that of x. Procedure startEval checks if x qualifies for being pushed on stack SX : node X is the root of Q or the stacks of all the parent nodes of X in Q are not empty and for each parent Y of X, the top entry of stack SY and x satisfy the structural relationship (‘//’ or ‘/’) between X and Y in Q. If it does, x is called an ancestor match of X. Avoiding redundant matches. Because the answer of a query comprises only the embeddings of the output node of the query, we might not need to identify all the matches
Eager Evaluation of Partial Tree-Pattern Queries on XML Streams
Fig. 1. (a) XML Tree, (b) Query Q1
243
Fig. 2. (a) XML Tree, (b) Query Q2 , (c) Snapshots of stacks
of the query pattern when computing the answer of the query. Whenever a matching of a query node is found, other matches of the same node that do not contribute to a possible new matching for the output node are redundant and can be ignored. A match of a query node X can be redundant in two cases. In the first case, X is a predicate node (a non-ancestor node of the output node of Q). Consider, for instance, evaluating the query Q1 of Figure 1(b) on the XML tree T1 of Figure 1(a). The nodes a1 , b1 , e1 and f1 which are matches for the predicate nodes A, B, E and F , respectively, contribute to the match d1 of the output node D. The nodes a2 , . . . , an , b2 , . . . , bn , e2 , . . . , en , f2 , . . . , fn which are also matches of the predicate nodes can be ignored since they all contribute to the same match d1 of the output node. Note that these nodes correspond to O(n4 ) embeddings of the query with the same match for the output node. Avoiding their computation saves substantial time and space. By avoiding redundant predicate matches, our evaluation algorithm exploits the existential semantics of queries during evaluation. In the second case, X is a backbone node (an ancestor node of the output node of Q). Consider again the example of Figure 1. Backbone node C has n matches {c1 , . . ., cn }. Note that before nodes c2 , . . ., cn are read, both r and c1 have already satisfied their predicates. Therefore, any match of the output node D that is a descendant of c1 (e.g., d1 ) can be identified as a solution and thus should be returned to the user right away. The nodes {c2 , . . ., cn } need not be stored. Storing these nodes unnecessarily delays the output of query solutions, and wastes time and memory space. It is important to note that redundant backbone node matches contribute to a number of pattern matches which in the worst case can be exponential on the size of the query. Our algorithm exploits the previous observations using the concept of redundant match of a query node. Redundant matches are not stored and processed by our algorithm. Note that most previous streaming algorithms either do not handle redundant matches [2,3] or avoid redundant matches for TPQs only [4]. Traversing the query dags. When node X is a sink node of query Q, Procedure startEval examines whether there are matches that become solutions by traversing two subdags GP and GB in that sequence. Dags GP and GB are a sub-dag of Q rooted at an ancestor node of X that consist of predicate nodes and backbone nodes, respectively. Procedure startEval traverses dag GP in a bottom-up way and evaluates for each query node the predicate matches encoded in the stacks. After then, it traverses dag GB in a top-down manner. It examines for each node its matches encoded in the stack, checking if there are candidate outputs (associated with the matches) become solutions and can
244
D. Theodoratos and X. Wu
be returned to the user. Redundant matches will be detected and pruned during each traversal. A traversal is terminated when either there is no more matches to examine, or some match that has been examined before is encountered. Example. Consider evaluating query Q2 of Figure 2(b) on the XML tree of Figure 2(a). When v4 (the start event of v4 ) is read, procedure startEval traverses the sub-dag (path in this case) Z/V starting with V. Subsequently, startEval goes up to Z. Then, it evaluates the predicates for the entry z3 in stack SZ . Procedure startEval ends its traversal at z1 . Figure 2(c) shows the snapshot of the query stacks at this time. Procedure startEval proceeds to traverse the sub-dag (path in this case) Z//W starting with Z. It terminates its traversal on W since stack SW is empty. Later, when u6 is read, startEval traverses the sub-dags Y //U and Y //W in that sequence. As a result, node w5 is found to be a solution and is returned to the user. Note that when w7 is read, node w7 is returned as a solution right away. 3.2 Close Event Handler When a close event for a tree node x arrives, for each query node X whose label matches that of x, Procedure endEval is invoked to pop out the entry of x from SX and checks if x is a candidate match of X. Node x is called a candidate match of X if x is the image of X under an embedding of the sub-dag rooted at X to the subtree rooted at x in T . If this is the case and X is a backbone node, each candidate output (a candidate match of the output node) stored in the entry of x is propagated to an ancestor of x in a stack, if X is not the root of Q, or is returned to the user, otherwise. If x is not a candidate match of X, the list of candidate output stored in the entry of x is either propagated to an ancestor of x in a stack, or is discarded, depending on whether there exists a path in stacks which consists of nodes that are candidate matches of the query nodes. 3.3 Analysis We can show that Algorithm EagerP SX correctly evaluates a query Q on a streaming XML document T . Under the reasonable assumption that the size of Q is bounded by constant, Algorithm EagerP SX uses O(|T |) space and O(|T | × H) time, where |T | and H denote the size and the height of T , respectively. The detailed analysis can be found in [1].
4 Experimental Evaluation We have implemented Algorithm EagerP SX in order to experimentally study its execution time, response time, and memory usage. We compare EagerP SX with P SX [5] and Xaos [2]. Algorithm P SX is the only known streaming algorithm for PTPQs and it is a lazy algorithm. Algorithm Xaos supports TPQs extended with reverse axes. All the three algorithms were implemented in Java. The experiments were conducted on an Intel Core 2 CPU 2.13 GHz processor with 2GB memory running JVM 1.6.0 in Windows XP Professional 2002. We used three datasets: a benchmark dataset, a real
Eager Evaluation of Partial Tree-Pattern Queries on XML Streams EagerPSX PSX XAOS
494
250
Query EagerP SX SQ1 0.02 SQ2 0.01 SQ3 0.012 SQ4 0.09 SQ5 0.015
200
150
100
SQ1
SQ2
N/A
N/A
50
*
Execution Time (seconds)
300
245
SQ3
SQ4
SQ5
(a) Execution time
P SX XAOS 3.83 493.8 4.6 279.5 4.64 * 6.43 N/A 4.19 N/A
(b) Time to 1st solution (seconds)
‘*’ denotes an execution that didn’t finish within 7 hours; ‘N/A’ denotes incapacity of the algorithm to support the query. Fig. 3. Query execution time and response time on synthetic dataset
EagerPSX PSX XAOS
180
160
140
*
100 SQ1
SQ2
SQ3
SQ4
(a) Runtime memory usage
N/A
120
N/A
Memory Usage (MB)
200
SQ5
Query EagerP SX P SX XAOS SQ1 592 63864 94696 SQ2 0 198458 198458 SQ3 2050 90068 * SQ4 4783 113696 N/A SQ5 519 7271 N/A (b) Max number of stored candidate outputs
Fig. 4. Memory usage on synthetic dataset
dataset, and a synthetic dataset. The synthetic dataset is generated by IBM’s XML Generator 1 with N umberLevels = 8 and M axRepeats = 4. It has the size of 20.3MB and includes highly recursive structures. In the interest of space, following we only report the results on the synthetic dataset. Our experiments on the other two datasets confirm the results presented here. Additional experimental results can be found in [1]. On each one of the three datasets, we tested 5 PTPQs ranging from simple TPQs to complex dags. Both EagerP SX and P SX supports all five queries but Xaos only supports the first three. The execution time consists of the data and query parsing time and the query evaluation time. Figure 3(a) shows the results. As we can see, P SX has the best time performance, and in most cases it outperforms Xaos by at least one order of magnitude. EagerP SX uses slightly more time than P SX, due to the overhead incurred by eagerly traversing the query dags. The query response time is the time between issuing the query and receiving the first solution. Figure 3(b) shows the results. As we can see, EagerP SX gives the best query response time for both simple and complext queries. Compared to P SX and Xaos , EagerP SX shortens the response time by orders of magnitude. EagerP SX starts to deliver query solutions almost immediately after a query is being posed. 1
http://www.alphaworks.ibm.com/tech/xmlgenerator
246
D. Theodoratos and X. Wu
Figure 4 shows the memory consumption of the three algorithms. As we can see, the memory usage of EagerP SX is stable for both simple and complex queries. Figure 4(b) shows the maximal number of stored candidate outputs of the three algorithms. Among the three algorithms, EagerP SX always stores the lowest number of candidate outputs. This is expected, since EagerP SX uses an eager evaluation strategy that allows query solutions to be returned as soon as possible.
References 1. http://web.njit.edu/˜xw43/paper/eagerTech.pdf 2. Barton, C., Charles, P., Goyal, D., Raghavachari, M., Fontoura, M., Josifovski, V.: Streaming xpath processing with forward and backward axes. In: ICDE (2003) 3. Chen, Y., Davidson, S.B., Zheng, Y.: An efficient XPath query processor for XML streams. In: ICDE (2006) 4. Gou, G., Chirkova, R.: Efficient algorithms for evaluating XPath over streams. In: SIGMOD (2007) 5. Wu, X., Theodoratos, D.: Evaluating partial tree-pattern queries on XML streams. In: CIKM (2008)