Efficient Holistic Twig Joins in Leaf-to-Root Combining with Root-to-Leaf Way Guoliang Li, Jianhua Feng, Yong Zhang, Lizhu Zhou Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China {liguoliang;fengjh;dcszlz}@tsinghua.edu.cn
[email protected] Abstract. Finding all the occurrences of a twig pattern on multiple elements in an XML document is a core operation for efficient evaluation of XML queries. Holistic twig join algorithms, TwigStack and TSGeneric, have been recognized as optimal solutions when the twig pattern only involves A-D(ancestor-descendant) relationships, while iTwigJoin can be optimal for partial twig patterns that contain A-D only or P-C (parent-child) only relationships. However, existing algorithms involve unnecessary computations and CPU cost of them can be further improved, and we in this paper mainly address this problem. We first propose three effective optimization rules to avoid those unnecessary computations, and then present two algorithms incorporated with these optimization rules to effectively answer twig patterns in leaf-to-root combining with root-to-leaf way. Experimental results on various datasets indicate that our algorithms perform significantly better than existing proposals.
1
Introduction
XML is emerging as a de facto standard for information exchange over the Internet. Although XML documents could have rather complex internal structures, they can be modeled as rooted, ordered and labeled trees. Queries in XML query languages (e.g., XPath [BBC+02], XQuery [BCF+02]) typically specify patterns of selection predicates on multiple elements which have some specified structural relationships. For example, to retrieve all paragraphs satisfying the XPath: //section[//title//keyword]//paragraph[//figure]. Such a query can be represented as a node-labeled twig pattern (or a small tree) with elements and string values as node labels [BKS02]. Finding all occurrences of a twig pattern is a core operation in XML query processing [FK99, STZ+99, TVB+02]. A typical approach is to first decompose the pattern into a set of binary structural relationships (P-C or A-D) between pairs of nodes, then match each of the binary structural relationships against the XML database, and finally stitch together the results from those basic matches [ZND+01, LM01, AJK+02, CVZ+02, JLW03, MHH06]. The main disadvantage of such a decomposition based approach is that intermediate result sizes can become very large, even if the input and the final result sizes are much more manageable. To address this problem, many holistic twig join algorithms are proposed, such as TwigStack [BKS02], TSGeneric [JWL+03], TJFast [LLC+05], iTwigJoin [CLL05]. They answer the twig query holistically and avoid huge intermediate results. However they have to recursively call a subroutine getNext many times, which is a core function for the holistic algorithms. getNext always returns a node q and ensures that: (i) the current element in stream q has a descendant element in each stream qi, for qi∈children(q), and (ii) each current element in stream qi recursively satisfies the first property. If node q satisfies the two properties, q is called to have a solution extension, which will be introduced in detail in section 4. However, existing algorithms will call getNext many times, but most of them are unnecessary and could be avoided. Hence, they involve many unnecessary computations, and the potential benefit of CPU cost is not fully explored among those existing proposals. For example, if querying Q1 on the XML document in Fig.1(a), existing algorithms will call getNext(s), getNext(p), getNext(f), getNext(t), getNext(k) many times respectively, but only a few times are useful and pivotal, and other times can be pruned. Table 1 lists the main flow about how algorithm TwigStack works on the example in Fig.1. The flow is
2
Guoliang Li, Jianhua Feng, Yong Zhang, Lizhu Zhou
similar to those of other algorithms and merging the partial solutions is omitted. However, there are three circumstances that some computations could be avoided: 1) Self-nested suboptimal. Given two nodes u1,u2 with the same label, if u1 is an ancestor of u2, we call they are self-nested. For example, in Fig.1, as s1 is an ancestor of s2, which in turn is an ancestor of s3, so s1, s2 and s3 are self-nested. Since s1 has a solution extension through calling getNext in steps 1-5, and when checking whether s2 has a solution extension, we only need to check whether s2 is a common ancestor of p1 and t1, which are current elements of its child nodes p and t. However, it is unnecessary to recursively check whether p, t and their corresponding descendants, i.e. f, k, have solution extensions. The reason is that the streams of p, t and their descendants are not changed, and even if checking them again, getNext will return the same results as steps 1-5. Accordingly, existing algorithms involve self-nested suboptimal, and some unnecessary computations (e.g. steps 6-15) can be pruned. 2) Order suboptimal. If a node has more than one child, selecting which child to first check whether having a solution extension is important to twig joins, however existing algorithms do not consider this problem. For example, in Table 1, if node t (but not node p) is first selected to check, it will return node k directly without involving to check p and f, subsequently steps 22,23,27,28 are unnecessary and can be pruned. 3) Stream null suboptimal. If stream q is empty, it is unnecessary to scan elements in the streams of q’s ancestors and descendants, because the elements in those streams will not contribute to final solutions. For example, in Fig.1, when the stream of node t is empty, it is unnecessary to check the elements of node s (the parent of t) and k (a descendant of t), thus steps 34, 35 in Table 1 are unnecessary. Even if there are some other elements in streams s and k, and start values of them are larger than start value of t1, they also can be skipped directly. Accordingly, if some streams are empty, elements in the streams of their ancestors and descendants can be skipped.
(a) An XML document (b) The Element set of (a) (c) A query Q1 Fig. 1. An XML document and a query Q1 Table 1. The main flow about how TwigStack (or TSGeneric) works getNext(s) steps result 1 s(s1) 6 s(s2) 11 s(s3) 16 t(t1) 21 k(k1) 26 k(k2) 31 p(p1) 36 f(f1) 39 f(f2)
getNext(p) steps result 2 p 7 p 12 p 17 p p 22 p 27 32 p 37 f 40 f
getNext(f) steps result 3 f 8 f 13 f 18 f 23 f 28 f 33 f 38 f 41 f
getNext(t) steps result 4 t 9 t 14 t 19 t 24 k 29 k t 34
getNext(k) steps result 5 k 10 k 15 k 20 k 25 k 30 k 35 k
To discover a partial solution, existing algorithms do not consider the circumstances of self-nested, order and stream null suboptimal, and they have to check whether a sub-twig pattern has a solution extension many times and involve additional computations. In this paper, we will demonstrate how to avoid those unnecessary computations under these three circumstances through discovering the partial solutions in leaf-to-root combining
Efficient Holistic Twig Joins in Leaf-to-Root Combining with Root-to-Leaf Way
3
with root-to-leaf way. We first propose three effective optimization rules to improve these three suboptimal and further explore the potential benefit of CPU cost, and subsequently present an algorithm, TJEssential, which avoids repeatedly checking whether a sub-twig pattern has a solution extension. Our contributions can be summarized as follows: • We propose three optimization rules to explore the potential benefit of CPU cost, which can avoid self-nested, order and stream null suboptimal of existing studies. • We present an efficient holistic twig join algorithm, TJEssential, which discover all the solutions in leaf-to-root combining with root-to-leaf way. More importantly, we incorporate the three optimization rules into our algorithm to optimize twig joins. • We implemented our proposed method and conducted an extensive performance study using both real and synthetic datasets of various characteristics. The results showed that our algorithm achieved high efficiency and outperformed existing proposals. The rest of the paper proceeds as follows. Section 2 is dedicated to some related work. We introduce some notations in section 3. Then TJEssential algorithm is proposed in detail in section 4. Section 5 reports experimental results, and we conclude in section 6.
2
Related Work
In the context of semi-structured and XML databases, structural join was essential to XML query processing because XML queries usually imposed certain structural relationships. For binary structural join, Zhang et al. [ZND+01] proposed a multi-predicate merge join (MPMGJN) algorithm based on (start,end,level) labeling of XML elements. Li et al [LM01] proposed ξξ/ξA-Join. Stack-tree-Desc/Anc was proposed in [AJK+02], and [CVZ+02], [G02], [JLWO03] were index-based approaches. The later work by Wu et al [WPJ03] studied the problem of binary join order selection for complex queries on a cost model which took into consideration factors such as selectivity and intermediate results size. Bruno et al [BKS02] proposed a holistic twig join algorithm, namely TwigStack, to avoid producing a large intermediate result. With a chain of linked stacks to compactly represent partial results of individual query root-to-leaf paths, TwigStack merged the sorted lists of participating element sets altogether, without creating large intermediate results. TwigStack has been proved to be optimal in terms of input and output sizes for twigs with only A-D edges. Further, Jiang et al [JWL+03] studied the problem of holistic twig joins on all/partly indexed XML documents. Their proposed algorithms used indices to efficiently skip the elements that do not contribute to final answers, but their method can not reduce the size of intermediate results. Choi et al [CMW03] proved that optimality evaluation of twig patterns with arbitrarily mixed ancestor-descendant and parent-child edges was not feasible. Lu et al [LCL04] proposed the algorithm TwigStackList, which was better than any of previous work in term of the size of intermediate results for matching XML twig pattern with both P-C and A-D edges. Chen et al [CLL05] proposed an algorithm iTwigJoin, which was still based on region encoding, but worked with different data partition strategies (e.g. Tag+Level and Prefix Path Streaming), and Tag+Level Streaming can be optimal for both A-D and P-C only twig patterns whereas PPS streaming could be optimal for A-D only, P-C only and one branch node only twig patterns assuming there was no repetitive tag in the twig patterns. [LLC+05] proposed a novel algorithm, TJFast on extended Dewey that only used leaf nodes’ streams and saved I/O cost. More recently, Mathis et al. [MHH06] proposed a set of new locking-aware operators for twig pattern query evaluation to ensure data consistency, and Chen et al. [CLT+06] presented Twig2Stack algorithm to avoid huge intermediate results. However, Twig2Stack reduced the intermediate results at the expense of a huge memory requirement, and it was restricted by the fan-out of the XML document.
4
Guoliang Li, Jianhua Feng, Yong Zhang, Lizhu Zhou
3
Background
XML data model and numbering scheme. XML data is commonly modeled by a tree structure, where nodes represent elements, attributes and texts, and edges represent element-subelement, element-attribute and element-text pairs. Most existing XML query processing algorithms use a region code (start, end, level) to present the position of a tree node in the data tree. start and end are calculated by performing a pre-order traversal of the document tree; level is the level of a certain element in its data tree. The region encodings support efficient evaluation of structural relationships. Formally, element u is an ancestor of element v if and only if u.start