XSelMark: A Micro-Benchmark for Selectivity Estimation Approaches of XML Queries Sherif Sakr National ICT Australia (NICTA) Sydney, Australia
[email protected]
Abstract. Estimating the sizes of query results and intermediate results is a crucial part of any effective query optimization process. Due to several reasons, the selectivity estimation problem in the XML domain is more complicated than that in the relational domain. Several research efforts have proposed selectivity estimation approaches in the XML domain. Lacking of a suitable benchmark was one of the main reasons which prevented a real assessment and comparison between the approaches to be conducted. In this paper we propose a selectivity estimation benchmark for XML queries, XSelMark. It consists of a set of 25 queries organized into seven groups and covers the main aspects of selectivity estimation of XML queries. These queries have been designed with respect to an XML document instance of a popular benchmark for XML data management, XMark. In addition, we suggest some criteria of assessing the capability and quality of XML queries selectivity estimation approaches. Finally, we use the proposed benchmark to assess the capabilities of the-state-of-the-art of the selectivity estimation approaches.
1
Introduction
Modern implementations of query processors are heavily relying for their efficient performance on sophisticated optimizer components to achieve a proper selection of many optimization decisions such as: access paths, join orders and materialization strategies. Estimating the sizes of query results and intermediate results is a crucial part of any effective query optimization process. In fact, the selectivity estimation problem in the XML domain is more complicated than that in the relational domain. There are several reasons behind this such as: 1) the absence of strict schema notion in the XML data. 2) the dualism between structural and value-based querying. 3) the high expressiveness of the XML query languages [5]. 4) the non-uniform distribution of tags and data. 5) the correlation and dependencies between the occurrences of the elements. In the recent past, several research efforts have proposed different selectivity estimation approaches in the XML domain [6, 15, 16]. However, these approaches are never comprehensively assessed, evaluated and compared. One of the main reasons for this situation is that there is a lack of a suitable benchmark that facilitates the ability to conduct such real assessments and comparisons.
Although the XML research community has proposed several benchmarks [3, 7, 13, 14, 17, 20] which are very useful for their intended targets and perspectives, none of these benchmarks fits in the context of being able to assess and evaluate the different selectivity estimation approaches of XML queries. The author of this paper has been faced with this problem during his work in [16, 18]. In [13], Michiels et al. have motivated the crucial need of different micro-benchmarks in order to get a good understanding of the different aspects in implementing efficient query processors in the XML domain. Therefore, the goal of this paper is to contribute and develop an XML Micro-benchmark, XSelMark, which is mainly focussed on exercising the selectivity estimation aspects of XML queries. The proposed benchmark aims of to be a guide for researchers and implementors in benchmarking and improving their research efforts in this domain. XSelMark consists of 25 queries organized into seven groups where each group is intended to address the challenges posed by the different aspects of XML query result size estimation. The remainder of this paper is organized as follows. Section 2 briefly gives an overview on the related benchmarks in the XML domain. Section 3 describes the main aspects of the selectivity estimation problem in the XML domain. Section 4 presents the set of queries of the XSelMark benchmark. A brief overview and an assessment of the supported features of the-state-of-the-art in the selectivity estimation approaches of XML queries is presented in Section 5 before we conclude Section 6.
2
Related Work
In general, XML benchmarks can be classified into two main categories: 1) Application (Macro) benchmarks [3, 14, 17, 20] which are used to evaluate the overall performance of an XML management system. Hence, this kind of benchmarks are not very useful for conducting a detailed assessment of specific aspects of an implementation that need improvement. 2) Micro-benchmarks [7, 13] which are designed to assess the performance of specific features of a system. In this section we give a brief overview about the state-of-the-art of XML benchmarks. XMach-1 [3] is a scalable multi-user benchmark. It is based on a web application and considers text documents and catalog data. It only defines a small number of XML queries that cover multiple functions and update operations for which system performance is determined. The main goal of XMach-1 is to test how many queries per second the query engine can execute. XBench [20] is designed to cover a large number of XML database applications. These applications are characterized by whether they are data-centric or text-centric and whether they consist of a single document or multiple documents. XMark [17] is a single-user benchmark. The database model is based on an internet auction site and consists of one big regularly structured XML document with text and non-text data. The TPOX benchmark [14] is based on a financial application scenario. It is mainly focussed on exercising all aspects of XML database management systems such as: storage, indexing, logging, transaction processing
and concurrency control. The work load of TPOX consists of insert, update and delete operations as well as query operations. XPathMark [7] is a Micro XPath 1.0 benchmark for XMark. It presents a set of XPath queries which covers the major aspects of the XPath language including different axes, node tests, Boolean operators, references, and functions. The targets of XPathMark is to assess the functional completeness, correctness, efficiency and data scalability of XPath implementations. MemBeR [13] is another Micro-Benchmark which has a main focus to benchmark the XQuery engines with respect to the efficiency of their implementation to four important XQuery constructs: XPath navigation, XPath predicates, XQuery FLWORs and XQuery Node Construction.
3
Main Aspects of Selectivity Estimation in the XML Domain
When looking for an efficient, capable and accurate selectivity estimation approach for XML queries, there are several issues that need to be addressed. From the experience of our work in [16, 18], the major issues of this problem include: – It should support structural and data value queries. In principal, all XML query languages can involve structural conditions in addition to the valuebased conditions. Therefore, any complete selectivity estimation system for the XML queries requires maintaining statistical summary information about both of the structure and the data values of the the underlying XML documents. A recommended way of doing this is to separate the structural summaries of the XML document from the data summaries and then group the related data values according to their path and data types into homogenous sets [11]. – It must be practical. The performance characteristics of the selectivity estimation process is a crucial aspect for any approach. The selectivity estimation process of any query or sub-query must be much faster than the real evaluation process and the required summary structure(s) for achieving this estimation process must be efficient in terms of memory space consumption. – It should be strongly capable. The standard query language for XML namely XPath and XQuery are very rich languages. It provides a wide set of functions and features such as: structure and content-based search, path expressions, element construction, join, sort, duplicate elimination and aggregation operations. Thus, a good selectivity estimation approach should be able to provide accurate estimates for a wide range of these features. – It should be composable. The XML query languages, specially XQuery, are compositional in nature as sub-expressions are combined with each other to form the final query. Hence, a good selectivity estimation approach should be able to estimate the selectivity of the final expressions as well as each sub-expressions. This feature is crucial for any cost-based query optimizer to enable a proper selection of cheap execution plans.
– It must be accurate. On the one hand, providing an accurate estimation for the query optimizer can effectively accelerate evaluation process of any query. However, on the other hand, providing the query optimizer with incorrect selectivity information will lead the query optimizer to incorrect decisions and consequently to inefficient execution plans. – It should be independent. The selectivity estimation process should be independent of the actual evaluation process and should be applicable with different query engines which are applying different evaluation mechanisms.
4
XSelMark Benchmark Queries
XMark [17] is a well-known benchmark for XML data management. The XMark database is modelling an internet auction web site. XMark comes with an XML generator that produces XML documents according to a numeric scaling factor proportional to the document size. We base the queries of our proposed benchmark on the structure of the XMark document ”auction.xml ” which is described in detail in [17]. The set of queries of our proposed benchmark, XSelMark, represents a mix of XML queries which covers a wide set of the major selectivity estimation aspects in the domain of XML queries. They are designed in a way to allow a realistic assessment for the advantages and shortcomings of the proposed XML selectivity estimation approaches and to identify their respective impact. The set of queries are expressed using two standard XML query languages XPath and XQuery. Due to lack of space, we do not present the source code of some queries. The source code of all queries can be downloaded from the benchmark Web site at [1]. The queries are grouped under subsection headings which indicate the feature to be tested. 4.1
Group 1: Path Expressions
Q1) Path expression with non-recursive axes: Find the names of all persons. /site/people/person/name/text() Non-recursive XPath axes are child, parent, attribute, following-sibling and precedingsibling. Q2) Path expression with recursive axes: Find all description nodes descendant of all item nodes. /site//item//description Recursive XPath axes are descendant, descendant-or-self, ancestor and ancestoror-self. Q3) Path expression with wild cards: Return the item subtrees of all regions. /site/regions/*//item/* Q4) Path expression with ordered-based axes: Return the description nodes which are following the tags with the name closed auction. /site//closed_auction/following::description where ordered-based axes are following, following-sibling, preceding and precedingsibling. Supporting such type of queries requires capturing specific statistical information about the order of the elements in the XML documents.
Q5) Branching XPath Expressions: Return the names of all persons who have age information in their profiles. /site//person[profile/age]/name 4.2
Group 2: Twig Expressions
Q6) Simple twig expression: Return the names and descriptions of all items. for $b in //item return ($b/name,$b/description) Q7) Twig expression with element construction: Return the restructured results of the names and descriptions of all items. for $b in //item return {$b/name} {$b/description} 4.3
Group 3: Predicates
The estimation of predicate selectivity is a well-known problem in database theory and practice. Most common solutions of this problem rely on histograms for capturing the distribution of data values, and on the use of the uniform distribution when nothing is known about the data involved in the predicate. In the context of XML, predicate selectivity estimation poses new challenges such as: 1) The predicates can be structural-based as well as value based. 2) Positional predicates represent a special form of predicates over the order information of the elements in the XML document. 3) XML elements are usually distributed in a non-uniform way, hence assuming a simple uniform distribution of the elements structure may lead to many potential estimation errors especially when the operated sequence of nodes are constructed by merging nodes from different groups of data elements. Q8) Positional Predicates: Return the third bidder of each open auction. Q9) Equality Predicates: Return the closed auctions with price equal to 40. Q10) Range Predicates: Return the closed auctions with price less than 40. Q11) Conjunctive/Disjunctive Predicates: Return the closed auctions with price greater than 40 and less than 100. Q12) Predicates with merged nodes from different paths: Return the african and asian items with id value greater than ’item100’. for $b in (/site//africa/item, /site//asia/item) where data($b/@id)> ’item100’ return $b An accurate estimation of such query should consider the different distribution for the data values nodes resulting from each different path expression as well as the percentage of each path in construcing the nodes of the operated sequence. Q13) Predicates with merged nodes from different paths and hybrid nature: Return the price nodes and quantity nodes with value greater than 100.
for $b in (/site//price,/site//quantity) where data($b) > 1 and data($b) > 100 return $b This query is more challenging than the previous one because the resulting nodes of the operated sequence are representing completely different data items (price, quantity) which may have totally different distributions for their data values. Q14) String Predicates: Return all persons with id value greater than ”person200”. 4.4
Group 4: Value-Based Joins (Theta Joins)
This group of queries assess the ability and the accuracy of the selectivity estimation approaches on effective and accurate dealing with value-based join operations between the data values of XML nodes. Q15) Value-based join instances where the values of each operand are constructed by path expression: Return all pairs of increase value and price value where the increase value is greater than the price value. Q16) Value-based join instances where the values of one operand are constructed by path expression and the values of the other operand are constructed by path expression manipulated with arithmetic expression: Return all pairs of increase value and price value where the increase value is greater than the price value multiplied by 2. for $x in /site//increase, $y in /site//price where data($x) > data($y) * 2 return {$x,$y} Q17) Equi-Joins of data values: Return all pairs of increase value and price value where the increase value is equal to the price value. 4.5
Group 5: Arithmetic and Comparison operations over Data Value Statistics
This group of queries assess the ability of the selectivity estimation approaches on their ability of not only being able to capture summary information about the data values of the XML elements but also on their ability of applying arithmetic and comparison operations over these summary information in a consistent and accurate way which does not hurt the quality of the selectivity estimation results. Q18) Arithmetic over Data Value Statistics 1: Return all pairs of increase value and price value where the sum of the increase value and the price value is greater than 100. for $x in /site//increase, $y in /site//price where data($x) + data($y) > 100 return {$x,$y} Q19) Arithmetic over Data Value Statistics 2: Return all pairs of increase value and price value where the sum of the increase value and the price value is
equal to 100. Q20) Arithmetic and Comparison operations over Data Value Statistics 3: Return all triples of increase value, price value and income where the sum of the increase value and the income value is greater than the sum of the price value and the income value. 4.6
Group 6: Nested Expressions
XQuery, as with many other XML query languages such as SQL/XML [4], is a free nesting language, where nested queries can be used for many targets such as reshaping elements or computing aggregate values. Since the result of nested queries may be the input for navigational or filtering operations in the outer query, predicting the size of nested query results will require building on-the-fly statistics about these intermediate results. Q21) Let - Aggregates: Return the names of persons and the number of items that they bought. for $p in /site/people/person let $a := for $t in /site//closed_auction where $t/buyer/@person = $p/@id return $t return {$p/name/text()} {count($a)} Q22) Predicates with values constructed by aggregate function: Return the open auctions with sum of bidder increases that are greater than 1000. for $b in /site/open_auctions/open_auction where sum(data($b/bidder/increase)) > 1000 return {$b} 4.7
Group 7: Data Dependent Estimations
This group of queries requires capturing additional specific forms of summary information about the data values of the underlying XML documents. Q23) Full Text Search: Return the names of all items whose description contains the word ”gold”. Q24) Distinct Operator: Return the distinct price values. Q25) Existential Document Order: Return the open auctions where a certain person issued a bid before another person. for $b in /site/open_auctions/open_auction where some $pr1 in $b/bidder/personref[@person = "person20"], $pr2 in $b/bidder/personref[@person = "person51"] satisfies $pr1