A Direct Approach to Holistic Boolean-Twig Pattern Evaluation

2 downloads 101 Views 298KB Size Report
for extended XML twig patterns, i.e., B-Twigs (Boolean Twigs), which allows presence of ... Tree Query, Twig Query, Tree Pattern, Logical Predicate. 1 Introduction ... been proposed to solve the XML twig pattern matching problem. Holistic twig.
A Direct Approach to Holistic Boolean-Twig Pattern Evaluation Dabin Ding, Dunren Che, and Wen-Chi Hou Department of Computer Science Southern Illinois University, Carbondale, IL 62901, USA {dding,dche,hou}@cs.siu.edu

Abstract. XML has emerged as a popular formatting and exchanging language for nearly all kinds of data, including scientific data. Efficient query processing in XML databases is of great importance for numerous applications. Trees or twigs are the core structural elements in XML data and queries. Recently, a holistic computing approach has been proposed for extended XML twig patterns, i.e., B-Twigs (Boolean Twigs), which allows presence of AND, OR, and NOT logical predicates. This holistic approach, however, resorts to pre-normalization on input B-Twig queries, and therefore causes extra processing time and possible expansion on input queries. In this paper, we propose a direct, holistic approach to BTwig query evaluation without using any preprocessing or normalization, and present our algorithm and experimental results. Keywords: Query processing, XML Query, Boolean Twig, Twig join, Tree Query, Twig Query, Tree Pattern, Logical Predicate.

1

Introduction

XML (Extensible Markup Language) as a de facto standard for data exchange and integration is ubiquitous over the Internet. Many scientific datasets are represented in XML, such as the Protein Sequence Database, which is an integrated collection of functionally annotated protein sequences [1] and the scientific datasets at NASA Goddard Astronomical Data Center [2]. XML is frequently adopted for representing meta data for scientific and other computing tasks. In addition, numerous domain-specific XML markup languages are defined, such as the Chemical Markup Language (CML), the Mathematics Markup Language (MathML) and the Geography Markup Language (GML). Efficiently querying XML data is a fundamental request to fulfill these scientific applications. In addition to examine the contents and values, an XML query requires matching implied twig patterns against XML datasets. Twig pattern matching is a core operation in XML query processing. In the past few years, many algorithms have been proposed to solve the XML twig pattern matching problem. Holistic twig pattern matching has been demonstrated as so far the most efficient approach to XML twig pattern computation. Well-known holistic twig join/matching algorithms include [5], [10], [6], [11], [12], [15], [8], [14], [7]. S.W. Liddle et al. (Eds.): DEXA 2012, Part I, LNCS 7446, pp. 342–356, 2012. c Springer-Verlag Berlin Heidelberg 2012 

A Direct Approach to Holistic Boolean-Twig Pattern Evaluation

343

These algorithms, however, can only deal with twig queries which contain limited types of predicates. However, queries in practical applications may contain all three boolean predicates (AND, OR, NOT). Such kinds of twigs are called Boolean-Twigs or B-Twigs for short. Some example B-Twig XML queries are given below (in XPath-like format): Q1: /department/employee [age > 30 OR city = “NYC” ]/name Q2: /vehicle/car/[[made = “ford” AND year < 2005] OR [NOT[type = “coupe”] AND color = “white”]] Q1 selects the employees who are either older than 30 or live in NYC. Q2 (involving all three types of logical predicates) selects the cars that are either “made by FORD and before 2005” or “white but not a coupe”. OR and NOT predicates are very important not only in theory but also in practical applications. It is hence a very natural requirement to support these two types of predicates, in addiction to ANDs, in twig pattern matching algorithms. We have seen several efforts being made toward holistically computing B-Twig pattern matches. Jiang et al. [6] proposed a solution to AND/OR-twigs; Yu et al. [15] proposed an algorithm for holistically evaluating AND/NOT-twigs; most recently, Che et al. (in our own group) [7] proposed the first algorithm, called BT wigM erge, for holistic computing of full B-Twigs (i.e., AND/OR/NOT-twigs), but requiring significant preprocessing (normalization) on input B-Twigs first. Normalization helps control the complexity in holistic BTwig pattern matching, but inevitably introduces an extra processing step and may cause query expansion especially when NOTs are pushed down to lower levels in the B-Twigs. In this paper, we propose the first, direct holistic B-Twig pattern matching algorithm, called DBT wigM erge, without relying on any preprocessing or normalization on the input B-Twig queries. Compared with our prior algorithm, BT wigM erge [7], our new algorithm, DBT wigM erge, relies on a new mechanism, called status which is associated with every query node, to unify the processing needed for different types of query nodes (including AND, OR, and NOT nodes) into a coherent holistic processing framework. Our new algorithm thus can very elegantly process any input B-Twig without the need to normalize [7] it first, and yet our new algorithm remains runtime optimal, i.e., linear to the total size of input and output. We believe what we achieved and reported in this paper represents one major advance in the area of XML query processing as our algorithm is unique and is the first of its kind – as a holistic twig join algorithm designed to be directly applied to arbitrary input B-Twig queries. The remainder of the paper is organized as follows. Section 2 discusses related works. Section 3 provides preliminary knowledge, including our data model and the tree representation used in this paper. Section 4 elaborates on the novel DBT wigM erge algorithm that directly applies to arbitrary B-Twig queries without relying on any preprocessing or normalization on the input queries. Section 5 presents the result of our performance study. Finally, Section 6 concludes this paper.

344

2

D. Ding, D. Che, and W.-C. Hou

Related Work

Twig pattern matching is a core operation in XML query processing. Naive navigation (or pointer-chasing), structural joins, and holistic twig joins have all been studied for twig pattern matching. In the following, we review representative works on structural joins and particularly on holistic twig joins. The first structural join (or containment join) algorithm was proposed by Zhang et al. [16], which extends the traditional merge join to multi-predicate merge join (MPMGJN). Al-Khalifa et al. [4] later proposed two families of structural join algorithms, i.e., tree-merge and stack-based structural joins, as primitives for XML twig query processing. In 2002, Bruno et al. [5] first proposed the so-called holistic twig join approach to XML twig queries, of which the main goal was to overcome the drawback of structural joins that usually generate large sets of unused intermediate results. Bruno et al. designed the holistic twig join algorithm, named T wigStack, which is optimal for twigs with only AD edges (but not with PC edges). The work of Lu et al. [11] aimed at making up this flaw and they presented a new holistic twig join algorithm called T wigStackList, in which a list structure is used to cache limited elements in order to identify a larger optimal query class. Chen et al. [8] studied the relationship between different data partition strategies and the optimal query classes for holistic twig joins. Lu et al. [12] proposed a new labeling scheme, called extended Dewey, and an interesting algorithm, named TJF ast, for efficient processing of XML twig patterns. Unlike all previous algorithms based on region encoding, to answer a twig query, TJF ast only needs to access the labels of the leaf query nodes. The result of Lu et al. [12] includes enhanced functionality (can process limited wildcard), reduced disk access, and increased total query performance. The same group [13] also studied efficient processing for ordered XML twig patterns using their proposed encoding scheme. S.K. Izadi et al. [17] proposed a series of algorithm called S3. Twig queries are first evaluated against structural summary of documents to avoid document access as far as possible. Later on, they published newer version of their algorithms to support B-Twig queries [18]. NOT and OR operators are processed by special techniques in their algorithms. Overall, their series of S3 algorithms are based on structural joins and do not produce matching result holistically. Additional techniques are introduced in [18] to reduce intermediate results and redundant I/O. In an ordinary twig, the multiple sibling nodes under a common parent node automatically signify the AND logic relationship among them, and all holistic twig join algorithms discussed above already support this implied AND logic in their implementation schemes. Users would take all the three commonly used logical predicates, AND, OR, and NOT, as granted facilities in formulating their XML queries and thus would expect full support from a query engine for unlimited use of all these predicates in their XML queries. Jiang et al. [6] made the first effort toward incorporating support for OR predicates into the holistic twig join approach pioneered by Bruno et al. [5]. Jiang et al. [6] presented an interesting framework for holistic processing of AND/OR-twigs based on the concept of OR-blocks. With resort to the mechanism of OR-blocks, an AND/OR-twig

A Direct Approach to Holistic Boolean-Twig Pattern Evaluation

345

is transformed to an AND-only twig carrying OR-blocks. Yu et al. [15] made effort for supporting NOT predicates in XML twig queries. The recent publication of Xu et al. [14] proposed another interesting algorithm that claims to be able to efficiently compute the answers to XML queries without holistically computing the twig patterns — the answers obtained contain individual elements corresponding to designated output query nodes. So basically this work does not belong to the category of holistic twig join algorithms. But what is interesting of their work [14] is the proposed path-partitioned element encoding scheme, which bears efficiency potential and may be adopted in the future course of seeking improved performance on holistic B-Twig pattern matching. Most recently, to support query that contains AND/NOT/OR predicates, we [7] proposed normalization procedure to regulate the arbitrary combination of AND, NOT, OR predicates in B-Twig. Normalization transforms the original B-Twig into an equivalent one that resembles the DNF form of Boolean logic, and which is then evaluated by an efficient computing scheme. However normalization comes with a cost- extra processing steps and possible query expand. Our goal in this paper is to design a more powerful, holistic computing scheme that can directly apply to arbitrary input B-Twigs without normalization. Our approach and new algorithms are to be detailed in the subsequent sections.

3

Preliminaries

In this section, we address the data model and tree representation of XML data and XML queries. We adopt the general perspective [5] that an XML database is a forest of rooted, ordered, and labeled trees, each node corresponds to a data element or a value, and each edge represents an element-subelement or element-value relation. The order among sibling nodes implicitly defines a total order on the tree nodes. Node labels are important for efficient processing of twig pattern queries, as properly designed node labels may leave out the necessity of accessing the node contents during query evaluation. This is especially true with twig pattern matching, which is at the core of XML query processing. Node labels typically encode the region information of data elements and reflect the relative positional relationships among the elements in the source data file. We assume a simple encoding scheme — a triplet region code (start, end, level) — which is assigned to each data element in a XML document. When multiple documents are present, the document-id is added to the labels to differentiate the documents. Region code can be conveniently obtained through preorder document-tree traversing. For example, region codes for each XML data element are given in Fig. 1(a). Region code can preserve the tree structure among nodes. For twig pattern evaluation, our algorithm only need to store and access the region code of each data element. Each element type node in input twig pattern is associated with a stream of data elements (represented in region code) of that type. As in Fig. 1(a), the stream of element type Author is (3,6,3)(7,10,3)(18,21,3)(22,25,3). Each XML query implies a twig pattern, small or large. The smallest twig may contain

346

D. Ding, D. Che, and W.-C. Hou

just a single node, but a typical twig usually comprises a number of nodes. In this paper, we will use upper case letters to denote the query nodes in a twig query, and the same lower case letters to denote the current data elements in their associated streams. The target of our investigation is B-Twigs that allow arbitrary combination of ANDs, ORs, and NOTs. In general, a B-Twig may consist of two general categories of nodes: ordinary query nodes standing for element types (or tags) and the special connective nodes denoting logical predicates — ANDs, ORs, and NOTs. More specifically, we introduce the following types of nodes that a B-Twig may contain: – QNode: An ordinary query node, associates with an element type (or tag name) in an XML database, and is further associated with an input stream (of which the data elements are represented by their region code, respectively). – LgNode: A logical node, can be either an AND node, an OR node, or a NOT node. An LgNode does not have any stream associates with it. We further introduce the following specific types of LgNodes: • ANode: An AND logical node, always takes the text ’AND’ as its content. It connects two or more child subtrees through the AND logic. • ONode: An OR logical node, always takes the text ’OR’ as its content. It connects two or more child subtrees through the OR logic. • NNode: The NOT logical node, always takes the text ’NOT’ as its content. It negates the node below it; – DAQ: a Direct Ancestor Query node. Every node (either QNode or LgNode) has a DAQ node. We define the DAQ of a LgNode to be the first QNode met when traversed up in query tree; and the DAQ of a QNode to be itself. For the convenience of presentation, we use “a query node” to generally refer to any node (QNode or LgNode) that appears in the B-Twig query under discussion. In contrast, an ordinary node refers to a QNode in the B-Twig. There are two kinds of relationship between two connected query nodes, parent-child(PC) or ancestor-descendant(AD) relationship. Every B-Twig XML query can be represented as tree (as shown in figure 1(b)) or in an XPath-like expression. The answer to a B-Twig query is a set of qualified twig instances (i.e., the embedding of the twig pattern into the XML database). We assume the following output model for B-Twigs: each output twig instance of a B-Twig query comprises of elements from only those QNodes that are not inside of any predicate. The sub-twig resulted from the original input B-Twig after pruning all predicate branches (which are subtrees rooted at a predicate node) is called the output twig of the B-Twig query. Each remaining QNode on the output twig is called an output node, and each leaf on the output twig is called an output leaf.

4

Direct B-Twig Evaluation

As pointed out earlier, precious efforts have been made toward holistic B-Twig pattern matching: GT wigM erge for AND/OR-twigs, T wigStackList¬

A Direct Approach to Holistic Boolean-Twig Pattern Evaluation

347

for AND/NOT-twigs, and BT wigM erge for normalized B-Twigs, and none for general B-Twigs. Developing a holistic approach for general B-Twigs involves great challenges, and requires new, creative supporting mechanism. 4.1

Status Mechanism

A general B-Twig query may contain arbitrary combination of logical nodes. For arbitrary combination of logical predicates, we mean that there is no limitation on the usage of logical predicates as long as they are meaningful and bear no redundancy. All the meaningless queries, e.g. NOT as a leaf node, and redundant queries, e.g. NOT/NOT branch can be easily eliminated by a simple preprocessing process. For arbitrary B-Twigs, what most troublesome is the involvement of multiple NOT nodes; in that case, the repeated negations implied in the B-Twig can be very hard to interpreted and programmatically dealt with. The real challenge in B-Twig pattern matching is centered around the LgNodes in the B-Twig and their evaluation. It is possible for a B-Twig query to have more than four levels of LgNodes between any two QNodes. To deal with complex combination of logical predicates, we need to take the LgNodes into consideration during path-match and subtree match. The nodes in B-Twig need to bear a status of its subtree matching result and LgNodes need to inherit a fake region code from its DAQ node. To facilitate the evaluation process, we introduce an explicit status flag (true or false) for each LgNode and each QNode. Our algorithm continuously detects the status of the current node. More formally, we give the following definition for the status mechanism of every node in a B-Twig: Definition 1 (Status of Node). Every node in a B-Twig is associated with a boolean flag, called the status of the node; the status of a node takes the value of either true or false, indicating whether a match for the sub-tree rooted at this node is found. To motivate the new status mechanism which is fundamental to our algorithm, we look at the twig query in Fig. 1(b) and the sample XML data in Fig. 1(a). To save up space, we will use I# to denote the data element, e.g., I1 denotes Refinfo(1,15,1) and I3 denotes Author(3,6,3). Initially, I12/’1962’ is read in, then the status of QNode Year is set as true since it is a match. LgNode NOT is set as false since the status of its child is true, which should be negated. When I3 is processed, the status of QNode Author is false, then later on the stream of Author is advanced and I7 is being processed and the status of Author is updated to be true. After all two child branches of LgNode OR are updated, then the status of OR can be determined as true since one of the child branch, Author, is true. Finally we can find out that the status of QNode Refinfo is true for I1, and I1/I12/’1962’ is a matching instance and be outputted. Similarly, I16/I27/’1988’ can also be found as a matching instance. To summarize, we introduced the status for each query node and each logical node in a B-Twig; based on this mechanism, we are able to uniformly deal with all the nodes (logical and query nodes) in the B-Twig during the matching

348

D. Ding, D. Che, and W.-C. Hou

process. Now we can envision our holistic B-Twig matching process as follows: we simultaneously keep an eye on the input B-Twig and an eye on the input streams; we start from the root of the input B-Twig and probe into the associated streams to see if we can find a match; often, whether an element is a match is not only decided by the element itself, but also by whether the lower level nodes in the BTwig can find a match from their input streams; stream heads need be promptly advanced during this process when the head element is found disqualified by probing the range covering relationship between the elements; the whole process consists of repeatedly deciding the status of all query nodes in the input BTwig, outputting successful matches (resulting in a ‘true’ status value on the root query node), advancing stream heads, and updating the statuses of query nodes (including logical nodes). Our direct, holistic B-Twig join algorithm is designed based on the above process. ProteinEntry (0,32,0)

Refinfo Refinfo(1)

Refinfo(2)

(1,15,1)

(16,31,1)

Authors

Year

(2,11,2)

(12,15,2)

Author

Author

(3,6,3)

(7,10,3)

‘Matsubara, H.’

‘Smith, E.L.’

‘1962’

OR

Authors

Year

(17,26,2)

(27,30,2)

Author

Author

(18,21,3)

(22,25,3)

‘1988’

NOT

Author

Year

‘Smith, E.L.’

‘1962’ ‘Evans, M.J.’ ‘Scarpulla, R.C.’

(a) XML Data

(b) B-twig Query

Fig. 1. Illustrative XML data and query

4.2

DBT wigM erge Algorithm Design

Status updating is going to be a primary supporting mechanism in our direct, holistic B-Twig join algorithm. We will discuss how the status of different types of query node is going to be updated. First, we need to define some terms and prerequisites. As mentioned earlier, we use upper case letters (e.g., Q) to denote query nodes, and same lower case letters (e.g., q) to denote the current data instances in the corresponding streams. R(Q,S) denotes the region covering relationship between query nodes Q and S, which is supported by and equals to R(q,s), where q and s refer to the two corresponding, stream head instances. A special case is when S is a LgNode, then R(Q,S) is always set as true. For an AD relationship of Q//S, R(q,s) is true (i.e., q covers s) iff q.Start≤s.Start, q.End≥s.End, and q.Level

Suggest Documents