Region-Based Coding for Queries over Streamed XML ... - Springer Link

1 downloads 212 Views 230KB Size Report
region-based coding scheme, this paper models the query expression into query tree and ...... Chen, L., Ng, R.: On the m
Region-Based Coding for Queries over Streamed XML Fragments Xiaoyun Hui, Guoren Wang, Huan Huo, Chuan Xiao, and Rui Zhou Northeastern University, Shenyang, China [email protected]

Abstract. Recently proposed Hole-Filler model is promising for transmitting and evaluating streamed XML fragments. However, by simply matching filler IDs with hole IDs, associating all the correlated fragments to complete the query path would result in blocking. Taking advantage of region-based coding scheme, this paper models the query expression into query tree and proposes a set of techniques to optimize the query plan. It then proposes XFPR (XML Fragment Processor with Region code) to speed up query processing by skipping correlating adjacent fragments. We illustrate the effectiveness of the techniques developed with a detailed set of experiments.

1

Introduction

As an emerging standard for tsid="1"> XML 2000 ......

Fragm ent 2: Origins ......

Fragm ent 3: ... ...

Fig. 3. XML Document Fragments

Taking (StartPos, EndPos, Level) as f id, the fragment not only retains the link information between correlated fragments, but also indicates the descendant fragments within the region from StartPos to EndPos. Given the region codes, the interval (StartPos, EndPos) of two arbitrary fragments are either inclusive or exclusive, and we can get the ancestor-descendant relationship between nodes by testing their region codes.We suppose that f1 with region code (S1 , E1 , L1 ),f2 with region code (S2 , E2 , L2 ) are fragments. Ancestor-Descendant: f2 is a descendant of f1 iff S1 < S2 and E2 < E1 . e.g. Fragment 3 with code (20,29,3) is a descendant of Fragment 1 with code (1,70,1) in Figure 3. Parent-Child: f2 is a child of f1 iff (1) S1 < S2 and E2 < E1 ; and (2) L1 + 1 = L2 . e.g. Fragment 2 with code (16,40,2) is a child of Fragment 1 with code (1,70,1) in Figure 3.

3

XFPR Query Handling

Based on hole-filler model, infinite XML streams turn out to be a sequence of XML fragments, which become the basic processing units of the query. From the analysis of simple code, we can find that getting the ancestor-descendant relationship needs more steps of assuredness of parent-child relationship. So, waiting for a fragment to come to complete the information necessary for execution would result in blocking. Taking advantage of region coding scheme for fragments, we can skip evaluating the structural relationship between the fragments to expedite processing time, especially for nested path expressions.

Region-Based Coding for Queries over Streamed XML Fragments

491

This section focuses on the techniques based on region coding scheme. We first introduce the pruning polices on query expressions to eliminate “redundant” path evaluations. Then we present the query plan transformation techniques for efficient query handling with XML fragments. 3.1

Query Plan Generation

Linear Pattern Optimization. Let T be an optimized query tree after dependence pruning and T S a tag structure complying with the DTD, we can transform the queries on XML elements to queries on XML fragments. Since T captures all the possible tsids involved in the query according to tag structure, we only need to locate the corresponding fragments, which are presented by element nodes with type “filler” in query expressions. Note that T can also capture all the valid paths in query path. For linear pattern optimization, we only need to handle operators which can output results. For example, the end element of Query1 “/book/chapter/section/head” is “head”, and its type is not “filler”. However, the type of “section” element which is in the same filler with “head” is “filler”.We only need to handle particular fragment “section” in output operator by matching its tsid 8 and level 3 in region code without inquiring the parent fragments. A simple query with “/*”, “//”, for example, Query 2 “/book//author”, we can catch the relationship with “book” fillers and the fillers including “author” element since region coding can quickly and directly locate the ancestordescendant relationship without knowing the intermediate fillers. However, such case cannot be directly used for queries including predicates. Twig Pattern Optimization. It is not common for path expressions to have only connectors such as “/” and “//”. The location step can also include one or more predicates to further refine the selected set of nodes.We can simplify a path computation into an XML fragment matching operation after determine the key nodes in the query tree to speed up the query. In this way, queries involving one or more predicates or twig patterns can also be optimized. Definition 3.1. Let T S be a tag structure and T an optimized query tree after dependence pruning. For ti ∈ T , if the successor of ti is more than one nodes , or the predicate of ti is not null, then ti is defined as the key node in the query tree, in short KN (T ) = ti . Taking advantage of the region code of the fragments, we can quickly judge the ancestor-descendant relationship between fragments by comparing (StartPos, EndPos) of the nodes. If a.StartP os < d.StartP os and d.EndP os < a.EndP os, fragment a is the ancestor of fragment d. Since the content of fragment is guaranteed by tag structure, we can prune off the intermediate nodes in query tree, which are not key nodes and only keep the key nodes and the output nodes in the query tree. As for nested-loop step, we can apply the same policy and check the level number with the number of repetition steps.

492

X. Hui et al.

Considering the Query 3: /book/chapter[/head]/figure/title, whose query tree is presented in Figure 4. In (a), all kinds of fragments involved in this query are indicated. In (b), according to definition 3.1, we only keep the “chapter” and “figure” nodes since “chapter” has predicate and “figure”can output the result. In this way, we can compare ancestor-descendant relationship between “chapter” fillers and “figure” fillers without inquiring the “book” fillers. bo o k, tsid= 1 , Filler chapter,tsid= 6 , Filler head tsid=7 c hapter,

tsid= 6 figure, tsid= 1 1 , Filler

head tsid=7

title tsid=12

figure , tsid= 1 1 ,Filler title tsid=12

(a) Original Query Tree

(b) Optim ized Query Tree

Fig. 4. Query Tree of Query 3

Nested Pattern Optimization. Given an XPath p, we define a simple subexpression s of p if s is equal to the path of the tag nodes along a path < v1 , v2 , · · · vn > in the query tree of p, such that each vi is the parent node of vi+1 (1 ≤ i < n) and the label of each vi (except perhaps for v1 ) is prefixed only by “/”. If each vi shares the same tsid and the same predecessor, we define it as repetition step. For example, Query 4: /book/chapter/section/section/section/head is such a query involving repetition step, whose query tree is shown in Fig. 5 (a).

Tsid:

1

6

8

8

8

9

book

chap

sect

sect

sect

head

(a) Original Query Tree

1

6

8

9

book

chap

(s ect) 3

head

Tsid:

(b) Optim ized Query Tree

Fig. 5. Query Tree of Query 4

Query 4 involves three types of fragments with tsid 1, tsid 6 and tsid 8. Since ‘/section’ is a nested path step, the query tree includes three nodes with the same tsid 8. Repetition step in XPath expression degrades the performance significantly, especially when the repeated path is highly nested. Taking advantage

Region-Based Coding for Queries over Streamed XML Fragments

493

of level number, we can simplify such repetition path evaluation. Since parentchild operator ‘/’ indicates the nodes at adjacent level, we can optimize the query tree by pruning off repeated steps and recording the number of repetition step for further processing. In Query 4, ‘/section’ occurs three times, so we keep only one of them in query tree and embrace it with “( )”, at the right corner of which we mark 3. The query plan for Query 4 after pruning is shown in Figure 5 (b). According to the linear pattern optimization, we only need to handle “section” filler with level 5. Considering Query 2:/book//head fragments with tsid 1, tsid 6, and tsid 8. Since “/section” is a nested path step, there can be many repetition steps between “/book” and “//head”. Taking advantage of region coding, we can simplify such repetition path evaluation by checking only ancestor-descendant relationship between fragments with tsid 1, tsid 6 and tsid 8. In the optimized query plan, we can embrace “/section” and mark * that means we can handle “section” fillers in any level. 3.2

The XFPR Matching Algorithm

XFPR is based on optimized query plan after pruning off the “redundant” operations in query tree. In this section, we focus on the main algorithm of query evaluation in XFPR framework and then analyzing its efficiency comparing to previous work. The transform from query tree to query plan is a mapping from XPath expression and the tag structure to XFPR processor. For each node in query tree, the tsid of the element node with type “Filler” is corresponding to an entry of the hash table, which is characterized by a predecessor p, a bucket list b, and the tag structure corresponding to the fragment. The predecessor p resolves fragments relationships and predicate criteria. The bucket list b linked each fragment with the corresponding tsid of the node together, and each item is denoted as a f-tuple (f illerid, {holeid}, value), in which f illerid denotes (StartPos, EndPos, Level) in XFPR, holeid denotes (StartPos, tsid) in XFPR, value can be set to true, f alse, undecided (⊥), or a result fragment corresponding to predicates. While the former three values are possible in intermediate steps that do not produce a result, the latter is possible in the terminal step in the query tree branch. Algorithm 1 describes the processing method, which is based on the SAX event-based interface that reports parsing events. When a fragment is processed by XFPR, it first needs to verify if the predecessor operator has excluded its parent fragment due to either predicate failure or exclusion of its ancestor. If the ancestor fragment has arrived, the value of the f-tuple copies the status of its ancestor’s value, otherwise the value is tagged with an “⊥”. And it has to trigger the descendant fragments and pass the status value on to its descendants as fragments may be waiting on operators to decide on their ancestor eligibility.

494

X. Hui et al.

Algorithm 1 startElement() 1: if (isFragmentStart()==true) then 2: tsid=getTsid(); 3: f id=getFid(); // including f id.startP os, f id.endP os, f id.level 4: if ( hashFindOperator(tsid)!=null) then 5: fillInformation(); 6: if (isQueryFragment()==true) then 7: p=findAncestorOperator(tsid); 8: for each f-tuple f t of p do 9: if ( f t.f id.startP os < f id.startP os && f t.f id.endP os > f id.endP os && isSatisfied(f t.f id.level,f id.level) && f t.f id.value! = ⊥) then 10: currentValue=ft.value; 11: end if 12: end for 13: else 14: q=findDescendantOperator(tsid); 15: for each f-tuple f t of q do 16: if ( f id.startP os < f t.f id.startP os && f id.endP os > f t.f id.endP os && isSatisfied(f t.f id.level,f id.level) && f id.value! = ⊥) then 17: ft.value=currentValue; 18: end if 19: end for 20: end if 21: end if 22: end if

3.3

Algorithm Analysis

In this section, we illustrate the advantage of our region-coding scheme adopted in XFPR with different types of queries, and compare with the query efficiency with simple fid and hid numbering scheme in previous frameworks [8, 3]. Hierarchical Matching. As the illustrative example, consider Query 5://chapter/section/section/title, which returns the “title” of the “section” nested in other “sections”.In the previous frameworks with simple fid and hid numbering scheme, when fragment with fid 2, hids 4, 13 arrives, it is accepted by “chapter” operator by hashing to the corresponding entry of the hash table, and the link information are recorded as f-tuple (2, {4, 13}, undecided). The value identifies the fragment’s state which is decided by its parent fragment. If its parent fragment arrives, the value copies the parent fragment’s value, otherwise the fragment’s value is set “undecided”. Similarly, there are three “section” fragments with fid 4, hid 6, fid 6, hid 10 and fid 13, hid 15 arrive successively. After inquiring the predecessor operator and triggering the successive operator, the “section” fragments with fid 4 and 13 relate to the “chapter” fragment with fid 2, and fragment with fid 6 is the child fragment with fid 4. However, not all the “section” fragments can be output as the results, since only those “section” fragments matching the second “/section” steps in the query path contribute to

Region-Based Coding for Queries over Streamed XML Fragments

495

the results. As the fragment with fid 6 matches the second location step “/section” in the query path, it is output as the result after tagging “true” for the corresponding value of the fragment’s f-tuple. We can find out from Query 5 that the simple fid,hid numbering scheme is not efficient for some queries. In the hash table, each entry recording the arrived fragment information needs to execute two steps. One is inquiring predecessor, the other is triggering successive operators. It takes too much time to find the related fragments before query processing under such kind of manipulation. Moreover, after many such manipulations, filler still cannot output the result due to not matching the particular location step in query path. However, with region-coding scheme, we can easily identify the element level and speed up such hierarchical relationship operations. In our framework, we use nested pattern optimization and linear pattern optimization to generate the query plan for Query 5, we only need to check each arrived “section” fragment whether its tsid equals to “4” and its level equals to “4”. When “section” fragment with region code arrives, we directly output the element “title” as the result, which is in the “section” fragment representing the same element type and the fourth level in XML document tree. From the analysis of the query example, we can conclude that simple numbering scheme is not suitable for hierarchical relationship evaluation. Especially, if the path expression inquiries the element in a nested hierarchical step, it complexes the processing and costs more time. Obviously, region-based coding scheme shows strong superiority in hierarchical relationship matching. Skipping Fragments. Consider Query 6://Chapter/section[/figure]/*/figure/ title, which is a twig pattern expression with “*” involved. Tag structure defines the structure of data and captures all the valid paths. We can use it to expand wild-card path selections in queries to specify query execution. So after we specify the “*”node, Query 6 equals to “ book/chapter/section [/figure1] /section/figure2/title”. In order to distinguish the “figure” in branch expression and the “figure” in main path expression, we denote the former one as figure1 and the latter as figure 2. We know that the simple numbering scheme for f id and hid can only handle parent-child relationship between fragments. In this way, it might decelerate the query evaluation because it would result in blocking to wait for all the the correlated fragments to come to complete the query path. However, taking advantage of twig pattern optimization, XFPR can skip the intermediate fragments and output the results as soon as possible. For fragments with tsid 4 and tsid 7 satisfying the condition section.StartP os < f igure2.StartP os ∧ section.EndP os > f igure2.EndP os ∧ f igure2.Level = 5 ∧ section.Level = 3, algorithm outputs the results immediately if “figure1” fragment also satisfies the following condition:f igure1.StartP os > section.StartP os ∧ f igure1.EndP os < section.EndP os ∧ f igure1.Level = 4. The handle process adopted twig pattern optimization is that : assume the “section” fragment with region code (6, 27, 3) has already arrived and its information is recorded in the association table. When “figure” fragment with

496

X. Hui et al.

region code (15, 26, 5) arrives, it is compared with the “section” fragment by their StartPos, EndPos. Since 15 > 6, 26 < 27 and figure.Level=5, section.level=3 the “figure” fragment is one of the descendant fragments of the “section” fragment. However the “figure” fragment (mapping in the twig pattern) as the child of the “section” has not arrived yet, the “section” fragment can not verifies its value. So we can not output the “figure” fragment as the result. When the “figure” fragment with region code (9, 14, 4) arrives, its f-tuple value is set “true” since it satisfies the condition (i.e. Level = 4 ∧ 6 < 9 ∧ 27 > 14). So the “section” fragment triggers its descendant “figure” fragment with region code (15, 26, 5). Then the “title” element in the same fragment is output. From the analysis of the query example, we can conclude that by taking advantage of region coding scheme, checking an ancestor-descendant structural relationship is as easy as checking a parent-child structural relationship. Hence we can skip intermediate fragments along the path and produce the query results as soon as possible without waiting for all the correlated fragments to arrive.

4

Performance Evaluation

In this section, we present the results of performance evaluation of various algorithms over queries with different types, depths and document sizes on the same platform. All the experiments are run on a PC with 2.6GHz CPU, 512M memory. Data sets are generated by the xmlgen program. We have fragmented an XML document into fragments to produce an stream, based on the tag structure defining the fragmentation layout. And we implemented a query generator that takes the DTD as input and creates sets of XPath queries of different types and depths. We have used 4 queries on the document and compared the results among the following algorithms: (1) XFrag [3], (2) XFPro [8] and (3) XFPR. Figure 6 shows the queries that we used.

NO

Path expression

Q1

book/sec tion/title

Q2

book/sec tion/*/title

Q3

book//sec tion/title

Q4

book/sec tion[/figure/title]/sec tion/title

Fig. 6. Path expression

In Figure 7 (a), (b) three kinds of processing strategies over various query types are tested and compared. From the result, we can conclude that for any query type, XFPR outperforms its counterparts, and query performance doesn’t vary much on different query types. This because the XFPR only need to process the fragments that include key nodes or the output results. Furthermore, XFPro outperforms XFrag in time, because it deletes the dependent operations. But it

Region-Based Coding for Queries over Streamed XML Fragments

XFPR

XFPro

XFPR

XFrag

XFrag

XFPR

100000 50000

4000000 3000000 2000000 1000000

0 Q2

Q3

XFrag

150000 100000 50000

0 Q1

XFPro

200000

Memory cost(KB)

Time(ms)

Time(ms)

XFPro

5000000

150000

497

0 Q1

Q4

Q2

(a)

Q3

Q4

3 5 8 (c)Depth of Query

(b)

Fig. 7. Time with Different Queries

is not better than XFPR since it has to specify the query paths when queries including “*” or “//” and it cannot eliminate intermediate fillers. For XFrag, each fragment needs to be passed on through the pipeline and evaluated step by step. Therefore the performance of XFrag is affected by the character of query. For memory usage, complex queries will result in an increase in the number of operators joining in the query processing, along with more information in the association table and additional space consuming. Figure 7 (c) shows the time of various query depths, 3, 5 and 8 respectively on the three methods. When the depth increases, the time of XFrag and XFPro increases due to the increased path steps. While with region coding , XFPR greatly reduces intermediate path steps’ evaluation, thus time cost of deep queries is almost the same with that of short queries.

40000

XFPro

150000

0

100000

100000

XFPro

XFrag

XFPR

Memory cost(KB)

XFrag 2500000 2000000 1500000 1000000 500000 0

10M 15M (e)

XFPro

10M

15M

(h)

20M

200000

0 10M

15M

20M

5M

10M

(f)

XFrag

XFPR

8000000 6000000 4000000 2000000

XFPR

100000

5M

20M

0 5M

XFrag

300000

0 5M

20M

XFPro 400000

50000

Memory cost(KB)

10M 15M (d)

XFPR

150000

0 5M

XFrag

200000

50000

20000

Memory cost(KB)

XFPR

XFPro

XFrag

XFPR

Memory cost(KB)

60000

XFrag

200000

Time(ms)

XFPro

XFPR

80000

Time(ms)

XFrag

Time(ms)

Time(ms)

XFPro 100000

8000000 6000000 4000000 2000000

15M (g)

XFPro

20M

XFPR

8000000 6000000 4000000 2000000 0

0 5M

10M

15M (i)

20M

5M

10M

15M

20M

(j)

5M

10M

15M

20M

(k)

Fig. 8. Time with Different Documents

In Figure 8 (d),(e),(f) and (g), as the document size increases, XFrag observes the most costly, for much time is wasted in inserting fragments and finding relationship between them. In XFPro, with dependence pruning, it omits some operators corresponding the query path. The work XFPR performs best for the reason that quite a number of intermediate fillers are out of regard, and query path is greatly shortened. Figure 8 (h),(i),(j) and (k), illustrate the memory usage for different document size. For XFPR, memory usage is less impact with size increasing since many intermediate fillers are omited. For XFPro and XFrag,

498

X. Hui et al.

this case is in reverse. However, XFPro performs a bit better than XFrag. XFPro only considers subroot nodes in tid tree while XFrag handles all operators in pipeline. With the document size increasing, more operators means more redundant information, so space cost becomes large.

5

Conclusions

This paper adopts the region coding scheme for XML documents and adapts it to streamed XML fragment model. Taking advantage of region coding scheme, we model the query expressions into query tree and propose a set of techniques, which enable further analysis and optimizations. Based on this optimized query tree, we map a query tree directly into an XML fragment query processor, named XFPR, which speed up query processing by skipping correlating adjacent fragments. Our experimental results over XPath expressions with different properties have clearly demonstrated the benefits of our approach. Acknowledgement. This work is partially supported by National Natural Science Foundation of China under grant Nos. 60573089 and 60273074 and supported by Specialized Research Fund for the Doctoral Program of Higher Education under grant SRFDP 20040145016.

References 1. Fegaras, L., Levine, D., Bose, S., Chaluvadi, V.: Query processing of streamed XML data. In: Eleventh International Conference on Information and Knowledge Management (CIKM 2002), McLean, Virginia, USA (November 4–9, 2002) 2. Bose, S., Fegaras, L., Levine, D., Chaluvadi, V.: A query algebra for fragmented XML stream data. In: Proceedings of the 9th International Conference on Data Base Programming Languages, Potsdan, Germany (September 6–8, 2003) 3. Bose, S., Fegaras, L.: XFrag: A query processing framework for fragmented XML data. In: Eighth International Workshop on the Web and Databases (WebDB 2005), Baltimore, Maryland (June 16–17,2005) 4. Liu, Y., X.Liu, Xiao, L., .Ni, L., Zhang, X.: Location-aware topology matching in p2p systems. In: IEEE INFOCOM, Hongkong (2004) 5. Chen, L., Ng, R.: On the marriage of lp-norm and edit distance. In: Proceedings of 30th International Conference on Very Large DataBase, Toronto, Canada (August, 2004) 6. L. Chen, M.T.O., Oria, V.: Robust and fast similarity search for moving object trajectories. In: Proceedings of 24th ACM International Conference on Management of Data (SIGMOD’05), Baltimore, MD (June 2005) 7. Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA (May, 2001) 8. Huo, H., Wang, G., Hui, X., Zhou, R., Ning, B., Xiao, C.: Efficient query processing for streamed XML fragments. In: The 11th International Conference on Database Systems for Advanced Applications, Singapore (April 12–15,2006)

Suggest Documents