2PXMiner - ACM Digital Library

Research Track Poster

2PXMiner - An Efficient Two Pass Mining of Frequent XML Query Patterns Liang Huai Yang

Mong Li Lee, Wynne Hsu, Xinyu Guo

School of Electronics Engineering and Computer Science, Peking University, China

School of Computing National University of Singapore

[email protected]

{leeml,whsu,guoxinyu}@comp.nus.edu.sg

ABSTRACT

is able to reduce the number of tree matchings required, there is no guarantee on the number of database scans required. Further, the proposed method is applicable to XML queries on a class of XML documents that do not contain sibling repetitions. However, it is important to consider XML queries that involve sibling repetitions because they are widespread in many XML applications including authors in a publication database, and actors in a movie database. This paper describes a novel algorithm called 2PXMiner to discover frequent XML query patterns involving repeated sibling nodes. This method requires only two scans of the query pattern tree database. This is achieved by utilizing the following data structures: 1. Transaction summary structure called T-GQPT. This is a global query pattern tree that summarizes the query patterns and keeps track of the transaction ID of each pattern. 2. Equivalence Class Tree (ECTree). This is essentially a search tree that provides an infrastructure to generate candidate rooted subtrees. 3. Index structure called RETrie. This provides for the storage and fast tracing of rooted subtrees that contain at least one node of repeated siblings. Based on the above data structures, we develop three optimization techniques in 2P XM iner: 1. Computing Upper Bound of Potential Frequent Patterns: We describe a scheme to compute the exact frequency for all paths in T-GQPT without accessing query pattern trees in the database. With this scheme, we can efficiently compute the upper bound frequency of the potential frequent query patterns from T-GQPT. 2. Early Pruning: We prove that by exploiting the TGQPT, we can remove subtrees in T-GQPT whose frequency counts are below the minimum support before the candidate enumeration process. This technique greatly reduces the amount of mining efforts. 3. Tracing Repeated Candidates: The index tree RET rie canonizes unordered trees and provides for the tracing of repeated candidates caused by sibling repetitions in a query pattern tree. This effectively removes a large number of unnecessary computations. Experiment results indicate that the proposed 2P XM iner is very efficient and has good scalability.

Caching the results of frequent query patterns can improve the performance of query evaluation. This paper describes a 2-pass mining algorithm called 2P XM iner to discover frequent XML query patterns. We design 3 data structures to expedite the mining process. Experiments results indicate that 2P XM iner is both efficient and scalable. Categories & Subject Descriptors: H.2.8 [Database Management]: Database Applications - Data Mining General Terms: Algorithms, Performance Keywords: XML Query Pattern, Tree Mining

1. INTRODUCTION The efficient management and delivery of XML data has become a prominent focus of recent research. Tree pattern is a distinguishing characteristic in XML query languages. Matching tree patterns against XML data is a core operation in XML query processing. This operation can be expensive since it involves navigation through the hierarchical structure of XML documents, which can be deeply nested. One approach to improve the performance of XML management systems is to discover frequent queries and to cache the results of these queries [3, 5]. By modelling XML queries as query pattern trees, we obtain a database of query pattern trees. Each query pattern tree is a transaction that is associated with a transaction ID. Mining frequent query patterns is equivalent to finding the rooted subtrees that occur frequently over the set of pattern trees. This mining process involves tree matching, which is expensive and is complicated by the presence of wildcards “*” and relative paths “//” in the XML query pattern trees. Given that the rooted subtrees are candidate patterns, the search space is exponential to the size of a pattern tree. Hence, an efficient mining algorithm will typically aim to minimize the number of tree matchings needed. [5] recently develop an algorithm called F astXM iner to discover frequent XML query patterns. While F astXM iner

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’04, August 22–25, 2004, Seattle, Washington, USA. Copyright 2004 ACM 1-58113-888-1/04/0008 ...$5.00.

2.

PRELIMINARIES

Fig. 1(a) shows the query pattern tree of a query to retrieve the title, author and price of books where “books/ section//title” has value “XML Schema”. A query pattern

731

Research Track Poster book

book

book section

price title price section title author author

book title author price title

title author

QPT1 XML Schema

book

fn ln QPT2

title QPT3

book section title QPT4

book title

price

author 3 RST 1

(b)A Rooted Subtree

Figure 2: Example of a frequent query pattern tree

(a)A Query Pattern Tree

Figure 1: A query pattern tree and a rooted subtree.

pattern tree qpt of QDB, we denote it as occqpt (RST )=1, otherwise occqpt (RST )=0. The totaloccurrence of an RST in QDB is denoted by freq(RST )= qpt∈QDB occqpt (RST ). Let |QDB| denote the number of QP T s in database QDB, the support level is defined as supp(RST )=f req(RST )/|QDB|. We say that RST is σ-frequent in QDB if supp(RST ) exceeds the minimum support σ where 0