Discovering Interesting Information in XML Data with ... - CiteSeerX

Discovering Interesting Information in XML Data with Association Rules Daniele Braga∗ , Alessandro Campi∗ , Stefano Ceri∗ , Mika Klemettinen† , PierLuca Lanzi∗ ∗ Politecnico di Milano P.za L. da Vinci 32, I-20133 Milano, Italy † Nokia Research Center FIN-00045 Nokia Group P.O.Box 407, Finland

{braga,campi,ceri,lanzi}@elet.polimi.it [email protected]

Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures

ABSTRACT Data mining algorithms are designed to extract interesting information from large amounts of data. They usually assume that source data are in relational (tabular) form. However, the recent success of XML as a standard to represent semi-structured data and the increasing amount of data available in XML pose new challenges to the data mining community. In this paper we introduce association rules for XML data. To accomplish this, we propose a new operator, based on XPath and inspired by the syntax of XQuery, which allows us to express complex mining tasks, compactly and intuitively. The operator can indifferently (and simultaneously) target both the content and the structure of the data, since the distinction in XML is slight.

1.

INTRODUCTION

Knowledge discovery in databases (KDD) deals with the process of extracting interesting knowledge from large amounts of data, usually stored in large databases or data warehouses. Knowledge can be represented in many different ways, such as clusters, decision trees, decision rules, etc.; in particular, association rules [3] [4] have proved to be an effective tool to discover interesting relations in massive amounts of data. Association rules were initially introduced in the application domain of basket data analysis to analyze customer habits; basket data consist of sets of items (corresponding to the products which are bought in the same

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 2003 Melbourne, Florida, USA Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

purchase transactions), and association rules discover interesting properties within and between these sets. During the recent years, we have seen the dramatic development of the eXtensible Markup Language (XML) [19] as a major standard for storing and exchanging information. XML enables the self-description of hierarchies of semistructured information, by intermixing data content with semantic tags which describe such content. It is easy to foresee that, in the near future, a large portion of available data will be represented in XML. At the moment, the use of XML within the data mining community is still quite limited; there are some proposals to exploit the XML syntax to express interfaces to knowledge representation artifacts, including association rules, so that they can be easily exchanged among data mining tools; but there are no really significant extensions of data mining research taking full advantage of the intrinsic properties of XML. However, it is easy to foresee that the spreading of XML will cause an increasing interest on this subject, going beyond a mere syntactic adaptation to XML of data mining artifacts and techniques. In this paper, we introduce association rules from native XML documents. This extension is associated to nontrivial problems, related to the hierarchical nature of the XML data model. Consequently, most of the common and well-known abstractions of the relational framework need to be adapted to and redefined in the specific XML context. The proposed extension is formalized by introducing an XML-specific operator for describing association rules, called XMINE RULE, which is based on the use of XML query languages; in particular, it uses XQuery, the standard XML query language proposed by the W3C, and fully exploits XPath expressions, the the most important component of XQuery both in terms of ease of expression and of computability. Paper Outline. The paper is organized as follows. In Section 2 we introduce the notion of association rules for XML data. To assist the specification of mining tasks within XML documents, in Section 3 we introduce the XMINE RULE operator, providing its formal syntax and intuitive semantics. In Section 4 we illustrate, by means of examples, the expressive power of the XMINE RULE operator. Then we introduce, in Section 5, a specific clause of the XMINE RULE operator for remodeling a hierarchical structure by means

A 1 1 1 0

B 0 1 0 1

C 0 1 1 1

rule A⇒B C⇒A AB ⇒ C BC ⇒ A

(a)

supp. 0.25 0.5 0.25 0.25

conf. 0.33 0.66 1.00 0.50

(b)

Figure 1: An example of source data and of association rules generated from the data. of explicit grouping of items. In Section 6 we present an outline of how the prototype of XMINE RULE has been implemented by composing an execution environment for XPath expressions and an algorithm discovering frequent itemsets. We conclude the paper with a discussion of related work and of future research directions.

2.

ASSOCIATION RULES

Association rules [2, 4] are implications of the form X ⇒ Y where the rule body X and the rule head Y are subsets of the set I of items (I = {I1 , . . . , In }) within a set of transactions D. The rule X ⇒ Y states that the transactions T ∈ D which contain the items in X (X ⊂ T ) are likely to contain also the items in Y (Y ⊂ T ). Association rules are characterized by two measures: the support, which measures the percentage of transactions that contain both items X and Y ; the confidence, which measures the percentage of transactions that contain the items Y within the transactions that also contain the items in X. More formally, given the function freq(X, D), denoting the percentage of transactions in D which contains X, we define: support(X ⇒ Y ) confidence(X ⇒ Y )

= freq(X ∪ Y, D) (1) = freq(X ∪ Y, D)/freq(X, D) (2)

The problem of mining association rules consists of finding all the association rules, from a set of transactions D, that have support and confidence greater than the two defined thresholds: the minimum support (minsupp) and the minimum confidence (minconf). Association rules assume a relational representation of data (e.g., [4]), in which the transactions T ∈ D correspond to tuples, while the items I ∈ I correspond to binary attributes. For instance, a rather typical although simple example of source data is depicted in Figure 1(a). The attributes A, B, C represent the set of items I; the tuples represent the transactions, in particular the first tuple h1, 0, 0i represents a transaction with only the item A; the last tuple h0, 1, 1i represents a transaction with both the items B and C. Some association rules extracted from the data in Figure 1(a) are shown in Figure 1(b). For instance, the first rule A ⇒ B states that in 25% of the tuples, A and B appear together and that the 33% of tuples which contain A contain also B. To define association rules for XML documents, we have to map the basic concepts of association rules in the XML context. In particular, we must define what is a transaction T , and consequently the set D; what is an item and therefore the set I; and finally we must define how the function freq is computed so to define support and confidence within an XML document.

... ... ... ... ...
...

...
... ... XML...XQuery This award..

Figure 2: http://www.cs.atlantis.edu/research.xml, a sample document with various information about the research activities of a university department. First, as a working example, we introduce the XML document depicted in Figure 2 which represents various information about a department. In particular, it stores information about the available Ph.D. courses (identified by the tag ) and about the people in the department (). These can be either students () or professors (). For each of them, some personal information are stored () as well as the list of works published () such as books (), journal papers, or conference papers (
). A formal description of the document structure is included in the appendix through a DTD. We consider the problem of mining frequent associations among people that appear as coauthors in the publications appearing in the document in Figure 2. In practice, we are interested in finding associations of the form: “Wilson ⇒ Holmes” which states that, in the department, the papers which are authored by Wilson are also likely to have Holmes as author. Since any part of an XML document is a tree, usually called XML fragment, both D and I are collections of trees.

In particular, the transactions T ∈ D are the XML fragments that define the context in which the items I ∈ I must be counted. The items I ∈ I are the fragments that must be counted. If we consider the problem of mining association rules among authors appearing in the same papers, we have that D are the set of trees (i.e., fragments) of all the publications authored by people within the department. The items I ∈ I are the set of fragments of all the authors who appear in the works published by people within the department. This situation is depicted in Figure 3 where an example of the research.xml document is represented. The overall document is represented by the grey triangle. The white triangles represent the fragments corresponding to the various publications authored by people of the department. The black triangles represent the fragments corresponding to the authors who appear in various publications. Note that, the value of the tag is depicted at the bottom of the black triangle, thus a black triangle labeled Wilson is equivalent to the XML fragment Wilson. While the value of a white triangle, corresponding to a tag or to a
tags, represents an entire XML fragment which corresponds to a book or to an article, e.g.:
Wilson

In the example in Figure 3, D contains the four white triangles and I contains the the three black triangles corresponding to the authors Holmes, Stolzmann, and Wilson. While an itemset X ⊂ I is a set containing black triangles. It is now possible to compute freq(X, D) as the percentage of XML fragments in D which contains the fragments in X. In particular, freq(X, D) is computed as the percentage of white triangles which contain all the black triangles in X. Thus, in the example depicted in Figure 3, freq({Wilson,Holmes},D) is 0.5 since the fragments Wilson and Holmes appear together in two white triangles out of four, i.e., in 50% of the cases. Likewise, freq(Wilson},D) is 0.75 since the fragment Wilson appears in three white triangles out of four, i.e., in 75% of the cases. Therefore we can now compute the confidence of the (XML) association rule: Wilson

⇒ Holmes

as: freq({Wilson, Holmes}, D) freq(Wilson}, D) which returns 0.66, i.e., in 66% of papers published by Wilson appears also Holmes or, in the XML context, in 66% of the fragments in D in which appears the fragment Wilson appears also the fragment Holmes. While the support of the association rule is computed as: freq({Wilson,Holmes},D) which returns 0.5 since the fragments Wilson and Holmes appear

Figure 3: A graphical representation of the document research.xml described in Figure 2. together in two white triangles out of four, i.e., in 50% of the cases.

3. AN OPERATOR FOR XML MINING To specify complex association rule mining tasks in XML documents we introduce the XMINE RULE operator. XMINE RULE allows the specification of complex mining task compactly and intuitively, and puts in evidence the most relevant semantic features of the problem. The operator is based on XPath [20], inspired to the syntax of XQuery, and to the work of [15] in the context of relational databases. Following the approach of [15], we define XMINE RULE through a sequence of examples of increasing difficulty. As a first example, we consider the problem formerly discussed in Section 2 which consists of mining frequent associations among people who appear as coauthors. We can formulate this mining task with the following statement. XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN //People/*/Publications/* LET BODY := ROOT/Author, HEAD := ROOT/Author EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 RETURN { FOR $item IN BODY RETURN { $Item } } { FOR $item IN HEAD RETURN { $Item } }

The XMINE RULE statement begins with the IN clause which specifies the data source, i.e., the document http://www.cs.atlantis.com/research.xml. In the FOR section, the special variable ROOT specifies the fragments which represent the transactions T ∈ D by means of one (or more) XPath expressions. In the example in Figure 3, ROOT identifies the white triangles. The next two special variables, BODY and HEAD, identify the fragments in the source data which should appear respec-

tively in the rule body and in the rule head, i.e., the set of items I ∈ I. Note that both BODY and HEAD are defined with respect to the context (i.e., to the ROOT variable) since confidence and support values have meaning only with respect to the context defined by the ROOT (i.e., the set of transactions D). More specifically, BODY and HEAD are defined as Author elements appearing in some Publications. Since the Author element appears both in articles and in books, Author is specified as a subelement of Publications/*. The EXTRACTING RULES WITH clause specifies the minimum support and the minimum confidence values for the output association rules with the clauses, “SUPPORT=0.1” and “CONFIDENCE=0.2”. The final RETURN clause specifies the structure of the association rules produced. An association rule is defined by a RULE tag, with two attributes, support and confidence which specify the support and confidence values; and two subelements, and , which specify the items in the rule body and in the rule head. Some of the rules produced by the above statement on the data depicted in Figure 2 are the following: Holmes Stolzmann ... Stolzmann Wilson Holmes Where, the latter RULE fragment represents the rule with Stolzmann and Wilson as body, and Holmes as head; the rule has support 0.25 and confidence 0.5. In the the remaining of the paper, we will omit the RETURN clause from the examples, since the one discussed above can be applied to all of them.

3.1 Syntax The XMINE RULE statement consists of six clauses, expressed with the following syntax, inspired by that of XQuery. XMINE RULE IN Doc-Name (, Doc-Name)∗ FOR ROOT IN XPathExpr (, FOR Variable IN XPathExpr)∗ LET BODY := (XPathExpr,)+ HEAD := XPathExpr (,XPathExpr)∗ (, Variable := XPathExpr)∗ [WHERE XQuery-Where-Clause] [GROUP Variable IN Target BY Criteria ] EXTRACTING RULES WITH SUPPORT = AND CONFIDENCE = [AND XQuery-Where-Clause] [RETURN XQuery-Return-Clause]

The IN clause specifies the source documents. The FOR/LET clauses define the mandatory ROOT, HEAD and BODY variables. Note that additional Variable definitions might be added depending on the user needs. All variables in this section are bound to fragment sets defined in terms of lists of commaseparated XPath expressions, denoted by the terminal symbol XPathExpr. The optional WHERE clause expresses arbitrary complex conditions to filter the fragments identified by the ROOT variable. All the conditions to filter the source data must be specified here; the HEAD and BODY keywords can be used to reference the corresponding fragment sets; the terminal symbol XQuery-Where-Clause is defined in [21]. The optional GROUP clause allows the restructuring of the source data so to define the mining task independently from the actual structure of the source documents (see Section 5 for further details). The EXTRACTING RULES WITH clause specifies a set of constraints on the association rules generated. In particular the mandatory SUPPORT and CONFIDENCE clauses specify the minimum support and the minimum confidence threshold as two real numbers (); additional constraints can be expressed with an XQuery-Where-Clause. The HEAD and BODY keywords can be still used here, in order to reference the right and left parts of the rules. Finally, the optional RETURN clause defines the structure of the output rules. It is expressed according to the syntax of the node constructors adopted in the RETURN clause of XQuery. Whenever RETURN is missing, the output association rules will be produced according to the PMML 2.0 specification [9].

3.2 Semantics We now present an intuitive semantics of the XMINE RULE operator described in Section 3. Let XR be the set of XPath expressions specified in the ROOT section; XB be the set of XPath expressions specified in the BODY section; XH be the set of XPath expressions specified in the HEAD section; XI be XB ∪ XH . In particular, we consider the simple case in which the BODY and the HEAD sections have identical sets of XPath expressions, XI = XB = XH , as in the example in Section 3. Let value be a function that, given an XPath expression p and an XML document d returns the set of XML fragments in d identified by p. Let satisfy be a function that, given an XML fragment f and an XQuery where clause w, returns one if f satisfies w, zero otherwise. Let contains be a function that, given an XML fragment x and an XML fragment y, returns one if x contains y, zero otherwise. First, we compute the set FD as the collection of all the XML fragments defined by the XPath expressions in XR , more formally: [ FD = value(p, d) p∈XR

Likewise, we compute the set FI as the collection of XML fragments that are defined by the XPath expressions in XI . [ FI = value(p, d) p∈XI

We now define FD′ and FI′ by filtering the sets FD and FI through the XQuery where clause w specified in the WHERE section. FD′ and FI′ are defined as follows:

FD′ FI′

= =

{f | f ∈ FD ∧ satisfy(f, w)} {f | f ∈ FI ∧ satisfy(f, w)}

a)

From the definitions of FD′ and FI′ it is now possible to compute the frequency of an itemset X, X ∈ FI′ , as follows: P Q ′ c∈FD f ∈X contains(c, f ) freq(X, FD′ ) = |FD′ | Given the function freq it is then possible to extract frequent (XML) itemsets from an XML document. Therefore it is possible to extract the association rules, as specified in the XMINE RULE statement. Note that all the three functions used to define the freq are available or easy to implement. In particular, the function value can be directly implemented by any of the available XPath interpreter (e.g., Xalan [18]). The function satisfy can be implemented by an XPath interpreter, if the conditions in the WHERE clause are simple enough; by an XQuery interpreter when this will become available. The function contains can be again easily implemented on top of an XPath interpreter (see Section 6).

4.

EXPRESSIVE POWER THROUGH EXAMPLES

b)

Figure 4: Different context definitions

In this section, we progressively demonstrate the various features of the XMINE RULE operator by means of examples.

of the previous one, and both support and confidence take different values.

4.1 Collaborations in the department

4.2 Related research topics

We begin with a new version of the example introduced in section 3, imposing the condition that at least one of the authors in the body of the generated rules is a member of the department. This is done by adding an XQuery predicate to the EXTRACTING... clause, that requires that at least one of the authors’ names (denoted by the $a variable) be contained in the PersonalInfo element, denoting the members of the department.

The next example shows two rules which differ for the choice of context; the two examples enable a discussion of the meaning of rules. In either case, the mining task aims at discovering frequent associations between research interests, represented by means of keywords. In the first example, the rule searches for associations among keywords related to people of the departments, and therefore extract research fields which are typically addressed by the same people. The head and the body of the generated rules is the Keyword element in publications, and the context is the set of all the department members (PhDStudent and FullProfessor elements), denoted as direct subelements of the People element (by means of the * XPath step).

XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN //People/*/Publications/* LET BODY := ROOT/Author, HEAD := ROOT/Author EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 AND SOME $a IN BODY SATISFIES not(empty(//People/*/PersonalInfo/Name[=$a])) We can further develop this example in order to show a filtering condition, restricting the publications to be considered to those published in 2001. This can be done adding an appropriate WHERE clause. XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN //People/*/Publications/* LET BODY := ROOT/Author, HEAD := ROOT/Author WHERE ROOT//@year = 2001 EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 AND SOME $a IN BODY SATISFIES not(empty(//People/*/PersonalInfo/Name[=$a])) This clause has obviously a side effect on the generated rules, since all the publications that do not satisfy the condition are pruned. Therefore, the resulting context is a subset

XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN //People/* LET BODY := ROOT/Publications//Keyword, HEAD := ROOT/Publications//Keyword EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 The rationale of this rule is graphically represented in Figure 4a, where the white fragments denote the (ROOT elements (matching the (//People/* path expression), and the black fragments denote the HEAD and BODY fragments (Keyword). In the second example, the ROOT context is defined as the set of all publications, and keywords are searched within each publication, in a search space of much smaller granularity; of course, there are many more roots, as shown in Figure 4b. This rule mines the intrinsic “closeness” of research topics — as determined because the topics appear in the same research paper — and not the related interests of the researchers — as determined because two topics are of

interest to the same researcher. The second rule is expressed substituting in the previous example the following clauses: FOR ROOT IN /People/*/Publications/* LET BODY := ROOT/Keyword, HEAD := ROOT/Keyword Figure 4b shows the ROOT fragments of this rule (matching /People/*/Publications/*), still depicted in white; dashed lines correspond to the roots of the previous query.

DEPARTMENT

A typical association

PhDStudent FullProfessor

XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN /People/*/Publications/* LET BODY := ROOT/Author, HEAD := ROOT/Keyword EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 AND count(BODY) >= 2 Since the intrinsic meaning of the task is to find collaborations, an obvious restriction to the generated rules is that they should contain at least two authors in the body, and this is expressed by the generic XQuery predicate count(BODY) >= 2, in addition to predicates defining the minimum support and confidence. An example of a possible rule is: Hennessy Patterson Computer Architecture

4.4 How to win an award The next example shows the use of several path expressions. We want to discover the associations between publications and awards, looking for which type of publication is mostly related to which kind of award. Since the publishers are represented as attributes in Journal elements and as PCDATA content in Books, while the conference types are stored as name attributes, the BODY fragment set is the set union of the nodes which three different path expressions evaluate to. XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN /People/* LET BODY := ROOT/Publications/Book/Publisher, ROOT/Publications/Article/Journal/@publisher, ROOT/Publications/Article/Conference/@name, HEAD := ROOT/Award/@society, EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2

PhDStudent award

award name

name

4.3 Collaborations on specific topics So far we presented rules in which the head and body fragment sets are defined by the same path expressions, and thus pertain to the same domain. The next example shows instead rules mining collaborations extracting authors when they publish papers on the same topic. This means that the rule HEAD will be extracted from the Author elements, and the rule BODY from the Keyword elements. A proper association between authors and keywords imposes that the ROOT be set as /People/*/Publications/* (all publications).

FullProfessor

name

award publisher

name

award publisher

publisher

award

Figure 5: A sample mining task In figure 5 we have represented as black triangles all the fragments that form the HEAD and in white the fragments that form the BODY, despite the fact that they are of different data types.

4.5 What helps a career We now show how we can intermix data and structural information in the same mining task. We want to discover which kinds of awards may influence the career, i.e., best correlates with the specific position of the members of the department. XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN /People/* LET BODY := ROOT/Award/@society, HEAD := ROOT/name() EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 Note that the the positions “covered” by the members are expressed as different tagnames, and not as regular data. In order to compactly denote all career of a Department’s member, the HEAD definition exploits the XPath name() step, that extract the tagname of the items to which it is applied. This feature is particularly relevant when the mining activity is addressed arbitrary tags in arbitrary positions, e.g. when documents describe the same application domain with different DTDs.

5. GROUPING CLAUSE It may happen that the containment relationship between the various nodes in the source documents does not suite the specification of a particular mining task—for instance, because that the desired ROOT fragments are descendants of the desired HEAD or BODY fragments, and not vice versa. This situation typically occurs when the XML hierarchy is produced as the mapping of tables with an N to M relationships—for instance, the Course/Student relationship in the running example, represented with Student elements aggregated by Course elements—and then the mining activity is addressed to the opposite aggregation—e.g., to courses grouped by students. In order to support such mining task, the document should first be restructured; we have defined a restruc-

{ $ItemSet } }

Target

We show a use of the group by clause on a mining task that requires a restructuring, discovering courses which are frequently included in PhD students’ tracks.

Criteria

GroupedElements

GroupedSet

GroupedSet

GroupedSet

Figure 6: Grouping trees turing option that covers this very specific—but frequent—situation and allows the user to reference the items using path expressions that refer to the original containment relationship. This specific restructuring is declaratively expressed by the GROUP clause, which we have not explained so far. Its syntax is: GROUP Variable IN Target BY Criteria where Target and Criteria are two path expressions; Criteria should start with Variable and the two path expressions should only use descendant or immediate descendant operators. Figure 6 informally illustrates the process of restructuring of the XML tree. The figure above contains 9 different XML fragments that match the Target path expression; they are given the same color when they contain an equal sub-fragment in the position indicated by Criteria—note therefore that there are 3 different colors, corresponding to three different sub-fragments. These nine fragments are grouped, in the figure below, based on the identity of their sub-fragments; this result is achieved by generating, for the purpose of the data mining activity, a new version of the document, with a collection of three newly generated fragment sets, such that each new fragment set has a new node as root, immediately followed by the original XML fragments matching the Target path expression, and by the common sub-fragments which qualifies the given fragment set. The semantics of this transformation is expressed formally by the XQuery expression that builds up the new version of the document. Such XQuery expression can be generated from the grouping clause by referencing its parametric components and by generating some new tags, as follows: { FOR $Crit IN Criteria LET $ItemSet := ( FOR $Item IN Target WHERE $Item/Criteria = $Crit RETURN $Item ) RETURN { $Crit }

XMINE RULE IN document("www.cs.atlantis.edu/research.xml") FOR ROOT IN //PhDCourses/Course/Student LET BODY := //PhDCourses//Course, HEAD := //PhDCourses//Course GROUPING $target IN //PhDCourses//Course BY $target//Student EXTRACTING RULES WITH SUPPORT = 0.1 AND CONFIDENCE = 0.2 Note that in this case the ROOT variable is not used as a prefix in the definition of HEAD and BODY, since they are not contained in the ROOT according to the native hierarchy. Also note that the context is defined in terms of the different students that attend at least one course (i.e. the cardinality of the distinct nodes that match the Criteria path expression). By instantiating the parametric transformation, we would build the following XQuery statement: { FOR $Crit IN //PhDCourses/Course/Student LET $ItemSet :=(FOR $Item IN //PhDCourses/Course WHERE $Item/Student/@ref = $Crit RETURN $Item) RETURN... }

6. IMPLEMENTATION NOTES In Section 3.2, we sketched an intuitive semantics for the XMINE RULE operator. In this section we use those definitions to illustrate how we have implemented the first prototype of the operator. More information about the implementation are available from [12, 7]. We implemented a prototype of the XMINE RULE operator as a three-step process. In the initial step, preprocessing, the XMINE RULE statement is processed in order to generate a representation of the mining problem as a binary relation R. Then in the following step, mining, binary association rules are extracted from R; during this phase the constraints specified in the EXTRACTING RULES clause are exploited. In the final step, post-processing, the binary association rules extracted from R are finally mapped into the XML representation. Preprocessing. First the sets FD and FI (see Section 3.2) are generated. These sets contain the XML fragments addressed by the XPath expressions specified in the clauses ROOT (FD ), HEAD, and BODY (FI ). In general, this step could be implemented by any of the available XPath interpreters; in our case, we use the Java implementation of Xalan [18]. Then the sets FD′ and FI′ containing the XML fragments in FD and FI which satisfy the WHERE clause are generated. This step is implemented on the top of the Xalan interpreter of Xpath. Note that since there is no XQuery interpreter currently available, our initial implementation supports only basic WHERE conditions mainly based on XPath expressions such as: ROOT//@year=2001. With the sets FD′ and FI′ the relational table R is generated as follows. • R has as many attributes (i.e., columns) as is the number of XML fragments in FI′ , i.e., R ⊂ {0, 1}|FI | .

• R has as many tuples (i.e., rows) as the number of fragments in FD′ , i.e., R = {r1 , . . . , r|FD′ | }; • for every XML fragment ci ∈ FD′ and every XML fragment fj ∈ FI′ , a tuple ri ∈ R, ri = hri,1 . . . ri,|FI | i, is defined as ri,j = contains(ci , fj ). From Section 3.2 we recall that the function contains(ci , fj ) returns one if the fragment ci contains fj . Note that the current implementations of XPath do not provide any method to check whether a fragment appear in another fragment. Accordingly, in our prototype the function contains is implemented on the top of the XPath interpreter as follows. Given two fragments ci and fj , their textual representation is generated exploiting the functionalities of the XPath interpreter. Then it is checked whether the textual representation of fj actually appear in the textual representation of ci ; if it does, one is returned, otherwise zero. This solution is indeed computationally expensive, but on the other hand, the comparison of XML fragments is one of the major open issues in the XML community. Thus, it might be possible in the future to reduce the complexity of this phase through some advanced technique introduced in the next versions of XPath and XQuery. Mining. With the table R and the constraints specified in the EXTRACTING RULE clause, binary association rules are extracted by using the Apriori algorithm. Note that among the many constraints which might be specified, our implementation currently considers only constraints concerning the composition of the rules (e.g., the number of items in the rule head or body), the minimum support, and the minimum confidence. In fact, complex constraints (e.g., those described in Section 4) are currently not possible with a tabular representation — they would require mining the XML association rules directly from the DOM (Document Object Model) representation. DOM offers a platform- and language-neutral API (Application Programming Interface) allowing programs to dynamically access and update the content, structure and style of the documents. Postprocessing. In the final step, the boolean association rules are remapped into the XML representation. Here the information which have been memorized during the preprocessing, such as the correspondence between the XML fragments in FD′ and FI′ and the boolean attributes in R, is exploited in order to print an XML version of the association rules generated through the Apriori algorithm. Efficiency has not been of our main concern in preparing the initial implementation; however, the first experiments of use are good thanks to the excellent performance of the Xalan XPath interpreter. In addition, our initial experiments suggests that the preprocessing and postprocessing steps are reasonably fast with respect to the central mining phase. These considerations suggest that an efficient execution of XMINE RULE statements is feasible with current technology as long as the statement includes only XPath expressions. Our three-step process, which exploits an intermediate relational representation, allowed us to implement a basic prototype smoothly. On the other hand, the lack of a robust XQuery execution environment does not allow us to compute the complex predicates supported by the XMINE RULE syntax on a native XML representation. The initial implementation described above (see also [7]) leaves many open issues, which must be addressed to have a fully functional and efficient implementation of the XMINE

RULE operator. We believe that the most important open issue regards the support evaluation, which should be used to extract association rules from XML documents. There are currently no XQuery interpreters to be used in filtering XML fragments and although XPath processors are quite advanced, they do not yet provide all the required functionalities either. Many of these functionalities should be included in the new XPath 2.0 version, while some operations are still not defined but are still requirements of the W3C consortium. Therefore, many required computation must be coded directly on the raw DOM representation of the documents (e.g., the filtering of WHERE clauses) which causes some computational overhead.

7. RELATED WORK Since the introduction of association rules in [3], the area has attracted the research community. First, the Apriori algorithm [5, 13] for extracting association rules was developed. Then, a number of algorithms for the extraction of association rules from multivariate data have been proposed (see [10] for a recent survey). Association rules and episode rules [14] have been applied to the mining of plain textual data (e.g., [6, 16]), or to semi-structured documents [17]. Additionally, a number of query languages such as MINE RULE (e.g., [15]), MSQL [11], and DMSQL [10] have been proposed to assist association rule mining. Because a large part of the data available in the very next future will be represented in XML, a number of proposals to exploit XML within Data Mining have been proposed. These proposals regard both the use of XML as a format to represent the knowledge extracted from the data and the development of techniques and tools for mining data expressed in XML documents. However, as far as we know, a well-formulated framework for discovering association rules within the context of native XML documents has not been introduced so far. There are of course, tools which deal with the mining of XML data. For instance, XML Miner [8] searches for relationships within XML code to predict values using fuzzy-logic rules. XML Rule [8] takes the output generated by XML Miner or designed by hand, and applies them to web sites. The Predictive Model Markup Language (PMML) [9] is an XML standard which can be used to represent most of the types of patterns which can be extracted from the data, e.g., association rules, decision trees, decision rules, etc. The PMML initiative is supported by a consortium of companies (such as IBM, Oracle, and SPSS) which promote it as a protocol for exchanging information among different data mining applications.

8. CONCLUSIONS In this paper, we defined association rules from native XML data by means of examples using a new XMINE RULE operator based on and inspired by MINE RULE [15] and XQuery. We also discussed the syntax and semantics behind the XMINE RULE operator, and presented concrete steps for implementation. Although this research provides a good starting point, there are still many open issues for future research — even within the work presented in this paper. Additionally, association rules from XML data could be extended to the case of episode rules [14]. A more general question is how to

consider and connect all available metadata — not only the XML DTD or schema information, but also RDF metadata and DAML+OIL format ontologies — for a specific mining task in a specific domain. This additional metadata provides enhanced possibilities to constrain the mining queries both automatically and manually, but it also increases the complexity and thus poses lots of new requirements for the data mining query language.

Acknowledgements In this work, authors have been supported by the consortium on discovering knowledge with Inductive Queries (cInQ) [1] of the EU-IST Programme (Contract no. IST-2000-26469); Daniele Braga and Alessandro Campi are partially supported by the “Virtual Campus” Microsoft Research Grant.

9.

REFERENCES

[1] consortium on discovering knowledge with Inductive Queries (cInQ). http://www.cinq-project.org. [2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD’93), pages 207 – 216. ACM, May 1993. [3] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD’93), pages 207 – 216. ACM, 1993. [4] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307 – 328. AAAI Press, 1996. [5] Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering (ICDE’95), pages 3–14. IEEE, 1995. [6] Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and A. Inkeri Verkamo. Mining in the phrasal frontier. In Jan Komorowski and Jan M. Zytkow, editors, Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’97), volume 1263 of Lecture Notes in Computer Science, pages 343 – 350. Springer-Verlag, 1997. [7] Roberto Carnevale. Algoritmi per stemming ed estrazione di regole di associazione su documenti xml. Master’s thesis, April 2002. Master thesis supervised by Marco Colombetti and Pier Luca Lanzi. (available in Italian). [8] Scientio Data Mining Company. XML Miner and XML Rule. http://www.metadatamining.com, 2001. [9] Data Mining Group. PMML 2.0 Predictive Model Markup Language. http://www.dmg.org, August 2001.

[10] Jiawei Han and Micheline Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. [11] Tomasz Imielinski and Aashu Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3(4):373 – 408, 1999. [12] PierLuca Lanzi. XMINE RULE a tool to extract xml association rules. submitted, May 2002. [13] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient algorithms for discovering association rules. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, Knowledge Discovery in Databases, Papers from the 1994 AAAI Workshop (KDD’94), pages 181 – 192. AAAI Press, 1994. [14] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovering frequent episodes in sequences. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD’95), pages 210 – 215. AAAI Press, 1995. [15] Rosa Meo, Giuseppe Psaila, and Stefano Ceri. An extension to SQL for mining association rules. Data Mining and Knowledge Discovery, 2(2):195 – 224, 1998. [16] Martin Rajman and Romaric Besan¸con. Text mining: Natural language techniques and text mining applications. In Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings series, 1997. [17] Lisa Singh, Peter Scheuermann, and Bin Chen. Generating association rules from semi-structured documents using an extended concept hierarchy. In Forouzan Golshani and Kia Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM’97), pages 193 – 200. ACM, 1997. [18] The Apache Software Foundation. The Apache XML Project. http://xml.apache.org/xalan-j/. [19] World Wide Web Consortium. Extensible Markup Language (XML) Version 1.0 (W3C Recommendation). http://www.w3c.org/xml/, February 1998. [20] World Wide Web Consortium. XML Path Language (XPath) Version 1.0 (W3C Recommendation). http://www.w3c.org/tr/xpath/, November 1999. [21] World Wide Web Consortium. XQuery 1.0: An XML Query Language (W3C Working Draft). http://www.w3.org/TR/2001/WD-xquery-20010607, June 2001.

APPENDIX THE DOCUMENT DTD

advisor IDREF #REQUIRED> PersonalInfo (Name)> PersonalInfo email CDATA #REQUIRED> Name (#PCDATA)> Subscription EMPTY> Subscription year CDATA #REQUIRED> FullProfessor (PersonalInfo, Publications, Award*)> FullProfessor id ID #REQUIRED> Publications (Article*, Book*)> Article (Author+, (Conference|Journal), Keyword*)> Article title CDATA #REQUIRED> Conference EMPTY> Conference name CDATA #REQUIRED year CDATA #REQUIRED> Journal EMPTY> Journal year CDATA #REQUIRED month CDATA #IMPLIED volume CDATA #IMPLIED name CDATA #REQUIRED publisher CDATA #REQUIRED> Book (Author+, Publisher, Keyword*)> Book title CDATA #REQUIRED year CDATA #REQUIRED> Author (#PCDATA)> Publisher (#PCDATA)> Keyword (#PCDATA)> Award (#PCDATA)> Award year CDATA #REQUIRED society CDATA #REQUIRED>

Discovering Interesting Information in XML Data with ... - CiteSeerX

Discovering Interesting Information in XML Data with ... - CiteSeerX

Suggest Documents

Discovering Interesting Holes in Data - CiteSeerX

Discovering Interesting Information with Advances in ... - Pierre Senellart

Discovering Interesting Association Rules by Clustering - CiteSeerX

XML-Based Information Mediation with VAMP - CiteSeerX

XML data integration with identification - CiteSeerX

XML and Information Visualization - CiteSeerX

Structured Information Retrieval in XML documents - CiteSeerX

Evaluation in (XML) Information Retrieval - CiteSeerX

Exchanging Intensional XML Data - CiteSeerX

Exchanging Intensional XML Data - CiteSeerX

Exchanging Intensional XML Data - CiteSeerX

MatML: XML for Information Exchange with Materials Property Data

Discovering Geographic Knowledge in Data-Rich ... - CiteSeerX

A Clustering-based Approach for Discovering Interesting ... - CiteSeerX

Discovering Dynamic Dipoles in Climate Data - CiteSeerX

Querying XML Data - Semantic Information Research Lab

Discovering Interesting Patterns Through User's ...

Discovering Interesting Subsets Using Statistical ... - Semantic Scholar

Datamining: Discovering Information From Bio-Data

Datamining: Discovering Information From Bio-Data

Interesting Information (Jokes, Quotes…)

Discovering Patterns in Microsatellite Flanks with ... - CiteSeerX

Data Integration with XML and Semantic Web Technologies - CiteSeerX

Information flow security for XML transformations - CiteSeerX