xml mining: from trees to strings - CiteSeerX

1 downloads 0 Views 128KB Size Report
To store the DOM tree in compact tree structure (Trie, PATRICIA Trie) .... [Cooper and al 01] F. Cooper, Neal Sample, Michael J. Franklin, Gisli Hjaltason, and ...
XML MINING: FROM TREES TO STRINGS ? S.Miniaoui, M.Wentland Forte INFORGE, Ecole des HEC de Lausanne, Switzerland email: [email protected] and [email protected] Abstract XML is becoming this few years the standard of data exchange in the Web and a new data description language. Consequently, in a Data Mining context, optimizing storage and access time to XML documents is becoming a new challenge. Indeed, for mining XML documents we have to parse them in order to obtain a tree data structure in RAM memory. This tree structure is more flexible and have a beter time access and navigation than the textual format. Moreover, this tree representation of XML documents presents more semantic-richness than the textual one. Thus, for an eficient XML mining task, the XML mapping stage will conserve the semantic of data (hierarchie of concepts wthin the XML document) and generate a compact data structure. In this paper we present an overview of XML mapping propositions and we discuss advantages and drawbacks for an efficient mining. Then, we note relevent concepts to consider in the XML mapping for an efficient mining. Keywords: XML mapping, Data mining 1.

Introduction

XML is emerged as a new standard of data exchange in the Web. Moreover, the semi structured data model [Abietboul 97] adopted for XML documents is bringing more structure, flexibility and semantic richeness. Indeed, the OEM data model [AQM+96] represents an XML document as an ordered labelled tree. This tree vision of XML document is more flexible than the textual one. Moreover, in a KDD process the data preparation stage (selection, cleaning, transforming) is necessary [fayyad and al 96]. Mastering this stage can improve the result of data mining algorithms. Indeed, the data preparation stage consists in mapping XML document in a specific data model. This mapping is aiming to obtain a compact data structure for an efficient Data Mining stage. Many researchs done in this prespective consider that the DOM tree is the best data structure to well handel XML documents. Moreover, for best performances in a data mining context the basic idea is : • To store the DOM tree in compact tree structure (Trie, PATRICIA Trie) • To map the DOM tree to another data model (Relational, Path-Based format, Strings) This work is aiming to demonstrate the utility of the mapping of XML documents to another data model for best performaneces. Furthermore, in the Frequent Structure Mining context we will demonstarte that the string data structure is more advantegeous than the tree data structure witch is against intituive. We discuss in this paper the importance of the XML mapping for an efficient mining and we introduce the two principales approches: the Trie-based mapping and the String-based mapping. This paper is structured as follow. In Section 2, we introduce related Works. In Section 3 and Section 4, we present XML mapping propositions. In section 5, we introduce key concepts for an efficient XML mapping. The Section 6 is conclusion and future works. 2.

Related Works

As known, the tree structure is the most advantageous data structure in terms of navigation and access time. Consequently, parsers based on API like DOM, for example, transforms XML documents into trees in RAM memory (Fig 1). The obtained trees are treated by Data Mining algorithms for clustering or for frequent sub-tree detection for example. Many works interested on mining XML documents proposes mapping strategies for storing XML trees in compact data structures (Table, Trie, String) for best performances. The first mapping proposition of XML documents was into the relational data model. Indeed, this data model is the best format for data representation and management and proposes advantageous analysis tools. In this mapping, each XML tree is encoded into table where elements/attributes are columns names and tags contents are transactions. This mapping presents problems in the choice of elements/attribute mapping [Kappel and al 00] and looses the data tree representation (empty tags will not be mapped which is a semantic lost). We present in following an overview of astutes XML mapping propositions and we discuss their advantages and drawbacks.

1

Book 2

Author

Edition

Title

3

Name

6

Sname

4

7

L’étranger 1980

5

Camus

Albert

Fig1: DOM Tree 3.

Trie-based mapping

The trie is a data structure allowing storing a string in a tree with an edge for each of its characters[Fredkin 60]. Moreover, the PATRICIA trie (Practical Algorithm To Retreive Information Coded In Alphanumeric) is a compact trie witch presents an adventegeous data structure for Data mining algorithms. However, a first preparation stage of the XML document is necessary for its storage in a Trie. [Wang and al 03] proposes to extract all the sequences (Fig 2) from the XML tree. Indeed, this is a transformation of an XML tree in sequences (ancestor, element) where ancestor is path from root until the current element. (E, books), (books, book), (books book, Author), (books book Author, Name), (books book Author Name, Albert), (books book Author, Sname), (books book Author Sname, camus), (books book, Title), (books book Title, l’etranger), (books book, edition) (books book edition, 1980). Fig 2: Sequenes of XML document The second stage of the mapping is to store these sequences in a compact data structure: the PATRICIA trie (Fig3). For more optimization, [Wang and al 03] proposes a tag coding stage in order to reduce the space memory. This mapping of XML documents to compact tries (PATRICIA tries) preserves the tree structure of XML documents and optimizes the storage space of trees witch consequently, improves performances in a data mining context [Cooper and al 01]. This two steps mapping is interesting but, in a frequent structure mining for example, a string representation of the tree can be more advantegeous.In addition, this mapping don’t exploit the OID of each node witch can be relevent in node localization. Indeed, [Zaki02] proposes the encoding of the XML tree into string.

Book : 1 Author : 2 Name : 3 Sname : 4 Title : 5 Edition: 6 Tag Coding

1 6

2

3

123V1

5 4

15V3

124V2

Fig 3 : PATRICIA Trie-based mapping

16V4

4.

String-Based Mapping

The First stage of this mapping consists in encoding all the edges (tag names) of the XML DOM tree in a digital format (Fig 4.A). Then, the pre-order pass of the tree allows encoding the tree into a string (Fig 4.B). [Cooper and al 01] proposes to store the obtained string in a PATRICIA Trie. However, [Zaki02] proposes a mapping algorithm that inserts a character (-1) in the string to indicate the movement in the tree which is reminds the Turing machine (Fig 4.C). This based-string mapping preserves all paths within an XML document and transform the XML mining problem to a string mining (DNA commun sequences detection). Books : 0 Book : 1 Author : 2 Name : 3 Sname : 4 Title : 5 Edition: 6

01234516 B: XML Correspondant String

0123-14-15-16 C: XML correspondant String (Zaki)

A: Tag coding Fig 4 : String –Based mapping

5.

Mapping for efficient mining

We present in this section some concepts that a mapping could consider for an efficient Data Mining stage. 5.1 The Object Identifier In the document object model (DOM), each node is identified by an object identifier (OID). This OID is assimilated to a primary key in a relational table. However, no many mapping algorithms are exploiting this concept. Moreover, it could be relevant to optimize retrieval and access to XML tree nodes and for paths evaluation. This concept is necessary to Data Mining algorithms for an efficient tree searching [Gardarin and al 99] [Zaki02]. 5.2 Layers The XML tree is a collection of data structured in layers. Thus, the knowledge of node's layer could facilitate its localization, optimize its time access and bring a new parameter (layer) useful in a Data Mining context. [Cooper and al 01] proposes a multi layer index for optimizing access to XML nodes and that can be relevant in the Data Mining stage. 5.3 Explicit representation of Paths An XML tree is a collection of paths going from root element to leaves. Moreover, navigation within an XML document is based on path evaluation. Consequently a good mapping must reflect explicit paths within the XML document. Indeed, in the Data Mining stage, paths can be a string [Zaki 02], a collection of sequences [Wang and al 03], tuples [Gardarin and al 99] or a trie branch [Cooper and al 01]. 5.4 The Node Scope The scope's node is an OID interval where the first limit is the current node's OID and the other one is the leftist node's OID witch the current node is the root [Zaki 02] (Fig 5). This concept is interesting for optimizing branch exploration and for node retrieval. Furthermore, it can be exploited for instance in frequent structure mining [Zaki 02] or for layer deduction [Cooper and al 01].

Layer 1

S= [1,7]

1

Book S= [2,6] Author

Layer 2

S= [3,5] Name

Layer 3

Layer 4

S= [4,4]

4

Albert

3

2

Edition

Title

S= [6,6] 6 S= [7,7] 7 Sname L’étranger 1980 S= [5,5] Camus 5

Fig 5: DOM Tree with the scope of each node 6.

Discussion

The mapping of XML documents is relevent for well handling this data format. Indeed, the semi structured data model adopted for XML documents is bringing more structure, flexibility and semantic richeness to Web documents. This data model is exploited by some MDBS for optimizing storage, access and querying XML documents. However, only some MDBS are natives XML repositories [McHugh and al 97] [Aguilera and al 00] and are based on the semi structured data model for the management of XML documents either in querying and also in storage. Moreover, the API DOM witch is the application of the semi structured data model generates a tree data structure advantageous for the Data Mining stage. The tree data structure is, as known, the optimal data structure for navigation. Consequently, many XML mapping propositions exploits the DOM tree in generating a compact tree structure (trie, PATRICIA trie) advantageous in a data mining context. In frequent structure mining context for example and against intuitive, the string data structure is more advantageous than the tree data sructure. Furthermore, the mentionned concepts (OID, Scope, Explicits Paths, layers) could be used in mapping algorithms for improving the semantic richeness of the generated data structure. 7.

Conclusion and Future Works

We have presented in this paper some mapping propositions of XML documents. The basic idea of these propositions is to store XML document in a compact data structure for improving performances in the retrieval and the Data Mining task. We presented also some concepts to integrate in a mapping algorithm in order to optimize access and retrieval of XML nodes. Indeed, Paths, OID, Layer and Scope reveals the semi structured (tree) nature of XML documents. Furthermore, these concepts represents powerful parameters to well handle XML documents in a Data Mining context. We join the [Zhang and al 04] point of view about XML algebras for Data Mining. In a special context (Frequent sequence mining) the string data structure can be more advantageous than the tree data structure witch is against intuitive. We are working on a mapping algorithm that involves the mentionned concepts. This mapping could bring more semantic richness reached in a Data Mining context. References [Abietboul 97] S. Abiteboul, Querying semi-structured data. In Proc, of International Conference on Database Theory (ICDT'97), p 1-18, Delphi, Greece, 1997. [Aguilera and al 00] V.Aguilera, S.Cluet, P.Veltri, D.Vodislav, F.Wattez, Querying XML Documents in Xyleme, In Proceedings ACM-SIGIR2000 Workshop, 2000. [Comer 79] D. Comer, The ubiquitous B-Tree, Computing Survey, p 121-137, 1979. [Cooper and al 01] F. Cooper, Neal Sample, Michael J. Franklin, Gisli Hjaltason, and Moshe Shadmon, Fast index for semi structured data, VLDB 2001, p 341–350, 2001. [Fredkin 60] E.Fredkin, Trie Memory, CACM, p 490-499, 1960.

[fayyad and al 96] U.Fayad, G.P.Shapiro, P.Smyth, From Data Mining to Knowledge Discovery in Databases, AI Magazine (17) , p 37-54, 1996. [Gardarin and al 99] G.Gardarin, F.Sha, T.D.Ngoc, XML-based components for federating multiple heterogeneous data sources LNCS 1728, p 506-519, 1999. [Kappel and al 00] G.Kappel, E.Kapsammer, S.Rausch-Schott, W. Retschitzegger, X-Ray - Towards Integrating XML and Relational Database Systems, Proceedings of the 19th International Conference on Conceptual Modelling, LNCS 1920, p 339-353, 2000. [McHugh and al 97] J.McHugh, S.Abiteboul, R.Goldman, D.Quass, and J.Widom, Lore: A Database Management System for Semistructured Data, SIGMOD Record, 26(3), p 54-66, 1997. [Punin and al 00] J.Punin, M.Krishnamoorthy, Log http://www.cs.rpi.edu/~puninj/LOGML/draft-logml.html, 2000.

Markup

Language

LOGML

specification,

[Spitizner and al 00] J.Spitizner, E.Blyden, D.Dai, D.Gordon, C.Guo, S.Kraut, E.Rentschler, S.Roggenkamp, R.Rumpf, J.Spitzner, J.Spitzner, http://www.bsml.org/i3c/docs/BSML3_1_Reference_Manual.pdf, 2002. [Wang and al 03] H.Wang, S.Park, W.Fan, P.S.Yu, ViST: a dynamic index method for querying XML data by tree structures, SIGMOD 2003. [Zaki 02] M J. Zaki, Efficiently Mining Frequent Trees in a Forest, ACM, 2002. [Zhang and al 04] M. Zhang, J.T. Yao, XML Algebra for Data Mining, Proceedings of SPIE Vol. #5433, Data Mining and Knowledge Discovery: Theory, Tools, and Technology VI, p 209-217, 2004.