IMB3-Miner: Mining Induced/Embedded Subtrees by ... - CiteSeerX

14 downloads 555 Views 358KB Size Report
1 University of Technology Sydney, Faculty of Information Technology, Sydney, ... utilizing a novel Embedding List representation, Tree Model Guided enumera- ... TMG is a specialization of the right most path extension approach [2,8,9,10].
IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding Henry Tan1, Tharam S. Dillon1, Fedja Hadzic1, Elizabeth Chang2, and Ling Feng3 1

University of Technology Sydney, Faculty of Information Technology, Sydney, Australia {henryws, tharam, fhadzic}@it.uts.edu.au 2 Curtin University of Technology, School of Information System, Perth, Australia [email protected] 3 University of Twente, Department of Computer Science, Enschede, Netherlands [email protected]

Abstract. Tree mining has recently attracted a lot of interest in areas such as Bioinformatics, XML mining, Web mining, etc. We are mainly concerned with mining frequent induced and embedded subtrees. While more interesting patterns can be obtained when mining embedded subtrees, unfortunately mining such embedding relationships can be very costly. In this paper, we propose an efficient approach to tackle the complexity of mining embedded subtrees by utilizing a novel Embedding List representation, Tree Model Guided enumeration, and introducing the Level of Embedding constraint. Thus, when it is too costly to mine all frequent embedded subtrees, one can decrease the level of embedding constraint gradually up to 1, from which all the obtained frequent subtrees are induced subtrees. Our experiments with both synthetic and real datasets against two known algorithms for mining induced and embedded subtrees, FREQT and TreeMiner, demonstrate the effectiveness and the efficiency of the technique.

1 Introduction Research in both theory and applications of data mining is expanding driven by a need to consider more complex structures, relationships and semantics expressed in the data [2,3,4,6,8,9,12,15,17]. As the complexity of the structures to be discovered increases, more informative patterns could be extracted [15]. A tree is a special type of graph that has attracted a considerable amount of interest [3,8,9,11,12,17]. Tree mining has gained interest in areas such as Bioinformatics, XML mining, Web mining, etc. In general, most of the formally represented information in these domains is of a tree structured form and XML is commonly used. Tan et. al. [8] suggested that XML association rule mining can be recast as mining frequent subtrees in a database of XML documents. Wang and Liu [13] developed an algorithm to mine frequently occurring induced subtrees in XML documents. Feng et. al. [4] extend the notion of associated items to XML fragments to present associations among trees. The two known types of subtrees are induced and embedded [3,8,9,17]. An Induced subtree preserves the parent-child relationships of each node in the original tree whereas an embedded subtree preserves not only the parent-child relationships but W.K. Ng, M. Kitsuregawa, and J. Li (Eds.): PAKDD 2006, LNAI 3918, pp. 450 – 461, 2006. © Springer-Verlag Berlin Heidelberg 2006

IMB3-Miner: Mining Induced/Embedded Subtrees

451

also the ancestor-descendant relationships over several levels. Induced subtrees are a subset of embedded subtrees and the complexity of mining embedded subtrees is higher than mining induced subtrees [3,9,17]. In this study, we are mainly concerned with mining frequent embedded subtrees from a database of rooted ordered labeled subtrees. Our primary objectives are as follows: (1) to develop an efficient and scalable technique (2) to provide a method to control and limit the inherent complexity present in mining frequent embedded subtrees. To achieve the first objective, we utilize a novel tree representation called Embedding List (EL), and employ an optimal enumeration strategy called Tree Model Guided (TMG). The second objective can be attained by restricting the maximum level of embedding that can occur in each embedded subtree. The level of embedding is defined as the length of a path between two nodes that form an ancestor-descendant relationship. Intuitively, when the level of embedding inherent in the database of trees is high, numerous numbers of embedded subtrees exist. Thus, when it is too costly to mine all frequent embedded subtrees, one can restrict the level of embedding gradually up to 1, from which all the obtained frequent subtrees are induced subtrees. The two known enumeration strategies are enumeration by extension and join [3]. Recently, Zaki [17] adapted the join enumeration strategy for mining frequent embedded rooted ordered subtrees. An idea of utilizing a tree model for efficient enumeration appeared in [14]. The approach uses the XML schema to guide the candidate generation so that all candidates generated are valid because they conform to the schema. The concept of schema guided candidate generation is generalized into tree model guided (TMG) candidate generation for mining embedded rooted ordered labeled subtrees [8,10]. TMG can be applied to any data with clearly defined semantics that have tree like structures. It ensures that only valid candidates which conform to the actual tree structure of the data are generated. The enumeration strategy used by TMG is a specialization of the right most path extension approach [2,8,9,10]. It is different from the one that is proposed in FREQT [2] as TMG enumerates embedded subtrees and FREQT enumerates only induced subtrees. The right most path extension method is reported to be complete and all valid candidates are enumerated at most once (non-redundant) [2,8,9]. This is in contrast to the incomplete method TreeFinder [11] that uses an Inductive Logic Programming approach to mine unordered, embedded subtrees. The extension approach utilized in the TMG generates fewer candidates as opposed to the join approach [8,9]. In section 2 the problem decomposition is given. Section 3 describes the details of the algorithm. We empirically evaluate the performance of the algorithms and study their properties in section 4, and the paper is concluded in section 5.

2 Problem Definitions A tree can be denoted as T(r,V,L,E), where (1) r ∈ V is the root node; (2) V is the set of vertices or nodes; (3) L is the set of labels of vertices, for any vertex v ∈ V, L(v) is the label of v; and (4) E is the set of edges in the tree. Parent of node v, parent(v), is defined as the predecessor of node v. There is only one parent for each v in the tree. A node v can have one or more children, children(v), which are defined as its successors. If a path exists from node p to node q, then p is an ancestor of q and q is a

452

H. Tan et al.

descendant of p. The number of children of a node is commonly termed as fanout/degree of the node, degree(v). A node without any child is a leaf node; otherwise, it is an internal node. If for each internal node, all the children are ordered, then the tree is an ordered tree. The height of a node is the length of the path from a node to its furthest leaf. The rightmost path of T is defined as the path connecting the rightmost leaf with the root node. The size of a tree is determined by the number of nodes in the tree. Uniform tree T(n,r) is a tree with height equal to n and all of its internal nodes have degree r. All trees considered in this paper are rooted ordered labeled. 0 ‘b’

T:

1 ‘b’

2 ‘e’

3 ‘e’

T1:

5 ‘c’ 6 ‘c’ 7 ‘b’

4 ‘c’

T2:

0 ‘b’

6 ‘c’

0 ‘b’ 7 ‘b’

7 ‘b’

8 ‘e’

8 ‘e’

8 ‘e’ T3:

0 ‘b’

3 ‘e’

4 ‘c’

T4:

0 ‘b’

6 ‘c’

T5:

T6:

0 ‘b’

3 ‘e’

7 ‘b’ 8 ‘e’

0 ‘b’ 2 ‘e’ 5 ‘c’ 4 ‘c’

7 ‘b’

Fig. 1. Example of induced subtrees (T1, T2, T4, T6) and embedded subtrees (T3, T5) of tree T

Induced Subtree. A tree T’(r’, V’, L’, E’) is an ordered induced subtree of a tree T (r, V, L, E) iff (1) V’⊆V, (2) E’⊆E, (3) L’⊆L and L’(v)=L(v), (4) ∀v’∈V’, ∀v∈V and v’ is not the root node parent(v’)=parent(v), (5) the left-to-right ordering among the siblings in T’ should be preserved. Induced subtree T’ of T can be obtained by repeatedly removing leaf nodes or the root node if its removal doesn’t create a forest in T. Embedded Subtree. A tree T’(r’, V’, L’, E’) is an ordered embedded subtree of a tree T(r, V, L, E) if and only if it satisfies property 1, 2, 3, 5 of induced subtree and it generalizes property (4) such that ∀v’∈V’, ∀v∈V and v’ is not the root node ancestor(v’) = ancestor (v). Level of Embedding (Φ). If T’(r’, V’, L’, E’) is an embedded subtree of T, the level of embedding (Φ) is defined as the length of a path between two nodes p and q, where p∈V’ and q∈V’, and p and q form an ancestor-descendant relationship from p to q. We could define induced subtree T as an embedded subtree with maximum Φ that can occur in T equals to 1, since the level of embedding of two nodes that form parentchild relationship equals to 1. For instance in fig 2 the level of embedding, Φ, between node at position 0 and node at position 5 in tree T is 3, whereas between node 0 and node 2, 3, and 4 is equal to 2. According to our definition of induced and embedded subtree previously, S1 is an example of an induced subtree and S2, S3, and S4 are examples of embedded subtrees. Transaction based vs occurrence match support. We say that an embedded subtree t is supported by transaction k ⊆ K in database of tree Tdb as t p k. If there are L occur-

IMB3-Miner: Mining Induced/Embedded Subtrees

T:

S1:

0 ‘a’

2 ‘c’

1 ‘b’

ĭ:2

S3: 3 ‘d’

4 ‘e’

0 ‘a’

ĭ:1

ĭ:1 1 ‘b’

S2:

0 ‘a’

S4:

0 ‘a’

ĭ:3

ĭ:2 1 ‘b’

2 ‘c’

3 ‘d’

0 ‘a’

ĭ:2

5 ‘f’

1 ‘b’

4 ‘e’

453

ĭ:3 1

‘b’

5 ‘f’

Fig. 2. Illustration of restricting the level of embedding when generating S1-4 subtrees from subtree ‘a b’ with OC 0:[0,1] of tree T

rences of t in k, a function g(t,k) denotes the number of occurrences of t in transaction k. For transaction based support, t p k=1 when there exists at least one occurrence of t in k, i.e. g(t,k)≥1. In other words, it only checks for existence of an item in a transaction. For occurrence match support, t p k corresponds to the number of all occurrences of t in k, t p k=g(t,k). Suppose that there are N transactions, k1 to kN, of trees in Tdb, the support of embedded subtree t in Tdb is defined as: N

∑t pk i =1

i

(1)

Transaction based support has been used in [3,12,17]. However occurrence match support has been less utilized and discussed. In this study we are in particular interested in exploring the application and the challenge of using occurrence match support. Occurrence match support takes repetition of items in a transaction into account whilst transaction based support only checks for existence of items in a transaction. There has not been any general consensus which support definition is used for which application. However, it is intuitive to say that whenever repetition of items in each transaction is to be accounted for and order is important, occurrence match support would be more applicable. Generally, transaction based support is very applicable for relational data. String encoding (φ). We utilize the pre-ordering string encoding (φ) as utilized in [8,9,17]. We denote encoding of subtree T as φ(T). For each node in T (fig. 1), its label is shown as a single-quoted symbol inside the circle whereas its pre-order position is shown as indexes at the left/right side of the circle. From fig. 1, φ(T1):‘b c / b e / /’; φ(T3):‘b e / c /’, etc. We could omit backtrack symbols after the last node, i.e. φ(T1):‘b c / b e’. We refer to a group of subtrees with the same encoding L as candidate subtree CL. A subtree with k number of nodes is denoted as k-subtree. Throughout the paper, the ‘+’ operator is used to conceptualize an operation of appending two or more tree encodings. However, this operator should be contrasted with the conventional string append operator, as in tree string encoding the backtrack symbols needs to be computed accordingly. Mining (induced|embedded) frequent subtrees. Let Tdb be a tree database consisting of N transactions of trees, KN. The task of frequent (induced|embedded) subtree mining from Tdb with given minimum support (σ), is to find all the candidate (in-

454

H. Tan et al.

duced|embedded) subtrees that occur at least σ times in Tdb. Based on the downwardclosure lemma [1], every sub-pattern of a frequent pattern is also frequent. In relational data, given a frequent itemset all its subsets are also frequent. A question however arises of whether the same principle applies to tree structured data when the occurrence match support definition is used. To show that the same principle doesn’t apply, we need to find a counter-example. Lemma 1. Given a tree database Tdb, if there exist candidate subtrees CL and CL’, where CL ⊆ CL’, such that CL’ is frequent and CL is infrequent, we say that CL’ is a pseudo-frequent candidate subtree. In the light of the downward closure lemma these candidate subtrees are infrequent because one or more of its subtrees are infrequent. Lemma 2. The antimonotone property of frequent patterns suggests that the frequency of a superpattern is less than or equal to the frequency of a subpattern. If pseudo-frequent candidate subtrees exist then the antimonotone property does not hold for frequent subtree mining. From fig. 1, suppose that the minimum support σ is set to 2. Consider a candidate subtree CL where L:’b c / b’. When an embedded subtree is considered, there are 3 occurrences of CL that occur at position {(0, 4, 7), (0, 5, 7), (0, 6, 7)}. On the other hand, when an induced subtree is considered, there are only 2 occurrences of CL that occur at position {(0, 5, 7), (0, 6, 7)}. With σ equal to 2, CL is frequent for both induced and embedded types. By extending CL with node 8 we obtain CL’ where L’:L+’e’ = ’b c / b e’. In the light of lemma 1, CL’ is a pseudo-frequent candidate subtree because we can find a subtree of CL’ whose encoding ‘b b e’ at position (0, 7, 8) is infrequent. This holds for both induced and embedded subtrees. In other words, lemma 1 holds whenever occurrence match support is used. Subsequently, since pseudo-frequent candidate subtrees exist, according to lemma 2, the antimonotone property does not hold for frequent subtree mining when occurrence match support is used. Hence, when mining induced and embedded subtrees, there can be frequent subtrees with one or more of its subsets infrequent. This is different to flat relational data where there are only 1-to-1 relationships between items in each transaction. Tree structured data has a hierarchical structure where 1-to-many relationships can occur. This multiplication between one node to its many children/descendants makes the antimonotone property not hold for tree structured data. This makes full (k-1) pruning should be performed at each iteration when generating k-subtrees from a (k-1)-subtree when occurrence match support is used to avoid generating pseudo-frequent subtrees.

3 IMB3-Miner Algorithms Database scanning. The process of frequent subtree mining is initiated by scanning a tree database, Tdb, and generating a global pre-order sequence D in memory (dictionary). The dictionary consists of each node in Tdb following the pre-order traversal indexing. For each node its position, label, right-most leaf position (scope), and parent position are stored. An item in the dictionary D at position i is referred to as D[i]. The notion of position of an item refers to its index position in the dictionary. When generating the dictionary, we compute all the frequent 1-subtrees, F1. After the dictionary is constructed our approach does not require further database scanning.

IMB3-Miner: Mining Induced/Embedded Subtrees

455

0 ‘b’ 0 ‘b’ 0 ‘b’ 0 ‘b’ 0 ‘b’ 0 ‘b’ 0 ‘b’ 0 ‘b’ 2 ‘e’ 2 ‘b’ 7 ‘b’ 1 ‘b’ 2 ‘e’ 3 ‘e’ 4 ‘c’ 5 ‘c’ 6 ‘c’ 7 ‘b’ 8 ‘e’ 3 ‘e’ 4 ‘c’ 8 ‘e’

All 2-subtree candidates generated from T

.

0: 2: 7:

1 3 8

2 4

3

4

5

6

7

8

Fig. 3. The EL representation of T in fig 1

Constructing Embedding List (EL). For each frequent internal node in F1, a list is generated which stores its descendant nodes’ hyperlinks [12] in pre-order traversal ordering such that the embedding relationships between nodes are preserved. The notion of hyperlinks of nodes refers here to the positions of nodes in the dictionary. For a given internal node at position i, such ordering reflects the enumeration sequence of generating 2-subtree candidates rooted at i (fig 3). Hereafter, we call this list as embedded list (EL). We use notation i-EL to refer to an embedded list of node at position i. The position of an item in EL is referred to as slot. Thus, i-EL[n] refers to the item in the list at slot n. Whereas |i-EL| refers to the size of the embedded list of node at position i. Fig 3 illustrates an example of the EL representation of tree T (fig. 1). In fig 3, 0-EL for example refers to the list: 0:[1,2,3,4,5,6,7,8], where 0-EL[0]=1 and 0-EL[6]=7. Occurrence Coordinate (OC). When generating k-subtree candidates from (k-1)subtree, we consider only frequent (k-1)-subtrees for extension. Each occurrence of ksubtree in Tdb is encoded as occurrence coordinate r:[e1,…ek-1]; r refers to k-subtree root position and e1,…,ek-1 refer to slots in r-EL. Each ei corresponds to node (i+1) in k-subtree and e1