Dynamic Restructuring of Classi cation Hierarchy towards Data Mining Subhamoy Maitra & Aditya Bagchi Computer & Statistical Service Centre Indian Statistical Institute 203, B.T. Road Calcutta 700 035 INDIA e-mail: fsubho,
[email protected] Abstract
The system proposed in this paper o ers a classication hierarchy over a backend database to generate and update association rules for the purpose of data mining. The concept presented in this e ort provides algorithms towards continuous restructuring of a class hierarchy. The class hierarchy represents relationship among di erent attributes (items) of the database. When a new instance (transaction) arrives in the database, the system tries to place it in the existing hierarchy. However, if it fails to classify the instance exactly, it adds the instance as an exception to the class found to be closest. The proposed system initiates restructuring only when number of exceptions to a class exceeds a predened threshold value. The threshold value is related to support and con dence of an association rule in the context of data mining. Three important methods add, fuse and break have been used for restructuring. The add method adds a new branch to the structure, fuse merges more than one classes to one and break decomposes a class into a number of classes. The dynamic hierarchy is always approximate in nature. In the application domain, the possible attribute values are considered either present (value 1) or absent (value 0).
Keywords : Classi cation Hierarchy, Dynamic Restructuring, Data Mining, Association Rule, Vertex Connectivity, Database.
1 Introduction The system proposed in this paper o ers a data mining environment where a dynamic dataset is used for the generation and updation of association rules. In addition, the system provides the facility of classication. If a problem domain provides an initial class hierarchy, the proposed system can rene it by adding new classes or by breaking an existing class into more specialised sub-classes. The facility becomes even more useful in dynamic environment where the database is very frequently updated. A set of algorithms have been proposed for the generation and modication of the class hierarchy and association rules. The system starts with an initial class hierarchy and class descriptions obtained from the problem domain and keeps on modifying them as new instances are considered. These new instances are either taken from the underlying database or inserted afresh. Thus the restructuring process behaves like a learning system. One such conceptual clustering algorithm for approximate schema design has
already been proposed in 1]. Their proposal is based on Explanation Based Learning and intended to restructure classication hierarchy after each entry of exceptional instance. However, in an application domain where the number of attributes as well as the number of instances are suciently large, and the motivation is towards generating important association rules, such an approach to restructure after every exception may add signicant overhead. Thus we decide to go for restructuring the hierarchy only when the number of exceptional instances, the instances those could not be classied exactly in the existing hierarchy, crosses certain predened threshold limits. The ideas of concept hierarchy and multiple-level association rules have also been discussed in 2, 4]. The salient features of our proposed system are as follows. 1. Initial structure of the system : (a) A class hierarchy ( a tree of classes ). (b) Universal set of attributes. (c) Set of attributes initially associated with each class. (d) A count associated with each class keeping the number of transactions matching the class. (e) A count associated with each class keeping the number of transactions generating exception to the class. (f) Two threshold counts and related to the overall system. 2. The evolutionary structure : (a) A dynamic class hierarchy that classies the input instances, and ags exception, if needed. (b) Each class is associated with updated set of attributes, and updated counts of and . The terms described here are formalised and dened in the coming sections. The organization of the document is as follows. In the Section 2 we will talk about some existing results and concepts. Section 3 will be on some primary classication techniques using di erent matching procedures. The very next three sections will talk about add (Section 4), fuse (Section 5) and break (Section 6) procedures. Section 7 summarizes the overall tree restructuring concept and the application of this method in data mining. We draw conclusion in Section 8.
2 Related Concepts This section would discuss about some background materials to be used in the proposed system. Basic ideas about data mining have been covered. Since some concepts of graph theory have been exploited here, the necessary background materials have also been summarized.
2.1 Data Mining Data mining covers the methods for nding interesting trends or patterns in large datasets. These discovered patterns help and guide the appropriate authority in taking future decisions. Generally
data mining tools are expected to identify interesting patterns in the data with minimal user intervention. Since data mining e orts usually assume a very large volume of data, eciency and scalability are two very important criteria for data mining algorithms. The pattern discovered by data mining should properly portray the contents of the dataset and the nature of the application under consideration. The imperfectness should be expressed by approximate rules and should also be quantiable. Discovering or mining association among di erent features present in an application domain has recently attracted a lot of attention (see 7, 9]). Definition 2.1. An
are sets of items.
association rule is of the form LHS ) RHS , where both LHS and RHS
This identies that if every item of LHS is present in a transaction, then it is likely that the items in RHS will also be present. There exists two important measures for an association rule, one is support and another is con dence. Definition 2.2. The
of these items.
Remark 2.1. The
LHS RHS .
support for a set of items is the percentage of transactions that contain all
support for a rule LHS ) RHS is support for the set of items
Low support may imply that a rule has arisen purely by chance, whereas high support value may identify some relational pattern among the items. Definition 2.3. The
LHS )RHS . condence for a rule LHS ) RHS is support support LHS (
)
(
)
Remark 2.2. Out of the transactions that have LHS , the percentage of transactions that have RHS
as well, is the measure of condence of the rule LHS ) RHS .
It indicates the degree of correlation between presence of these set of items. Algorithms for generating valid association rules are explained in 9]. In the present discussion we are only interested on the presence or absence of an item in a transaction, but not on the exact quantity of it.
2.2 Vertex Connectivity - A Concept from Graph Theory This section describes one important graph theoretic result from 6], which will be used for the break algorithm in Section 6.
2 V be such that (a b) 62 E . Set S V ; fa bg is an (a b) vertex separator if every path from a to b passes through a vertex of S . In other words a and b belong to dierent connected components of G ; S . The minimum cardinality of any (a b) vertex separator is denoted by N (a b). Definition 2.4. Let G = (V E ) be an undirected graph and let a b
2 V be such that (a b) 62 E . Then e), where e =j E j and n =j V j.
Lemma 2.1. Let G = (V E ) be an undirected graph and let a b
N (a b) can be computed in time O(n
1 2
The vertex connectivity c of an undirected graph G(V E ) is ;1 if G is complete c = nmin fN (a b) (a b) 62 E g otherwise Theorem 2.1. Let G(V E ) be an undirected graph and let c be its vertex connectivity. Then c can
be computed in time O(cn 32 e) = O(n 12 e2 ) where e =j E j and n =j V j.
3 Dierent Matching Procedures and Classication This section discusses how an instance can be classied in a given hierarchical structure at the time of insertion. First we dene some concepts which will be used throughout this document. Definition 3.1. The universal attribute
be considered for the application domain.
set U , j U j= n is the set of all the attributes that will
Here we are interested to nd out relevant relationships among the items in U . Once the universal attribute set is identied, the attributes are considered in a specic order 0 1 : : : n;1 for convenience of representation. Since i 2 f0 1g 8i, the ith bit of an n bit string represents presence and absence of an attribute.
;
Definition 3.2. An instance (transaction) I is considered as a string of length n containing 0 1 values (I 0 1 n) implying the presence(1) and absence (0) of the items in the transaction.
2f g
Definition 3.3. A class C consists of a set of attributes C A
a set of classes, with a parent-child relationship among them.
U . A class hierarchy consists of
;
Remark 3.1. A class may be identi ed by a 0 1 bit string of length n, where the ith bit is 0 if ai = C A , and 1 otherwise. C A represents the set of attributes belonging to class C only and not the
2
attributes inherited from the parent classes. The root class has no parent. The intermediate classes have one parent and one or more children. The leaf classes have one parent and no child.
Here we list the important count variables. The requirement of these count variables will be clear as the discussion proceeds.
(C ) = number of instances matching the class C . The idea for count has also been claried in Remark 3.4., 3.5.. (C ) = number of exceptions to the class C . The idea for count has also been claried in Remark 3.5., 4.3.. = a system wide threshold count on the number of exceptions in a class. The system may initiate a reorganization if (C ) > , for a class C . (See Remark 4.1. also). = a system wide threshold count on the number of same type of exceptions in a class. If a specic exception pattern corresponding to a class exceeds , the hierarchy need to be reorganized. This is equivalent to the support value. A new child class (Cch ) will be generated only when the number of transactions matching Cch exceeds the threshold for the system. (See Remark 4.2. also).
Remark 3.2. Reorganization procedure will be initiated if (C ) > for a class C . However, actual
reorganization will be done only if the count of a particular type of exception exceeds .
3.1 Exact match When the system considers a new instance, it is compared against the leaf level classes for exact match. If it fails, top-down search is done from the root for approximate classi cation. In this case matching may be possible only upto some class at intermediate level. Since further matching down the hierarchy fails, the new instance is agged as an exception to the class corresponding to the intermediate node. In this method a lot of exceptions may get accumulated in each class. If the count of exceptions for a class exceeds the application specic threshold values ( ), the class hierarchy needs to be restructured. The di erent matching procedures for inserting an instance into the system are discussed below. Definition 3.4. The total attribute set C tA attached to a class C is the union of all the attribute
sets of the classes lying on the path from root to that class. Corresponding to total attribute set an n bit binary number is formed, with 1 at the ith position if the attribute i 2 C tA and 0 otherwise. This is called the path identication number of the class.
So, a maximum of (2n ; 1) leaf classes are possible in the application domain. However, in real application it will be much less than that. The algorithms considered here do not give rise to multiple inheritance (see Remark 5.1.) and hence there exists only a single path from the root to a leaf node. These numbers are sorted in descending order. Since the total attribute set of all the leaf classes are di erent and the attributes are ordered, the di erence is re ected in the path identi cation numbers as well. A sorted list of the leaf level classes according to their path identi cation numbers is stored and it is updated whenever a new leaf class appears in the tree. Definition 3.5. An instance I is an exact match with a leaf level class C , if the binary number
corresponding to I is equal to the path identication number of C .
Remark 3.3. Here equality means the similarity of the bit patterns of the two n bit strings, one the
instance I and another the path identication number of the class C .
Thus given an instance which contains only 0 and 1 values, binary search can be used over the sorted list of leaf classes to check whether it matches with any of the path identi cation numbers. If it matches then the membership of the instance can directly be decided. It is obvious that if there are L leaf classes then this operation needs O(log2 L) comparisons. Remark 3.4. Once an instance becomes exact
match with one leaf class C , the count increases for all the classes on the path from the leaf class C , to the root. Here we describe the algorithm for inserting an instance using exact match. Input : An instance I . Output: The leaf class with which exact match is found, if exists. Begin Algorithm Check whether I matches exactly with one of the classes (CL ) in the list of leaf classes
HHH " " HH " " H HH "" @ ; ; @@ ; ; @ ; @ ; @ C0 a0
C1
C4
a1 a2
b0
C2
C3
C5
C6
C7
a3 a4
a5 a6
b 1 b2
b3 b4
b5 b6
Figure 1: The Initial Tree Structure If any such CL exists (C )++ for each of the class C from the leaf class CL upto the root class Else, go for approximate classication using perfect match. (Subsection 3.2) End Algorithm That is, if an instance does not match with any of the leaf classes then the system will have to start checking from the root for approximate classication as described in Subsection 3.2.
f
Example 3.1. Let the universal attribute set be A = a0 , a1 , a2 , a3 , a4 , a5 , a6 , b0 , b1 , b2, b3, b4 ,
b5 , b6 g with 14 attributes. Initially there are eight classes in the hierarchy C0 , C1 , C2 , C3 , C4 , C5 , C6 and C7 as shown in the Figure 1. The path identication number of each leaf level class will be a 14 bit binary number, e.g. the path identication number of C3 is 11100110000000 where the bits corresponding to a0 , a1 , a2 , a5 , a6 contain 1 value and all others are of value 0. In the example, initially there are ve leaf classes C2 , C3 , C5 , C6 and C7 sorted in descending
order according to their path identication numbers. If an instance comes with 11100110000000 pattern, it is checked over the sorted list using binary search and exact match with C3 is found. Let there be a situation where the initial 500 transactions has been processed with attribute a0 present. The motivation is to analyse the transaction pattern with all those transactions where the item a0 is present. Now let there be 5 dierent types of transactions (100 each) with the itemsets fa0 a1 a2 a3 a4 g, fa0 a1 a2 a5 a6 g, fa0 b0 b1 b2 g, fa0 b0 b3 b4g, and fa0 b0 b5 b6 g. This implies that there are 100 instances of exact match for each leaf class from initial 500 instances in the domain of application. So corresponding to the Figure 1, we nd that (C0 ) = 500, (C1 ) = 200, (C2 ) = 100, (C3 ) = 100, (C4 ) = 300, (C5 ) = 100, (C6 ) = 100 and (C7 ) = 100. According to the de nition of 100 support and association in Subsection 2.1 support(C2tA ) = support(fa0 a1 a2 a3 a4g) = 500 = ( C 2) A A 20% whereas condence of the association rule C1 ) C2 i.e. (fa1 a2 g ) fa3 a4 g) = (C1 ) = 100 = 50%. This gives the initial hierarchical condition. The exception count corresponding to 200 each class is initialized to 0 at this point as each of the 500 instances are exact match to one of the ve leaf level classes.
3.2 Perfect match In this section the system takes care of those instances which fail to have exact match with any of the leaf classes. These instances are called exceptional instances or in short exceptions. Now denition of perfect match will be discussed. The denition will go by induction. The system considers that all the instances coming in the schema are in the application domain, i.e. each of the instances will be a perfect match for the root class at depth 0. If any instance is not a perfect match with the root class then it does not belong to the application domain. As shown in Figure 1), a0 is the only attribute in the root class. So the transactions to be considered must contain a0 . For the sake of inductive denition, we set an initial condition.
An instance I will be a perfect match with root class if all the attributes of root class are present in that instance.
S
Let Fr(C ) = CiA , where Ci 's are the peer classes of C , excluding C . Definition 3.6. An instance I will be a
1. 2. 3. 4.
perfect match with a class C at depth i 1 if
It is a perfect match to a class Cp at depth i ; 1. Cp is the parent of C . All the attributes of C are present in the instance I , i.e. C A I . Let A0 = Fr(C ) ; C A . In all the attributes of A0 the instance I should contain 0 value (absence).
The algorithm for approximate classication is given below. Input : An instance I which is not exactly classied. Output: The class upto which perfect match is found. Begin Algorithm
Cp = The root class i = 0 A = I if I is not a perfect match for Cp report that I does not belong to the application domain
exit (Cp )++ // The instance is relevant to application domain. while (true) i = i + 1 if there exists a perfect match with a class C among the children of Cp at ith level then Cp = C (Cp )++ A = A ; CpA //Remove the matched attributes from the instance.
end if else report the classication upto Cp record the instance as an exception of Cp (Cp )++ exit end else end while End Algorithm
perfect match, the system can classify an instance upto some depth in the tree. For all the classes, from the root downwards upto which the perfect match is found, say C , is incremented by 1. This is an under classication for the class C where the perfect match has been found. It is not possible to classify the incoming transaction any further under the class C and also the (C ) is incremented by 1.
Remark 3.5. Hence, in case of
Example 3.2. Let us consider the initial tree structure shown in Figure 1. Let there be an instance
(transaction) I1 with the attribute (item) set f a0 , a3 , a4 , b0 , b1 , b2, b3 , b4 g, the pattern being 10011001111100. This instance does not oer an exact match for any of the leaf classes. So we try for a perfect match upto the level possible. It is initially a perfect match with the root class C0 in Figure 1. Now among the children of C0 , i.e. C1 , C4 , at the level 1, 1. 2. 3. 4.
I1 is a perfect match to C0 at depth 0. C0 is the parent of C4 . All the attributes (here b0 ) of C4 are present in I1 . A0 = f a1 , a2 g, and I1 contains 0 value for those attributes.
Hence, I1 is a perfect match to C4 at depth 1. But in the level 2, the condition 4 of De nition 3.6. fails for C5 , C6 and the conditions 3 and 4 fail for C7 . Hence perfect match is found upto depth 1 with the class C4 for the instance I1 . The system marks the class C4 with under classication tag and identi es I1 as an exception to it.
Let us consider that many such instances are encountered by the system, such that (C4 ) > . The accumulation of large number of exceptions in a class triggers reorganization of the class hierarchy.
4 Add In this section we will discuss on addition of a new class in the hierarchy. Remark 4.1. If a class C accumulates very large number of exceptions which crosses the threshold
, i.e. (C ) > , then restructuring procedure is initiated. New classes may need to be generated.
HHH " " HH " " H HH "" @ ; ; @@ ; ; @ ; @ ; @ C0 a0
C1
C4
a1 a2
b0
C2
C3
C5
C6
C7
a3 a4
a5 a6
b 1 b2
b3 b4
b5 b6
C8
b1 b2 b3 b4
Figure 2: A New Class C8 added under C4 Example 4.1. In Figure 1 there are three children classes C5 C6 and C7 under the parent C4 . C5
has the attributes fb1 b2 g, C6 has fb3 b4 g and C7 has fb5 b6 g. The instances of the form I1 , (as given in Subsection 3.2) will become exception to class C4 . Now if we add a new class C8 with C8A = fb1 b2 b3 b4 g as a child of C4 , the exceptions corresponding to I1 will not be generated. They would rather be the perfect match to C8 . So if the number of exceptions exceeds the threshold for the class C4 , we may have to create at least one new child class of C4 so that the number of exceptions for the class C4 go below the threshold. It is then better to generate a single class rst such that the number of exceptions decreases maximally. Among the exceptions a pattern of attributes is to be chosen that would generate the maximum cardinality. So, the exceptions of C4 , that matches C8 , would be pushed to C8 as in Figure 2. Remark 4.2. It should be noted at this level that the
count of the new class generated (e.g.
C8 ) may also be very less, which may not be interesting from the aspect of generating important association rules. So another constraint may exist that count of the new class (e.g. C8) to be added should be greater than some other kind of threshold value . This is equivalent to nding out a large number of same kind of exceptions, greater than a threshold value , for the class (e.g. C4 ) under which the new class (e.g. C8 ) is going to be added. Now we formalize the algorithm for add. Input : A class C in the hierarchy with (C ) > . Output: A new class which will be added under C . Begin Algorithm Identify the children classes C1 C2 : : : Ck of C Let A = ki=1 CiA Find the set of attributes S A among the exceptions such that there exists maximum number
of exceptions at class C having value 1 for each of the attributes in S and value 0 for each of the attributes of A ; S Initiate construction of a new class Ck+1 with attribute set S // CkA+1 = S If (Ck+1 ) > // (See Remark 3.2.) Ck+1 is added as a child of C alongwith C1 C2 : : : Ck End Algorithm The class Ck+1 is named as the partially classi ed leaf. Let A contains m attributes. In order to identify the maximally occurring exceptional pattern, the attributes of A are ordered as 1 2 : : : m . So each exceptional pattern corresponds to an m-bit binary vector and one scan through the exceptions gives the counts against each vector type. The add algorithm may need to be run more than once for the same class to generate more than one classes, if there exists di erent kinds of exceptions with large count.
4.1 Partially Classied Leaf The new class created due to add (Ck+1 ) does not have any child class. So Ck+1 is a leaf. But the class is not properly classied in the sense that the instances coming to this class from its parent class may contain many other attributes except the attributes which dene Ck+1 . Remark 4.3. The
total attribute set (see De nition 3.4.) of the class Ck being CkA , there +1
+1
may exist number of instances in Ck+1 which contain some extra attributes alongwith CkA+1 . These instances will become the exceptions for the class Ck+1 .
Let us consider the instances of type I1 (see Example 3.2.) which goes to C8 (Figure 2). We have, I1 = fa0 a3 a4 b0 b1 b2 b3 b4 g and C8tA = fa0 b0 b1 b2 b3 b4 g. Thus a3 a4 are the extra attributes. Hence all the instances of type I1 will be exception to the class C8 . If the number of these exceptions crosses the relevant threshold parameters, then we may need to restructure the hierarchy again. Thus, considering the instances in Ck+1 with extra attributes, it may be possible to nd certain pattern for breaking the class, which will be discussed in Section 6. The partially classi ed class is not added to the list of the leaf classes for exact matching. Break operation will provide classes at one level lower than Ck+1 . These classes will be considered as partially classi ed again and may become the candidates for further breaking. Ck+1 will then become an intermediate class. The recursive breaking will continue depending on the and values corresponding to the newly generated classes, alongwith the threshold parameters. At the termination of this process, the partially classi ed leaves will be declared as leaf classes and inserted in the leaf class list so that they can be considered for exact matching. A partially classi ed leaf would however participate for perfect matching of instances.
5 Fuse This section considers fuse of more than one classes which have the common parent. It will change the parent-child relationship of some classes. Figure 3 explains the fuse operation. Now we formalize the algorithm for fuse.
@ ; ; @ ; @ ; @@ ; @@ ; @@ ; ; ; ; @ ; @ ; @ ;;; @@ @ K
K
p1 p2
K1
K11
:::::
K2
q
K1
K21
:::::
p1 p2
p1 p2 p3 p4
K2
K1
m
K11
:::::
K1
p3 p4
q
K2
(a)
K21
:::::
m
K2
(b)
Figure 3: Fuse K1 and K2 Input : A class K in the hierarchy whose children K1 and K2 can be fused. Output: A new class hierarchy. Begin Algorithm
let K1A K2A let the children of K1 be K11 : : : K1q and those of K2 be K21 : : : K2m remove the parent-child relationship between K and K2 K2A = K2A ; K1A //remove the attributes p1 p2 from K2 . place K2 as a child of K1 along with K11 : : : K1q (K1 ) = (K1 ) + (K2 )
End Algorithm Interestingly, the path identi cation numbers of all the leaf classes remain undisturbed by the fuse operation, thereby keeping the syntactic structure of each path same. For fuse operation the relationships among the children classes of the same parent are to be found. In worst case it takes O(n2 ) set operations using naive algorithm. In the Figure 2 of the previous section, C8A = fb1 b2 b3 b4 g. So, C5A C8A and also C6A C8A . In this situation the fuse operation is not allowed. If the hierarchy is restructured with C8 becoming the child of C5 , then all the instances those were perfect match upto C8 , will become exceptions at C4 , the parent class of C5 . The instances, which contain b1 b2 b3 and b4 , will not be able to come down under C4 due to the presence of C5 (attributes b1 b2) and C6 (attributes b3 b4 ). The purpose of add method will be defeated in this case, which generates the class C8 to reduce the number of exceptions. Remark 5.1. The
fuse of two peer classes K and K (K A K A) is not allowed if there exists
any other peer class K3 with (K3A K2A ).
1
2
1
2
It is apparent from the add and fuse algorithms that C8 would never be placed as the child of both C5 and C6 . Thus the possibility of multiple inheritance does not exist.
6 Break The purpose of this section is to present a partitioning mechanism of a partially classi ed leaf. The hierarchy considered here is dynamic because the structure of the classes and the interrelationship among them keep on changing as new instances are inserted. One class may be broken into several classes to equip the system for more subtle classication of instances. The problem is mapped to graph theoretic domain and a deterministic polynomial time algorithm is proposed for this purpose.
6.1 Restricted vertex connectivity First we give the denition for restricted vertex separator.
2
2
Definition 6.1. Let G(V1 V2 E ) be an undirected bipartite graph and let a b V1 be such that there does not exist any vertex c V2 so that both (a c) (b c) E simultaneously. The set SR V1 a b
2
;f g
is de ned as a restricted vertex separator if every path from a to b passes through at least one vertex of SR , i.e. a and b belong to dierent connected components of G ; SR . The minimum cardinality of any restricted (a b) vertex separator is denoted by NR (a b). Lemma 6.1. Let G0 = (V1 E 0 ) be constructed from G where,
E 0 = f(a b) a b 2 V1 c 2 V2 and (a c) (b c) 2 E g. If SR is a minimum cardinality restricted vertex separator set of G(V1 V2 E ), it is also a minimum cardinality vertex separator set of G0 (V1 E 0 ) for a b 2 V1 and vice versa.
Proof : The lemma will be proved by contradiction. Let the vertices a b in G0 (V E 0 ) be not separated by removal of SR . Then there exists at least one path p0 = fa = a a : : : ak = bg in G0 . Now, for each edge (ai ai ) 2 E 0 , there exists some vi 2 V such that (ai vi ) (vi ai ) 2 E . Hence there exists a path p = fa = a v a v : : : vk; ak = bg in G. Hence it is a contradiction. To prove the other side, let S be a minimum cardinality vertex separator set of G0 (V E 0 ) for 1
0
1
+1
+1
2
0
0
1
1
1
a b 2 V1 . Let the removal of S in G(V1 V2 E ) does not separate a b. Then after the removal of S there exists at least one path p = fa = a0 v0 a1 v1 : : : vk;1 ak = bg in G. Now from construction p0 = fa = a0 a1 : : : ak = bg is a path in G0 even after removal, which contradicts the denition of S. 1
Remark 6.1. So, from the discussion in Section 2, the restricted vertex separator between two 1
vertices of a bipartite undirected graph can be calculated in time O(n 2 e), where e is the number of edges in the transformed general undirected graph, i.e., e =j E 0 j and n =j V1 j.
restricted vertex connectivity of a bipartite graph G(V1 V2 E ), where the vertices are removed from V1 is V1 j ;1 , if G is complete bipartite cR = jmin fNR(a b) a b 2 V1 c 2 V2 simultaneously (a c) (b c) 62 E g , otherwise Definition 6.2. The
Remark 6.2. Calculation of cR also takes polynomial time O(n 2 e2 ) as it is equal in order to nd 1
the vertex connectivity of the transformed general undirected graph.
If the vertices from the restricted vertex separator set are removed, the graph is divided into two or more components. If the graph by itself contains more than one component (disconnected), then more than one components can be found without removing any of the vertices.
6.2 Problem Description
A partially classied leaf K consists of a set of attributes A. As example, let K A = fp1 p2 p3 p4 p5 g in the given order with the instances as specied below. (See Figure 4) i1 = 11100, i2 = 11100, i3 = 10011, i4 = 10011, i5 = 11111, i6 = 10011. To clarify the concept, attributes p1 p2 p3 are present but attributes p4 p5 are absent in i1, again all the attributes are present in i5. Now it is interesting to watch that if the instance i5 is not considered, the situation is as follows. i1 = 11100, i2 = 11100, i3 = 10011, i4 = 10011, i6 = 10011. So, three classes K 0 with attribute p1 , K1 with attributes p2 p3 and K2 with p4 p5 may be made where all the instances are members of class K 0 , i1 i2 are members of K1 and i3 i4 i6 are of K2. This gives an idea to break a class K into three classes which o ers better classication of the instances, where K1 and K2 are the children of class K 0 . A very natural question is whether all the instances will o er such well structured forms so that a partially classied leaf can be divided perfectly. The answer is obviously no. If instance i5 is considered, the class K ceases to break. If the class K is broken into K 0 K1 K2 then the instance i5 will be an exception to class K 0 and after being a perfect match upto class K 0 , i5 will not proceed downwards. So, in case of breaking a class in this way, it would be necessary to minimize the number of exceptions generated. Now, the problem is to decompose a partially classied leaf into two or more new partially classied leaves (children) along with an intermediate class (the parent class which was previously a partially classied leaf) by creating minimum number of exceptional instances. First the attributes which are present in all the instances of the partially classied leaf (attribute p1 in the Figure 4) are identied. A parent intermediate class is constructed with these attributes and all the instances of the original class will become the members of this class. Definition 6.3. A new hypothetic class with remaining attributes and with all the instances of the
original class is called the residual class.
This residual class is the candidate to break. The attributes of residual class are to be divided to more than one classes with generation of minimum number of exceptional instances. Let Kp = parent of K . Now after breaking, the generated classes are K 0 K1 K2 (say). The generated parent class K 0 may be given the same name K (with the same parent Kp) as the previously partially classied leaf and K1 , K2 will be the new partially classied leaves which may experience further decomposition. The graph theoretic modeling enables us to decide how the attributes and the corresponding instances can be distributed among the children classes.
p2
p3
p4
p5
JBBJ JBBJ ;; BBJJ B;BJ;J B J ; B J BB JJ; ; BB JJ
i1
i2
i5
i3
i4
+
HH HHH XXXXXHXHXH XHXH i1
exception
i6
i3
i5
i2
i4
(b)
; @@ ) ; ; @ .. .
i6
(a)
.. .
K
K
p1
p1 p2 p3 p4 p5
0
K1
K2
p2 p3
p4 p5
(c)
Figure 4: Break K
6.3 Mapping The residual class is mapped to the following undirected bipartite graph: G = (J A E ), where J A = set of vertices and E = set of edges. Also, A = the attribute set of the residual class. J = the instance set of the residual class. E = f(p i) p 2 A i 2 J and instance i contains attribute p g. (p i) is unordered pair and hence the graph G is undirected. Moreover, G is bipartite. A situation with 4 attributes and 6 instances is shown if Figure 4(a). So the problem of decomposing one class into several classes by creating minimum number of exceptional instances reduces to the following graph algorithmic problem. Given an undirected graph G = (J A E ), A and J being the vertex bipartitioning, we need to nd the minimum cardinality set SR , such that SR J and removal of the vertices of SR makes the graph disconnected (see Subsection 6.2). So we land into the problem of restricted vertex connectivity in an undirected bipartite graph and it can be done deterministically in polynomial time. The instances of SR will be the exceptions of the generated parent class and the two disconnected components of the graph will construct two children classes. Hence if the disconnected components are G1 = (J1 A1 E1 ) and G2 = (J2 A2 E2 ), then one of the children classes will contain the attribute set A1 and instances J1 , the other A2 and J2 . From the example in the previous Subsection 6.2, J = fi1 i2 i3 i4 i5 i6 g, A = fp2 p3 p4 p5g. The attribute p1 is not in A (the attribute set of the residual class) as p1 belongs to all the instances and becomes a member of the parent class. Figure 4(a) gives the resulting bipartite graph and Figure 4(b) gives the corresponding transformed one. After the break algorithm the parent class K will have instance set fi1 i2 i3 i4 i5 i6 g and attribute set fp1g. For the two children classes, we get
J1 = fi1 i2 g, A1 = fp2 p3 g and J2 = fi3 i4 i6g, A2 = fp4 p5 g as shown in Figure 4(c). For the parent class K , i5 will be an exception as it will not be a perfect match with any of its children classes K1 K2 . Once the children classes K1 K2 are created, we scan through the instances of each child, and check whether all the attributes of the class are present in each of the instances. Let us consider a new instance, i7 = 11000, which after the breaking becomes a member of class K1 . But the attribute p3 is absent in it. Hence it should be pushed up to K 0 and kept as before the exception of K 0 . We may use the same name K for K 0 . Through this algorithm more than two children classes may also be generated. That is, each of the disconnected components of the graph will represent one child class. Next we summarize the algorithm for break. Input : A class K in the hierarchy which is a candidate for break. Output: Two or more children classes. Begin Algorithm run the graph theoretic algorithm on K to generate K 0 and the children classes K1 K2 : : : Kr corresponding to each of the disconnected components of the representative graph for each class Ki (1 i r) //KiA is the attribute set of Ki push up the instances of Ki to K 0 for which all the attributes of KiA are not present (Ki ) = the number of instances remaining in Ki (Ki ) = the number of instances which contain extra attributes in addition to the total attribute set (KitA ) (Denition 3.4.) of Ki End for rename K 0 as K P (K ) = (K ) ; ri=1 (Ki ) // (K ) remains unchanged. End Algorithm Using this idea, break algorithm is applied over the hierarchy described in section 4 (Figure 2). We construct an example with di erent types of instances to explain the e ect of break mechanism on a dynamic hierarchy. Example 6.1. Refer to Example 3.1. in Subsection 3.1, where (C4 ) = 300. Let us consider 245
instances with ve dierent patterns
I1 (10011001111100 with a0 , a3 , a4 , b0, b1, b2, b3, b4 present) (100 occurrences), I2 (10000111111100 with a0 , a5 , a6 , b0, b1, b2, b3, b4 present) (100 occurrences), I3 (10010101111100 with a0 , a3 , a5 , b0, b1, b2, b3, b4 present) (10 occurrences), I4 (10010101110010 with a0 , a3 , a5 , b0, b1, b2, b5 present) (25 occurrences), I5 (10010001111100 with a0 , a3 , b0 , b1 , b2 , b3 , b4 present) (10 occurrences). All these are perfect match to C4 at depth 1. So there are 245 exceptions to the class C4 . After
we carry forward 500 instances from Example 3.1. and their corresponding distribution in relevant classes, we get, (C0 ) = 500 + 245 = 745 and (C4 ) = 300 + 245 = 545. The exception count (C4 ) = 245. Let us consider that (C4 ) > . Thus add procedure is initiated. Now there are 100 + 100 + 10 + 10 = 220 exceptions (corresponding to instance type I1 I2 I3 I5 ), which contain
HHH " " HH " " H HH "" @ ; ; @@ ; ; @ ; @ ; @ @ ; ; @ ; @ C0 a0
C1
C4
a1 a2
b0
b1 b2 b3 b4
C2
C3
C5
C6
C7
a3 a4
a5 a6
b 1 b2
b3 b4
b5 b6
C8
C9
C10
a3 a4
a5 a6
Figure 5: C8 breaks to C9 and C10
fb b b b g.
We form a class C8 under C4 . Hence (C8 ) = 220, and (C4 ) = 245 ; 220 = 25. 1 2 3 4 There are 25 exceptions due to the instances of type I4 only. However, in the class C8 , all the instances are exception (See Remark 4.3.), i.e. (C8 ) = 220. We run break method to provide more subtle classi cation for the hierarchy. For the partially classied leaf C8 , two children classes C9 and C10 are generated by break method Figure 5. C9A = fa3 a4g (corresponding to instances of type I1) and C10A = fa5 a6g (corresponding to I2 ). Thus, 10 occurrences of I3 and I5 each (total 20) stays in the class C8 as exceptions. So, (C8 ) = 20, (C9 ) = 100, and (C10 ) = 100.
7 Towards Data Mining Here the complete process is summarized. After the application of the algorithm for approximate classication, if the number of exceptions exceeds relevant thresholds for a class which is under classi ed, the following steps are performed.
To add a new child class CN under C . To push the part of exceptions of C , which will not remain its exception after creation of class CN , to CN .
To explore the fuse possibilities of CN with any of its peer class and to restructure the hierarchy
accordingly. To break CN recursively if possible. To transform some partially classied leaves to leaf level classes with the consent of the domain expert, if the application domain so demands.
The initial tree structure may be constructed either from domain knowledge or from analysis of some data using standard data mining techniques 9]. Relevant thresholds , may be initialized at di erent values depending on the nature of analysis. Higher threshold values reduce the possibility of restructuring.
association rule from the database. The generation of a new class C is possible only when (C ) is greater than a threshold value . This guarantees that the association rule generated has proper support. The condence of the association rule can also be calculated by only checking the values of each of the classes on the path from root to C . Let Cx be a superclass of Cy on the same path in the hierarchy. The condence for an association rule CxA ) CyA can be calculated as ((CCxy )) . Remark 7.1. Whenever a new class is generated, it explores a new
Corresponding to the Example 6.1. (Figure 5), we have (C0 ) = 745, the total number of instances in the application. For the association rule C8A ) C9A (i.e. fb1 b2 b3 b4g ) fa3 a4 g) the support 100 is CC09 = 745 , whereas con dence is CC98 = 100 . In the dynamic environment the database is updated 245 quite often. Since the size of the database is not xed here, the value of the threshold counts need to be changed from time to time corresponding to the predetermined values (in percentage) of support and con dence for the association rules. The standard data mining algorithms 9] run on a static database to mine nontrivial information. However, many transaction processing environments provide dynamic databases. Running the standard static data mining algorithms repeatedly over a dynamic database may generate considerable overhead. The concept discussed here is free from that kind of overhead and works as a front end data mining layer.
8 Conclusion The system developed has been tested against data collected during a a customer satisfaction survey made by a leading audio cassette manufacturing company. We could generate association rules which matched with those obtained using well known Apriori algorithm 9]. In addition we could also study the change in hierarchy over the sampled environment. The present paper is motivated towards presenting the techniques formally and hence the detail of the experimental results has been avoided. The experimental result has been discussed extensively in 3]. The following future possibilities may be explored towards the improvement of the scheme described in this paper. 1. Extending the work towards general quantitative and qualitative value domain, instead of 0 ; 1 values, may be an interesting eld of study 8]. 2. For classication of instances with some of the attribute values unknown 5], the system may consider the values of other attributes and try to classify them maximally. 3. Some coding measures like Hamming Distance may be used for approximate classication. 4. Assignment of weight (importance) for each attribute and dependency among them may be considered.
As claimed earlier, this article reveals that the dynamic hierarchy can be used for data mining to generate important association rules. E ort has been taken to state the problem and its solution mostly in application independent way so that any application can be developed over this skeleton. The system may be used as an intelligent front end engine to any database.
References 1] Howard W. Beck, Tarek Anwar, and Shamkant B. Navate. A conceptual clustering algorithm for database schema design. IEEE Transactions on Knowledge and Data Engineering, 6(3):396{411, June 1994. 2] Ming-Syan Chen, Jiawei Han, and Philip S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, December 1996. 3] Tanmoy K. Das. Object oriented schema design with dynamic restructuring. M. C. A. Project (CS17), Indira Gandhi National Open University, June 1998. 4] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the 21th International Conference on Very Large Data Bases, pages 420{431, September 1995. 5] Subhamoy Maitra. Approximate schema design by conceptual clustering. M. Tech (Computer Science) Dissertation, Indian Statistical Institute, Calcutta, July 1996. 6] Kurt Mehlhorn. Graph Algorithms and NP-Completeness. Monographs on Theoretical Computer Science. Springer-Verlag, 1984. 7] Raghu Ramkrishnan. Database Management System. McGraw-Hill, 1998. 8] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proceedings of ACM SIGMOD, June 1996. 9] Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules and Sequential Patterns. PhD thesis, University of Wisconsin - Madison, 1996.