Data Mining with Concept Generalization Digraphs - CiteSeerX

1 downloads 6409 Views 145KB Size Report
Data Mining with Concept Generalization Digraphs. Wanlin Pang, Robert J. Hilderman, Howard J. Hamilton, and Scott D. Goodwin. Department of Computer ...
Data Mining with Concept Generalization Digraphs

Wanlin Pang, Robert J. Hilderman, Howard J. Hamilton, and Scott D. Goodwin Department of Computer Science University of Regina Regina, Saskatchewan, Canada, S4S 0A2 fpang,hilder,hamilton,[email protected]

Abstract We present the Path-Based Generalization and Bias-Based Generalization algorithms for attribute-oriented induction using concept generalization digraphs. Concept generalization digraphs are constructed from multiple concept hierarchies associated with an attribute, or set of attributes, in a database. The algorithms eciently manage the generalization process by avoiding unnecessary re-generalization and by determining which intermediate generalized relations to store for possible future use.

1 Introduction Attribute-oriented induction has been shown to be an e ective method for knowledge discovery from databases [6, 7, 5, 3, 4, 2, 9, 10, 8, 1]. A concept hierarchy (CH) associated with an attribute in a database is represented as a tree where leaf nodes correspond to actual data values in the database, intermediate nodes correspond to a more general representation of the data values, and the root node corresponds to the most general representation of the data values. For example, a CH for tracking sales by a computer services company could be represented by the tree in Figure 1. This company sells four di erent types of hardware and software, where one type of hardware and software is sold from each city. The levels of this CH correspond to the more general representation in the concept graph (CG) of Figure 3(a). Higher level concepts can be learned through generalization of the sales data as follows: 1. Retrieve relevant data from the database and remove attributes for which there are no higher level concepts in the CH. 2. Generalize the data using the knowledge provided in the CH:

ANY WEST

Vancouver

Hardware1

EAST

L.A.

Software1

Hardware2

Toronto

Software2

Hardware3

Software3

New York

Hardware4

Software4

Figure 1: Concept hierarchy (a) Compile statistics about the number of distinct values encountered for the attribute being generalized. (b) Replace data values with higher level concepts and remove duplicate tuples from the resulting generalized relation. (c) Generalize further if the number of tuples in the generalized relation is greater than the speci ed attribute threshold. 3. Transform a tuple in the generalized relation to conjunctive normal form, and transform multiple tuples to disjunctive normal form. When knowledge about an attribute can be expressed in di erent ways (e.g., by multiple CHs), the generalization technique outlined above may not produce a satisfactory result from the selected CH. For example, it does not provide any information regarding sales by country. To accomplish this, another CH for tracking sales could be represented by the tree in Figure 2. The levels of this CH correspond to the more general representation in the CG of Figure 3(b). A negative aspect of the CHs in Figures 1 and 2 is that they do not identify any relationships that may exist between them. However, the CGs in Figures 3(a) and 3(b) clearly show that these CHs are similar. In this paper, we introduce concept generalization digraphs (CGDs) as a method for guiding the learning process when an attribute may have multiple CHs. Figure 3(c) shows a CGD which has been constructed from the CGs in Figures 3(a) and 3(b).

Dl is called the least element and Dg the greatest

ANY Canada

Vancouver

Hardware1

Software1

element.

U.S.

Toronto

Hardware3

Software3

L.A.

Hardware2

New York

Software2

Hardware4

Software4

Figure 2: Alternative concept hierarchy ANY

ANY

Division

Country

ANY

Division

Country

City

City

City

Product

Product

Product

(a)

(b)

(c)

Figure 3: Similar CGs and a CGD The remainder of this paper is organized as follows. In the following section, we provide a formal de nition of CGDs and describe how one is constructed. In Section 3, we introduce two algorithms for generalization using CGDs, and walk through a detailed example for one of them. We conclude in Section 4 with a summary.

2 Concept Generalization Digraphs 2.1 De nitions

Given a set S = fs1 ; s2 ; : : :; sn g (which could be the domain of an attribute or the Cartesian product of the domains of a set of attributes), S can be divided into parts in many di erent ways, e.g. P = fp1; p2; : : :; png = ffs1 g; fs2g; : : :; fsn gg, Q = fq1; q2g = ffs1g; fs2 ; : : :; sn gg. Let D be the set of partitions of set S , and  be a binary relation de ned on D such that D1  D2 if for every d 2 D1 there exists d 2 D2 such that d  d . The binary relation  is a partial order relation and < D; > de nes a partial order set, from which we can construct a partial order set diagram < D; E > as follows: 1. The vertices of the diagram are elements of D, 2. There is a directed arc from Dj to Di (denoted by E (Dj ; Di )) i Dj 6= Di , Dj  Di and there is no Dk 2 D such that Dj  Dk and Dk  Di . Let Dg = fS g and Dl = ffs1 g; fs2g; : : :; fsn gg, then for any Di 2 D we have Dl  Di and Di  Dg . 0

0

For example, given S = f0; 1; 2; 3; 4; 5g, let D = fDg ; D1 ; D2; D3 ; Dl g, where Dg = fS g = ff0; 1; 2; 3; 4; 5gg, D1 = ff0; 1; 2g; f3; 4; 5gg, D2 = ff0; 1g; f2; 3g; f4; 5gg, D3 = ff0; 1g; f2g; f3g; f4; 5gg, and Dl = ff0g; f1g; f2g; f3g; f4g; f5gg, then we have a partial order set diagram < D; E > as shown in Figure 4. We call this partial order set diagram a concept generalization digraph for the given set and call the nodes (elements of D) concepts. The least element is the lowest level concept and the greatest element is the highest level concept. Two concepts, Di and Dj , are comparable if either Di  Dj or Dj  Di . That is, in the CGD, there exists a path containing both Di and Dj . For each node Di in CGD < D; E >, we de ne parents(Di ) to be all nodes Dj such that Di  Dj , children(Di ) to be all nodes Dk such that Dk  Di , in degree(Di ) = jchildren(Di )j, and out degree(Di ) = jparent(Di )j.

2.2 Constructing a CGD

Given a CG for a domain, we call each level (including root and leaves) a concept and the nodes at each level the instances of the concept. Every concept is a partition of the instance set of lower level concepts, as well as a partition of the domain. If one or more CGs are given for a domain, all the concepts de ned by these CGs comprise a set D of partitions of the domain. Based on the de nition in the previous section, a pair < D; > de nes a partial order set, and from this partial order set we can construct a CGD < D; E > associated with the domain. Each CG corresponds to a path from Dl to Dg in < D; E >. Note that there may be some paths from Dl to Dg in < D; E > such that there is no corresponding CG. In this case, the CGD provides more information from which we can create additional CGs.

3 Generalization Using CGDs 3.1 Basic Idea

CGs de ne the mapping from the instance set of lower level concepts to that of higher level concepts. If we map the instances of one concept to the instances of another concept (i.e., replace speci c values with generalized values), and the result is not satisfactory, we can either map the generalized value to a more generalized value, or we can use another CG to perform a similar generalization. In the latter case, we need to manage the generalization process to avoid unnecessary re-generalization, and to determine which intermediate generalized relations to store for possible future use. CGDs can be used to achieve both these

Dg

D1

B

D2

D3

Dl

Figure 4: Partial order set diagram goals. For example, given the CGD shown in Figure 5, if we have generalized from concept J to I , and we need further generalization from I to either H or F , then we do not need to keep the relation at J . However, if we have generalized to F and further generalization is needed, we need to keep the intermediate relation at I because we may need to generalize from I to H later on. If we have generalized from F to D and the results are still unsatisfactory, we can return to I and generalize from I to H . From the CGD we can see that it is unnecessary to generalize from H to G, because eventually this path will lead to D, which has already been found to be unsatisfactory.

3.2 Algorithms

Two methods for generalization using CGDs are the path-based generalization (PBG) and the bias-based generalization (BBG) algorithms. Assume that several CGs are given on the domain of an attribute which is to be generalized, and that an attribute threshold is given to control termination of the generalization process. From the given CGs, a CGD is constructed as shown in Figure 5. Using PBG, we select a CG (i.e., a path from Dl to Dg ) from the CGD and generalize the relations at lower level concepts to relations at higher level concepts until the attribute threshold is reached (i.e., generalization is complete) or until the most general concept is the last remaining concept to be generalized. In the latter case, another CG (i.e., a di erent path from Dl to Dg ) is selected and the above steps are repeated. Using BBG, all concepts in the CGD are ranked a priori to determine the order in which the edges are traversed during generalization. We select an edge (i.e., a edge from Di to Dj ) from the CGD and generalize the relation at the lower level concept to a relation at the higher level concept. If the attribute threshold has has not been reached, the next edge is selected and the above steps are repeated. This process continues until the attribute threshold is reached or until the most general concept is the last remaining concept to be generalized. Note that the rank of a concept must be consistent with partial ordering.

A

C

E

D

G

H

F

I

J

Figure 5: Concept generalization digraph That is, if r(Di ) denotes the rank of Di , consistency means that if E (Di ; Dj ), then r(Di ) < r(Dj ).

3.2.1 The PBG Algorithm

Procedures pbg and pbg reconcile of the PBG algorithm are shown in Figures 6 and 7, respectively. Given a CGD < D; E >, function nodes(< D; E >) generates the set of all nodes contained in < D; E >. Function all paths(< D; E >) generates the set of all paths from Dl to Dg contained in < D; E >. Function cardinality(PathSet) returns the number of elements in PathSet. Function tuple count(Ch ) determines the number of distinct tuples in a relation. Function node count(CurrentPath) determines the number of nodes in the current path remaining to be generalized. Function generalize(Cl ; Ch) replaces the values at Cl with the values at Ch . It calls tuple count(Ch ) and returns true if the attribute threshold has been reached, and false otherwise. Any of the generalization algorithms presented in [3, 5, 4] may be used to implement this function. We now walk through a detailed example of PBG. Consider the CGD, with least node J and greatest node A, shown in Figure 5. From this CGD, we rst generate the set of all paths contained in < D; E > so that PathSet = f< JIFDA >; < JIHGDA > ; < JIHECA >; < JIHEBA >g. The concepts in each path of PathSet will be generalized until the attribute threshold is reached (i.e., success) or until there are no more paths to be generalized (i.e., failure). We set CurrentPath to < JIFDA > (i.e., the rst path in PathSet) and remove < JIFDA > from PathSet. We set Ch to J (i.e., the rst node in CurrentPath) and remove J from CurrentPath. We call tuple count(J ) to determine whether the attribute threshold has been reached. If so, we return success and are done (i.e., the attribute is already generalized). Otherwise, the concepts of each node in CurrentPath will be generalized until the attribute

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

begin

NodeSet nodes(< D; E >); PathSet all paths(< D; E >); while cardinality(PathSet) > 0 do CurrentPath rst path in PathSet; remove CurrentPath from PathSet; Ch rst node in CurrentPath; remove Ch from CurrentPath; if tuple count(Ch )  threshold then return success; end if while node count(CurrentPath) > 1 do Cl Ch ; Ch rst node in CurrentPath; remove Ch from CurrentPath; if generalize(Cl ; Ch ) then return success; end if pbg reconcile(Cl; Ch ); end while Cl Ch ; Ch rst node in CurrentPath; pbg reconcile(Cl; Ch ); end while return failure;

end

Figure 6: Procedure pbg threshold is reached (i.e., success) or until there are is only one node remaining in CurrentPath (this will be the most general concept). We set Cl to Ch , set Ch to I (i.e., the new rst node in CurrentPath), remove I from CurrentPath, and generalize from J to I . If generalize(J; I ) returns success, then we are done. Otherwise, we call pbg reconcile(J; I ). We remove J from all paths in PathSet containing an edge from J to I and remove any paths in PathSet which now contain only one node (again, this will be the most general concept). We now have CurrentPath =< FDA > and PathSet = f< IHGDA >; < IHECA >; < IHEBA >g. We now generalize from I to F . If this fails, no removals are required from PathSet because there are no paths with an edge from I to F . We now have CurrentPath =< DA > and PathSet remains unchanged. We now generalize from F to D. If this fails, again no removals are required from PathSet. We now have CurrentPath =< A > and PathSet remains unchanged. Since CurrentPath now contains only only one node, we are nished with this path. We set Cl to Ch , set Ch to A, and call pbg reconcile(D; A). pbg reconcile removes D and all its descendants from the paths in PathSet. We now have CurrentPath =< A > and PathSet = f< A >; < IHECA >; < IHEBA >g. Since < A > contains only one node, it is removed from PathSet. We now have PathSet = f< IHECA > ; < IHEBA >g which corresponds to a new CGD

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.

begin

RemoveSet ;; for each Path 2 PathSet do if < Cl Ch >2 Path then RemoveSet RemoveSet[children(Ch ) from Path; remove children(Ch ) from Path; if node count(Path) = 1 then remove Path from PathSet; 0

0

0

0

end if end if end for while cardinality(RemoveSet) > 0 do

Cr rst node in RemoveSet; remove Cr from RemoveSet; RemoveNode true; for each Path 2 PathSet do if Cr 2 Path then

RemoveNode

false;

end if end for if RemoveNode then

remove Cr from NodeSet;

end if end while return; end

Figure 7: Procedure pbg reconcile(Cl ; Ch) 0

B

0

A

C

E

H

I

Figure 8: New concept generalization digraph with least node I and greatest node A, as shown in Figure 8. We set CurrentPath to < IHECA > and repeat the above procedure. A node is removed from NodeSet when the value of the corresponding concept is no longer needed. For example, if the original data in the database is no longer needed, node J can be removed. Similarly, when an intermediate generalized relation is no longer needed, the node for the corresponding concept can be removed.

3.2.2 The BBG Algorithm

Procedures bbg and bbg reconcile of the BBG algorithm are shown in Figures 9 and 10, respectively. Given a CGD < D; E >, function all edges(< D; E >) generates the ordered set of all edges con-

tained in < D; E > as follows: 1. Edges are ordered by the rank of the higher level concept (from lowest to highest), 2. Where multiple edges have the same higher level concept, these edges are ordered by the rank of the lower level concept. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.

begin

NodeSet nodes(< D; E >); Ch lowest ranked node in NodeSet; if tuple count(Ch )  threshold then return success; end if EdgeSet all edges(< D; E >); while cardinality(EdgeSet) > 0 do CurrentEdge rst edge in EdgeSet; remove CurrentEdge from EdgeSet; Cl rst node in CurrentEdge; Ch second node in CurrentEdge; if generalize(Cl ; Ch ) then return success; end if bbg reconcile(Cl ; Ch ); end while return failure;

end

Figure 9: Procedure bbg 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

begin

induction using CGDs. Future research will focus on developing heuristics for improving eciency of the algorithms. Heuristics are required for selecting and ordering the paths during path-based generalization, and for determining the rank of a node during biasbased generalization.

References

[1] R. Agrawal, T Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD-95), pages 207{216, Mon-

[2]

[3] [4]

RemoveSet Cl ; Edge 2 EdgeSet do Cs second node in Edge; if Ch = Cs then Cf rst node in Edge; RemoveSet RemoveSet [ Cf ; remove Edge from EdgeSet; 0

for each

[5]

0

end if end for while cardinality(RemoveSet) > 0 do

Cr rst node in RemoveSet; remove Cr from RemoveSet; RemoveNode true; for each Edge 2 EdgeSet do Cf rst node in Edge; if Cr = Cf then RemoveNode false;

[6]

end if end for if RemoveNode then

[7]

remove Cr from NodeSet;

end if end while return; end

Figure 10: Procedure bbg reconcile(Cl ; Ch) 0

4 Conclusion

0

We presented the path-based generalization and biasbased generalization algorithms for attribute-oriented

[8]

treal, August 1995. C. L. Carter and H. J. Hamilton. Fast, incremental generalization and regeneralization for knowledge discovery from databases. In J. Stewman, editor, Proceedings of the 8th Florida Arti cial Intelligence Symposium, pages 319{323, Melbourne, Florida, April 1995. C. L. Carter and H. J. Hamilton. A fast, on-line generalization algorithm for knowledge discovery. Appl. Math. Lett., 8(2):5{11, 1995. C. L. Carter and H. J. Hamilton. Performance evaluation of attribute-oriented algorithm for knowledge discovery from databases. (Accepted by 7th IEEE Int. Conf. on Tools with AI), 1995. J. Han. Towards ecient induction mechanisms in database systems. Theoretical Computer Science, (Special Issue on Formal Methods in Databases on Software Engineering), October 1994. J. Han, Y. Cai, and N. Cercone. Knowledge discovery in databases: an attribute-oriented approach. In Proceedings of the 18th International Conference on Very Large Data Bases, pages 547{559, Vancouver, August 1992. J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational databases. IEEE Trans. on Knowledge and Data Engineering, 5(1):29{40, February 1993. H-Y Hwang and W-C Fu. Ecient algorithms for attribute-oriented induction. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD-95), pages

168{173, Montreal, August 1995. [9] M. Madjimichael and H. J. Hamilton. Extracting concept hierarchies from relational databases.

In J. Stewman, editor, Proceedings of the 8th Florida Arti cial Intelligence Symposium, pages 314{318, Melbourne, Florida, April 1995. [10] S. Nishio, H. Kawano, and J. Han. Knowledge discovery in object-oriented databases: the rst step. In G. Shapiro, editor, Proceedings of AAAI-93 Workshop on Knowledge Discovery in Databases, pages 299{313, Washington DC, July

1993.