Data Mining in Large Databases Using Domain ... - CiteSeerX

, , 1{39 ()

c Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Data Mining in Large Databases Using Domain Generalization Graphs ROBERT J. HILDERMAN [email protected] HOWARD J. HAMILTON [email protected] Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 NICK CERCONE [email protected] Department of Computer Science, Faculty of Mathematics, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

Abstract. Attribute-oriented generalization summarizes the information in a relational database by repeatedly replacing speci c attribute values with more general concepts according to userde ned concept hierarchies. We introduce domain generalization graphs for controlling the generalization of a set of attributes and show how they are constructed. We then present serial and parallel versions of the Multi-Attribute Generalization algorithm for traversing the generalization state space described by joining the domain generalization graphs for multiple attributes. Based upon a generate-and-test approach, the algorithm generates all possible summaries consistent with the domain generalization graphs. Our experimental results show that signi cant speedups are possible by partitioning path combinations from the DGGs across multiple processors. We also rank the interestingness of the resulting summaries using measures based upon variance and relative entropy. Our experimental results also show that these measures provide an eective basis for analyzing summary data generated from relational databases. Variance appears more useful because it tends to rank the less complex summaries (i.e., those with few attributes and/or tuples) as more interesting. Keywords: Data mining, knowledge discovery, machine learning, knowledge representation, attribute-oriented generalization, domain generalization graphs

1. Introduction Knowledge discovery from database (KDD) algorithms can be broadly classi ed into two general areas: summarization and anomaly detection. Summarization algorithms nd concise descriptions of input data. For example, classi catory algorithms partition input data into disjoint groups. The results of such classi cation might be represented as a high-level summary, a decision tree, or a set of characteristic rules, as with C4.5 (Quinlan, 1993), DBLearn (Han et al., 1995), and KID3 (Piatetsky-Shapiro, 1991). Anomaly-detection algorithms identify unusual features of data, such as combinations that occur with greater or lesser frequency than might be expected. For example, association algorithms nd, from transaction records, sets of items that appear with each other in sucient frequency to merit attention (Agrawal et al., 1993, Brin et al., 1997a, Brin et al., 1997b, Park et al., 1995, Toivonen, 1996). Similarly, sequencing algorithms nd relationships among items or events across time, such as events A and B usually precede event C (Agrawal et al., 1995, Agrawal and Srikant, 1995,

2

Srikant and Agrawal, 1996). Hybrid approaches that generate high-level association rules from input data and concept hierarchies have also been investigated (Han and Fu, 1995, Hilderman et al., 1998a, Srikant and Agrawal, 1995). Attribute-oriented generalization (AOG) (Han, 1994, Han et al., 1992, Han et al., 1993) is a summarization algorithm that has been eective for KDD. AOG summarizes the information in a relational database by repeatedly replacing speci c attribute values with more general concepts according to user-de ned concept hierarchies associated with relevant attributes. A concept hierarchy (CH) associated with an attribute in a database is represented as a tree, where leaf nodes correspond to the actual data values in the database, intermediate nodes correspond to a more general representation of the data values, and the root node corresponds to the most general representation of the data values. For example, a CH for the Oce attribute in a sales database is shown in Figure 1. Knowledge about a higher level concept (i.e., a non-leaf node) can be discovered by generating the summary corresponding to the collection of nodes at each level in the CH. As the result of recent research, AOG methods are considered among the most ecient of KDD methods for knowledge discovery from databases (Carter and Hamilton, 1995a, Carter and Hamilton, 1995b, Carter and Hamilton, 1995c, Han, 1994, Hwang and Fu, 1995). ANY

West

East

Vancouver

Office1

Office2

L.A.

Office3

Office4

New York

Office5

Office6

Office7

Figure 1. A CH for the Oce attribute

The complexity of the CHs is a primary factor determining the interestingness of the results (Barber and Hamilton, 1997, Hamilton and Fudger, 1995). If several CHs are available for the same attribute, which means knowledge about the attribute can be expressed in dierent ways, current AOG methods require the user to select one. For example, the CH in Figure 1 does not provide any information regarding sales by country. Thus, a fundamental problem with AOG methods is that they present only one possible group of summaries to the user without evaluating the relative merits of other groups of summaries consistent with other CHs. To facilitate the generation of other groups of summaries, new data structures are needed to enable the data in a relational database to be represented in dierent ways. Here we introduce a data structure called a domain generalization graph (DGG) which augments CHs and expands the scope of AOG methods (Hamilton et al., 1996, Hilderman et al., 1997a, Pang et al., 1995). Informally, a

3

DGG associated with a CH de nes a partial order that represents a set of generalization relations for the attribute. A DGG always includes a single source (i.e., the node at the lowest level corresponding to the domain of the attribute) and a single sink (i.e., the node at the highest level corresponding to the most general representation of the domain and which contains the value ANY). The node at each depth in a DGG is a general description of the nodes at the same depth in the corresponding CH. For example, the nodes at each depth of the CH in Figure 1 correspond to the nodes at the same depth in the more general representation of the DGG in Figure 2. That is, West and East correspond to Division; Vancouver, L.A. and New York correspond to City; and Oce1 to Oce7 correspond to Oce. Note that a CH always corresponds to a DGG that contains a single path. ANY

Division

City

Office

Figure 2. A DGG for the Oce attribute

When there are multiple CHs associated with a single attribute, meaning knowledge about an attribute can be expressed in many dierent ways, a multi-path DGG can be constructed from the single-path DGGs. Figure 3 shows a multi-path DGG on the right that has been constructed from the two single-path DGGs on the left. Here we assume that a common name used in the single-path DGGs associated with the attribute represents the same partition of the domain in the associated CHs. For example, in Figure 3, the ANY, City, and Oce nodes in the single-path DGGs on the left represent the same partition of the domain in the associated CHs. Consequently, the like-named nodes can be combined in the multi-path DGG. ANY

ANY

ANY

Division

Country

City

City

City

Office

Office

Office

Division

Figure 3. A multi-path DGG for the Oce attribute

Country

4

In this paper, we present serial and parallel versions of the All Gen algorithm for generalization of a set of attributes using multi-path DGGs. We show that a set of attributes can be considered a single attribute whose domain is the cross product of the individual attribute domains. These algorithms generate summaries corresponding to all combinations of nodes from the DGGs associated with a set of attributes and perform the minimum number of generalizations in a space ecient manner. When the number of attributes to be generalized is large (however, we typically restrict the number of attributes to no more than four or ve) or the DGGs associated with a set of attributes is complex, many summaries may be generated, requiring the user to evaluate each one to determine whether it contains an interesting result. To overcome this problem, we use variance and relative entropy measures to rank the interestingness of the summaries generated. Then, after the summaries have been ranked, we use ve pruning heuristics to reduce the number of summaries under consideration. This work was motivated by the need to automate the discovery process for cases where many ways of generalization may be be appropriate. For example, given a database with a time-related attribute, summaries can be done according to the concepts hour of day, part of day, day, day of week, day of month, week, week in month, week in quarter, month, year, and many others. Our system not only creates all of these summaries, it also ranks them to help identify any anomalies, such as a disproportionate percentage of activity in the rst week of a month. Furthermore, all other attributes can have similarly complex summaries and our system considers all resulting combinations. These features enable a database analyst or domain expert to analyze the database from many dierent perspectives. The remainder of this paper is organized as follows. In the following section, we state the formal de nition of a DGG. In Section 3, we provide a high-level description of the DGG-Discover system, a data mining tool which uses DGGs for guiding knowledge discovery. In Section 4, we introduce and discuss serial and parallel multi-attribute generalization algorithms. In Section 5, we consider two interestingness measures used to rank the summaries generated by our algorithms. In Section 6, we present an extended example demonstrating the operation of the serial implementation of the multi-attribute generalization algorithm. In Section 7, we compare and contrast the two interestingness measures. In Section 8, we discuss pruning methods for reducing the number of summaries generated. In Section 9, we present experimental results. Finally, in Section 10, we summarize our work and suggest areas for future research.

2. De nitions Given a set S = fs1; s2 ; : : :; sng representing the domain of some attribute, we can partition S in many dierent ways. For example, D1 = ffs1 g; fs2g; : : :; fsngg, D2 = ffs1 g; fs2; : : :; sn gg, and D3 = ffs1; s2 g; fs3; : : :; sngg represent three possible partitions of S. Let D = fD1 ; D2 ; : : :; Dm g be the set of partitions of set S. We de ne a nonempty binary relation (called a generalization relation) on D, where

5

we say Di Dj if for every di 2 Di there exists dj 2 Dj such that di dj . The generalization relation is a partial order relation and hD; i de nes a partial order set from which we can construct a graph called a domain generalization graph, or DGG, hD; E i as follows. First, the nodes of the graph are elements of D. And second, there is a directed arc from Di to Dj (denoted by E(Di ; Dj )) i Di 6= Dj , Di Dj , and there is no Dk 2 D such that Di Dk and Dk Dj . Let Dg = fS g and Dd = ffs1 g; fs2g; : : :; fsn gg. For any Di 2 D we have Dd Di and Di Dg , where Dd and Dg are called the least and greatest elements of D, respectively. We call the nodes (elements of D) domains, where the least element is the most speci c level of generality and the greatest element is the most general level. There is a DGG where the least element is mapped directly to the greatest element (i.e., Dd is mapped to Dg ). For each node Di in hD; E i, we de ne descendants(Di ) to be all nodes Dj such that Di Dj and ancestors(Di ) to be all nodes Dk such that Dk Di . For example, given S = fVancouver, Toronto, Montreal, Los Angeles, New York, St. Louisg, let D = fOce, City, Division, Country, ANYg, where Dg = fS g = ffVancouver, Toronto, Montreal, Los Angeles, New York, St. Louisgg, D3 = ffVancouver, Toronto, Montrealg, fLos Angeles, New York, St. Louisgg, D2 = ffVancouver, Torontog, fMontreal, Los Angelesg, fNew York, St. Louisgg, D1 = ffVancouver, Torontog, fMontrealg, fLos Angelesg, fNew York, St. Louisgg, and Dd = ffVancouverg, fTorontog, fMontrealg, fLos Angelesg, fNew Yorkg, fSt. Louisgg, then these partitions correspond to the nodes in the multi-path DGG shown in Figure 3, where Dd , D1 , D2 , D3 , and Dg correspond to the Oce, City, Division, Country, and ANY nodes, respectively.

3. Attribute-Oriented Generalization The DGG-Discover system is a research software tool for KDD based upon an enhanced version of the well-known data mining technique AOG (Cai et al., 1991). In this section, we describe, in general terms, how DGG-Discover works to generate summaries from data in a database. We informally describe our AOG technique by way of a simple example showing how a CH can be ascended to produce a summary. Transforming a speci c data description into a more general one is called generalization. Generalization techniques include the dropping condition and climbing tree methods (Michalski, 1983). The climbing tree method transforms the data in a database by repeatedly replacing speci c attribute values with more general concepts according to user-de ned CHs. The dropping condition method transforms the data in a database by removing a condition from a conjunction of conditions, so that the remaining conjunction of conditions is more general. For example, assume the conjunction of conditions (shape = round ^ size = large ^ colour = white) describes the concept ball. Removing the condition colour = white, which is equivalent to generalizing the Colour attribute to ANY, yields the conjunction of conditions (shape = round ^ size = large). The concept ball is now more general because it encompasses large round objects of any colour.

6

AOG (Han, 1994, Han et al., 1992, Han et al., 1993) is a summarization algorithm that integrates the climbing tree and dropping condition methods for generalizing data in a database. An ecient variant of the AOG algorithm has been incorporated into the DGG-Discover system. By associating a CH with an attribute, DGG-Discover can generalize and summarize the domain values for the attribute in many dierent ways. An attractive feature of DGG-Discover is that it is possible to obtain many dierent summaries for an attribute by simply changing the structure of the associated CH, and these new summaries can be obtained without modifying the underlying data. For example, consider the database shown in Table 1. Using the CH from Figure 1 to guide the generalization, one of the many possible summaries which can be generated is shown in Table 2, where the Oce, Quantity, and Amount attributes have been selected for generalization, and the actual values for the Oce attribute in each tuple have been generalized to the level of West and East. The values in the Quantity and Amount attributes have been aggregated accordingly, and a new attribute, called Count, shows the number of tuples that have been aggregated in each generalized tuple from the unconditioned data in the original sales database. Table 1. The sales database Oce 2 5 3 7 1 6 4

Shape round square round round square round square

Size small medium large large x-large small small

Colour white black white black white white black

Quantity 2 3 1 4 3 4 2

Amount $50.00 $75.00 $25.00 $100.00 $75.00 $100.00 $50.00

Table 2. An example sales summary Oce West East

Quantity 8 11

Amount $200.00 $275.00

Count 4 3

4. Multi-Attribute Generalization In this section, we present algorithms for multi-attribute generalization. These algorithms have been incorporated into DGG-Discover, our most recent research tool for KDD. DGG-Discover provides both serial and parallel processing techniques for performing AOG when multiple CHs can be associated with an attribute. In Section 4.1, we introduce the basic ideas behind multi-attribute generalization with

7

an example. In Section 4.2, we provide a general discussion regarding the set of all possible summaries that can be generated by the multi-attribute generalization algorithm. In Sections 4.3 and 4.4, we describe ecient serial and parallel implementations of All Gen, the multi-attribute generalization algorithm. 4.1. The Basic Idea

A typical discovery task usually involves generalization of a small set of attributes (i.e., no more than four or ve), where each attribute is associated with a multipath DGG. When generalizing a set of attributes, the set can be considered a single attribute whose domain is the cross product of the individual attribute domains. For example, given the domains for the attributes Shape, Size, and Colour shown in Table 3, the domain (i.e., the cross product of the domains for the three attributes) for the compound attribute Shape-Size-Colour is as shown in Table 4. Table 3. Domains for the Shape, Size, and Colour attributes Shape round square

Size small medium large x-large

Colour black white

Table 4. Domain for the compound attribute Shape-Size-Colour Shape-Size-Colour round-small-black round-small-white round-medium-black round-medium-white round-large-black round-large-white round-x-large-black round-x-large-white square-small-black square-small-white square-medium-black square-medium-white square-large-black square-large-white square-x-large-black square-x-large-white

A generalization from the domain of the Shape-Size-Colour attribute is a combination of nodes from the DGGs associated with the individual attributes, taking one

8

node from the DGG for each attribute. For example, consider again the database shown in Table 1. The Shape, Size, Colour, and Quantity attributes have been selected for generalization. Using the CHs from Figure 4 for the Shape, Size, and Colour attributes, where two CHs are shown for the Size attribute, and the associated DGGs shown in Figure 5, the summary which can be obtained by generalizing to the DGG node combination ANY-Package-Colour is shown in Table 5. We use the climbing tree method to obtain the generalized values for the Shape and Size attributes. Since the value for the Shape attribute now covers all possible values, we could use the dropping condition method to remove the Shape column from our summary without losing any meaningful information. ANY

Round

ANY

Square

ANY

Bag

Box

Black

Shape

White

Colour Small

Medium

Large

X-Large

ANY

Light

Small

Heavy

Medium

Large

X-Large

Size

Figure 4. CHs for the Shape, Size, and Colour attributes ANY

Shape

ANY

Package

Weight

ANY

Colour

Size

Figure 5. DGGs for the Shape, Size, and Colour attributes

4.2. The Generalization State Space

The generalization state space consists of all possible summaries which can be generated from the DGGs associated with a set of attributes. This space is more general than a version space (Mitchell, 1978) because we allow more complex generalizations than value-to-ANY, and our summaries contain a relation rather than

9

Table 5. Summary for the DGG node combination ANY-Package-Colour Shape ANY ANY ANY ANY

Size bag bag box box

Colour white black white black

Count 2 2 2 1

a single conjunctive description. On the other hand, it is more constricted than a Galois lattice (Godin et al., 1995) because not every possible subset is included in the lattice. The generalization state space is obtained by determining all possible combinations of nodes from the DGGs, and then generating the summary which corresponds to each node combination. For example, the generalization state space for the compound attribute Shape-Size-Colour in Table 4 consists of 2 4 2 = 16 summaries, where 2, 4, and 2 are the number of nodes in the DGG associated with the Shape, Size, and Colour attributes, respectively. The two summaries corresponding to the node combinations Shape-Size-Colour and ANY-ANY-ANY do not provide any new information. In the case of the Shape-Size-Colour summary, all of the attributes are ungeneralized, which corresponds to the unconditioned data present in the original database. In the case of the ANY-ANY-ANY summary, all of the attribute values are generalized to the value ANY, which corresponds to a summary obtainable from any database using the aggregate SQL function COUNT(*) (this function counts all rows in the database, without duplicate elimination, and generates a summary consisting of one row and one column containing a single scalar value). The size of the generalization state space depends only on the number of nodes in the associated DGGs, it is not dependent upon the number of tuples in the input relation or the number of attributes selected for generalization. For example, in the experimental results presented in Section 9, the four DGGs used in the four-attribute discovery task contain 29, 12, 16, and eight nodes, respectively, so 29 12 16 8 = 44; 544 summaries are generated. The largest relation in the database contains approximately 3,000,000 tuples, but using any of the algorithms presented in (Carter and Hamilton, 1995b, Carter and Hamilton, 1998), we only need to read the original input relation once. The algorithms presented in (Carter and Hamilton, 1995b, Carter and Hamilton, 1998) run in O(n) time, where n is the number of tuples in the input relation, and require O(p) space, where p is the number of tuples in the summaries (typically p n). In (Carter and Hamilton, 1998), it is also proven that an AOG algorithm which runs in O(n) time is optimal for generalizing a relation. In general, for a discovery task containing m attributes, a database containing n tuples, andQanm O(n) generalization algorithm, creating all possible summaries requires O(n i=1 jDi j) time, where jDi j is the number of nodes in the DGG for attribute i. We have developed and implemented practical serial and parallel algorithms for traversing the

10

generalization state space where m is small ( 5) and n is large (> 1; 000; 000) (Hamilton et al., 1996, Hilderman et al., 1997a). We assume that a user developing CHs and DGGs is a domain expert, knowledgeable in all aspects of the data for which the CHs and DGGs are being constructed. Our experience has shown that CHs and DGGs can be arbitrarily large, thus increasing their complexity, and making them more dicult to understand. For example, the 29-node DGG from the previous paragraph has seven unique paths through it, each associated with a unique CH. The largest CH is seven levels deep containing approximately 9,200 nodes, over 2,000 of which are leaf nodes. To assist in the task of constructing CHs, methods for automatically generating CHs from databases using cluster identi cation and concept aggregation are proposed in (Chu et al., 1996, Hu and Cercone, 1994, Stumme et al., 1998, Wille, 1992). However, none of these methods were used in this work because we wanted to construct the CHs and DGGs utilizing concepts and terms that were known to, and of interest to, the domain experts provided by our commercial partner. 4.3. The Serial Algorithm

Given a relation R, a set of m DGGs, and a set of m attributes, where one DGG is associated with each attribute, the All Gen algorithm, shown in Figure 6, generates all possible summaries consistent with the DGGs for the set of attributes. In All Gen, the function Node Count (line 4) determines the number of nodes in DGG Di . The function Generalize (line 9) returns a summary where attribute i in the input relation has been generalized to the level of node Di , where Di is the k-th node of DGG Di . Any of the generalization algorithms presented in (Carter and Hamilton, 1998, Carter and Hamilton, 1995a, Carter and Hamilton, 1995b, Carter and Hamilton, 1995c, Han, 1994) may be used to implement the Generalize function. The procedure Interest (line 10) determines the interestingness of the summary. The procedure Output (line 11) saves the summary and combination of nodes from which the summary was generated. k

k

4.3.1. General Overview The initial call to All Gen is All Gen (R, 1, m, D, Dnodes ), where R is the input relation for the discovery task, 1 is an identi er corresponding to the rst attribute, m is an identi er corresponding to the last attribute, D is the set of m DGGs associated with the m attributes, and Dnodes is a vector in which the i-th element is initialized to Di1 (we assume the rst node in each Di corresponds to the domain of Di ). Dnodes is used to store the combination of nodes from which each summary is generated. The algorithm is described as follows. In the i-th iteration of All Gen (corresponding to the i-th attribute), one pass is made through the for loop (lines 4 to 12) for each non-domain node in Di (i.e., the DGG associated with attribute i). If the i-th iteration of All Gen is not also the m-th iteration (that is, corresponding to the last attribute) (line 5), then the i + 1-th iteration of All Gen is made (line 6). The i + 1-th iteration of All Gen is All Gen (work relation; i+ 1; m; D; Dnodes ),

11

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

procedure All Gen (relation; i; m; D; Dnodes ) begin work relation

relation

for k = 1 to Node Count (Di ) ? 1 do begin if i < m then end

All Gen (work relation; i + 1; m; D; Dnodes )

Dnodes [i]

Dik+1 work relation Generalize (relation; i; Dik+1 ) interestingness Interest (work relation) Output (work relation; Dnodes ; interestingness)

end

end

Figure 6. Serial multi-attribute generalization algorithm

where the values of m, D, and Dnodes do not change from the i-th iteration. The rst parameter, work relation, was previously set to the value of relation prior to entering the for loop (line 3). The second parameter, i, is incremented by one (corresponding to the i + 1-th attribute). In the rst pass through the for loop (i.e., k = 1) for the i-th iteration, the value of work relation is R (i.e., the original input relation). In the m-th iteration of All Gen, or when the i + 1-th iteration returns control to the i-th iteration (line 6), the i-th iteration determines the next level of generalization for attribute i (i.e., Di +1 ) and saves it in the i-th element of the vector Dnodes (line 8). The relation used as input to the i-th iteration of All Gen is generalized to the level of node Di +1 (line 9), the interestingness of the resulting summary is determined (line 10), and the summary is saved along with the interestingness and combination of nodes from which the summary was generated (line 11). In all passes through the for loop, other than the rst (i.e., k > 1), the value of work relation passed by the i-th iteration to the i + 1-th iteration is the value of relation generalized to the level of Di +1 . k

k

k

4.3.2. Detailed Walkthrough We now present a detailed walkthrough of the serial

algorithm. Consider again the sales database shown in Table 1, and the associated DGGs for the Shape, Size, and Colour attributes shown in Figure 5. The All Gen procedure is initially called with parameters relation = the contents of the sales database from Table 1, i = 1, m = 3, D = the DGGs from Figure 5, and Dnodes = fhShapei; hSizei; hColourig. In this walkthrough, we assume that D11 = hShapei, D12 = hANYi, D21 = hSizei, D22 = hPackage i, D23 = hWeighti, D24 = hANYi, D31 = hColouri, and D32 = hANYi. Iteration 1 - Invocation 1 - Loop 1. We set work relation to relation (line 3) to prevent changing the original value of relation in this iteration. Since

12

k = 1 Node Count(D1 ) ? 1 = 1 (line 4), we continue with the rst loop of this iteration. Since i = 1 < m = 3 (line 5), we call the second iteration of All Gen (line 6) with parameters work relation (unchanged from the current iteration), i = 2, m = 3, D (D never changes from one iteration to the next), and Dnodes = fhShapei; hSizei; hColourig. Iteration 2 - Invocation 1 - Loop 1. We set work relation to relation (line 3). Since k = 1 Node Count(D2 ) ? 1 = 3 (line 4), we continue with the rst loop of this iteration. Since i = 2 < m = 3 (line 5), we call the third iteration of All Gen (line 6) with parameters work relation (unchanged from the current iteration), i = 3, m = 3, D, and Dnodes = fhShapei; hSizei; hColourig. Iteration 3 - Invocation 1 - Loop 1. We set work relation to relation (line 3). Since k = 1 Node Count(D3 ) ? 1 = 1 (line 4), we continue with the rst loop of this iteration. Since i = 3 6< m = 3 (line 5), we do not call a fourth iteration of All Gen (line 6). Instead, we set Dnodes [3] = D32 (i.e., hANYi) (line 8), so Dnodes = fhShapei; hSizei; hANYig. We set work relation to the result returned from a call to Generalize (line 9) with parameters relation, i = 3, and D32 . The value of work relation, shown in Table 6, is the value of Table 1, having selected only the Shape, Size, and Colour attributes, with the Colour attribute generalized to the level of node D32 . We call Interest (line 10) with parameter work relation. We call Output (line 11) with parameters work relation and Dnodes = fhShapei; hSizei; hANYig. Since k = Node Count(D3 ) ? 1 = 1 (line 4), the rst invocation of the third iteration is complete. Table 6. Summary for the DGG node combination Shape-Size-ANY Shape round square round square square

Size small medium large x-large small

Colour ANY ANY ANY ANY ANY

Count 2 1 2 1 1

Iteration 2 - Invocation 1 - Loop 1 (continued). The call to the third iteration of All Gen (line 6) is complete. We set Dnodes [2] = D22 (i.e., hPackagei) (line 8), so Dnodes = fhShapei; hPackagei; hColourig. We set work relation to the result returned from a call to Generalize (line 9) with parameters relation, i = 2, and D22 . The value of work relation, shown in Table 7, is the value of Table 1, having selected only the Shape, Size, and Colour attributes, with the Size attribute

generalized to the level of node D22 . We call Interest (line 10) with parameter work relation. We call Output (line 11) with parameters work relation and Dnodes = fhShapei; hPackagei; hColourig. The rst loop of the rst invocation of the second iteration is complete. Iteration 2 - Invocation 1 - Loop 2. Since k = 2 Node Count(D2 ) ? 1 = 3 (line 4), we continue with the second loop of this iteration. Since i = 2 < m = 3 (line 5), we call the third iteration of All Gen (line 6) with parame-

13

Table 7. Summary for the DGG node combination Shape-Package-Colour Shape round square round round square

Size bag bag box box box

Colour white black white black white

Count 2 2 1 1 1

ters work relation (unchanged the current iteration), i = 3, m = 3, D, and Dnodes = fhShapei; hPackagei; hColourig. Iteration 3 - Invocation 1 - Loop 1. We set work relation to relation (line 3). Since k = 1 Node Count(D3 ) ? 1 = 1 (line 4), we continue with the rst loop of this iteration. Since i = 3 6< m = 3 (line 5), we do not call a fourth iteration of All Gen (line 6). Instead, we set Dnodes [3] = D32 (i.e., hANYi) (line 8), so Dnodes = fhShapei; hPackagei; hANYig. We set work relation to the result returned from a call to Generalize (line 9) with parameters relation, i = 3, and D32 . The value of work relation, shown in Table 8, is the value of Table 1, having selected only the Shape, Size, and Colour attributes, with the Colour attribute generalized to the level of node D32 . We call Interest (line 10) with parameter work relation. We call Output (line 11) with parameters work relation and Dnodes = fhShapei; hPackagei; hANYig. Since k = Node Count(D3 ) ? 1 = 1 (line 4), the second invocation of the third iteration is complete. Table 8. Summary for the DGG node combination Shape-Package-ANY Shape round square round square

Size bag bag box box

Colour ANY ANY ANY ANY

Count 2 2 2 1

The important aspects of the serial algorithm have now been clearly demonstrated. The remainder of this example is left as an exercise for the reader. 4.4. The Parallel Algorithm

As previously mentioned in Section 4.2, the size of the generalization state space depends only on the number of nodes in the DGGs; it is not dependent upon the number of tuples in the input relation or the number of attributes selected for generalization. When the number of attributes to be generalized is large, or the DGGs associated with a set of attributes is complex, we can improve the performance of the serial algorithm through parallel generalization. Our parallel algorithm does

14

not simply assign one node in the generalization state space to each processor, because the startup cost for each processor was found to be too great in comparison to the actual work performed. Through experimentation, we adopted a more coarse-grained approach, where a unique combinations of paths, consisting of one path through the DGG for each attribute, was assigned to each processor. For example, given attribute A with three possible paths through its DGG, attribute B with 4, and attribute C with 2, our approach creates 3 4 2 = 24 processes. The Par All Gen algorithm, shown in Figure 7, creates parallel All Gen child processes on multiple processors (line 8). In Par All Gen, the function Path Count (line 3) determines the number of paths in DGG Di . 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

procedure Par All Gen (relation; i; m; D; Dpaths; Dnodes ) begin for k = 1 to Path Count (Di ) do begin

Dpaths[i] Dik if i < m then Par All Gen (relation; i + 1; m; D; Dpaths; Dnodes ) else

end

end end

fork All Gen (relation; 1; m; Dpaths; Dnodes)

Figure 7. Parallel multi-attribute generalization algorithm

4.4.1. General Overview The initial call to Par All Gen is Par All Gen (R, 1, m, D, ;, Dnodes ), where R, 1, m, D, and Dnodes have the same meaning as in the serial algorithm, and ; initializes Dpaths . Dpaths is a vector in which the i-th element is assigned a unique path from Di . The algorithm is described as follows. In the i-th iteration of Par All Gen, one pass is made through the for loop (lines 3 to 10) for each distinct path in Di . The current path, Dik , is determined and saved in the k-th element of Dpaths (line 4), where Dik is the k-th path in Di . If the i-th iteration of Par All Gen is not also the m-th iteration (line 5), then the i + 1-th iteration of Par All Gen is called (line 6). The i + 1-th iteration of Par All Gen is Par All Gen (relation, i + 1, m, D, Dpaths , Dnodes ), where the values for relation, m, D, and Dnodes do not change from the i-th iteration. The second parameter is incremented by one. The fth parameter, Dpaths , was previously set to Dik (line 4). When the i + 1-th iteration returns control to the i-th iteration (line 6), the next pass through the for loop begins (line 4). In the m-th iteration of Par All Gen, an All Gen child process is created (line 8). The call to All Gen is All Gen (relation, 1, m, Dpaths , Dnodes ), where relation,

15

m, and Dnodes are unchanged from the values passed as parameters to the m-th iteration of Par All Gen. The second parameter, 1, is an identi er corresponding to the rst attribute. The fourth parameter, Dpaths , is a unique vector containing m paths (i.e., one from each Di associated with the set of attributes). The All Gen child process then follows the serial algorithm described in the previous section. The parallel algorithm may generalize the same combination of nodes in Dnodes on multiple processors. This can occur when a node in a DGG resides on more than one path (i.e., the node is at an intersection where two or more paths cross). This could be prevented through prior analysis of the generalization state space or by introducing some form of communication and synchronization between processors, but both approaches introduce additional overhead. For these experiments, we consider this redundant generalization to be tolerable because it only occurs in a small percentage of the total number of states in the generalization state space. 4.4.2. Detailed Walkthrough We now present a detailed walkthrough of the par-

allel algorithm. Consider again the sales database shown in Table 1, and the associated DGGs for the Shape, Size, and Colour attributes shown in Figure 5. The Par All Gen procedure is initially called with parameters relation = the contents of the sales database from Table 1, i = 1, m = 3, D = the DGGs from Figure 5, Dpaths = ;, and Dnodes = fhShapei; hSizei; hColourig. We assume that the Di have the same values as described in Section 4.3.2. In this walkthrough, we also assume that D11 = the path hShape; ANYi in the DGG for the Shape attribute, D21 = the path hSize; Package; ANYi and D22 = the path hSize; Weight; ANYi in the DGG for the Size attribute, and D31 = the path hColour; ANYi in the DGG for the Colour attribute. Iteration 1 - Invocation 1 - Loop 1. Since k = 1 Path Count(D1 ) = 1 (line 3), we continue with the rst loop of this iteration. We set Dpaths [1] = D11 (line 4), so Dpaths = fhD11 ig. Since i = 1 < m = 3 (line 5), we call the second iteration of Par All Gen (line 6) with parameters relation (relation never changes from one iteration to the next), i = 2, m = 3, D (D never changes from one iteration to the next), Dpaths = fhD11 ig, and Dnodes = fhShapei; hSizei; hColourig. Iteration 2 - Invocation 1 - Loop 1. Since k = 1 Path Count(D2 ) = 2 (line 3), we continue with the rst loop of this iteration. We set Dpaths [2] = D21 (line 4), so Dpaths = fhD11 i; hD21 ig. Since i = 2 < m = 3 (line 5), we call the third iteration of Par All Gen (line 6) with parameters relation, i = 3, m = 3, D, Dpaths = fhD11 i; hD21 ig, and Dnodes = fhShapei; hSizei; hColourig. Iteration 3 - Invocation 1 - Loop 1. Since k = 1 Path Count(D3 ) = 1 (line 3), we continue with the rst loop of this iteration. We set Dpaths [3] = D31 (line 4), so Dpaths = fhD11 i; hD21 i; hD31 ig. Since i = 3 6< m = 3 (line 5), we do not call a fourth iteration of All Gen (line 6). Instead, we fork a call to All Gen (line 8) with parameters relation, i = 1, m = 3, D = Dpaths , and Dnodes = fhShapei; hPackagei; hColourig. All Gen will execute the serial algorithm, as described in Section 4.3.2, to generate all summaries described by the nodes on the paths in D. Since k = Path Count(D3 ) = 1 (line 3), the rst invocation of the third iteration is complete. j

16

Iteration 2 - Invocation 1 - Loop 1 (continued). The call to the third iteration of

Par All Gen (line 6) is complete. The rst loop of the rst invocation of the second iteration is complete. Iteration 2 - Invocation 1 - Loop 2. Since k = 2 Path Count(D2 ) = 2 (line 3), we continue with the second loop of this iteration. We set Dpaths [2] = D22 (line 4), so Dpaths = fhD11 i; hD22ig. Since i = 2 < m = 3 (line 5), we call the third iteration of Par All Gen (line 6) with parameters relation, i = 3, m = 3, D, Dpaths = fhD11 i; hD22 ig, and Dnodes = fhShapei; hSizei; hColourig. Iteration 3 - Invocation 2 - Loop 1. Since k = 1 Path Count(D3 ) = 1 (line 3), we continue with the rst loop of this iteration. We set Dpaths [3] = D31 (line 4), so Dpaths = fhD11 i; hD22 i; hD31 ig. Since i = 3 6< m = 3 (line 5), we do not call a fourth iteration of All Gen (line 6). Instead, we fork a call to All Gen (line 8) with parameters relation, i = 1, m = 3, D = Dpaths , and Dnodes = fhShapei; hPackagei; hColourig. Since k = Path Count(D3 ) = 1 (line 3), the second invocation of the third iteration is complete. Iteration 2 - Invocation 1 - Loop 2 (continued). The call to the third iteration of Par All Gen (line 6) is complete. Since k = Path Count(D2 ) = 2 (line 3), the rst invocation of the second iteration is complete. The important aspects of the parallel algorithm have now been clearly demonstrated. The remainder of this example is left as an exercise for the reader. 4.5. A Comparison with Commercial OLAP Systems

Currently available OLAP tools, such as BrioQuery from Brio Technologies, Crystal Reports from Seagate Software, and PowerPlay from Cognos have the capability to perform generalization at consolidation time. Here we compare our multi-attribute generalization technique to that of PowerPlay (we believe PowerPlay to be representative of commercial OLAP tools). The generalization technique utilized in PowerPlay is similar to that described in (Cai et al., 1991, Carter and Hamilton, 1995c, Carter and Hamilton, 1998, Han et al., 1993). In PowerPlay, a dimension is analogous to a single-path DGG. A collection of dimensions is referred to as a dimension map and describes the universe (i.e., generalization state space) for a particular set of attributes. A sample dimension map is shown in Table 9, where the Dim 1, Dim 2, and Dim 3 columns correspond to the hShape; ANYi, hSize; Package; ANYi, and hColour; ANYi paths of the DGG shown in Figure 5. One dimension is associated with each attribute. A discovery task is referred to as an application, and is limited to an exploration of the universe de ned by the dimension map. Multiple dimensions can be associated with an attribute (similar to a multi-path DGG associated with an attribute), but each dimension associated with the attribute requires that an additional column be added to the application. Alternatively, separate applications can be de ned, each containing an alternative dimension for the attribute. Both methods require the user to manage the relationships between dimensions when de ning the applications and when analyzing the results. Summaries at particular levels of aggregation/generalization must be preselected by the user, and interesting summaries must be located by

17

manually traversing the universe. If an application is not supplying satisfactory results, the user can switch to another application and re-query the database. For any but small applications, the actual size of the universe, the number of possible summaries, and what each may contain is usually not known. Rules of thumb suggest limiting the number of dimensions to no more than ve and the number of levels in each dimension to no more than seven. Finally, there are no parallel processing capabilities. Table 9. A sample dimension map for the Shape, Size, and Colour attributes Level 1

Dim 1 All Shapes 1

Dim 2 All Sizes 1

Dim 3 All Colours 1

2

Shape 2

Package 2

Colour 2

3

Size 4

In our approach, we allow multiple DGGs (i.e., dimensions) to be associated with an attribute. We then explicitly generate all possible summaries in the generalization state space, and rank the interestingness of each summary before presenting the results to the user. Interesting summaries can then be used as a starting point for further exploration and analysis. If a group of related summaries (i.e., associated with a particular combination of DGG paths) is not supplying satisfactory results, the user can jump to another group of related summaries (i.e., associated with a dierent combination of DGG paths) without requiring the user to re-query the database. A description of the level of generalization/aggregation for each attribute is provided with each summary. We also suggest a rule of thumb where the number of attributes is limited to no more than ve, but we do not suggest any limits on the number of levels in the associated DGGs because the size of the generalization state space depends only on the number of nodes in the associated DGGs (although both techniques allow more than ve attributes/dimensions, interpretation of results becomes more dicult). For large problems, where many attributes have been selected for generalization, or the DGGs associated with the attributes have many nodes, we provide a parallel technique which partitions the problem across multiple processors.

5. Interestingness Measures To identify those summaries from which we may learn the most, two measures are used to rank their interestingness. The rst measure, variance, is the most common measure of variability used in statistics. We use variance to compare the distribution de ned by the tuples in a summary to that of a uniform distribution of

18

the tuples. The second measure, based upon the relative entropy measure KullbackLeibler (KL) distance, has been suggested as an appropriate measure for comparing data distributions in unstructured textual databases (Feldman and Dagan, 1995). Here we use the KL-distance to compare the distribution de ned by the structured tuples in a summary to that of a uniform, historical, or unknown distribution of the tuples. In Sections 5.1 and 5.2, we introduce the variance and KL-distance measures, respectively. 5.1. Variance

Given a summary R = fht1 ; c1i; ht2; c2i; : : :; htn; cnig, where each hti ; cii is a tuple, n is the number of tuples in the summary, each ti is some unique combination of generalized attributes, and each ci represents the number of tuples aggregated from the unconditioned data in the original database, we can measure how much the distribution of the ti 's in R varies from that of a uniform distribution. The probability of each ti occurring in R is given by: p(ti) = (c + c +ci : : : + c ) 1 2 n And the probability of each ti occurring in a summary where the tuples are uniformly distributed is given by: (c1 +c2 +:::+c ) q(ti ) = (c + c +n : : : + c ) = n1 1 2 n The variance of the tuples in R from a uniform distribution is de ned as: Pn i) ? q(ti))2 s2 = i=1 (p(t n?1 The larger the variance, the less similar are the distributions of p and q. Note that we use n ? 1 in our calculation for variance because we assume the database does not contain all possible combinations of attributes, meaning we are observing only a sample of the possible tuples. Example: Consider the summary R = fht1 ; 20i; ht2; 10i; ht3; 40i; ht4; 30i; ht5; 25ig. From the actual distribution of the ti 's in R, we have p(t1 ) = 0:16, p(t2) = 0:08, p(t3 ) = 0:32, p(t4) = 0:24, p(t5) = 0:20, and from the uniform distribution we have q(ti ) = 0:20, for all i. The variance for the tuples in R from a uniform distribution is: s2 = ((0:16 ? 0:20)2 + (0:08 ? 0:20)2 + (0:32 ? 0:20)2 + (0:24 ? 0:20)2 + (0:20 ? 0:20)2)=4 = (0:0016 + 0:0144 + 0:0144 + 0:0016 + 0:0)=4 = 0:032=4 = 0:008 n

19

5.2. Kullback-Leibler Distance

Given R and p as described in Section 5.1, we can measure the distance from the distribution of the ti 's in R to that of some distribution q. The distance from p(ti ) to q(ti ) measures the amount of information we lose by modelling p by q. This distance, called the relative entropy, or KL-distance, between two probability distributions p and q, is de ned as (Kullback and Leibler, 1951): n X D(p k q) = p(t ) log i=1

i

p(ti ) 2 q(ti )

The larger the KL-distance, the less similar are the distributions of p and q. Example: KL-distance may be used to measure the distance of the actual distribution from that of a uniform distribution, similar to the variance. Consider again the summary R, the actual distribution p, and the uniform distribution q given in the example of Section 5.1. The KL-distance of the actual distribution p from the uniform distribution q is given by: 0:08 + 0:32 log 0:32 + D(p k q) = 0:16 log2 0:16 + 0:08 log 2 2 0:20 0:20 0:20 0:20 0:24 log2 0:24 + 0:20 log 2 0:20 0:20 = (?0:052) + (?0:106) + 0:217 + 0:063 + 0:0 = 0:122 bits Since log2 is used in the calculation, a binary system is assumed and the unit of measure for KL-distance is expressed in bits. Example: KL-distance may also be used to measure the distance of the actual distribution from that of some historical distribution. Consider again the summary R and the actual distribution p given in the example of Section 5.1. Also, consider the historical summary S = fht1; 15i; ht2; 15i; ht3; 45i; ht4; 20i; ht5; 30ig. From the historical distribution of the ti 's in S, we have q(t1 ) = 0:12, q(t2 ) = 0:12, q(t3 ) = 0:36, q(t4 ) = 0:16, and q(t5 ) = 0:24. The KL-distance of the actual distribution p from the historical distribution q is given by: 0:16 + 0:08 log 0:08 + 0:32 log 0:32 + D(p k q) = 0:16 log2 0:12 2 0:12 2 0:36 0:20 0:24 log2 0:24 0:16 + 0:20 log2 0:24 = 0:066 + (?0:047) + (?0:054) + 0:140 + (?0:053) = 0:052 bits Example: KL-distance may also be used to measure the distance of the actual distribution from that of an unknown distribution. Since we have no information

20

regarding the unknown distribution, we assume it will have the distribution q given for summary S in the previous example of this section. We then use a Bayesian approach to modify our beliefs regarding the unknown distribution q, known as the prior distribution, before measuring the KL-distance. We modify our beliefs based upon accumulated evidence, such as the actual distribution p given for the summary R in the example of Section 5.1. On the basis of this summary, we modify the prior distribution to obtain a new one called the posterior distribution. Bayes Theorem tells us what to do to obtain the posterior distribution, and is given by: i jH)P(X jAi) P(AijX) = PnP(AP(A j jH)P(X jAj ) j =1 where the Ai 's are a set of mutually exclusive and exhaustive alternatives, P(Ai jH) is the prior probability of Ai , X is the actual summary with distribution p, and P(X jAi ) is the probability associated with Ai in the actual summary X. P(AijX) is the posterior probability of Ai . Now the posterior probabilities for S are given by: P(t1jS) = ((0:12)(0:16))=((0:12)(0:16)+ (0:12)(0:08) + (0:36)(0:32) + (0:16)(0:24) + (0:24)(0:20)) = 0:0192=0:2304 = 0:0833 P(t2jS) = 0:0096=0:2304 = 0:0417 P(t3jS) = 0:1152=0:2304 = 0:5 P(t4jS) = 0:0384=0:2304 = 0:1667 P(t5jS) = 0:048=0:2304 = 0:2083 So, from the posterior probabilities of the ti 's in S, we have q(t1 ) = 0:0833, q(t2 ) = 0:0417, q(t3 ) = 0:5, q(t4) = 0:1667, and q(t5 ) = 0:2083. If we now let p be the prior distribution, we can determine the KL-distance of the prior distribution p from the posterior distribution q in the same way as that shown in the previous example of this section.

6. An Extended Example In this section, we present an extended example based upon the serial version of the multi-attribute generalization algorithm. Consider a cable television operator in a large metropolitan area with a division that oers a pay-per-view movie service. At any given time, there are typically eight to ten movies available for viewing. The available movies and previews of these movies run continuously on speci c channels, 24 hours per day, seven days per week. If a subscriber wants to view a movie, it is available on a pay-per-view basis only. That is, to view a movie, the subscriber can either select the channel on which the desired movie is showing, and press a special \buy" button on the remote control in the home (in which case the movie is available immediately), or a special telephone number can be called that enables

21

the desired movie to be selected via an interactive voice response system (in which case the movie is available at the selected showing). The pay-per-view movie division would like to develop a more detailed understanding of the viewing patterns of their subscribers. For example, although we all have an intuitive feeling that most people usually rent movies in the evening, not all people do. In the large metropolitan area served by the pay-per-view movie division, there are typically thousands of subscribers watching pay-per-view movies at any hour throughout the day and night. By understanding the viewing patterns of the subscribers, the pay-per-view movie division will attempt to increase sales by making available appropriate movies, with appropriate ratings, at appropriate times during the day. This understanding will also contribute to the planning eorts required to determine when current movies should be dropped and new ones added. The objective is to develop a pro le of the viewing patterns which maximizes pro ts for the division. We now demonstrate the serial multi-attribute generalization algorithm and DGGs using the 50-tuple table, shown in Table A.1 of the Appendix, which represents a subset of the pay-per-view movie rental transactions for the month of August, 1995. Tuples consist of four attributes: Movie, Date, Day, and Time, with 4, 31, 7, and 15 possible distinct values, respectively. The multi-path DGGs associated with the attributes, shown in Figure 8, contain 3, 6, 5, and 8 nodes, respectively, and consist of 1, 3, 2, and 3 single-path DGGs, respectively. Ag

A1

Cg

Bg

B2

B3

B4

C2

B1

Dg

C3

D4

D5

D6

C1

D1

D2

D3

Ad

Bd

Cd

Dd

Movie

Date

Day

Time

Figure 8. DGGs for the Movie, Date, Day, and Time attributes

The domain values for each node in Figure 8 are mapped conceptually as shown in Table A.2 of the Appendix. For example, the domain values for Cd can be mapped to either C1 or C2. When mapped to C1, Monday and Tuesday are mapped to early weekday; Wednesday and Thursday are mapped to late weekday; and Friday, Saturday, and Sunday are unchanged. When mapped to C2, Monday to Friday are mapped to workday; and Saturday and Sunday are mapped to weekend. Similarly, the domain values for C2 and C3 can be mapped to Cg . Both workday and weekend in the domain of C2 , and both weekday and weekend in the domain of C3, are mapped to ANY.

22

The generalization state space generated by traversing these DGGs consists of 720 nodes (i.e., 3 6 5 8). Of these, we considered only those where every attribute was generalized at least once (i.e., k > 2 in All Gen), yielding a generalization state space containing 280 nodes (i.e., 2 5 4 7). Ignoring the most general case where all attributes are generalized to ANY, 279 possible summaries were generated and ranked for interestingness by each discovery task. For simplicity, we emphasize those summaries containing one and two attributes only, since these are more compact and less complex, yet representative of the operation of the algorithm. The subset of transactions in the pay-per-view database were selected to facilitate human veri cation of the rankings for the summaries generated, and to validate the serial and parallel algorithms. Tuples were selected with the planned biases shown in Table 10, where Bias ID is the unique identi er assigned to each planned bias, Attribute is the attribute upon which the planned bias is based, Domain is the pool of possible values for each attribute, No. of Rentals is the number of tuples containing the corresponding domain value, and Analysis describes the planned bias. Table 10. Biases in movie rental database Bias ID 1 2

Attribute Day

3

Day

4

Date

5

Movie

6

Time

7

Movie

8

Time

Domain weekend weekday Sunday Monday Tuesday Wednesday Thursday Friday Saturday early month mid month late month adult general A.M. P.M. A B C D day time prime time

No. of Rentals 50 35 15 6 1 3 5 6 18 11 25 19 6 24 26 11 39 11 13 13 13 29 21

Analysis Most rentals on weekends

Most rentals on Fridays

Most rentals near start of run No classi cation dominates Most rentals after noon No title dominates rentals Most rentals not in prime time

Biases 1 to 5 were consciously planned, while biases 6 to 8 were discovered afterward and considered, in retrospect, to be subconsciously planned biases. A good discovery technique should be able to nd obvious relationships such as the planned biases described here.

23

During a discovery task, a set of 279 summaries is generated from the database. The number of summaries containing from one to four non-ANY attributes is shown in Table 11. Table 11. Number of summaries classi ed by number of attributes No. of Attributes 1 2 3 4 Total

No. of Summaries 14 67 126 72 279

The 14 unique single-attribute summaries are shown in Table 12, where the Rank column describes the relative degree of interest of the corresponding summary, the Attribute column describes the name of the remaining non-ANY attribute, the Generality column describes the level of generalization, the Interest column describes the calculated interest based upon the variance measure, and the Bias ID column describes the planned bias, if applicable, of the corresponding summary. Table 12. Single attribute summaries Rank 1 2 3 3 4 5 6 7 8 9 10 10 11 12

Attribute Time Day Date Day Date Date Time Day Time Time Time Time Date Movie

Generality

; ; ; D4 i ; ;C3 ; i ; B2 ; ; i ; ;C2 ; i ; B3 ; ; i ; B4 ; ; i ; ; ; D2 i ; ;C1 ; i ; ; ; D1 i ; ; ; D3 i ; ; ; D5 i ; ; ; D6 i ; B1 ; ; i A;;;i

h h h h h h h h h h h h h h 1

Interest 0.07840 0.04000 0.02560 0.02560 0.02516 0.01392 0.01310 0.00944 0.00702 0.00649 0.00640 0.00640 0.00525 0.00040

No. of Tuples 2 2 2 2 3 5 4 5 3 6 2 2 9 2

Bias ID B6 B2

B4

B8 B8 B5

All planned biases were identi ed by the discovery task and are shown in Tables 13 through 18. The domains of planned biases 3 and 7 are not generalized, so are not identi ed in Table 12, but they do correspond to the ungeneralized summaries at nodes Cd and Ad , respectively. In these summaries, the rst column describes the level of generalization, the Count column describes the number of tuples which have been aggregated from the unconditioned data in the original input relation, and the Count (%) column describes the Count as a percentage of all tuples generalized.

24

Table 13. Rentals by weekend and weekday (planned bias 2)

C4 weekend weekday Total

Count 35 15 50

Count(%) 70 30 100

Table 14. Rentals by time of the month (planned bias 4)

B3 early month mid month late month Total

Count 25 19 6 50

Count(%) 50 38 12 100

The least interesting single-attribute summary (i.e., the summary with rank 12), shown in Table 15, corresponds to planned bias 5, where the Date, Day, and Time attributes are generalized to ANY, and the Movie attribute is generalized to node A1 . This summary, by the classes adult and general, is the least interesting because it is a near uniform distribution of the tuples, with 52% general viewing and 48% adult viewing. Table 15. Rentals by classi cation (planned bias 5)

A1 general adult Total

Count 26 24 50

Count(%) 52 48 100

The most interesting single-attribute summary, shown in Table 16, corresponds to planned bias 6, where the Movie, Date, and Day attributes are generalized to ANY (i.e., Ag , Bg , and Cg , respectively) and the Time attribute is generalized to node D4 . This summary, by the classes am and pm, is the most interesting because the distribution of the tuples is furthest from a uniform distribution, with 78% of the rentals occurring in pm and 22% in am. There are two single-attribute summaries with rank 10, shown in Tables 17 and 18, which correspond to planned bias 8, where the Movie, Date, and Day attributes are generalized to ANY, and the Time attribute is generalized to either node D5 or D6 . Node D5 is a summary by the classes day time and prime time and D6 is a summary by the classes traditional and non-traditional. This example shows that although nodes D5 and D6 occur along dierent paths of the same DGG, and

25

Table 16. Rentals by am and pm (planned bias 6)

D4 pm am Total

Count 39 11 50

Count(%) 78 22 100

summarize according to dierent CHs, they generate identical summaries. So, even though both summaries have the same degree of interest, but are based upon a dierent view of the input relation, they both arrive at the same conclusion. Table 17. Rentals by viewing period (planned bias 8)

D5 daytime primetime Total

Count 29 21 50

Count(%) 58 42 100

Table 18. Rentals by alternate viewing period (planned bias 8)

D5 traditional non-traditional Total

Count 29 21 50

Count(%) 58 42 100

There are two identical single-attribute summaries with rank 3, shown in Table 19. In the rst, the Movie, Day, and Time attributes are generalized to ANY, and the Date attribute is generalized to node B2 . In the other, the Movie, Date, and Time attributes are generalized to ANY, and the Day attribute is generalized to node C3. Although the domain of the Date attribute (i.e., [1 .. .31]) is dierent from the domain of the Day attribute (i.e., [Sunday .. .Saturday]), and B2 and C3 occur in dierent DGGs for dierent attributes, and summarize according to dierent CHs, the respective DGGs generate identical summaries. Again, even though both summaries have the same degree of interest, but are based upon a dierent view of the input relation, they both arrive at the same conclusion.

7. A Comparison of KL-Distance and Variance We now compare and contrast the interestingness measures based upon variance and KL-distance. Again, due to the large number of summaries generated, even

26

Table 19. Summaries h ;B2 ; ; i and h ; ; C3 ;

D5 workday weekend Total

i

Count 33 17 50

Count(%) 66 34 100

for a small database containing 50 four-attribute tuples, we only discuss a subset of the results. From the 279 summaries generated by each of the two discovery tasks, 67 contained two-attribute tuples. From this group of 67, we have selected 17 for further discussion which we believe to be the most illustrative. Descriptions for these summaries are shown in Table 20, where the Row column describes the corresponding row for each line in the table (for reference purposes), the Generality column describes the level of generalization, the KL-Distance Rank and Variance Rank columns describes the rank assigned to the corresponding summary when calculating interestingness using the KL-distance and variance measures, respectively. In this table, the lower the rank, the greater the degree of interestingness (i.e., summaries with rank 1 and 67 correspond to the most interesting and least interesting, respectively). Table 20. Ranks assigned to summaries by KL-distance and variance measures Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Generality

; B 1 ; ; D1 i ; B 3 ; ; D1 i A1 ; ; ; D 1 i ; ;C1 ; D4 i ; B 3 ; ; D4 i ; ;C3 ; D2 i ; B 3 ; ; D3 i ; ;C1 ; D3 i A1 ; ; ; D 2 i ; ;C3 ; D3 i ; B 3 ; ; D6 i ; B 3 ; C1 ; i ; B 2 ; C3 ; i ; ;C3 ; D6 i ; B 3 ; C3 ; i A1 ; ; ; D 6 i A1 ; ;C3 ; i

h h h h h h h h h h h h h h h h h

KL-Distance Rank 1 13 16 18 19 23 28 29 35 36 42 43 45 49 51 53 61

Variance Rank 67 57 43 41 15 20 38 54 35 21 8 32 62 2 14 5 13

The variance measure tends to rank summaries with a small number of tuples, but with a large variance from the uniform distribution, as most interesting. For example, h ; ; C3; D6i (in row 14), hA1; ; ; D6 i (in row 16), and h ; B3 ; ; D6 i (in

27

row 11), of Table 20, shown in Tables 21, 22, and 23, respectively, were ranked as the second, fth, and eighth most interesting summaries, respectively. These summaries are similar in the number of tuples they contain and the distribution of the counts for each tuple. Other summaries with a similar ranking (not shown) also have a similar number of tuples and distribution of counts. In contrast, the KL-distance measure ranked these summaries as 49th, 53rd, and 42nd, respectively. The order in which the KL-distance ranked these summaries is not similar to how they were ranked by the variance measure, and other summaries with a similar ranking (not shown) did not have a similar number of tuples and distribution of counts. For example, h ; B3; C1; i, containing 14 tuples and shown in Table 24, was ranked 43rd. Clearly, h ; B3; ; D6i is not similar to this summary in terms of the number of tuples and distribution of counts. Table 21. Summary h ; ; C3 ;D6 i

D6 pm pm am am Total

C3 weekend weekday weekend weekday

Count 28 11 7 4 50

Count(%) 56 22 14 8 100

Table 22. Summary hA1 ; ; ;D6 i

A1 adult general general adult Total

D6 pm pm am am

Count 22 17 9 2 50

Count(%) 44 34 18 4 100

Table 23. Summary h ; B3 ; ; D6 i

D6 pm pm am am pm am Total

B3 early month mid month mid month early month late month late month

Count 21 14 5 4 4 2 50

Count(%) 42 28 10 8 8 4 100

28

Table 24. Summary h ; B3 ;C1 ;

C1 Friday late weekday Friday Saturday Saturday Sunday early weekday Sunday Friday late weekday late weekday early weekday Saturday Sunday Total

i

B3 mid month early month early month early month mid month early month early month mid month late month mid month late month mid month late month late month

Count 9 7 7 5 5 3 3 2 2 2 2 1 1 1 50

Count(%) 18 14 14 10 10 6 6 4 4 4 4 2 2 2 100

h; ; ; C3; D6i, h; A1 ; ; ; D6i, and h ; B3 ; ; D6 i also emphasize how dierently the two interestingness measures rank the summaries. For example, these summaries were considered interesting by the variance measure (indicated by the low ranking) and not interesting by the KL-distance measure (indicated by the high ranking). Close examination of Table 20 shows that there is almost an inverse relationship between the two measures, the most obvious example being h; ; B1 ; ; D1 i (in row 1), where it is ranked as least interesting by the variance measure and as most interesting by the KL-distance measure. In general, the KL-distance measure tended to rank summaries with a large number of tuples, regardless of their distribution, as more interesting. The average number of tuples for the ten most interesting and ten least interesting summaries, for both interestingness measures, are shown in Tables 25 and 26, respectively. Table 25 shows that the most interesting single-attribute summaries generated using the variance measure have approximately 30% fewer tuples than those generated using the KL-distance measure. For four-attribute tuples, the variance measure generates approximately 70% fewer tuples. Similarly, Table 26 shows that the least interesting single-attribute summaries generated using the variance measure have approximately 40% more tuples than those generated using the KL-distance measure. For four-attribute tuples, the variance measure generates approximately twice as many tuples. This comparison clearly illustrates that the variance measure quanti es the simpler and less complex summaries (i.e., those with few attributes and/or tuples) as more interesting.

29

Table 25. Average number of tuples in most interesting summaries No. of Attributes 1 2 3 4

KL-Distance 4.1 16.7 31.5 39.9

Variance 2.9 4.2 6.5 12.1

Average

23.0

6.4

Table 26. Average number of tuples in least interesting summaries No. of Attributes 1 2 3 4

KL-Distance 2.9 5.0 9.7 20.9

Variance 4.1 19.0 35.9 42.4

Average

9.6

25.4

8. Pruning Methods To reduce the number of summaries generated during a discovery task, it is possible to prune the generalization state space in some circumstances. In this section, we present ve pruning heuristics. If a summary is a direct descendant of some other summary, but has higher interest, then the ancestor can be eliminated from further consideration. For example, consider the relationship between the two-attribute summaries h ; ; C1; D2i and h ; ; C1; D5i, shown in Tables 27 and 28, respectively. These summaries are generated by following a common path through the multi-path DGG for the Time attribute. That is, the values for the Time attribute in h ; ; C1; D5 i are a generalization of the values for the Time attribute in h ; ; C1; D2i. Now h ; ; C1; D2i and h ; ; C1; D5 i have interest of 0.00877 and 0.01170, respectively. Since h ; ; C1; D5i has higher interest than h ; ; C1; D2 i, and is a direct descendant, then h ; ; C1; D2i can be deleted. More generally, for this two-attribute case, h ; ; C1; D2 i can be deleted if any of h ; ; C1; D5i, h ; ; C3; D2 i, or h ; ; C3; D5 i have higher interest. Similar pruning can also be done in the situation where a summary is a direct descendant of a summary that has higher interest. For example, consider the relationship between the two-attribute summaries h ; B1; C1; i and h ; B1 ; C3; i, shown in Tables 29 and 30, respectively. These summaries are generated by following a common path through the DGG for the Day attribute. That is, the values for the Day attribute in summary h ; B1; C3; i are a generalization of the values for the Day attribute in summary h ; B1 ; C1; i. Now summaries h ; B1; C1; i and h ; B1; C3; i

30

Table 27. Summary h ; ; C1 ;D2 i

D2 evening late afternoon morning evening early afternoon morning early afternoon late afternoon

C1 weekend weekend weekend weekday weekday weekday weekend weekday

Count 15 12 7 6 4 4 1 1

Count(%) 30 24 14 12 8 8 2 2

C1 weekend weekend weekday weekday

Count 20 15 9 6

Count(%) 40 30 18 12

Table 28. Summary h ; ; C1 ;D5 i

D5 traditional non-traditional traditional non-traditional

have interest of 0.00944 and 0.00062, respectively. Since summary h ; B1; C3; i has lower interest than summary h ; B1 ; C1; i, and is a direct descendant, then h ; B1 ; C3; i can be deleted. Table 29. Summary h ; B1 ;C1 ;

i

C1 Friday Saturday late weekday Sunday early weekday

B1 workday weekend workday weekend workday

Table 30. Summary h ; B1 ;C3 ;

C3 weekend weekend weekday

Count 18 11 11 6 4

Count(%) 36 22 22 12 8

i

B1 workday weekend workday

Count 18 17 15

Count(%) 36 34 30

31

Pruning can also be done based upon the table threshold, attribute threshold, and interest threshold. When pruning using the table threshold, all summaries containing more tuples than the table threshold would be deleted regardless of their degree of interest. For example, the discovery task generated 14 single-attribute summaries, as shown in Table 12. If we set the table threshold to three, then the number of single-attribute summaries remaining is reduced to nine. When pruning using the attribute threshold, all summaries containing an attribute where the number of distinct values for the attribute is greater than the attribute threshold, would be deleted. For example, if we also set the attribute threshold to three, then the number of single-attribute summaries remaining is again reduced to nine. And nally, when pruning using the interest threshold, all summaries whose degree of interest is less than the interest threshold would be deleted. For example, if we set the interest threshold to 0.01, then the number of single-attribute summaries remaining is reduced to seven.

9. Experimental Results We ran all of our experiments on a 64-node Alex AVX Series 2, a MIMD distributed memory parallel computer. Each inside-the-box compute node consists of a T805 processor, with 8 MB of local memory, paired with an i860 processor, with 32 MB of shared memory (the pair communicates through the shared memory). Each i860 processor runs at 40 MHz and each T805 processor runs at 20 MHz with a bandwidth of 20 Mbits/second of bi-directional data throughput on each of its four links. The compute nodes run version 2.2.3 of the Alex-Trollius operating system. The front-end host computer system is a Sun Sparc 20 with 32 MB of memory, running version 2.4 of the Solaris operating system. DGG-Discover functions as three types of communicating modules: a slave program runs on an inside-the-box compute node and executes the discovery tasks that it is assigned, the master program assigns discovery tasks to the slave programs, and the bridge program coordinates access between the slave programs and the database. Input data was from a large database supplied by a commercial partner in the telecommunications industry. This database has been used frequently in previous data mining research (Carter et al., 1997, Hilderman et al., 1998a, Hilderman et al., 1997a, Hilderman et al., 1998b). It consists of over 8,000,000 tuples in 27 tables describing a total of 56 attributes. The largest table contains over 3,300,000 tuples representing the account activity for over 500,000 customer accounts and over 2,200 products and services. Our queries read approximately 545,000 tuples from three tables, resulting in an initial input relation for the discovery tasks containing up to 26,950 tuples and ve attributes. Our experience in applying data mining techniques to the databases of our commercial partners has shown that domain experts typically perform discovery tasks on a few attributes that have been determined to be relevant. Consequently, we present the results for experiments where two to ve attributes are selected for generalization, and the DGGs associated with the selected attributes contained one to ve unique paths.

32

The characteristics of the DGGs associated with each attribute are shown in Table 31, where the No. of Paths column describes the number of unique paths, the No. of Nodes column describes the number of nodes, and the Avg. Path Length column describes the average path length. Table 31. Characteristics of the DGGs associated with the selected attributes Attribute A B C D E F G

No. of Paths 5 4 3 4 2 1 5

No. of Nodes 20 17 12 17 8 3 21

Avg. Path Length 4.0 4.3 4.0 4.3 4.0 3.0 4.2

From these experiments, we draw two main conclusions. First, as the complexity of the DGGs associated with a discovery task increases (either by adding more paths to a DGG, more nodes to a path, or more attributes to a discovery task), the complexity and traversal time of the generalization state space also increases. This was expected based upon the complexity analysis given in Section 4.2. And second, as the number of processors used in a discovery task increases, the time required to traverse the generalization state space decreases, resulting in signi cant speedups for discovery tasks run on multiple processors. These results are shown in the graphs of Figures 9 to 12, where the number of processors is plotted against execution time in seconds, and Table 32, where the speedup and eciency results are described. In each of the four experiments discussed here, we varied the number of processors assigned to the discovery tasks. A maximum of 32 processors were available. The graphs show that as the complexity of the generalization state space increases, the time required to traverse the generalization state space also increases. For example, the two-, three-, four-, and ve-attribute discovery tasks in Figures 9, 10, 11, and 12, respectively, required a serial time of 36, 402, 3,732, and 25,787 seconds, respectively, to generate 340, 3,468, 27,744, and 102,816 summaries, respectively, on a single processor. A similar result was obtained when multiple processors were allocated to each discovery task. The number of summaries to be generated by a discovery task (i.e., the size of the generalization state space) is determined by multiplying the values in the No. of Nodes column of Table 31. For example, when attributes B, C, D, and E were selected for the four-attribute discovery task, 27,744 (i.e., 17 12 17 8) summaries were generated. The graphs also show that as the number of processors assigned to a discovery task is increased, the time required to traverse the generalization state space decreases. Each discovery task can be divided into smaller discovery tasks (i.e., sub-tasks) which can each be run independently on a separate processor. For example, the

33

40

Time (seconds)

35 30 25 20 15 10 5 0 1

2

4

8

16

32

16

32

No. of Processors

Figure 9. Relative performance generalizing two attributes 450

Time (seconds)

400 350 300 250 200 150 100 50 0 1

2

4

8

No. of Processors

Figure 10. Relative performance generalizing three attributes

two-, three-, four-, and ve-attribute discovery tasks that required 36, 402, 3,732, and 25,787 seconds, respectively, on a single processor, required only 3, 21, 167, and 1,245 seconds, respectively, on 32 processors. The two-attribute discovery task was partitioned across 20 of the 32 available processors, as there were only 20 possible sub-tasks (i.e., unique path combinations). The number of sub-tasks to be generated by a discovery task is determined by multiplying the values in the No. of Paths column of Table 31. For example, when attributes B, C, D, and E were selected for the four-attribute discovery task, the discovery task could be partitioned into 96 (i.e., 4 3 4 2) sub-tasks. Speedups for the discovery tasks run on multiple processors are shown in Table 32, where the No. Nodes column describes the number of nodes in the generalization

34

4000

Time (seconds)

3500 3000 2500 2000 1500 1000 500 0 1

2

4

8

16

32

16

32

No. of Processors

Figure 11. Relative performance generalizing four attributes 30000

Time (seconds)

25000 20000 15000 10000 5000 0 1

2

4

8

No. of Processors

Figure 12. Relative performance generalizing ve attributes

state space, the No. Sub-Tasks column describes the number of unique path combinations that can be obtained from the set of DGGs associated with the attributes, the No. Procs. column describes the number of processors used, the Ser. Time column describes the time required to run the discovery task on a single processor, the Par. Time column describes the time required to run the discovery task on the corresponding number of processors, the Speedup column describes the serial time divided by the parallel time, and the E. column describes the speedup divided by the number of processors (i.e., eciency). Signi cant speedups were obtained when a discovery task was run on multiple processors. For example, the maximum speedups for the two-, three-, four-, and ve-attribute discovery tasks, which were

35

obtained when the discovery tasks were run on 32 processors, were 12.0, 19.1, 22.3, and 20.7, respectively. Table 32. Speedup and eciency results obtained using the parallel algorithm Exp. 1

Atts. Gen. A,B

No. Nodes 340

No. Sub Tasks 20

No. Procs. 2 4 8 16 20

Ser. Time 36

Par. Time 18 9 6 4 3

Speedup 2.0 4.0 6.0 9.0 12.0

E. 1.00 1.00 0.75 0.56 0.60

2

B,C,D

3468

48

2 4 8 16 32

402

199 104 56 32 21

2.0 3.9 7.2 12.6 19.1

1.00 0.98 0.90 0.79 0.60

3

B,C,D,E

27744

96

2 4 8 16 32

3732

1985 1017 506 273 167

1.9 3.7 7.4 13.7 22.3

0.95 0.93 0.93 0.86 0.70

4

C,D,E,F,G

102816

120

2 4 8 16 32

25787

13939 7264 3723 2080 1245

1.8 3.5 6.9 12.4 20.7

0.90 0.88 0.86 0.78 0.65

10. Conclusion and Future Research We presented serial and parallel versions of the Multi-Attribute Generalization algorithm for attribute-oriented generalization of a set of attributes. The algorithm generates all possible combinations of generalizations of concepts from the DGGs associated with a set of attributes. KL-distance and variance were used to rank the interestingness of the summaries generated by the algorithm. The variance measure appears to be a more useful measure than KL-distance because it tends to rank less complex summaries as more interesting. We believe this to be an important property of interestingness measures. The complexity of a summary is dependent upon the number of attributes and the number of unique values for each attribute. For example, generalizing just four attributes, where there are 4, 7, 2, and 10 possible values for the associated concepts at some arbitrary level of generalization, a summary could be generated with up to 560 tuples (i.e., 4 7 2 10). Thus, low complexity summaries, as quanti ed by the variance

36

measure, are attractive because they are more concise, and therefore, intuitively easier to understand and analyze. Both interestingness measures generate too many summaries. We showed how ve heuristics could be used to reduce the number of summaries under consideration. First, for a given summary, eliminate all summaries for ancestors that have a lower degree of interest. Second, eliminate all summaries for descendants that have a lower degree of interest. Third, eliminate all summaries where the number of tuples is greater than the table threshold. Fourth, eliminate all summaries containing an attribute where the number of distinct values is greater than the attribute threshold. Finally, eliminate all summaries whose degree of interest is less than the interest threshold. Combinations of these ve heuristics can be used to further reduce the number of summaries under consideration. Unfortunately, a threshold is a subjective measure which could result in the loss of interesting summaries, in some cases, because establishing a threshold requires some prior knowledge about the nature of the database. Thus, given the dynamic nature of databases, a trial-and-error approach would be required to locate a meaningful threshold, and this threshold could change over time. Another approach may be to use a multi-level scheme. Attributes in the database would be ranked a priori and this ranking would contribute to how the interestingness of a summary is determined. Summaries containing higher ranking attributes would tend to be considered more interesting. Increasing the complexity of the DGGs associated with a set of attributes, or increasing the number of attributes, increases the complexity of the generalization state space. Our experimental results showed that partitioning path combinations from the DGGs across multiple processors signi cantly reduces the time required to traverse the generalization state space. Future research will focus on two areas. First, we need to be able to compare the probability distribution of the tuples in a summary to that of some expected or historical distribution. New interestingness measures will be developed which enable us to do this. To compare with an expected distribution, a Bayesian approach will be developed. To compare with an historical distribution, other non-Bayesian approaches will be developed. For example, new investigation shows that a nonBayesian approach based upon a modi ed KL-distance measure may be useful. Other new investigations show that non-Bayesian approaches from mathematical ecology (MacArthur, 1965, Whittaker, 1972, Bray and Curtis, 1957) and economics (Theil, 1967, Schutz, 1951, Atkinson, 1970) may also be useful. Preliminary results using approaches based upon the widely known Simpson and Shannon indexes are described in (Hilderman and Hamilton, 1999) and (Hilderman et al., 1999). Second, we need a technique to visualize and navigate through the generalization state space (Hilderman et al., 1997b). Data visualization can be used in a knowledge discovery system to guide the discovery, to guide the presentation of the results, and to present the results themselves. We will concentrate on the second stage and develop techniques for managing the many possible summaries and for selecting the anomalous (and potentially interesting) summaries among them. We propose a tool with a graphical user interface which ranks and arranges sum-

37

maries according to user-de ned criteria. This tool will support dierent views for visualizing and manipulating the summaries generated.

Acknowledgments We give sincere thanks to the editor and anonymous reviewers, who provided many helpful and detailed comments. The authors are members of the Institute for Robotics and Intelligent Systems (IRIS), and wish to acknowledge the support of the Networks of Centres of Excellence Program of the Government of Cananda, the Natural Sciences and Engineering Research Council of Canada, and Canadian Cable Labs. We also acknowledge the support and participation of Paradigm Consulting Group, PRECARN Associates, Rogers Cablevision, and SaskTel.

References R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on the Management of Data (SIGMOD'93), pages 207{216, Washington, D.C., May 1993. R. Agrawal, K. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21th International Conference on Very Large Databases (VLDB'95), pages 490{501, Zurich, Switzerland, September 1995. R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95), pages 3{14, 1995. A.B. Atkinson. On the measurement of inequality. Journal of Economic Theory, 2:244{263, 1970. D.B. Barber and H.J. Hamilton. Comparison of attribute selection strategiesfor attribute-oriented generalization. In Lecture Notes in Arti cial Intelligence, The 11th International Symposium on Methodologies for Intelligent Systems (ISMIS'97), pages 106{116, Charlotte, North Carolina, 1997. J.R. Bray and J.T. Curtis. An ordinationof the upland forest communities of southern Wisconsin. Ecological Monographs, 27:325{349, 1957. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97), pages 265{276, May 1997. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97), pages 255{264, May 1997. Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In G. Piatetsky-Shapiro and W. Frawley, editors, Knowledge Discovery in Databases, pages 213{ 228, Cambridge, MA, 1991. AAAI/MIT Press. C.L. Carter and H.J. Hamilton. Fast, incremental generalization and regeneralization for knowledge discovery from databases. In Proceedings of the 8th Florida Arti cial Intelligence Symposium, pages 319{323, Melbourne, Florida, April 1995. C.L. Carter and H.J. Hamilton. A fast, on-line generalization algorithm for knowledge discovery. Applied Mathematics Letters, 8(2):5{11, 1995. C.L. Carter and H.J. Hamilton. Performance evaluation of attribute-oriented algorithms for knowledge discovery from databases. In Proceedings of the Seventh IEEE International Conference on Tools with Arti cial Intelligence (ICTAI'95), pages 486{489, Washington, D.C., November 1995.

38

C.L. Carter and H.J. Hamilton. Ecient attribute-oriented algorithms for knowledge discovery from large databases. IEEE Transactions on Knowledge and Data Engineering, 10(2):193{208, March/April 1998. C.L. Carter, H.J. Hamilton, and N. Cercone. Share-basedmeasuresfor itemsets. In J. Komorowski and J. Zytkow, editors, Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'97), pages 14{24, Trondheim, Norway, June 1997. W.W. Chu, K. Chiang, C.C. Hsu, and H. Yau. An error-based conceptual clustering method for providing approximate query answers. Communications of the ACM, 39(12):VE, December 1996. http://www.acm.org/pubs/cacm/extension. R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT). In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), pages 112{117, Montreal, August 1995. R. Godin, R. Missaoui, and H. Alaoui. Incremental concept formation algorithms based on Galois (concept) lattices. Computational Intelligence, 11(2):246{267, 1995. H.J. Hamilton and D.F. Fudger. Measuring the potential for knowledge discovery in databases with DBLearn. Computational Intelligence, 11(2):280{296, 1995. H.J. Hamilton, R.J. Hilderman, and N. Cercone. Attribute-oriented induction using domain generalization graphs. In Proceedings of the Eighth IEEE International Conference on Tools with Arti cial Intelligence (ICTAI'96), pages 246{253, Toulouse, France, November 1996. J. Han. Towards ecient induction mechanisms in database systems. Theoretical Computer Science, 133:361{385, October 1994. J. Han, Y. Cai, and N. Cercone. Knowledge discovery in databases: an attribute-oriented approach. In Proceedings of the 18th International Conference on Very Large Data Bases, pages 547{559, Vancouver, August 1992. J. Han, Y. Cai, and N. Cercone. Data-drivendiscoveryof quantitativerules in relationaldatabases. IEEE Transactions on Knowledge and Data Engineering, 5(1):29{40, February 1993. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proceedings of the 1995 International Conference on Very Large Data Bases (VLDB'95), pages 420{431, September 1995. J. Han, Y. Fu, and S. Tang. Advances of the DBLearn system for knowledge discovery in large databases. In Proceedings of the 1995 International Joint Conference on Arti cial Intelligence (IJCAI'95), pages 2049{2050, 1995. R.J. Hilderman, C.L. Carter, H.J. Hamilton, and N. Cercone. Mining association rules from market basket data using share measures and characterized itemsets. International Journal on Arti cial Intelligence Tools, 7(2):189{220, June 1998. R.J. Hilderman, C.L. Carter, H.J. Hamilton, and N. Cercone. Mining market basket data using share measures and characterized itemsets. In X. Wu, R. Kotagiri, and K. Korb, editors, Proceedings of the Second Paci c-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'98), pages 159{173, Melbourne, Australia, April 1998. R.J. Hilderman and H.J. Hamilton. Heuristics for ranking the interestingness of discovered knowledge. In N. Zhong, editor, Proceedings of the Third Paci c-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), Beijing, China, April 1999. R.J. Hilderman, H.J. Hamilton, and Brock Barber. Ranking the interestingness of summaries from data mining systems. In Proceedings of the 12th Annual Florida Arti cial Intelligence Research Symposium (FLAIRS'99), Orlando, FL, May 1999. R.J. Hilderman, H.J. Hamilton, R.J. Kowalchuk, and N. Cercone. Parallel knowledge discovery using domain generalization graphs. In J. Komorowski and J. Zytkow, editors, Proceedings of the First European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'97), pages 25{35, Trondheim, Norway, June 1997. R.J. Hilderman, L. Li, and H.J. Hamilton. Data visualization in the DB-Discover system. In Proceedings of the Ninth IEEE International Conference on Tools with Arti cial Intelligence (ICTAI'97), pages 474{477, Newport Beach, CA, November 1997. T. Hu and N. Cercone. Object aggregation and cluster identi cation. Applied Mathematics Letters, 7(4):29{34, 1994. H.-Y. Hwang and W.-C. Fu. Ecient algorithms for attribute-oriented induction. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD'95), pages 168{173, Montreal, August 1995.

39

S. Kullback and R.A. Leibler. On information and suciency. Annals of Mathematical Statistics, 22:79{86, 1951. R.H. MacArthur. Patterns of species diversity. Biological Reviews of the Cambridge Philosophical Society, 40:510{533, 1965. R.S. Michalski. A theory and methodology of inductive learning. In R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, editors, Machine Learning: An Arti cial Intelligence Approach, pages 83{134. Tioga Publishing Company, 1983. T.M. Mitchell. Version Spaces: An Approach to Concept Learning. PhD thesis, Stanford University, 1978. W. Pang, R.J. Hilderman, H.J. Hamilton, and S.D. Goodwin. Data mining with concept generalization graphs. In Proceedings of the 9th Annual Florida Arti cial Iintelligence Research Symposium (FLAIRS'96), pages 390{394, Key West, FL, May 1996. J.S. Park, M.-S. Chen, and P.S. Yu. An eective hash-based algorithm for mining association rules. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'95), pages 175{186, May 1995. G. Piatetsky-Shapiro. Discovery, analysis and presentation of strong rules. In Knowledge Discovery in Databases, pages 229{248. AAAI/MIT Press, 1991. J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann, 1993. R.R. Schutz. On the measurement of income inequality. American Economic Review, 41:107{122, 1951. R. Srikant and R. Agrawal. Mining generalized association rules. In Proceedings of the 21th International Conference on Very Large Databases (VLDB'95), pages 407{419, Zurich, Switzerland, September 1995. R. Srikant and R. Agrawal. Mining sequential patterns: generalization and performance improvements. In Proceedings of the Fifth International Conference on Extending Database Technology (EDBT'96), Avignon, France, March 1996. G. Stumme, R. Wille, and U. Wille. Conceptual knowledge discovery in databases using formal concept analysis methods. In J. Zytkow and M. Quafafou, editors, Proceedings of the Second European Conference on the Principles of Data Mining and Knowledge Discovery (PKDD'98), pages 450{458, Nantes, France, September 1998. H. Theil. Economics and information theory. Rand McNally, 1967. H. Toivonen. Sampling large databases for nding association rules. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB'96), pages 134{145, Mumbay, India, September 1996. R.H. Whittaker. Evolution and measurement of species diversity. Taxon, 21(2/3):213{251, May 1972. R. Wille. Concept lattices and conceptual knowledge systems. Computers and Mathematics with Applications, 23:493{515, 1992.

Appendix

40

Robert J. Hilderman received his B.A. degree in Mathematics and Computer Science from

Concordia College at Moorhead, Minnesota in 1980. He worked as a consultant in the software development industry from 1980 to 1992, developing nancial and management information systems. In 1994, he received his M.Sc. degree in Computer Science from the University of Regina at Regina, Saskatchewan, and was selected as the Governor-General's Award Nominee from the Department of Computer Science for 1994. Since 1995, he has been a Professional Research Associate and Ph.D. candidate with the Department of Computer Science at the University of Regina. His researchinterests include knowledgediscovery and data miningusing statisticaland informationtheoretic methods, parallel and distributed computing, and software engineering. He has authored research papers and articles in the areas of knowledge discovery and data mining, data visualization, parallel and distributed algorithms, and protocol veri cation and validation.

Howard J. Hamilton has been in the Department of Computer Science with the University

of Regina at Regina, Saskatchewan since 1991. He received his B.Sc. degree with High Honours in Computational Science from the University of Saskatchewan at Saskatoon, Saskatchewan in 1980, followed by his M.Sc. degree in 1983. He spent the next few years in the software industry with Mitel Corporation at Kanata, Ontario. He received his Ph.D. degree in Computing Science from Simon Fraser University at Burnaby, British Columbia in 1992. His research interests include knowledge discovery and data mining, machine learning, distributed systems, and applying arti cial intelligence techniques to computer systems.

Nick Cercone is Professor and Chair of Computer Science with the University of Waterloo at Waterloo, Ontario (since 1997), after having completed his term as Associate Vice President (Research), Dean of Graduate Studies and the International Liaison Ocer with the University of Regina (1993-1997) at Regina, Saskatchewan. Formerly, he was Director of the Centre for Systems Science (1987-1992) and past chairman of the School of Computing Science (1980-1985) with Simon Fraser University at Burnaby, British Columbia. In 1986 he spent one year visiting the University of Victoria at Victoria, British Columbia to help establish the Laboratory for Communications, Information and Research. He is co-editor of Computational Intelligence, a journal which he started and published (by Blackwell's Publishers), and serves on the editorial board of six journals. He is a member of the ACM, IEEE, AAAI, AISB, AGS, and ACL, and a past president of the CSCSI/SCEIO (Canadian Society for Computational Studies of Intelligence), of the Canadian Society for Fifth Generation Research, and of the Canadian Association for Computer Science (CACS/AIC). He has also served on the Canadian Genome Assessment and Technology Board and the CANARIE Board, and the boards of CanWest, the Institute for Robotics and Intelligent Systems (IRIS) Research Management Committee, Regina Economic Development Authority (REDA) Information Technology, and the Saskatchewan Research Council. He also serves on NSERC committees, U.S. NSF committees, and recently a Canadian Foundation for Innovation (CFI) committee. In 1996 he won the Canadian Arti cal Intelligence Society's Distinguished Service Award. His research interests include natural language processing, knowledge-based systems, knowledge discovery in databases, machine learning, and the design of human interfaces. He received his B.S. degree in Engineering Science from the Franciscan University at Steubenville in 1968, his M.S. degree in Computer and Information Science from Ohio State University in 1970, and his Ph.D. degree in Computing Science from the University of Alberta in 1975. He worked for IBM Corporation in 1969 and 1971 on design automation, and has authored over 175 technical papers, including the writing or editing of 5 books. He has been a consultant for Boeing Corporation, Rogers Cablesystems, Environment Canada, SaskTel, Showbase Inc., and Open Text Corporation.

41

Table A.1. The 50-tuple table Movie A D B C B D A C D B A D B A C D A C B A B C D C B C A D D C A B D C B D A C A B C D B C D A B C D B

Date 1 2 2 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 8 9 9 10 11 11 11 11 11 12 12 12 13 15 17 17 18 18 18 18 19 19 20 23 25 25 26 27 31

Day Tuesday Wednesday Wednesday Thursday Thursday Friday Friday Friday Friday Friday Friday Friday Saturday Saturday Saturday Saturday Saturday Sunday Sunday Sunday Monday Tuesday Wednesday Wednesday Thursday Friday Friday Friday Friday Friday Saturday Saturday Saturday Sunday Tuesday Thursday Thursday Friday Friday Friday Friday Saturday Saturday Sunday Wednesday Friday Friday Saturday Sunday Thursday

Time 19:00 11:30 13:30 22:30 10:30 18:30 23:00 15:30 22:00 16:30 15:00 11:30 16:30 15:00 22:30 11:30 15:00 22:30 16:30 15:00 19:30 12:00 18:30 22:30 13:30 15:30 11:00 11:30 18:30 15:30 11:00 16:30 11:30 15:30 10:30 15:00 19:00 22:30 19:00 19:30 22:30 18:30 13:30 22:30 11:30 18:30 19:30 19:00 11:30 13:00

42

Table A.2. Mapping of domain values for nodes in DGGs Node

Ag

Domain ANY

Node

A1

Domain adult

Node

Ad

general

Bg

ANY

B2

workday

B1

weekend

B3 B4 Cg

ANY

C2

early month mid month late month actual week1 actual week2 actual week3 actual week4 actual week5 workday

Bd Bd Cd

weekend

C3

weekday

C1

Domain

A C B D

work week1 work week2 work week3 work week4 work week5 weekend1 weekend2 weekend3 weekend4 1-10 11-20 21-31 1-5 6-12 13-19 20-26 27-31 Monday Tuesday Wednesday Thursday Friday Saturday Sunday early weekday

Node

Bd

1-4 7-11 14-18 21-25 28-31 5-6 12-13 19-20 26-27

Cd

Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0:00-5:59 6:00-11:59 12:00-17:59 18:00-23:59 0:00-5:59 18:00-23:59 6:00-11:59 12:00-14:59 15:00-17:59 0:00-3:59 4:00-6:59 7:00-10:59 11:00-12:59 13:00-15:59 16:00-17:59 18:00-21:59 22:00-23:59

late weekday weekend

Dg

ANY

D4

am

D1

pm

D5

non-traditional

D2

traditional

D6

night time day time prime time

D3

Friday Saturday Sunday early morning morning afternoon evening early morning evening morning early afternoon late afternoon late night early morning morning mid day afternoon early evening late evening early night

Domain

Dd Dd Dd

Data Mining in Large Databases Using Domain ... - CiteSeerX

Data Mining in Large Databases Using Domain ... - CiteSeerX

Suggest Documents

Data Mining in Vulnerability Databases - CiteSeerX

Domain Knowledge Integration in Data Mining using ... - CiteSeerX

Fuzzy Data Mining from Multidimensional Databases - CiteSeerX

Opportunity Explorer: Navigating Large Databases Using ... - CiteSeerX

Mining Gene Expression Data using Domain Knowledge

Databases and Data Mining - Liacs

Data Management and Mining in Astrophysical Databases

Knowledge Discovery and Data Mining in Databases

From Data Mining to Knowledge Discovery in Databases - CiteSeerX

From Data Mining to Knowledge Discovery in Databases - CiteSeerX

Multi-Relational Data Mining in Medical Databases - CiteSeerX

Challenging Research Issues in Data Mining, Databases ... - CiteSeerX

Data Condensation in Large Databases by Incremental ... - CiteSeerX

Measuring Data Dependencies in Large Databases

Recognition of Images in Large Databases Using a ... - CiteSeerX

VisiMine: Interactive Mining in Image Databases - CiteSeerX

mining for knowledge in databases - CiteSeerX

Data-mining open source databases for drug repositioning using ...

Domain Driven Data Mining

Data Mining of Microarray Databases for the Analysis of ... - CiteSeerX

A Perspective on Databases and Data Mining - CiteSeerX

An Improved Algorithm for Mining Association Rules in Large Databases

Mining Association Rules between Sets of Items in Large Databases

Mining Association Rules between Sets of Items in Large Databases