Attribute-Oriented Induction Using Domain Generalization Graphs Howard J. Hamilton, Robert J. Hilderman, and Nick Cercone Department of Computer Science University of Regina Regina, Saskatchewan, Canada, S4S 0A2 fhamilton,hilder,
[email protected]
Abstract Attribute-oriented induction summarizes the information in a relational database by repeatedly replacing speci c attribute values with more general concepts according to user-de ned concept hierarchies. We show how domain generalization graphs can be constructed from multiple concept hierarchies associated with an attribute, describe how these graphs can be used to control the generalization of a set of attributes, and present the Multi-Attribute Generalization algorithm for attributeoriented induction using domain generalization graphs. Based upon a generate-and-test approach, the algorithm generates all possible combinations of nodes from the domain generalization graphs associated with the individual attributes, to produce all possible generalized relations for the set of attributes. We rank the interestingness of the resulting generalized relations using measures based upon relative entropy and variance. Our experiments show that these measures provide a basis for analyzing summary data from relational databases. Variance appears more useful because it tends to rank the less complex generalized relations (i.e., those with few attributes and/or few tuples) as more interesting.
expected. For example, association algorithms nd, from transaction records, sets of items that appear with sucient frequency to merit attention [13]. Similarly, sequencing algorithms nd relationships among items or events across time, such as events A and B usually precede event C [1]. A hybrid approach generates high-level association rules from input data and concept hierarchies [9]. Attribute-oriented induction (AOI) [7, 8] is a summarization algorithm that has been eective for KDD. AOI summarizes the information in a relational database by repeatedly replacing speci c attribute values with more general concepts according to userde ned concept hierarchies (CHs). A concept hierarchy associated with an attribute in a database is represented as a tree where leaf nodes correspond to actual data values in the database, intermediate nodes correspond to a more general representation of the data values, and the root node corresponds to the most general representation of the data values. For example, a CH for the Location attribute in a sales database is shown in Figure 1. ANY West
1 Introduction Knowledge discovery from database (KDD) algorithms can be broadly classi ed into two general areas: summarizing and anomaly detection. Summarization algorithms nd concise descriptions of input data. For example, classi catory algorithms partition input data into disjoint groups. The results of such classi cation might be represented as a decision tree or a set of characteristic rules, as from C4.5 [15], DBMiner [10], and KID3 [14]. Anomaly-detection algorithms identify unusual features of data, such as combinations that occur with greater or lesser frequency than might be
East
Vancouver
Office1
L.A.
Office2
Office3
Toronto
Office4
Office5
New York
Office6
Office7Office8
Office9
Figure 1. Concept hierarchy
KDD approaches are evaluated by the interestingness of the results and the eciency with which they are obtained. Recent research has shown AOI methods to be among the most ecient for KDD [2, 3, 4, 7, 11] because only one path of generalization is followed for each attribute.
The complexity of CHs is a primary factor determining the interestingness of results [6]. If several CHs are available for the same attribute, meaning knowledge about the attribute can be expressed in dierent ways, current AOI methods require the user to select one, which may not generate the most interesting results. For example, the CH in Figure 1 does not provide any information regarding sales by country, but that in Figure 2 does. ANY Canada
Vancouver
Office1
Office2
U.S.
Toronto
Office5
Office6
L.A.
Office3
New York
Office4
Office7Office8
Office9
Figure 2. Alternative concept hierarchy
A fundamental problem with AOI methods is the failure to evaluate the relative merits of other possible generalizations consistent with the CHs. To facilitate this, a higher level representation is needed to manage the process of generalizing in dierent ways. The levels of the CHs in Figures 1 and 2 correspond to the more general representation in the domain generalization paths (DGPs) of Figures 3(a) and 3(b), respectively, where the nodes at each level are a general description of the concepts at the same level in the corresponding CH. These DGPs are clearly similar to each other, and to formalize the relationship between them, a domain generalization graph (DGG) can be constructed as shown in Figure 3(c). We assume a common name used in multiple DGPs represents the same partition of the domain in the underlying CHs for the corresponding attribute. ANY
ANY
Division
Country
ANY
Division
Country
City
City
City
Product
Product
Product
(a)
(b)
(c)
Figure 3. Similar DGPs and a DGG
In [12], we introduced ecient methods for generalizing individual attributes using DGGs. In this paper, we introduce the Multi-Attribute Generalization algo-
rithm for generalizing a set of attributes. We consider the set of attributes to be a single attribute whose domain is the cross product of the individual attribute domains. The algorithm generates all possible combinations of generalizations of concepts from the DGGs associated with the set of attributes. Generalizing in this way creates many generalized relations, so two measures are used to rank their relative interestingness. This work was motivated by the need to automate the discovery process for cases where many ways of generalization may be appropriate. For example, given a database with a time-related attribute, summaries can be done according to the concepts hour of day, part of day, day, day of week, day of month, week, week in month, week in quarter, month, year, and many others. Our system not only creates all of these summaries, but also ranks them to help identify any anomalies, such as a disproportionate percentage of activity in the rst week of a month. Furthermore, all other attributes can have similarly complex summaries and our system considers all resulting combinations. Our system automatically provides a database analyst with a variety of perspectives on the database. The remainder of this paper is organized as follows. In the following section, we provide a formal de nition of DGGs. In Section 3, we introduce the MultiAttribute Generalization algorithm, All Gen, for generalizing a set of attributes. In Section 4, we consider two interestingness measures for ranking the generalized relations. In Section 5, we present experimental results and compare the two measures. We conclude in Section 6.
2 De nitions Given a set S = fs1 ; s2; : : :; sn g (the domain of an attribute), S can be partitioned in many dierent ways, for example D1 = ffs1g; fs2g; : : :; fsngg, D2 = ffs1 g; fs2; : : :; sngg, etc. Let D be the set of partitions of set S, and be a binary relation (called a generalization relation) de ned on D such that Di Dj if for every di 2 Di there exists dj 2 Dj such that di dj . hD; i de nes a partial order set from which we can construct a domain generalization graph hD; E i such that the nodes of the graph are elements of D and there is a directed arc from Di to Dj (denoted by E(Di ; Dj )) i Di 6= Dj , Di Dj , and there is no Dk 2 D such that Di Dk and Dk Dj . If we let Dg = fS g and Dl = ffs1g; fs2g; : : :; fsngg, then for any Di 2 D we have Dl Di and Di Dg . Dl and Dg are called the least and greatest elements of D, respectively. We call the nodes (elements of D)
domains, where the least element is the most speci c level of generality and the greatest element is the most general level. Two domains, Di and Dj , are comparable if either Di Dj or Dj Di . That is, in the
DGG, there exists a path containing both Di and Dj . Since generalization begins at the least element and moves toward the greatest element, for each node Di in hD; E i, we de ne descendants(Di ) to be all nodes Dj such that Di Dj and ancestors(Di ) to be all nodes Dk such that Dk Di . There is a trivial DGG consisting of a single DGP where the domain of the attribute is mapped to the most general level (i.e., Dl is mapped directly to Dg ). Example 1: For S = f0; 1; 2; 3; 4g and D = fD1 ; D2 ; D3 ; D4 ; D5 g, where D1 = ff0g; f1g; f2g; f3g; f4gg, D2 = ff0; 1g; f2g; f3g; f4gg, D3 = ff0; 1; 2g; f3; 4gg, D4 = ff0; 1g; f2; 3g; f4gg, and D5 = fS g = ff0; 1; 2; 3; 4gg, we construct the DGG shown in Figure 4. D5
D3
D4
D2
D1
Figure 4. Domain generalization graph
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
procedure All Gen (relation; i; m; S ) begin for k = 1 to jDi j do begin if k > 1 then gen relation Generalize (relation; Dki ) else gen relation relation end if i < m then All Gen (gen relation; i + 1; m; S [ Dki ) else Output (gen relation; S [ Dki ) end end end
Figure 5. Multi-attribute generalization algorithm
Example 2: For the simple DGGs for attributes A, B, and C shown
in Figures 6(a), 6(b), and 6(c), respectively, the multi-attribute generalization graph, shown in Figure 7, describes all possible levels of generalization for the domain associated with the set of attributes. Node hA1 ; B1 ; C1 i (lower left) contains the least element from each individual DGG. This node corresponds to the original domain of the attributes and the values contained in this domain are represented by the cross-product of the values contained in the individual attribute domains. All other nodes correspond to a possible level of generalization of node hA1 ; B1 ; C1 i. For example, assume that a, b1 , b2 , and c describe the generalization relations for attributes A, B, and C , respectively. Applying a, b1 , b2 , and c in order, we obtain generalizations corresponding to hA2 ; B1 ; C1 i, hA2 ; B2 ; C1 i, hA2 ; B3 ; C1 i, and hA2 ; B3 ; C2 i. Node hA2 ; B3 ; C2 i (upper right) contains the greatest element from each individual DGG and corresponds to all attributes generalized to ANY. B3
A2
b2
3 The All Gen Algorithm Given a relation and n DGGs associated with a set of n attributes, the All Gen algorithm, shown in Figure 5, creates all possible generalized relations consistent with the DGGs. In All Gen, the function Generalize returns a generalized relation where attribute i with DGG Di in the target relation has been generalized to the level of node Dki (any of the generalization algorithms presented in [3, 7, 4] may be used to implement this function) and procedure Output saves a generalized relation. Node D1i is the least element of DGG Di . The initial call is All Gen (relation; 1; n; ;), where relation is the target relation being generalized, 1 is the number of the target attribute for this iteration, n is the total number of target attributes, and ; initializes S. All Gen uses recursion to iterate over all n attributes. The output is all combinations of nodes in the DGGs associated with the n attributes. For m attributes, a database of n tuples, and an O(n) generalization Q algorithm, the computational complexity is O(n mi=1 jDij), where jDij is the number of nodes in DGG Di .
C2
B2
a
c
b1 A1
B1
C1
Figure 6. A set of DGGs
c
A2B3C1
A2B3C2
a
a
b2
c
A1B3C1
b2 A2B2C1
A 2B 2C 2 a
c
A1B2C1
b1
c
a A1B1C1
A1B3C2
b2
c
a b1
A2B1C1
b2
b1 A1B2C2
b1 A2B1C2 a
c
A1B1C2
Figure 7. Multi-attribute generalization graph
4 Interestingness Measures Generalizing all possible combinations of concepts in a multi-attribute domain generalization graph has the potential to generate a large number of generalized relations, even for a few, small DGGs. To identify those generalized relations from which we may learn the most, two measures are used to rank their interestingness. The rst measure, based upon the relative entropy measure (Kullback-Leibler (KL) distance), has been suggested as an appropriate measure for comparing data distributions in unstructured textual databases [5]. Here we use the KL-distance to compare the distribution de ned by the structured data in a generalized relation to that of a model distribution of the data. The second measure, variance, is the most common measure of variability used in statistics. Variance compares the distribution de ned by the data in a generalized relation to a uniform distribution of the data. 4.1
KL-Distance
Given a concept A with domain S = fS1 ; S2 ; : : :; Sm g, where the value of A in each tuple can be assigned any value in S, we measure the distance from the distribution of the Si 's in the actual data from that of a model distribution. If we denote the distributions of the Si 's in the actual data by p and the model distribution by q, then the distance from p(Si ) to q(Si ) measures the amount of information we lose by modelling p by q. This distance is called the relative entropy measure (or KL-distance). The relative entropy between two probability distributions p(S) and q(S) is de ned as [5]: n X p(Si ) D(p k q) = p(Si ) log2 q(S i) i=1 The larger the KL-distance, the less similar are the distributions of p and q. Example 3: We measure the KL-distance of our actual distribution
from the uniform distribution. Let X be a real random variable with uniform distribution on the interval [a; a + b] of length b. Then the probability density function p(x) is given by:
p(x) =
n
1 b
; when x is a point in [a; a + b]
0;
otherwise
Consider the concept A with domain S = fS1 ; S2 ; S3 g, where the actual distribution of the distinct values for S1 , S2 , and S3 in the generalized relation for instance A1 are 20, 10, and 40, respectively. From the actual distribution of A1 , we have p(S1 ) = 0:29, p(S2 ) = 0:14, and p(S3 ) = 0:57, and from the uniform distribution we have q(Si ) = 0:33, for all i. The KL-distance is given by: 0:14 0:57 D(p k q) = 0:29log2 00::29 33 + 0:14log2 0:33 + 0:57log2 0:33 = (?0:055) + (?0:173) + 0:449 = 0:221
4.2
Variance
Given a concept A with domain S = fS1 ; S2 ; : : :; Sm g, where the value of A in each tuple can be assigned any value in S, we measure how much the distribution of the Si 's in the actual data varies from the uniform distribution. If we denote the distribution of the Si 's in the actual data by p and the uniform distribution by q, then the variance, 2, is de ned as: 2
=
Pni=1(p(Si) ? q(S))2
n?1 where q(S) is the average probability for the Si 's in q. The larger the variance, the less similar are the distributions of p and q. Example 4: We measure the variance of our actual distribution from the uniform distribution. Consider again the distribution of instance A1 and the uniform distribution described in Example 3. The variance for A1 from the uniform distribution is: 2 = ((0:29 ? 0:33)2 + (0:14 ? 0:33)2 + (0:57 ? 0:33)2=2 = (0:002 + 0:036 + 0:058=2 = 0:096=2 = 0:048
5 Experimental Results The Multi-Attribute Generalization algorithm has been implemented in C as an extension to DB-Discover, a software tool utilizing attribute-oriented induction for knowledge discovery from databases. Series of discovery tasks were run on tables from two databases: one consisting of a single 50-tuple table and the other the 10,000-record NSERC research awards database that has been widely used in previous data mining research (e.g. [4, 2, 6]). Since the maximum number of generalized relations generated by these discovery tasks depends only on the size of the CHs associated with the attributes being generalized (i.e., it is not dependent upon the number of tuples), for simplicity, we report on the results obtained using the 50-tuple table. 5.1
The Database
The 50 tuple table represents a subset of pay-perview movie rentals for the month of August, 1995. We primarily restrict our discussion to those generalized relations containing one and two attributes since these are more compact and less complex, yet representative of the results obtained. Tuples in the 50-tuple table consist of four attributes: Movie, Date, Day, and Time, with 4, 31, 7, and 15 possible distinct values, respectively. The DGGs associated with the attributes, shown in Figure 8, contain 3, 6, 5, and 8 nodes, respectively, and
provide 1, 3, 2, and 3 possible DGPs, respectively. The multi-attribute generalization graph constructed from these DGGs consists of 720 nodes (i.e., 3 6 5 8). Of these, we considered only those where every attribute was generalized at least once (i.e., k > 2 in All Gen), giving 280 nodes (i.e., 2 5 4 7). Ignoring the most general case where all attributes are generalized to ANY, 279 possible generalized relations were generated and ranked for interestingness by each discovery task. \Obvious" results were built into the 50-tuple table when it was created. It contains 35 weekend and 15 weekday rentals, 6 Sunday, 1 Monday, 3 Tuesday, 5 Wednesday, 6 Thursday, 18 Friday, and 11 Saturday rentals, 25 August 1 to August 10, 19 August 11 to August 20, and 6 August 21 to August 31 rentals, and 24 adult and 26 general classi cation rentals. In the discussion that follows, the domains of nodes A2 , B4 , C2, C4, and D5 are fgeneral, adultg, fearly month, mid month, late monthg, fearly weekday, late weekday, Friday, Saturday, Sundayg, fweekday, weekendg, and fam, pmg, respectively. A3
B6
A2
B3
A1
B2
Movie
B4
B5
C3
B1
C1
Date
Day
Figure 8. DGGs for the Time attributes
5.2
D8
C5
C4
D5
D6
D7
C2
D2
D3
D4
D1
Time
Movie, Date, Day, and
Discovery Tasks
During a discovery task, a set of generalized relations is generated and each is classi ed according to the number of non-ANY attributes it contains, as shown in Table 1 (all tables are given in Appendix A). A description of the generalized relations contained in the set is also generated, of which a sample subset is shown in Table 2, where the Relation column contains the unique identi er assigned to each generalized relation, the KLdistance column is the KL-distance from the uniform distribution, and the Concept(s) column contains the concept and level of generalization for each concept in the corresponding generalized relation. Generalized relations are sorted by the number of non-ANY attributes (ascending) and KL-distance (descending).
Each discovery task generates the generalized relations corresponding to the \obvious" results described above. However, we had to visually search for these single-attribute relations, shown in Tables 3 through 6, to extract them from the complete set (i.e., a discovery task is not able to recognize what we consider to be \obvious"). In these tables, the rst column is of the form x:y, where x is the attribute being generalized and y is the level of generalization, the Count column is the number of tuples which generalized to the concept in the rst column, and the Count (%) column represents the Count as a percentage of all tuples generalized. 5.3
KL-Distance versus Variance
Here we compare the KL-distance (using the uniform distribution as the model distribution) and variance results. Due to the large number of generalized relations generated, even for a database consisting of only 50 four-attribute tuples, we only discuss a subset of the results. From the 279 generalized relations generated by the two discovery tasks, 67 contained twoattribute tuples. From this group of 67, we selected 17 for further discussion which we believe to be most illustrative. Descriptions for these generalized relations are shown in Table 7, where the Relation column speci es the unique identi er assigned to each generalized relation during the discovery task, the Concept(s) column has the same meaning as the one in Table 2, and the KL-Distance Rank and Variance Rank columns specify the rank assigned to the corresponding generalized relation by the KL-distance and variance measures, respectively (the lower the rank, the greater the degree of interestingness). The variance measure tended to rank generalized relations with a small number of tuples, but with a large variance from the uniform distribution, as most interesting. For example, relations 8, 24, and 39, of Table 7, shown in Tables 8, 9, and 10, respectively, were ranked as the second, fth, and eighth most interesting generalized relations, respectively. These tables are similar in the number of tuples they contain and the distribution of the counts to each tuple. Other generalized relations with a similar ranking (not shown) also had a similar number of tuples and distribution of counts. In contrast, the KL-distance measure ranked these generalized relations as 49th, 53rd, and 42nd, respectively. Although, the order in which the KL-distance ranked these generalized relations is similar to how they were ranked by the variance measure, other generalized relations with a similar ranking did not have a similar number of tuples and distribution of counts. For example, relation 38, shown in Table 11, was ranked 43rd.
Clearly, relation 38 does not appear similar to relations 8, 24, and 39 in terms of the number of tuples and distribution of counts. Relations 8, 24, and 39 also emphasize how dierently the two interestingness measures rank the generalized relations. For example, these generalized relations were considered interesting by the variance measure (indicated by the low ranking) and not interesting by the KL-distance measure (indicated by the high ranking). Close examination of Table 7 shows that there is almost an inverse relationship between the two measures, the most obvious example being relation 243, which is ranked most interesting by the KL-distance measure and least interesting by the variance measure. Relation 243 (not shown) contains 29 two-attribute tuples, where the distribution of counts is nearly uniform. In general, the KL-distance measure tended to rank generalized relations with a large number of tuples, regardless of their distribution, as more interesting. The average number of tuples for the ten most interesting and ten least interesting generalized relations for both interestingness measures are shown in Tables 12 and 13, respectively. Table 12 shows that the most interesting single-attribute generalized relations generated under the variance measure have approximately 30% fewer tuples than those generated under the KL-distance measure. For four-attribute tuples, the variance measure generates approximately 70% fewer tuples. In contrast, Table 13 shows that the least interesting single-attribute generalized relations generated under the variance measure have approximately 40% more tuples than those generated under the KL-distance measure. For four-attribute tuples, the variance measure generates approximately twice as many tuples. This comparison clearly illustrates that the variance measure quanti es the simpler and less complex generalized relations (i.e., those with few attributes and/or tuples) as more interesting.
6 Conclusion and Future Work We presented the Multi-Attribute Generalization algorithm for attribute-oriented induction of a set of attributes. The algorithm generates all possible combinations of generalizations of concepts from the DGGs associated with a set of attributes. KL-distance and variance were used to rank the interestingness of the generalized relations generated by the algorithm. We believe that an important property of interestingness measures should be to rank generalized relations of low complexity as more interesting. The complexity of a generalized relation is dependent upon the number of attributes and the number of unique val-
ues for each attribute. For example, generalizing just four attributes, where there are 4, 7, 2, and 10 possible values for the associated concepts at some arbitrary level of generalization, a generalized relation could be generated with up to 560 tuples (i.e., 4 7 2 10). Thus, low complexity generalized relations, as quanti ed by the variance measure, are attractive because they are more concise, and therefore, intuitively easier to understand and analyze. The algorithm generates too many generalized relations. To further reduce some of these from consideration, additional heuristics are required. One approach may be to only output the n most interesting generalized relations. A second approach may be an interestingness threshold where we only output those generalized relations above the threshold. Unfortunately, a threshold is a subjective measure requiring some prior knowledge about the nature of the database. Given the dynamic nature of databases, this threshold could change over time. A third approach may be to eliminate generalized relations which are quanti ed similarly and which share common ancestors or descendants. If many generalized relations dier by only one or two levels of generalization along some common DGP, then we only output one. A fourth approach may be to use a multi-level quanti cation scheme. Attributes in the database could be ranked a priori such that generalized relations containing higher ranking attributes would be considered more interesting. Another problem is the diculty in determining the relative degree of interestingness of a generalized relation with some certainty or con dence. For example, using either the KL-distance or variance measures, how interesting is a result of 1.0 compared to 0.1, 0.01, or 0.001?
References [1] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of the 11th Int'l Conf. on Data Engineering, pages 3{14, 1995. [2] C. L. Carter and H. J. Hamilton. Fast, incremental generalization and regeneralization for knowledge discovery from databases. In Proceedings of the 8th Florida Arti cial Intelligence Symposium, pages 319{323, Melbourne, Florida, April 1995. [3] C. L. Carter and H. J. Hamilton. A fast, on-line generalization algorithm for knowledge discovery. Applied Mathematics Letters, 8(2):5{11, 1995. [4] C. L. Carter and H. J. Hamilton. Performance evaluation of attribute-oriented algorithm
for knowledge discovery from databases.
In
Proceedings of the Seventh IEEE International Conference on Tools with Arti cial Intelligence (ICTAI'95), pages 486{489, Washington, D.C.,
November 1995. [5] R. Feldman and I. Dagan. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD-95), pages 112{117, Mon-
[6]
[7] [8]
[9]
[10]
treal, August 1995. H.J. Hamilton and D.F. Fudger. Measuring the potential for knowledge discovery in databases with DBLearn. Computational Intelligence, 11(2):280{296, 1995. J. Han. Towards ecient induction mechanisms in database systems. Theoretical Computer Science, 133:361{385, October 1994. J. Han, Y. Cai, and N. Cercone. Datadriven discovery of quantitative rules in relational databases. IEEE Trans. on Knowledge and Data Engineering, 5(1):29{40, February 1993. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of 1995 Int'l Conf. on Very Large Data Bases (VLDB'95), 1995. J. Han and Y. Fu. Attribute-oriented induction in data mining. In U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, editors, Adavances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[11] H.-Y. Hwang and W.-C. Fu. Ecient algorithms for attribute-oriented induction. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD-95), pages 168{
173, Montreal, August 1995. [12] W. Pang, R.J. Hilderman, H.J. Hamilton, and S.D. Goodwin. Data mining with concept generalization graphs. In Proceedings of the Ninth Annual Florida AI Research Symposium, Key West, Florida, May 1996. [13] J. S. Park, M.-S. Chen, and P. S. Yu. An eective hash-based algorithm for mining association rules. SIGMOD Record, 25(2):175{186, 1995. [14] G. Piatetsky-Shapiro. Discovery, analysis and presentation of strong rules. In Knowledge Discovery in Databases, pages 229{248. AAAI/MIT Press, 1991.
[15] J. R. Quinlan. C4.5 Programs for Machine Learning. Morgan Kaufmann, 1993.
7 Appendix No. of Attributes 1 2 3 4 Total
No. of Relations 14 67 126 72 279
Table 1. Number of generalized relations classified by number of attributes
Relation .. . 12 55 1 .. . 109 163 207 .. . 111 209 222 .. . 250 231 196 .. .
Concept(s) .. . h ; ; ; D2 i h ; B2 ; ; i h ; B5 ; ; i .. . h ; B 2 ; C3 ; i h ; B3 ; ; D3 i h ; B5 ; ; D4 i .. . h ; B 2 ; C3 ; D5 i h ; B 5 ; C4 ; D4 i hA2 ; ; C2 ; D4 i .. . hA2 ; B3 ; C4 ; D7 i hA2 ; B4 ; C4 ; D7 i hA2 ; B3 ; C3 ; D6 i .. .
KL-Distance .. . 0.46317 0.31407 0.28439 .. . 1.31407 1.21477 1.20285 .. . 1.63008 1.61846 1.61317 .. . 1.19800 1.18794 1.17625 .. .
Table 2. Sample generalized relation records
Day:C4 weekend weekday Total
Count 35 15 50
Count(%) 70 30 100
Table 3. Rentals by weekend and weekday
Day:C2 Friday late weekday Saturday Sunday early weekday Total
Count 18 11 11 6 4 50
Count(%) 36 22 22 12 8 100
Table 4. Rentals by day of the week
Date:B4 early month mid month late month Total
Count 25 19 6 50
Count(%) 50 38 12 100
Count 26 24 50
Count(%) 52 48 100
Table 6. Rentals by classification
Relation 243 227 218 204 224 128 42 16 138 14 39 38 56 8 37 24 20
Concept(s) h ; B2 ; ; D2 i h ; B4 ; ; D2 i hA2 ; ; ; D2 i h ; ; C 2 ; D5 i h ; B4 ; ; D5 i h ; ; C 4 ; D3 i h ; B4 ; ; D4 i h ; ; C 2 ; D4 i hA2 ; ; ; D3 i h ; ; C 4 ; D4 i h ; B4 ; ; D7 i h ; B 4 ; C2 ; i h ; B 3 ; C4 ; i h ; ; C 4 ; D7 i h ; B 4 ; C4 ; i hA2 ; ; ; D7 i hA2 ; ; C4 ; i
KL-Distance Rank 1 13 16 18 19 23 28 29 35 36 42 43 45 49 51 53 61
Variance Rank 67 57 43 41 15 20 38 54 35 21 8 32 62 2 14 5 13
Table 7. Ranks assigned to generalized relations by KL-distance and variance measures
Time:D5 pm pm am am
Day:C4 weekend weekday weekend weekday Total
Count 28 11 7 4 50
Count(%) 56 22 14 8 100
Table 8. Relation 8
Movie:A2 adult general general adult
Time:D5 pm pm am am Total
Count 22 17 9 2 50
Count(%) 44 34 18 4 100
Table 9. Relation 24
Date:B4 early month mid month mid month early month late month late month Total
Count 21 14 5 4 4 2 50
Count(%) 42 28 10 8 8 4 100
Table 10. Relation 39
Table 5. Rentals by time of the month
Movie:A2 general adult Total
Time:D5 pm pm am am pm am
Day:C2 Friday late weekday Friday Saturday Saturday Sunday early weekday Sunday Friday late weekday late weekday early weekday Saturday Sunday
Date:B4 mid month early month early month early month mid month early month early month mid month late month mid month late month mid month late month late month Total
Count 9 7 7 5 5 3 3 2 2 2 2 1 1 1 50
Count(%) 18 14 14 10 10 6 6 4 4 4 4 2 2 2 100
Table 11. Relation 38)
No. of Attributes 1 2 3 4 Average
KL-Distance 4.1 16.7 31.5 39.9 23.0
Variance 2.9 4.2 6.5 12.1 6.4
Table 12. Average number of tuples in most interesting generalized relations
No. of Attributes 1 2 3 4 Average
KL-Distance 2.9 5.0 9.7 20.9 9.6
Variance 4.1 19.0 35.9 42.4 25.4
Table 13. Average number of tuples in least interesting generalized relations