Discovering Roll-Up Dependencies
Jef Wijsen and Raymond T. Ng y
Abstract
Roll-up dependencies (RUD) generalize functional dependencies (FDs) for relational databases that support roll-up/drill-down. Consider the following example. Suppose we keep track of the daily price in cents of a particular product. Even though prices in cents can change from day to day, we may nd that within a week, the daily prices are the same if they are rounded to the nearest integral dollars. The RUD (Day : week) ! (Price : dollar) captures this pattern. Note that the functional dependency Day
!
Price
does not imply the above RUD, nor vice versa. The applicability of RUDs in database design and OLAP is discussed. A sound and complete axiomatization for logical implication is provided. The problem of mining RUDs is at the center of this paper. The problem is shown NP-hard, and an outline of an algorithm is provided.
1 Introduction The problem of discovering functional dependencies (FDs) from relational databases has been studied several years ago [8], before the explosive growth of data mining. Although FDs are the single most important dependencies in database design, FDs are not prevalent in the data mining literature. This may be attributed to the fact that FDs other than those known at design time, are fairly rare. In this paper we propose an extension of FD, called roll-up dependency (RUD), which has clear and interesting applications in data mining and OLAP. In particular, we generalize FDs for attribute domains (called levels) that are organized in a ner-than order. Such ner-than relations are very common in data mining and OLAP; they have been called concept hierarchies [6], or roll-up/drill-down lattices. For example, time can be measured in days (level day) or weeks (level week). day is ner than week, and there is a many-to-one mapping from day values to week values capturing the natural containment of days in weeks. A price can be expressed in cents (level cent) or integral dollars (level dollar). cent is ner than dollar, and a price in cents can be expressed to the nearest integral dollar. For example, 778 cents is rounded to 8 dollar. In this way, a many-to-one mapping is determined from cent prices to dollar prices. RUDs allow comparing attributes for equality at dierent levels. For example, consider the relation scheme f(Day : day); (Price : cent)g to store the daily price in cents of a particular product. We have one price per day, which can be expressed by the FD: Day ! Price : (1)
Vrije Universiteit Brussel, Departement Informatica, Pleinlaan 2, B-1050 Brussel, Belgium.
y
Department of Computer Science, University of British Columbia, Vancouver, B.C. V6T 1Z4 Canada. Email:
[email protected].
[email protected].
1
Email:
In general, cent prices can change from day to day, and we may be interested in nding regularities in price uctuations. We may nd, for example, that within a week, all cent prices are equal if they are expressed to the nearest integral dollar, which is captured by the RUD: Day : week ! Price : dollar:
(2)
The meaning is that whenever there are two tuples whose Day -values belong to the same week, then their Price -values round to the same integral dollar. It is worth noting that dependency 1 does not imply dependency 2, nor vice versa. One of the applications of RUDs lies in the integration of data mining and OLAP. This integration is an interesting and important research direction|Han uses the term OLAP mining in this respect [5]. OLAP users typically execute roll-up/drill-down operations to establish an appropriate abstraction level to look at the data. RUDs can express signi cant and useful information about the result of a roll-up operation. In the foregoing example, if prices in integral dollars are suciently accurate for a given application, then dependency 2 indicates that rolling-up from days to weeks and from cent to dollar prices is advantageous: it reduces the amount of data without introducing unacceptable inaccuracies. This paper is organized as follows. Section 2 introduces roll-up dependencies (RUDs). Section 3 discusses some interesting applications of RUDs. Section 4 provides a sound and complete axiomatization for reasoning about RUDs. In that section, a RUD is either satis ed or falsi ed|there is no other possibility. In section 5, we take a gradual approach to satisfaction by adapting the classical notions of support and con dence [3]. These notions characterize the \strength" of a certain RUD. In section 6, we address the problem of discovering strong RUDs from a given database. We rst provide a computational complexity analysis of the problem, and then we discuss its algorithmic aspects.
2 Roll-Up Dependency In this section we rst de ne roll-up lattices, and then we formalize the notion of RUD. Finally, we discuss the relationship between RUDs and grouping.
2.1 Roll-Up Lattices
Domain names are called levels. In all examples throughout this paper, levels appear in typewriter style. Examples used earlier include day, week, cent, and dollar. The family of all levels is a partially ordered set. The order is denoted , and corresponds to an is- ner-than relationship. For example, we have day week, and there is a many-to-one mapping that maps each day to the week it belongs to. The following de nition is adapted from [2].
De nition 1 We assume the existence of a nite, partially ordered set (L; ) of levels. Every level l 2 L has associated with it a domain of constants, denoted dom (l), such that distinct levels have disjoint domains.
A roll-up instantiation is a set I of functions as follows: For every l1 ; l2 2 L with l1 l2 , there is a total function, denoted Ill12 , from dom (l1 ) into dom (l2 ), satisfying the following conditions: Transitivity: For every l1 ; l2 ; l3 2 L with l1 l2 l3 , Ill13 = Ill23 Ill12 ; and Re exivity: For every l 2 L, Ill is the identity function on dom (l).
2
year
HHHacademicyear month HHHweek day
industry
HHH category HHHbrand item
HHHcsize city HHHchain
area
store
dollar
qscale
cent
quality
Figure 1: A partially ordered set of levels.
fg HHH (City : area) HHH(City : csize) (City : area)(City : csize) (City : city) Figure 2: The lattice ((City : city)3; ).
(Q : qscale ` )
fg``
````` (P : dollar ```)```
````` ` `: cent) ( P : dollar)(Q : qscale) (Q : quality ) ( P ````` ````` `` `` (P : dollar)(` Q : quality) (P : cent)(Q : qscale) ````` ` (P : cent)(Q : quality) Figure 3: The lattice ((P : cent)(Q : quality)3; ).
3
If Ill12 (v) = w, we also say that v in l1 rolls up to w in l2 (where l1 or l2 can be omitted if they are clear from the context). 2 It is worth noting that the only assumption we make about is that it is a partial order. We do not assume that the order is a lattice, nor do we partition the set of all levels into dierent hierarchies, as is done in other proposals.
Example 1 Figure 1 shows a diagram of a partially ordered set of levels. We have day month. Con-
month that maps every constant of dom (day) to a constant of dom (month). sequently, there is a function Iday month Iday formalizes the natural containment of days into months. Likewise month year, and every constant in month rolls up to a constant in year. By transitivity, it follows that day year. The Transitivity requirement in de nition 1 requires that whenever d in day rolls up to m in month, and m rolls up to y in year, then d rolls up to y in year. Note incidentally that week is not ner than month, as two days of the same week can belong to distinct months. A price can be expressed in cents (domain cent) or integral dollars (domain dollar). cent is ner than dollar, and a price in cents rolls up to the nearest integral dollar. For example, 827 cents dollar rolls up to 8 dollar. That is, Icent (v) = (v + 50)div 100. Note that this function is determined by the roll-up instantiation in hand. Another roll-up instantiation may choose a dierent mapping from cents to dollars. quality is a level used to measure the quality of life. A member of dom (quality) is a natural number between 1 and 99. We have quality qscale and dom (qscale) = flow, medium, highg. Qualities between 1 and 33 roll up to low, qualities between 34 and 66 to medium, and qualities between 67 and 99 to high. 2
The following de nition introduces schemes and patterns, and is illustrated by example 2. The notion of scheme corresponds to a conventional relation scheme of relational database theory.
De nition 2 We assume the existence of a set A of attributes. A relation scheme (or scheme) is a set f(A1 : l1); : : : ; (An : ln)g where n 0, and A1; : : : ; An are pairwise distinct attributes, and l1; : : : ; ln are levels. That is, a scheme is a total function from fA1 ; A2 ; : : : ; An g to L. The function domain of a scheme is denoted atts (). That is, atts (f(A1 : l1 ); : : : ; (An : ln )g) = fA1 ; : : : ; An g. Let be a scheme. A pattern over is a subset of atts () L satisfying the following conditions: Roll-up: Whenever (A : l) 2 , then (A) l; and Incomparability: Whenever (A : l1 ); (A : l2 ) 2 , then either l1 = l2 or l1 k l2 . 1 The set of all patterns over is denoted 3. A pattern f(A1 : l1 ); (A2 : l2 ); : : : ; (An : ln )g will often be denoted (A1 : l1 )(A2 : l2 ) : : : (An : ln ). The function atts () on schemes is extended to patterns in an obvious way: if is a pattern over , then atts ( ) is the smallest set containing the attribute A whenever (A : l) 2 for some level l. Let ; be patterns over the scheme (i.e., ; 2 3). We write iff for every (A : l) 2 , there exists some (A : l0 ) 2 such that l0 l. 2 1 We write l1 k l2 if l1 6 l2 and l2 6 l1 .
4
Example 2 An example scheme is f(Item : item); (City : city); (Day : day); (Price : cent)g: Given the set of levels of gure 1, there are a few hundreds patterns over the previous scheme (2178 to be precise). One such pattern is (Item : industry)(City : csize)(City : area): To verify that this is a pattern, note that item industry, city csize, city area, and that and area are not comparable by (see gure 1). The expression (City : city)(City : area) is not a pattern, as city and area are comparable by . The operator is illustrated by the following expression: (Day : month)(Item : item) (Item : category)(Item : brand): To verify the correctness of this expression, note that item category and item brand. 2 csize
Let be a scheme. It follows from de nition 2 that is a pattern over itself. In particular, a scheme is a pattern in which every attribute appears only once. Moreover, if is a pattern over then . While is only a partial order, the set of all patterns over , ordered by , is a complete lattice, as we are going to show next (theorem 1). We call this lattice the roll-up lattice over . Figure 2 shows the roll-up lattice over (City : city). Figure 3 shows the roll-up lattice over (P : cent)(Q : quality). Note that our notion of roll-up lattice extends and generalizes several earlier proposals found in the literature. Our notion is more general than the one in [7], because the same attribute can appear more than once in a lattice element, as in (City : csize)(City : area). This extension is both natural and useful. In an OLAP application, for example, one may want to group data about several cities by both city size and area, rather than only by size or only by area. Dimensionality reduction [4] is embedded naturally in our roll-up lattice. The proof of theorem 1 makes use of the constructs de ned next.
De nition 3 Let be a scheme. Let be a subset of atts () L satisfying the following condition: for every (A : l) 2 , (A) l. Note that need not be a pattern, as the Incomparability requirement in de nition 1 needs not be satis ed. From we derive a pattern, denoted bc, as follows: bc is the smallest subset of containing (A : l) 2 if for every (A : l0 ) 2 , l0 l implies l = l0 . It can be readily seen that bc is a pattern over . Let ; be patterns over . We write for the smallest set containing (A; l) if contains (A : l1 ) for some level l1 l, and contains (A : l2 ) for some level l2 l. 2
Example 3 b(Item : category)(Item : industry)c = (Item : category), as category industry. See gure 1. The operator combines two patterns; the result needs not be a pattern as it can violate the Incomparability requirement in de nition 1. We have (Item : brand)(City : area) (Item : category) = (Item : industry). 2 Theorem 1 Let be a scheme. The set of all patterns over , ordered by , is a complete lattice. Let be a scheme. It suces to show two things: First, that (3; ) is a partially ordered set; and second, that inf f ; g and supf ; g are de ned for all patterns and over . From this and the obvious fact that 3 is nite, it follows that (3; ) is a complete lattice. Proof.
5
We rst show that (3; ) is a partially ordered set. Proving re exivity and transitivity is straightforward. For antisymmetry, assume and are patterns over with and . We need to show = |i.e., and . We show |the proof of is symmetrical. Let (A : l) 2 . It suces to show (A : l) 2 . Since , we have that contains (A : l0 ) for some level l0 with l0 l. Since , we have that contains (A : l00 ) for some level l00 with l00 l0 . So contains both (A : l) and (A : l00 ) with l00 l0 l. By the de nition of pattern, l00 = l, hence l0 = l. Consequently, contains (A : l). We next show that inf f ; g and supf ; g are de ned for all patterns ; over . Let ; be patterns over . We show that inf f ; g = b [ c. Clearly, b [ c and b [ c . Let
be a pattern over with and . We need to show b [ c. Let (A : l) 2 b [ c. Hence, (A : l) 2 or (A : l) 2 . Since and , we have that contains (A : l0 ) for some level l0 with l0 l. Since (A : l) is an arbitrary member of b [ c, it is correct to conclude that
b [ c. Finally, we show that supf ; g = b c. Clearly, b c and b c. Let be a pattern over with and . We need to show b c . Let (A : l) 2 . Since and , we have that contains (A : l1 ) for some level l1 l, and contains (A : l2 ) for some level l2 l. Hence, contains (A : l). Hence, b c contains (A : l0 ) for some level l0 l. Since (A : l) is an arbitrary member of , it is correct to conclude that b c . This concludes the proof. 2 Let be a pattern. We shall refer to the lattice (3; ) as the roll-up lattice over . The bottom element of (3; ) is , and the top element is fg. Note that the proof of theorem 1 is constructive: it gives us an eective way of computing ^ and _ .2
2.2 Roll-Up Dependency
The following de nition introduces RUDs and de nes what it means for a relation to satisfy a RUD. The de nition is illustrated by example 4. Logical implication will be discussed in depth in section 4.
De nition 4 Let be a scheme. A tuple over is a total function t with domain atts () such that t(A) 2 dom ((A)), for each A 2 atts (). A relation over is a set of tuples over . Let be a pattern over . Let s; t be tuples over . Given a roll-up instantiation I , we say that s and t satisfy , denoted I (s; t), iff for every (A : l) 2 , l (s(A)) = I l (t(A)): I( A) (A)
A roll-up dependency (RUD) over is a statement of the form ! where and are patterns over . Let ! be a RUD over , and let R be a relation over . Given a roll-up instantiation I , we say that R satis es ! , denoted R j=I ! , iff for every s; t 2 R, if I (s; t) then I (s; t). Let be a set of RUDs over . We write R j=I iff R j=I for each 2 . Let be a RUD over . Given a roll-up instantiation I , we say that (logically) implies , denoted j=I , iff for every relation R over , R j=I implies R j=I . 2 As is usual, we write ^ for inf f ; g, and _ for supf ; g.
6
We write j= iff j=I for each roll-up instantiation I . 2
Example 4 Consider the scheme (City : city)(Quality : quality): A tuple f(City : x); (Quality : y)g over this scheme means that the quality of life in city x is considered to be y, where y is a natural number between 1 and 99. Two tuples s and t over this scheme are as follows:
s = f(City : Chicago); (Quality : 27)g t = f(City : New York); (Quality : 51)g We have city area and city csize (see gure 1). Further let dom (csize)=fhuge, large, smallg. Chicago and New York both roll up to the area U.S.A. and the size huge. Consequently, s and t satisfy the pattern (City : area)(City : csize). We have quality qscale with dom (qscale) =fhigh, medium, lowg. The qualities 1{33 all roll up to low, 34{66 to medium, and 66{99 to high. See gure 4 top. Consequently, s and t falsify the pattern (Quality : qscale). It follows that a relation containing s and t falsi es the RUD (City : area)(City : csize) ! (Quality : qscale); which expresses that two cities of the same size within the same area score equally well on the qualityof-life scale. Note that FDs are a very restricted type of RUD|one without roll-up. For example, the FD City ! Quality is equivalent to the RUD (City : city) ! (Quality : quality): 2
2.3 Roll-Up and Group-By
Let be a pattern, and R a relation (both over the same scheme). induces a partitioning of R into disjoint groups.
Example 5 Let R be a relation over (City : city)(Quality : quality). Consider the RUD (City : area) ! (Quality : qscale). Figure 4 left shows the relation R partitioned according to the pattern
(City : area). Two tuples belong to the same partition class if and only if they satisfy (City : area)| i.e., if their City -values roll up to the same area. Each partition class will be called a (City : area)group of R. Each (City : area)-group of R is a set of tuples, and hence a relation. Within each (City : area)-group we can distinguish one or more (Quality : qscale)-groups, as shown in gure 4 right. To conclude, the RUD (City : area) ! (Quality : qscale) rst induces a partitioning of R into (City : area)-groups, and then a further partitioning of each (City : area)-group into (Quality : qscale)-groups. Figure 5 is derived from gure 4 right by applying a roll-up operation. The last row, for example, states that there are two U.S.A. cities with a low quality of life. This representation will be called the histogram rep of (City : area) ! (Quality : qscale), where R is implicitly understood. 2
De nition 5 Groups are de ned relative to a roll-up instantiation I and a scheme . Let be a pattern over . Let R be a relation over . A -group of R (or simply group if is understood) is a maximal (w.r.t. set inclusion), non-empty subset G of R satisfying for every s; t 2 G, I (s; t). We write G (R; ) for the smallest set containing all -groups of R. 2 7
Lemma 1 Let be a scheme. Let be a pattern, and let R be a relation (all over ). G (R; ) is a partition of R. Proof.
Trivial. 2
The histogram rep of a relation does not have the formal status of relations or -groups. We assume it is intuitively clear how a histogram rep is computed from a relation and a RUD. Given a relation R and a RUD ! (both over the same scheme ), the format of the histogram rep of ! is shown in gure 6. Note that the attributes A1 ; : : : ; Aq are not necessarily distinct. The horizontal lines correspond to the partitioning of R into -groups. The histogram rep shows that the number of -groups of R equals n. The number of -groups of the ith -group equals mi . Let i 2 [1::n] and j 2 [1::mi ]. We have vi;h 2 dom (lh ) for each h 2 [1::p], and also vi;j;k 2 dom (lk ) for each k 2 [p + 1::q]. The template histogram rep shows that ci;j is the number of tuples t in R satisfying lh (t(A )) = v and I lk (t(A )) = v both I( h i;h k i;j;k for each h 2 [1::p]; k 2 [p + 1::q]. Ah ) (Ak )
3 Applications
3.1 Database Design Consider the scheme
(Item : item)(Day : day)(Store : store)(Price : cent): (3) A tuple f(Item : u); (Day : v); (Store : w); (Price : x)g expresses that the price of item u on day v in store w amounted to x cents. Assume the price of an item does not change within a month and store chain, as described by the RUD: (Item : item)(Day : month)(Store : chain) ! (Price : cent): For example, the price of the item \Lego System 6115" in all stores of the Toysrus chain invariably was 212 cents during January, 1997. Then a relation over scheme 3 will store a lot of redundant prices: whenever two tuples that agree on Item roll up to the same month and chain, they must necessarily agree on Price . The data redundancy can be avoided by decomposing scheme 3 into the schemes: (Item : item)(Day : month)(Store : chain)(Price : cent) and (4) (Item : item)(Day : day)(Store : store): (5) Scheme 4 is to store the price of an item for a particular month and chain. Scheme 5 is to store whether a particular item was sold in a particular store at a particular day, and can be dropped if such information is not needed. It might be the case, for example, that each store of a certain chain always sells all items distributed by the chain. Scheme 4 already stores which items are distributed by a particular chain. Database design based on RUDs is outside the scope of this paper. We believe that such theory can be built along the lines of the theory of temporal FDs (TFDs) and temporal normalization introduced by Wang et al. [11]. Loosely speaking, the major dierence between RUDs and TFDs is that RUDs support roll-up for any attribute, whereas TFDs only support roll-up for one dedicated timestamping attribute. As we have seen, data redundancy may arise in the presence of RUDs. Database design using RUDs should address the problem of decomposing a \universal" scheme into a number of smaller subschemes without loss of information so as to avoid data redundancy. Intuitively, relations over schemes 4 and 5 can store non-redundantly all data that can be contained in a relation over scheme 3. 8
qscale
Iquality
1{33 : low 34{66 : medium 67{99 : high
further partitioned by (Quality : qscale) City : city Quality : quality Paris 33 Marseille 32 Lyon 68 Bordeaux 80 Chicago 27 Detroit 26 New York 51 San Francisco 72 Tucson 73
partitioned by (City : area) City : city Quality : quality Paris 33 Marseille 32 Lyon 68 Bordeaux 80 Chicago 27 Detroit 26 New York 51 San Francisco 72 Tucson 73
R
R
Figure 4: Partitioning of R. City : area Quality : qscale { # France high { 2 low { 2 U.S.A. high { 2 medium { 1 low { 2
Figure 5: Histogram rep of (City : area) ! (Quality : qscale). members of members of }| { z }| z (Ap+1 : lp+1 ) . . . (Aq : lq ) (A1 : l1 ) . . . (Ap : lp ) v1;1 . . . v1;p v1;1;p+1 . . . v1;1;q ... ... ... v1;m1 ;p+1 . . . v1;m1 ;q ... ... ... vi;1 . . . vi;p vi;1;p+1 . . . vi;1;q ... ... ... vi;j;p+1 . . . vi;j;q ... ... ... vi;mi;p+1 . . . vi;mi ;q ... ... ... vn;1 . . . vn;p vn;1;p+1 . . . vn;1;q ... ... ... vn;mn;p+1 . . . vn;mn ;q Figure 6: Format of histogram rep of ! . 9
{ {# { c1;1 { ... { c1;m1 { { { ci;1 { ... { ci;j { ... { ci;mi { { { cn;1 { ... { cn;mn
3.2 Condensed Representations
On examination of a relation over scheme 4, we may nd out that, although cent prices change within a year, the price uctuations are small. In particular, we may nd that within a year and chain, an item is always sold at the same integral dollar price, as described by the RUD: (Item : item)(Day : year)(Store : chain) ! (Price : dollar): For certain applications, dollar prices are suciently accurate. In that case, we can replace scheme 4 by the following one: (Item : item)(Day : year)(Store : chain)(Price : dollar):
(6)
This schema transformation, unless the scheme decomposition in subsection 3.1, is not information preserving, but the loss of information is acceptable for the application in hand. For example, from a relation over the latter scheme, we may derive that the price of the item \Lego System 6115" in each store of the Toysrus chain was within 2 dollars 50 cents throughout 1997. The advantage is that relations over scheme 6 are, on average, twelve times smaller than relations over scheme 4 (as there are twelve months in a year). The signi cance of condensed representations is discussed in [9].
3.3 Data Cube Approximation
A relation over scheme 3 can be represented as a three-dimensional data cube that maps each triple of dom (item) dom (day) dom (store) to a member of dom (cent). The schema transformations discussed in subsections 3.1 and 3.2 are useful in OLAP. A relation over scheme 6 can be represented as a three-dimensional data cube that maps each triple of dom (item) dom (year) dom (chain) to a member of dom (dollar). It is clear that the second data cube is many times smaller than the rst one. The reduction factor is 365 s where s is the average number of stores in each chain. As mentioned above, the tradeo is that the second data cube is not as accurate as the rst one. In this regard, RUDs help to determine the state in which the size of the cube is much reduced and yet the loss of accuracy is small or acceptable. A histogram rep of of a RUD ! (see gure 6) can also be seen as a multidimensional data cube. Each member of is a dimension, and every -group constitutes a non-empty entry in the data cube. The template histogram rep of gure 6 is p-dimensional and contains n non-empty entries. The value found at any one entry is not a simple value|as in most OLAP applications|but an entire histogram. In gure 6, each entry has a histogram at the right-hand side of the double vertical bar (k). Importantly, if is a singleton and the RUD ! is satis ed, then each histogram reduces to a single value{count pair. The need for maintaining non-simple values at a data cube entry has already been recognized in the literature. Gray et al. [4] distinguish between distributive, algebraic, and holistic aggregation functions. The idea is as follows. S Let S1 ; : : : ; Sn be sets, let f be a set function, and let fi = f (Si ), for each i 2 [1::n]. Let S = ni=1 Si . For a distributive function f , f (S ) can be computed from the \intermediate" values f1 ; : : : ; fn . An example is SUM. For a holistic function f , the intermediate values f1 ; : : : ; fn are not sucient to compute f (S )|we need all members of S to compute f (S ) instead. An example is a function that computes the median of a set of numbers. The execution of a roll-up operation in a data cube results in values being grouped together. Typically several roll-up operations are executed one after another (e.g., rst from day to month, and then from month to year). If an OLAP cube has to compute a holistic function for each group at each roll-up, then each entry must keep track of all basic values that have so far been grouped in it|it is not sucient to store just a single function value. For a distributive function, on the other hand, it is sucient to store a single 10
function value per entry. To conclude, holistic functions require the maintenance of non-simple values (for example, sets) at data cube entries. It should be noticed here that nearly all research in OLAP has focused on distributive functions.
4 Axiomatization
De nition 4 introduces two notions of logical implication, denoted j=I and j=. The latter notion is the stronger one, as it is independent of a particular roll-up instantiation. In this section, we provide a sound and complete axiomatization of j=. Such an axiomatization is an interesting result, as it captures concisely the essential properties of RUDs. The distinction between j=I and j= is further illustrated by the following example.
Example 6 Assume the level academicyear to represent academic years; suppose each academic year starts on October, 1 and ends on September, 30. We have year k academicyear, and also week k academicyear because an academic year can start in the middle of a week. See gure 1.
Assume the scheme = (Day : day)(Price : cent). Consider the following set of RUDs = f (Day : year) ! (Price : cent); (Day : academicyear) ! (Price : cent) g and the RUD
= (Day : week) ! (Price : cent): The RUDs of state that two tuples over whose Day-values roll up to the same civil year or to the same academic year, must agree on Price . For a \legal" roll-up instantiation I , we would have j=I because two days of the same week cannot belong to two distinct civil years and to two distinct academic years at the same time. Note, however, 6j= since it is possible to nd a relation R over and a roll-up instantiation I such that R j=I and R 6j=I . One can easily verify that any such roll-up instantiation I must contain two distinct constants v; w 2 dom (day) that roll up to the same week, to distinct years, and to distinct academic years. Such roll-up instantiation exists, even though it does not correspond to any conventional calendar. In subsection 3.1, we indicated an important dierence between RUDs and TFDs [11]: RUDs support roll-up for any attribute, whereas TFDs only support time granularity for a special timestamping attribute. Another important dierence between both concepts lies in the treatment of time. In this study, we are merely interested in the RUDs that are implied by a set of RUDs irrespective of the actual roll-up instantiation. No special semantics is attached to time; \temporal" levels are treated in exactly the same way as any other level. Another research direction would be to study logical implication of RUDs for particular roll-up instantiations. That is, one could impose special constraints on I and study j=I . Such constraints could capture special time semantics, such as linearity. The treatment of TFDs in [11] corresponds to the latter approach. 2
4.1 The Axioms
The axioms for RUDs are as follows.
De nition 6 Let be a set of RUDs over the scheme . Let ; ; be patterns over . We de ne: A1 If then ! . 11
A2 If ! then ^ ! ^ . 3 A3 If ! and ! then ! . Let be a RUD over . We write ` if can be proved from using A1, A2, and A3. 2 A1, A2, and A3 are generalizations of Armstrong's axioms for FDs [1]. We have the following properties.
Lemma 2 Let ; be patterns over the scheme . Let I be a roll-up instantiation. Let s; t be tuples over .
1. If and I (s; t) then I (s; t).
2. ( ^ )I (s; t) iff I (s; t) and I (s; t).
Property 1. Assume and I (s; t). Let (A : l) 2 . Hence there exists some (0 A : l0 ) 2 l (s(A)) = I l0 (I l (s(A))). such that (A) l0 l. By the de nition of roll-up instantiation, I( l (A) A) 0 0 l (s(A)) = I l (t(A)). Hence, I l (s(A)) = I l0 (I l0 (t(A))) = I l (t(A)). Since I (s; t), we have I( l (A) (A) (A) (A) A) Since (A : l) is an arbitrary member of , it follows that I (s; t). Property 2 ) follows from property 1 and the fact that ^ and ^ . For property 2 (, note that if (A : l) 2 ^ then (A : l) 2 or (A : l) 2 . 2
Proof.
The following theorem states that the axiomatization of de nition 6 is sound.
Theorem 2 Soundness. Let be a set of RUDs, and let be a RUD (all over the same scheme). If ` then j= . Simple induction on the assumed derivation of from . Correctness of A1 and A2 follows immediately from lemma 2. Correctness of A3 is easy to prove. 2 Proof.
4.2 Completeness
We are now going to prove that the axiomatization of de nition 6 is complete. As is usually the case, the completeness proof is much harder than the soundness proof. It relies on the following construct, which extends the classical notion of \closure of a set of attributes w.r.t. a set of FDs".
De nition 7 Let be a scheme. Let be a set of RUDs over . Let be a pattern over . We de ne:
2
( ; )+ = bf(A : l) j A 2 atts () and (A) l and ` ! (A : l)gc
Lemma 3 Let be a set of RUDs over the scheme . Let ; ; be patterns over . The following rule is sound:
A4 If ! and ! then ! ^ .
3 Recall that ^ denotes the in mum of and in the roll-up lattice (3 ; ).
12
Assume ` ! and ` ! . By A2, ` ! ^ and ` ^ ! ^ . By A3, ! ^ . 2 Proof.
Lemma 4 Let be a scheme. Let be a set of RUDs over . Let ; be patterns over . ` ! iff ( ; )+ . ) : Assume ` ! . Let (A : l) 2 . Since (A : l) is obvious, we have ` ! (A : l) by A1. By A3, ` ! (A : l). Hence, ( ; )+ contains (A : l0 ) for some level l0 l. Since (A : l) is an arbitrary member of , it is correct to conclude that ( ; )+ . ( : Assume ( ; )+ . Let (A : l) 2 . Hence, ( ; )+ contains (A : l0 ) for some level l0 l|i.e., ` ! (A : l0 ). Since (A : l0 ) (A : l), we have ` (A : l0 ) ! (A : l) by A1. By A3, ` ! (A : l). Since (A : l) is an arbitrary member of , we have ` ! (A : l) for each (A : l) 2 . By repeated application of A4, ` ! . This concludes the proof. 2 Proof.
Theorem 3 Completeness. Let be a set of RUDs, and let be a RUD (all over the same scheme). If j= then ` . Let be a scheme. Let be a set of RUDs over , and let ! be a RUD over such that 6` ! . It suces to show 6j= ! |i.e., there exists a relation R over satisfying and falsifying ! . Choose tuples s; t over , and roll-up instantiation I such that for each A 2 atts (), l (s(A)) = I l (t(A)) iff for each level l with (A) l, the following condition is satis ed: I( A) (A) (; )+ (A : l).
Proof.
The construction of s; t; and I can proceed as follows. 1. For every level l, choose a constant, called l0 , from dom (l). For every level l, for every level l0 l, Ill0 (l00 ) = l0 . 2. For every attribute A 2 atts (), (a) for every level l, if there does not exist a constant (A : l0 ) 2 (; )+ with l0 l, then choose a constant, called lA , from dom (l). Ensure that lA 6= l0 , and that A 6= B implies lA 6= lB . (b) for every level l, for every level l0 l for which l0 A is de ned, let ( A lA is de ned l 0A Il0 (l ) = ll0 ifotherwise (c) let s(A) = (A)0 and
(
(A)A if (A)A is de ned t(A) = ( A)0 otherwise The construction assumes that for every level l, jdom (l)j jj + 1. 4 We show that s 6= t. Assume s = t. Hence, for every A 2 atts (), (A)A is unde ned. Hence, for every A 2 atts (), there exists a constant (A : l0 ) 2 (; )+ with l0 (A). Hence, (; )+ . We have since is a pattern 4 We use jS j to denote the cardinality of a set S .
13
over . Hence, (; )+ by transitivity. Consequently, ` ! by lemma 4, a contradiction. We conclude by contradiction that s 6= t. l (s(A)) = I l (t(A)). Let be a pattern over . By de nition, I (s; t) iff for every (A : l) 2 , I( A) (A) So by the construction, I (s; t) iff for every (A : l) 2 , (; )+ (A : l). Consequently, the construction ensures that I (s; t) iff (; )+ .
Let R = fs; tg. We show that R is the desired counterexample relation. We have to show two things: First, that R satis es ; and second, that R falsi es ! . Suppose R falsi es ! of |i.e.,
I (s; t) and not I (s; t). Hence, (; )+ and (; )+ 6 . By lemma 4, ` ! . By A3, ` ! . By lemma 4, (; )+ , a contradiction. We conclude by contradiction that R satis es . We nally show that R falsi es ! . Suppose R satis es ! . By A1, ` ! . Hence, (; )+ by lemma 4. Hence, I (s; t). Since R satis es ! , I (s; t). Hence, (; )+ . Hence, ` ! by lemma 4, a contradiction. We conclude by contradiction that R falsi es ! . This concludes the proof. 2
5 Relaxed Notions of Satisfaction The notion of satisfaction that has so far been used is an all-or-nothing concept: A RUD is either satis ed or falsi ed by a relation; there is no third possibility. In data mining, one is typically not only interested in rules that are fully satis ed, but also in rules that are highly satis ed. In this section, we introduce a gradual notion of satisfaction by adapting the classical notions of support and con dence [3]. In this section and the next one, a roll-up instantiation I is implicitly understood.
5.1 Support and Con dence
Let R be a relation over the scheme . Let ! be a RUD over . The con dence of ! is the conditional probability that two tuples, which are randomly selected from R without replacement, satisfy given they already satisfy . The support of ! is the probability that two tuples, which are randomly selected from R without replacement, satisfy . Note: The con dence is a conditional probability, and the support is the probability that the conditioning event occurs.
De nition 8 In this de nition, a roll-up instantiation I is implicitly understood. Let R be a relation
over the scheme . We de ne:
pairs (R) = fS R j S is a pair g:
Let ! be a RUD over the scheme . Let l be the number of members of pairs (R) satisfying . Let b be the number of members of pairs (R) satisfying both and . Let p = jpairs (R)j. Let ( ( if l 6= 0 l=p if p = 6 0 s = 0 otherwise and c = 0b=l otherwise Then ! is said to be satis ed by R with support s and con dence c, denoted support ( ! ; R) = s, and conf ( ! ; R) = c. 14
2 There are two main dierences between our notions of support and con dence and the notions with the same name used in association mining: Our support and con dence refer to pairs of tuples, rather than single tuples. This is because only pairs of tuples can give evidence against a RUD. Contrast this with association rules where a single transaction can give evidence for or against a rule. In the de nition of support, we only consider left-hand sides. In this way, support and con dence are independent|the support can be smaller than or greater than the con dence. In the association rule literature, the support is de ned as the fraction of transactions satisfying both the left-hand and the right-hand side of a rule, and the support cannot exceed the con dence. Note that in our framework, multiplying support and con dence gives the fraction of tuple pairs that satisfy both the left-hand and the right-hand pattern of the RUD under consideration. We like our notion of support because it has a natural characteristic: The con dence is a conditional probability; the probability that the conditioning event occurs, is the support. There is a direct translation of the above de nition in terms of the grouping construct introduced in subsection 2.3: conf ( ! ; R) is the conditional probability that two tuples of R satisfy given they belong to the same -group of R. support ( ! ; R) is the probability that two tuples of R belong to the same -group of R. The support and the con dence of ! can be computed from the histogram rep of ! . It can be easily veri ed that the practical, mathematical formulas are as follows. ! P jGj G2G (R; ) 2 ! support ( ! ; R) =
jRj 2
P conf ( ! ; R) =
G2G (R; )
P
!
H 2G (G;)
!
P
G2G (R; )
jGj 2
jH j 2
! i where 2 = 0 if i < 2 and a division by zero yields zero. In terms of the template histogram rep of gure 6, we have:
support ( ! ; R) =
Pn
ci i=1 2 ! jRj
!
2 ! Pn Pmi ci;j i=1 j =1 2 ! conf ( ! ; R) = Pn ci i=1 2 15
where
ci =
mi X j =1
ci;j for each i 2 [1::n]:
5.2 Average Con dence per Group
The con dence measure discussed so far gives equal weight to all tuple pairs. Since larger -groups contain more tuple pairs, this means that larger groups contribute more to the overall con dence than smaller groups. Thus, larger groups may dominate smaller groups in the overall con dence. For the example given in gure 4, there may be 10 times more U.S.A. tuples than Belgian tuples. Under the current con dence measure, the RUD may become more a re ection of the situation in the U.S.A. than of a global one. In certain applications, it is desirable that each group be assigned equal weight, independent of its size. To accommodate these applications, we de ne the average con dence per group, which treats all groups equally, so that larger groups cannot dominate smaller groups.
De nition 9 Let R be a relation over the scheme . Let = ! be a RUD over . We de ne: P
conf (; G) avgconf (; R) = G2G (R; ) jG (R; )j
2 Example 7 Consider the relation R of gure 4 and the RUD (City : area) ! (Quality : qscale)
(call it ):
The histogram rep is shown in gure 5. Then: conf (; R) = 1=4 support (; R) = 4=9 avgconf (; R) = 1=3 + 1=5 = 4=15 2
Note that all these gures can be computed from the histogram rep of gure 5. Since all values are low, we cannot conclude from this that cities in the same area tend to score equally well on the quality-of-life scale. 2
6 Mining RUDs In this section, we are going to study the problem of discovering RUDs from a given database relation. This problem is interesting and relevant. As mentioned in section 3, discovered RUDs can give us important indications about the bene t of certain roll-up operations. This section contains three subsections. Subsection 6.1 de nes the decision problem we are interested in. Subsection 6.2 studies its computational complexity. Finally, subsection 6.3 outlines an algorithm for mining RUDs.
16
6.1 The Problem
We start with an easy-to-formulate decision problem, called RUDMINE(D). The problem can be illustrated as follows. Assume a relation R over the scheme (Item : item)(Day : day)(Store : store)(Price : cent): An analyst wants to nd out whether there are temporal or spatial patterns governing the price. He decides that prices in integral dollars are suciently accurate to begin with. That is, he is looking for \strong" RUDs of the form ! (Price : dollar) where is any pattern that does not contain the attribute Price . An example is the RUD (Item : item)(Day : year)(Store : area) ! (Price : dollar); expressing that the dollar price of an item does not vary within a year and area. By a strong RUD, we mean a RUD of which the support and the con dence exceed speci ed threshold values. It is not clear a priori whether any strong RUD can be found.
De nition 10 RUDMINE(D) is de ned relative to a roll-up instantiation I . A RUDMINE(D) problem is a quintet (; ; ts ; tc ; R) where is a scheme, is a pattern over , ts ; tc 2 (0; 1), and 5 R is a relation over .
The answer to the RUDMINE(D) problem (; ; ts ; tc ; R) is \yes" if there exists a pattern over such that: support ( ! ; R) ts , conf ( ! ; R) tc , and atts ( ) \ atts () = fg. Otherwise the answer is \no". 2 So RUDMINE(D) asks whether certain speci ed support and con dence thresholds can be attained by a RUD with a xed right-hand pattern. We note that whenever 0 , then the support of ! 0 is equal to the support of ! , and the con dence of ! 0 is greater than or equal to the con dence of ! . So if 0 then ! 0 is stronger than ! . However, the pattern 0 may be too general to be practically useful. In the extreme case, 0 equals the top element fg of the roll-up lattice, in which case 0 I (s; t) is obviously true for any tuples s and t, and ! 0 is a trivial RUD|i.e., it cannot possibly be falsi ed. That is why it is reasonable to x the right-hand pattern in a RUDMINE(D) problem. 5 (0; 1) denotes the set of reals between 0 and 1.
17
6.2 Computational Complexity
We are now going to show that RUDMINE(D) is NP-hard.
Theorem 4 RUDMINE(D) is NP-hard. Proof.
formula:
We are going to prove that 3SAT can be reduced to RUDMINE(D). Consider the propositional ^ = i1 _ i2 _ i3 i=1::m
where each ij is either a variable or the negation of one. Let X = fx1 ; x2 ; : : : ; xv g be the smallest set containing each variable appearing in . We describe the reduction next. We assume that L = fnumeric; evenodd; oddeveng with the order and roll-up instantiation shown in gure 7. Let = fxi : numericgi2[1::v+1] ; where xv+1 62 X . Let = fxv+1 : numericg. For each i 2 [1::v], let ri0 be a tuple over such that ri0 (xi ) = 1 and ri0 (xj ) = 0 for each j 2 [1::v + 1] with j 6= i. For each i 2 [1::v], let ri1 be a tuple over such that ri1 (xi ) = 2 and ri0 (xj ) = 0 for each j 2 [1::v + 1] with j 6= i. For each i 2 [1::v], let si0 be a tuple over such that si0 (xi ) = 4, and si0 (xv+1 ) = 1, and si0 (xj ) = 0 for each j 2 [1::v] with j 6= i. For each i 2 [1::v], let si1 be a tuple over such that si1 (xi ) = 5, and si1 (xv+1 ) = 1, and si1 (xj ) = 0 for each j 2 [1::v] with j 6= i. For the ith term of (i 2 [1::m]), we de ne two tuples (call them ti0 and ti1 ) satisfying for each j 2 [1::v], if the ith term of does not contain xj then ti0(xj ) = ti1 (xj ) = 0; if the ith term of contains xj non-negated then ti0(xj ) = 4(i + 1) and ti1(xj ) = ti0 (xj ) + 1; if the ith term of contains xj negated then ti0 (xj ) = 4(i + 1) and ti1(xj ) = ti0(xj ) ? 1. Furthermore, for each i 2 [1::m], ti0 (xv+1 ) = 0 and ti1 (xv+1 ) = 1. No term of contains both x non-negated and the negation of x, is assumed without loss of generality. Let R be the smallest relation containing ri0 ; ri1 ; si0 ; si1 ; tj 0 ; tj 1 for each i 2 [1::v], j 2 [1::m]. The construction is illustrated in gure 8. Let d be the cardinality of pairs (R). Clearly, d > 0. Let
v d
ts = and tc = 1:
We claim that is a reduction from 3SAT to RUDMINE(D). To prove our claim, we have to establish two things: (1) that any formula has a satisfying truth assignment if and only if the answer to the RUDMINE(D) problem (; ; ts ; tc ; R) is \yes"; and (2) that can be computed in space log m. 18
HHHoddeven
evenodd
numeric
dom (numeric) = f0; 1; 2; : : : g dom (evenodd) = f[0::0]; [1::1]; [2::3]; [4::5]; : : : g dom (oddeven) = f[0::0]; [1::2]; [3::4]; [5::6]; : : : g evenodd Inumeric (i) = [a::b] iff a i b oddeven Inumeric (i) = [a::b] iff a i b
Figure 7: Diagram of (L; ) and roll-up instantiation.
10 r11 r20 r21 r
v0 rv 1 s10 s11 s20 s21 r
x
1
1 2 0 0 0 0 4 5 0 0
x
2
0 0 1 2 0 0 0 0 4 5
v0 0 0 sv 1 0 0 t10 8 8 9 7 t11 t20 12 12 t21 13 13 s
3
x
0 0 0 0 0 0 0 0 0 0
x
4
0 0 0 0
x
5
0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 8 0 9 0 0 12 0 11
0 0 0 0 0 0
::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: :::
x
v
0 0 0 0
v+1
x
0 0 0 0
1 2 0 0 0 0
0 0 1 1 1 1
4 5 0 0 0 0
1 1 0 the rst term of is 1 1_: 2_ 3 0 the second term of is 1 1_ 2_: 4 x
x
x
x
:::
Figure 8: Example construction of R.
19
x
x
Assume that some RUD ! over with xv+1 62 atts ( ) is satis ed by R with support ts and con dence tc . We are going to show that has a satisfying truth assignment. Let support ( ! ; R) = s ts and conf ( ! ; R) = c tc . We consider several possibilities for . Let i 2 [1::v]. Assume x 6= xi for all (x : l) 2 |i.e., the attribute xi does not appear in . Then ri0 and si0 belong to the same -group, and consequently, c < tc , a contradiction. We conclude by contradiction that must contain at least one of (xi : numeric), (xi : evenodd), or (xi : oddeven), for every i 2 [1::v]. Consequently, one can easily check that the only possible -groups are: fri0 ; ri1g, for each i 2 [1::v], fsi0; si1 g, for each i 2 [1::v], fti0; ti1 g, for each i 2 [1::m]. Moreover, if ti0 and ti1 belong to the same -group, then c < tc , a contradiction. Consequently, the only possible -groups are: fri0 ; ri1g, for each i 2 [1::v], fsi0; si1 g, for each i 2 [1::v]. We also have that if fri0 ; ri1 g is a -group, then fsi0 ; si1 g is not a -group, and vice versa. Then to have s ts , must contain either (xi : evenodd) or (xi : oddeven), but not both, for each i 2 [1::v]. Let B be a truth assignment to the variables of X satisfying (i 2 [1::v]): ( true if contains (xi : oddeven) B (xi ) = false if contains (x : evenodd) i
c tc implies that B is a truth assignment satisfying . For example, in gure 8, t10 and t11 are derived from x1 _ :x2 _ x3 . The condition c tc implies that t10 and t11 do not belong to the same -group. This in turn implies that contains (x1 : oddeven) or (x2 : evenodd) or (x3 : oddeven). Conversely, it can now be easily seen that if has a satisfying truth assignment, then some RUD ! over with xv+1 62 atts ( ) is satis ed with support ts and con dence tc . To see that can be computed in log m space, note that () can be written directly from . This concludes the proof. 2
Theorem 4 gives us a lower bound for the complexity of RUDMINE(D). The question arises whether RUDMINE(D) is in NP. Consider the RUDMINE(D) problem (; ; ts ; tc ; R). A non-deterministic polynomial algorithm can guess a pattern over with atts ( ) \ atts () = fg, compute the support and the con dence of ! , and determine whether these values exceed the threshold values. The support and the con dence can be computed from the histogram rep of ! . The histogram rep can be constructed as follows. Let = f(Ai : li )gi2[1::p] and = f(Ai : li )gi2[p+1::q] where 0 p q. See also gure 6. Hence, [ = f(Ai : li )gi2[1::q] . For every tuple t in R, we generate a set li (t(A)) for each i 2 [1::q]. Note that t~ is not necessarily a t~ = f(Ai : vi)gi2[1::q] such that vi = I( Ai ) tuple, as the same attribute can appear several times in t~. Note also that we can have two distinct tuples t1 and t2 in R such that t~1 = t~2 . Let R~ be a multiset containing t~ for each t 2 R. Hence, jR~ j = jRj. We order R~ by A1; : : : ; Aq . In a single pass of this ordered R~ , we can construct the histogram rep of ! , and compute the support and the con dence. The procedure for computing the histogram rep is illustrated by gure 9. Let us now return to the non-deterministic polynomial 20
R
City : city Quality : quality Lyon 68 New York 51 Bordeaux 80 Chicago 27 Tucson 73 Paris 33
1 2 t3 t4 t5 t6 t t
~ City : area France U.S.A. France U.S.A. U.S.A. France
City : csize Quality : quality large high huge medium large high huge low large high huge low
~ ordered City : area France France France U.S.A. U.S.A. U.S.A.
City : csize Quality : quality large high large high huge low large high huge low huge medium
City : area France France U.S.A. U.S.A.
City : csize large huge large huge
R
R
~1 ~2 ~3 ~4 ~5 ~6
t t t t t t
~1 ~3 ~6 ~5 ~4 ~2
t t t t t t
Quality : qscale { # high { 2 low { 1 high { 1 low { 1 medium { 1
Figure 9: Computing the histogram rep of (City : area)(City : csize) ! (Quality : qscale).
21
algorithm described at the beginning of this paragraph. It is clear that RUDMINE(D) is in NP if the histogram rep can be constructed in polynomial time. It is also clear that once we have computed R~ , the histogram rep can be derived from it in polynomial time. So the remaining question is whether R~ , and hence each t~, can be computed in polynomial time. This, of course, depends on the complexity of the roll-up functions Ill12 for each l1 ; l2 with l1 l2 . So far we have said nothing about the computational complexity of rolling up a constant. For practical applications, we can assume that every function Ill12 can be computed in polynomial time; under this assumption, RUDMINE(D) is in NP.
6.3 Towards an Algorithm 6.3.1 Overall Approach
The problem we are interested in extends RUDMINE(D) by asking to nd each RUD that is satis ed with support and con dence above the speci ed threshold values|rather than merely asking whether such RUD exists.
De nition 11 A RUDMINE problem is a quintet (; ; ts ; tc ; R) where ; ; ts ; tc ; and R are as in de nition 10.
The answer to the RUDMINE problem (; ; ts ; tc ; R) is the set of patterns over containing whenever support ( ! ; R) ts , conf ( ! ; R) tc , and atts ( ) \ atts () = fg.
2 Clearly, RUDMINE is at least as hard as RUDMINE(D). Consequently, theorem 4 tells us that RUDMINE has exponential complexity. Assume we are given a RUDMINE problem (; ; ts ; tc ; R). Let be a pattern over . Our algorithm relies on the observation that whenever , then the histogram rep of ! contains all data that is needed to construct the histogram rep of ! . See also gure 10 and example 8. The relation R resides on secondary storage and, presumably, does not t into main memory. Nevertheless, the histogram rep of any RUD ! can be considerably smaller than R and can t into main memory. If so, we can build the histogram rep of ! in a single pass of the relation R. From this histogram rep we can compute the histogram rep of ! for any pattern with . The above discussion leads to the overall approach shown in gure 11.
Example 8 Let R be a relation over the scheme (City : city)(Quality : quality). Figure 10 shows the histogram reps of ! (Quality : qscale) for each pattern over (City : city). It is clear, for example, that the histogram rep of (City : area) ! (Quality : qscale) can be computed from the histogram rep of (City : area)(City : csize) ! (Quality : qscale). The RUD (City : csize) !
(Quality : qscale) is satis ed with a high support and con dence. This RUD states that two cities of the same size tend to have the same quality of life. Remark: It does not say that the quality of life is higher in smaller cities. Note incidentally that since R satis es the FD City ! Quality , every (City : city)-group is a singleton, which is clear from the bottom most histogram rep in gure 10. 2
22
support: 1, con dence: 0.55 Quality : qscale { # high { 70 medium { 25 low { 5
HHH H
support: 0.68, con dence: 0.55 City : area Quality : qscale { # U.S.A. high { 56 medium { 20 low { 4 France high { 14 medium { 5 low { 1
City : csize huge large
small
HH HH
support: 0.67, con dence: 0.78 Quality : qscale { # low { 3 medium { 15 low { 2 high { 70 medium { 10
support: 0.45, con dence: 0.78 City : area City : csize Quality : qscale { # U.S.A. huge low { 3 U.S.A. large medium { 12 low { 1 U.S.A. small high { 56 medium { 8 France large medium { 3 low { 1 France small high { 14 medium { 2 support: 0, con dence: 0 City : city Quality : qscale { # Bordeaux high { 1 Chicago low { 1 ... ...
Figure 10: Histogram reps of ! (Quality : qscale) for each pattern over (City : city). Support and con dence of ! (Quality : qscale) are indicated on top of each histogram rep. The lines between histogram reps re ect the lattice ((City : city)3 ; ), which is shown in gure 2.
23
0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
% Algorithmic framework for solving a RUDMINE problem ( ts tc ). % is a set of patterns containing whenever % the support and the con dence of ! have been computed. := fg := the set containing each pattern of 3 that has no attributes in common with while 6= loop nd some 2 n such that the histogram rep of ! ts into main memory if no such pattern can be found then exit loop ;
;
;
;R
U
U P
P
U
P
U
else
construct the histogram rep of ! into main memory for each 2 n with loop compute the support and the con dence of ! output if support and con dence exceed threshold values add to P
U
U
end-loop
end-if end-loop
process patterns remaining in n P
U
Figure 11: Algorithmic framework showing the overall approach. The overall approach can be improved in several ways. Note that line 10 in gure 11 implies a pass of the relation R. One improvement is to construct more than one histogram rep into main memory in a single pass of R. Then in line 6, instead of looking for a single pattern , we have to look for sets of patterns whose corresponding histogram reps t together into main memory. Another improvement is based on the fact that implies that the support of ! is less than or equal to the support of ! . This fact can be used for pruning away patterns: if the support of ! is below the threshold support, then so is the support of ! for any . On the other hand, if then conf ( ! ; R) can be less than or greater than conf ( ! ; R); so the con dence cannot be used for pruning. These improvements are implemented by the algorithm in gure 12. That algorithm uses constructs which are de ned next. De nition 12 The following constructs are de ned relative to a scheme . Let be a pattern over . Let N be a set of patterns over . We de ne: " = f 2 3 j g # = f 2 3 j g " N = f 2 3 j (9 2 N ) g # N = f 2 3 j (9 2 N ) g N is an up-set if " N = N . N is a down-set if # N = N . 2 The preamble of the algorithm in gure 12 explains the symbols used. The set U contains a pattern if the support and the con dence of ! have been computed. The set D contains a pattern if the support of ! is below the threshold support. D is a down-set, and at each resumption of the outer loop, U is an up-set. UD is a shorthand for U [ D. msize is the amount of main memory available. Line 17 looks for a number of histogram reps that t together into main memory. Lines 26 and 27 prune away patterns because of lack of support. This discussion leads to the following issues that need to be addressed: 24
0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33.
% Algorithmic framework for solving a RUDMINE problem ( ts tc ). % R is the solution set containing whenever % the support and the con dence of ! exceed the thresholds ts and tc . % is a set of patterns containing whenever % the support and the con dence of ! have been computed. % is a set of patterns containing whenever % the support of ! is known to be below the threshold support ts . % is a shorthand for [ . % msize is the main memory size. R = fg = fg := the set containing each pattern of 3 satisfying atts ( ) \ atts () = fg = fg while 6= loop for each 2 n , compute an upper bound, denoted hsize ( ), for the size of the histogram rep of ! exit loop if hsize ( ) msize for each 2 nP determine n such that 6= fg and 2N hsize ( ) msize construct in main memory the histogram rep of each RUD in f ! j 2 g NUP :=" n while NUP 6= fg loop choose a pattern (call it ) of NUP NUP := NUP n f g := [ f g compute = support ( ! ) and = conf ( ! ) if ts then NUP := NUP n # := [ # elsif tc then R := R [ f g ;
;
;
;R
U
D
UD
U
D
U P
D
P
UD
P
UD
>
N
P
P
UD
UD
N
N
N
U
U
U
s
;R
c
;R
s