An Axiomatic Approach to Defining Approximation ... - Semantic Scholar

2 downloads 0 Views 220KB Size Report
Chris Giannella. ⋆ ...... Ramakrishnan R., Gehrke J. Database Management Systems Second Edition. Mc-. Graw Hill Co., New York, 2000. 22. Wyss C.
An Axiomatic Approach to Defining Approximation Measures for Functional Dependencies Chris Giannella Computer Science Department, Indiana University, Bloomington, IN 47405, USA [email protected]

Abstract. We consider the problem of defining an approximation measure for functional dependencies (FDs). An approximation measure for X → Y is a function mapping relation instances, r, to non-negative real numbers. The number to which r is mapped, intuitively, describes the “degree” to which the dependency X → Y holds in r. We develop a set of axioms for measures based on the following intuition. The degree to which X → Y is approximate in r is the degree to which r determines a function from ΠX (r) to ΠY (r). The axioms apply to measures that depend only on frequencies (i.e. the frequency of x ∈ ΠX (r) is the number of tuples containing x divided by the total number of tuples). We prove that a unique measure satisfies these axioms (up to a constant multiple), namely, the information dependency measure of [5]. We do not argue that this result implies that the only reasonable, frequency-based, measure is the information dependency measure. However, if an application designer decides to use another measure, then the designer must accept that the measure used violates one of the axioms.

1

Introduction

In the last ten years there has been growing interest in the problem of discovering functional dependencies (FDs) that hold in a given relational instance (table), r [11,12,13,15,17,19,22]. The primary motivation for this work lies in knowledge discovery in databases (KDD). FDs represent potentially novel and interesting patterns existent in r. Their discovery provides valuable knowledge of the “structure” of r. Unlike FD researchers in the 1970s, we are interested in FDs that hold in a given instance of the schema rather than FDs that are pre-defined to hold in any instance of the schema. The FDs of our interest are instance based as they represent structural properties that a given instance of the schema satisfies rather than properties that any instance of the schema must satisfy to be considered valid. As such our primary motivation is not in database design, rather, in KDD. In some cases an FD may “almost” hold (e.g. [11] first name → gender). These are approximate functional dependencies (AFDs). Approximate functional 

Work supported by National Science Foundation grant IIS-0082407.

Y. Manolopoulos and P. N´ avrat (Eds.): ADBIS 2002, LNCS 2435, pp. 37–50, 2002. c Springer-Verlag Berlin Heidelberg 2002 

38

Chris Giannella

dependencies also represent interesting patterns contained in r. The discovery of AFDs can be valuable to domain experts. For example, paraphrasing from [11] page 100: an AFD in a table of chemical compounds relating various structural attributes to carcinogenicity could provide valuable hints to biochemists for potential causes of cancer (but cannot be taken as a fact without further analysis by domain specialists). Before algorithms for discovering AFDs can be developed, an approximation measure must be defined. Choosing the “best” measure is a difficult task, because the decision is partly subjective; intuition developed from background knowledge must be taken into account. As such, efforts made in defining a measure must isolate properties that the measure must satisfy. Assumptions from intuition are taken into account in the definition of these properties. Based on these properties, a measure is derived. In this paper, we develop an approximation measure following the above methodology. The intuition from which properties are developed is the following. Given attribute sets X and Y , the degree to which X → Y is approximate in r is the degree to which r determines a function from ΠX (r) to ΠY (r).1 By “determines” we mean that each tuple in r is to be regarded as a data point that either supports or denies a mapping choice x ∈ ΠX (r) → y ∈ ΠY (r). We prove that a unique measure (up to a constant multiple) satisfies these properties (regarded as axioms). The primary purpose of this paper is to develop a deeper understanding of the concept of FD approximation degree. The paper is laid out as follows. Section 2 describes related work, emphasizing other approximation measure proposals from the literature. Section 3 gives a very general definition of approximation measures, namely, functions that map relation instances to non-negative real numbers. Based on the fundamental concept of genericity in relational database theory, this definition is refined so that only attribute value counts are taken into account. Section 4 develops a set of axioms based on the intuition described earlier. It is proven that a unique measure (up to a constant multiple) satisfies these axioms. Finally, section 5 gives conclusions.

2

Related Work

This section describes approximation measures that have already been developed in the literature and other related works. The first subsection describes previous measures developed based on information theoretic principals. The second subsection describes previous measures developed based on other principals. The third subsection describes other related works. 2.1

Information Theoretic Approaches

Nambiar [18], Malvestuto [16], and Lee [14] independently introduce the idea of applying the Shannon entropy function to measure the “information content” 1

We assume that the reader is familiar with the basic notation for relational database theory, for a review, see [21].

An Axiomatic Approach to Defining Approximation Measures

39

of the data in the columns of an attribute set. They extend the idea to develop a measure that, given an instance r, quantifies the amount of information the columns of X contain about Y . This measure is the conditional entropy between the probability distributions associated with X and Y through frequencies (i.e. x ∈ ΠX (r) is assigned probability equal to the number of tuples containing x divided by the total number of tuples). All three authors show that this measure is non-negative, and is zero exactly when an X → Y is an FD. As such an approximation measure is developed. However, the main thrust of [18], [16], and [14] was to introduce the idea of characterizing the information content using entropy and not to explore the problem of defining an approximation measure for FDs. Cavallo and Pittarelli [3] also develop an approximation measure. Their measure is the conditional entropy normalized to lie between zero and one. However, the main thrust of [3] was to generalize the relational model to allow for probabilistic instances. They do not explore the problem of defining an approximation measure for FDs. Finally, Dalkilic and Robertson [5] (also [4]) independently discover the idea of applying the Shannon entropy function to measure “information content”. They use the conditional entropy to quantify the amount of information the columns of X contain about the columns of Y in r. They call this the information dependency measure. While they make explicit mention of the idea of using the information dependency as an FD approximation measure, they do not explore the idea further. Interestingly, the authors of [3,14], and [18] make little or no mention of the potential applicability to knowledge discovery of entropy as a measure of information content. This is probably due to the fact that, at the time of their writing (all before 1988), KDD had not yet received the attention that it does today. Dalkilic [4], however, does make explicit mention of the potential applicability to KDD. 2.2

Other Approaches

Piatetsky-Shapiro [20] introduce the concept of a probabilistic data dependency, denoted pdep(X, Y ), and use it to develop an approximation measure. Given two arbitrarily chosen tuples, pdep(X, Y ) is the probability that the tuples agree on Y given that they agree on X. The approximation measure developed is the same as the τ measure of Goodman and Kruskal [9]. Piatetsky-Shapiro develops a method of examining the significance of probabilistic data dependencies and, as such, touches on the issue of how the measure should be defined (see his section 3.2). However, he does not examine the fundamental assumptions upon which the pdep measure is based. Kivinen and Manilla [13] take a non-probabilistic approach to developing an approximation measure. They propose three measures, all of which are based on counting the number of tuples or pairs of tuples that cause the dependency to break. For example, one of the measures proposed is denoted g3 and is defined as min{|s| : r − s  X → Y }/|r|.

40

Chris Giannella

The main thrust of [13], however, is to develop an efficient algorithm that finds, with high probability, all FDs that hold in a given instance. The problem of how to best define an approximation measure is not considered. Huhtala et al. [11] develop an algorithm, TANE, for finding all AFDs whose g3 value is no greater than some user specified value. Again, though, they do not consider the problem of how to best define an approximation measure. 2.3

Other Related Works

De Bra and Paredaens [6] describe a method for horizontally decomposing a relation instance with respect to an AFD. The result is two instances; on one instance the AFD holds perfectly and the other represents exceptions. The authors go on to develop a normal form with respect to their horizontal decomposition method. They do not consider the problem of defining an AFD measure. Demetrovics, Katona, and Miklos [7] study a weaker form of functional dependencies that they call partial dependencies. They go on to examine a number of combinatorial properties of partial dependencies. They also investigate some questions about how certain related combinatorial structures can be realized in a database with minimal numbers of tuples or columns. They do not consider the problem of defining an AFD measure. Demetrovics et al. [8] study the average case properties of keys and functional dependencies in random databases. They show that the worst-case exponential properties of keys and functional dependencies (e.g. the number of minimal keys can be exponential in the number of attributes) is unlikely. They do not consider the problem of defining an AFD measure.

3

FD Approximation Measures: General Definition

In this section we define FD approximation measures in very general terms in order to lay the framework for our axiomatic discussion later. Let S be some relation schema and X, Y be non-empty subsets of S. Let D be some fixed, countably infinite set that serves as a universal domain. Let I(S, D) be the set of all relation instances2 over S whose active domain is contained in D. An approximation measure for X → Y over I(S, D) is a function from I(S, D) to R≥0 (the non-negative reals). Intuitively, the number to which an instance, r, is mapped determines the degree to which X → Y holds in r. In the remainder of this paper, for simplicity, we write “approximation measure” to mean “approximation measure for X → Y over I(S, D)”. 3.1

Genericity

The concept of genericity in relational database theory asserts that the behavior of queries should be invariant on the values in the database up to equality [1]. 2

For our purposes a relation instance could be a bag instead of a set.

An Axiomatic Approach to Defining Approximation Measures

41

In our setting, genericity implies that the actual values from D used in r are irrelevant for approximation measures provided that equality is respected. More formally put: given any permutation ρ : D → D, any approximation measure should map r and ρ(r) to the same value.3 Therefore, the only information of relevance needed from r is the attribute value counts. Given x ∈ ΠX (r), and y ∈ ΠY (r), let c(x) denote the number of tuples, t ∈ r, such that t[X] = x (the count of x); let c(y) denote the count of y; let c(x, y) denote the number of tuples where t[X] = x and t[Y ] = y. 3.2

Intuition

The degree to which X → Y is approximate in r is the degree to which r determines a function from ΠX (r) to ΠY (r). Consider the two instances as seen in figure 3.2. The one on the left has n tuples (n ≥ 2) while the one on the right has two tuples. Our hypothesis says that the degree to which A → B is approximate in each of these instances is the degree to which each determines a function from {1} to {1, 2}. We have a choice between 1 → 1 and 1 → 2. If we were to randomly draw a tuple from the instance on the right, there would be an equal likelihood of drawing (1, 1, .) or (1, 2, .). Hence the instance does not provide any information to decrease our uncertainty in choosing between 1 → 1 and 1 → 2. On the other hand, if we were to randomly draw a tuple from the and the instance on the left, the likelihood of drawing (1, 1, .) would be n−1 n likelihood of (1, 2, .) would be n1 . Hence, if n is large, then the instance on the left decreases our uncertainty substantially. The tuples of the two instances could be thought of as data points supplying the choice between 1 → 1 or 1 → 2 (e.g. tuple (1, 1, 1) supplies choice 1 → 1). In the next section we unpack our intuition further by developing a set axioms.

AB 1 1 1 1 .. .. . .

C 1 2 .. .

ABC 1 1 1 1 2 2

1 1 n-1 1 2 n Fig. 1. Two instances over schema A, B, C

4

Axioms

This section is divided into three subsections. In the first, the boundary case where |ΠX (r)| = 1 is considered and axioms are described that formalize our intuitions. In the second, the general case is considered and one additional axiom is introduced. We close the section with a theorem stating that any approximation measure that satisfies the axioms must be equivalent to the information dependency measure [5] up to a constant multiple. 3

ρ(r) denotes the instance obtained by replacing each value a in r by ρ(a).

42

Chris Giannella

4.1

The |ΠX (r)| = 1 Case

The only information of relevance needed from r is the vector [c(y): y ∈ ΠY (r)].4 , or equivalently, the frequency vector: [f (y): y ∈ ΠY (r)], where f (y) = c(y) |r| . We may think of an approximation measure as a function mapping finite, nonzero, rational probability distributions into R≥0 . Let Q(0, 1] denote the set of rational numbers in (0, 1]. Given integer q ≥ 1, let Fq = {[f1 , . . . , fq ] ∈ Q(0, 1]q : q fi = 1}. Formally, we may think of an approximation measure as a function i=1 ∞ from q=1 Fq to R≥0 . Equivalently, an approximation measure may be thought of as a family, Γ = {Γq |q = 1, 2, . . .}, of functions Γq : Fq → R≥0 . Example 1. Recall the approximation measure g3 described in Subsection 2.2. It can be seen that: g3 (r) = 1 −

 x∈ΠX (r)

max{c(x, y) : y ∈ ΠY (r)} . |r|

In the case of |ΠX (r)| = 1, we have g3 (r) = max{f (y) : y ∈ ΠY (r)}. So, g3 can be represented as the family functions Γ g3 , where Γqg3 ([f1 , . . ., fq ]) = max{fj }. Consider the instance as seen on the left in figure 3.2; call this instance s. We 1 n−1 1 n−1 have g3 (s) = Γ2g3 ([ n−1 n , n ]) = max{ n , n } = n .  We now develop our axioms. The first axiom, called the Zero Axiom, is based on the observation that when there is only one Y value, X → Y holds. In this case, we require that the measure returns zero; formally put: Γ1 ([1]) = 0. The second axiom, called the SymABC metry Axiom, is based on the observa1 1 1 tion that the order in which the fre1 1 2 quencies appear should not affect the ABC 1 1 3 measure. Formally stated: for all q ≥ 1, 1 1 1 1 2 4 and all 1 ≤ i ≤ j ≤ q, we have Γq ( [. . ., 1 1 2 1 2 5 fi , . . ., fj , . . .]) = Γq ( [. . ., fj , . . ., fi , 1 2 3 1 2 6 . . .]). 1 2 4 1 3 7 The third axiom concerns the be1 3 8 havior of Γ on uniform frequency dis1 3 9 tributions. Consider the two instances as seen in figure 4. The B column frequency distributions are both uniform. Fig. 2. Two instance over schema The instance on the left has frequen- A, B, C cies 12 , 12 while the instance on the right has frequencies 13 , 13 , 13 . 4

The counts for the values of ΠX (r) are not needed, because, we are assuming for the moment that |ΠX (r)| = 1. In the next subsection, we drop this assumption and the counts for the values of ΠX (r) become important.

An Axiomatic Approach to Defining Approximation Measures

43

The degree to which A → B is approximate in the instance on the left (right) is the degree to which the instance determines a function from {1} to {1, 2} ({1} to {1, 2, 3}). In the instance on the left, we have a choice between 1 → 1 and 1 → 2. In the instance on the right, we have a choice between 1 → 1, 1 → 2, and 1 → 3. If we were to randomly draw a tuple from each instance, in either case, each B value would have an equal likelihood of being drawn. Hence, neither instance decreases our uncertainty of making a mapping choice. However, the instance on the left has fewer mapping choices than the instance on the right (1 → 1, 2 vs. 1 → 1, 2, 3). Therefore, A → B is closer to an FD in the instance on the left than in the instance on the right. Since we assumed that an approximation measure maps an instance to zero when the FD holds, then a measure should map the instance on the left to a number no larger than the instance on the right. Formalizing this intuition we have: for all q  ≥ q ≥ 2, Γq ( [ q1 , . . ., q1 ]) ≥ Γq ( [ 1q , . . ., 1q ]). This is called the Monotonicity Axiom. For the fourth axiom, denote the single X value of r as x and denote the Y values as y1 , . . . , yq (q ≥ 3). The degree to which X → Y is approximate in r is the degree of uncertainty we have in making the mapping choice x → y1 , . . ., yq . Let us group together the last two choices as G = {yq−1 , yq }. The mapping choice can be broken into two steps: i. choose between y1 , . . . , yq−2 , G and ii. choose between elements of G if G was chosen first. The uncertainty in making the final mapping choice is then the sum of the uncertainties of the choice in each of these steps. Consider step i. The uncertainty of making this choice is Γq−1 ( [f (y1 ), . . ., f (yq−2 ), f (yq−1 ) + f (yq )]). Consider step ii. If y1 , . . ., yq−2 were chosen in step i., then step ii. is not necessary (equivalently, step ii. has zero uncertainty). If G was chosen in step i., then an element must be chosen from G in step ii. The f (y ) f (yq−1 ) uncertainty of making this choice is Γ2 ( [ f (yq−1 q−1 )+f (yq ) , f (yq−1 )+f (yq ) ]). However, this choice is made with probability (f (yq−1 ) + f (yq )). Hence the uncertainty of f (y ) f (yq−1 ) making the choice in step ii) is (f (yq−1 ) + f (yq ))Γ2 ( [ f (yq−1 q−1 )+f (yq ) , f (yq−1 )f (yq ) ]). Our fourth axiom, called the Grouping Axiom, is: for q ≥ 3, Γq (f1 , . . ., fq ) = fq−1 fq−1 Γq−1 (f1 , . . ., fq−2 , fq−1 + fq ) + (fq−1 + fq ) Γ2 ( fq−1 +fq , fq−1 +fq ). 4.2

The General Case

We now drop the assumption that |ΠX (r)| = 1. Consider the instance, s, as seen in figure 4.2. The degree to which A → B is approximate in s is determined by the uncertainty in making the mapping choice for each A value, namely, 1 → 1, 2 and 2 → 1, 3, 4. The choice made for the A value, 1, should not influence the choice for 2 and vice-versa. Hence, the approximation measure on s should be determined from the measure on s1 := σA=1 (s) and s2 := σA=2 (s). Each of these falls into the |Π| = 1 case. However, there are five tuples with A value 1 and only three with A value 2. So, intuitively, the measure on s1 should contribute more to the total measure on s than the measure on s2 . Indeed, five-eighths of the tuples in s contribute to making the choice 1 → 1, 2 while only three-eights

44

Chris Giannella

contribute to making the choice 2 → 1, 3, 4. Hence, we assume that the measure on s is the weighted sum of the measures on s1 and s2 , namely, 58 (Measure on s1 ) + 38 (Measure on s2 ). Put in more general terms, the approxiABC mation measure for X → Y in r should be 1 1 1 the weighted sum of the measures for each 1 1 2 rx , x ∈ ΠX (r). Before we can state our fi1 1 3 nal axiom, we need to generalize the nota1 2 4 tion from the |Π| = 1 case. In the |Π| = 1 1 2 5 case, Γq was defined on frequency vectors, 2 1 6 [f1 , . . . , fq ]. However, with the |Π| = 1 as2 3 7 sumption dropped, we need a relative fre2 4 8 quency vector for each x ∈ ΠX (r). Given y ∈ ΠY (r), let f (y|x) denote the relative frequency of y with respect to x: f (y|x) = Fig. 3. Instance, s, over schema c(x,y) c(x) . The relative frequency vector associ- A, B, C ated with x is [f (y|x)): y ∈ ΠY (σX=x (r))]. Notice, Y values that do not appear in any tuple with x are omitted from the relative frequency vector. Moreover, we also need the frequency vector for the X values, [f (x): x ∈ ΠY (r)]. Let ΠX (r) = {x1 , . . . , xp } and |ΠY (σX=xi (r))| = qi . Γq must be generalized to operate on the X frequency vector, [f (x1 ), . . ., f (xp )], and the relative frequency vectors for Y associated with each X value, [f (y|xi ) : y ∈ ΠY (σX=xi (r))]. The next set of definitions makes precise the declaration of Γ . Given integers p, q1 , . . . , qp ≥ 1, let Q(0, 1]p,q1 ,...,qp denote Q(0, 1]p × (×pi=1 Q(0, 1]qi ). Let Fp,q p ], [f1|1 , . . ., fq1 |1 ], · · · , [f1|p , . . ., fqp |p ]) ∈ p = {([f1 , . . ., f 1q,...,q p i Q(0, 1]p,q1 ,...,qp : j=1 fj|i = fi and i=1 fi = 1}. An approximation measure is a family, Γ = {Γp,q1 ,...,qp : p, q1 , . . ., qp = 1, 2, . . .}, of functions Γp,q1 ,...,qp : Fp,q1 ,...,qp → R≥0 . Our final axiom, called the Sum Axiom, is: for all p ≥ 2 and  q1 , . . . , qp ≥ 1, Γp,q1 ,...,qp ( [f1 , . . ., fp ], [f1|1 , . . ., fq1 |1 ], · · · , [f1|p , . . ., fqp |p ]) = pi=1 fi Γqi ( [f1|i , . . ., fqi |i ]). FD Approximation Axioms 1. Zero. Γ1 ([1]) = 0. 2. Symmetry. For all 1 ≤ q, 1 ≤ i ≤ j ≤ q, Γq ([. . ., fi , . . ., fj , . . .]) = Γq ([. . ., fj , . . ., fi , . . .]). 3. Monotonicity. For all q  ≥ q ≥ 1, Γq ([ q1 ,. . ., q1 ]) ≥ Γq ([ 1q ,. . ., 1q ]). 4. Grouping. For all q ≥ 3, Γq ([f1 , . . ., fq ]) = Γq−1 ([f1 , . . ., fq−2 , fq−1 + fq ])+ fq f q−1 (fq−1 + fq ) Γ2 ([ fq−1 +fq , fq−1 +fq ]). 5. Sum. For all p ≥ 2 and  q1 , . . . , qp ≥ 1, Γp,q1 ,...,qp ( [f1 , . . ., fp ], [f1|1 , . . . , fq1 |1 ], · · · , [f1|p , . . . , fqp |p ]) = pi=1 fi Γ1,qi ( [f1|i ,. . ., fqi |i ]).

An Axiomatic Approach to Defining Approximation Measures

45

We arrive at our main result. Theorem 1. Assume Γ satisfies the FD Approximation Axioms, then )], [f (y|x1 ): y ∈ ΠY (σX=x1 (r))], · · · , [f (y|xp ): y ∈ Γp,q1 ,...,qp ( [f (x1 ), . . ., f (xp  ΠY (σX=xp (r))]) equals −c x∈ΠX (r) f (x) y∈ΠY (σX=x (r)) f (y|x)log2 (f (y|x)) where c = Γ2 ([ 12 , 12 ]) (c is non-negative a constant). The information   dependency measure of [5] (written HX→Y ) is defined as − x∈ΠX (r) f (x) y∈ΠY (σX=x (r)) f (y|x)log(f (y|x)). Theorem 1 shows that if Γ satisfies the FD Approximation Axioms, then Γ is equivalent to the cHX→Y (r) for c a non-negative constant. To prove the theorem, we start by proving the result for the case of |ΠX (r)| = 1. Namely, we prove the following proposition (the general result follows by the Sum axiom). Proposition 1. Assume Γ satisfies the FD Approximation Axioms. For all q ≥ q 1, Γq ([f1 , . . ., fq ]) is of the form −Γ2 ([ 12 , 12 ]) j=1 fj log2 (fj ). The case of q = 1 follows directly from the Zero axiom, so, we now prove the proposition for q ≥ 2. The proof is very similar to that of Theorem 1.2.1 in [2], however, for the sake of being self-contained, we include our proof here. We show four lemmas, the forth of which serves as the base case of a straightforward induction proof of the proposition on q ≥ 2. q Lemma 1. For all q ≥ 2, Γq ([ 1q , . . . , 1q ]) = i=2 qi Γ2 ([ 1i , i−1 i ]). Proof: Apply the Grouping axiom q − 2 times.



Lemma 2. For all q ≥ 2, k ≥ 1, Γqk ([ q1k , . . . , q1k ]) = kΓq ([ 1q , . . . , 1q ]). Proof: Let q ≥ 2. I prove the desired result by induction on k ≥ 1. In the base case (k = 1), the result follows trivially. Consider now the induction case (k ≥ 2). By q − 1 applications of Grouping followed by an application of Symmetry we have Γqk ([

q  i 1 1 1 1 1 1 i−1 ]). (1) , . . . , ]) = Γ ([ , , . . . , ]) + Γ ([ , k q −(q−1) k 2 i qk qk q k−1 q k qk q i i=2

Repeating the reasoning that arrived at equation (1) q k−1 − 1 more times, we have Γqk ([

1 1 1 1 , . . . , k ]) = Γqk −(qk−1 )(q−1) ([ k−1 , . . . , k−1 ]) + qk q q q q  i 1 i−1 ]) (q k−1 ) Γ2 ([ , k q i i i=2 = Γqk−1 ([

1

q

,..., k−1

1

q

]) + k−1

q  1 i−1 i Γ2 ([ , ]). q i i i=2

46

Chris Giannella

By Lemma 1, we have Γqk ([

1 1 1 1 1 1 , . . . , k ]) = Γqk−1 ([ k−1 , . . . , k−1 ]) + Γq ([ , . . . , ]). k q q q q q q

So, by induction, we have Γqk ([

1 1 1 1 1 1 , . . . , k ]) = (k − 1)Γq ([ , . . . , ]) + Γq ([ , . . . , ]) qk q q q q q 1 1 = kΓq ([ , . . . , ]). q q 

Lemma 3. For all q ≥

2, Γq ([ 1q , . . . , 1q ])

=

Γ2 ([ 12 , 12 ])log2 (q).

Proof: Let q ≥ 2. Assume Γq ([ 1q , . . . , 1q ]) = 0. Then by Lemma 1, 0 = q i 1 i−1 1 1 i=2 q Γ2 ([ i , i ]). Since Γ2 is non-negative by definition, then Γ2 ([ 2 , 2 ]) = 0, 1 1 so, the desired result holds. Assume henceforth that Γq ([ q , . . . , q ]) > 0. For any integer r ≥ 1, there exists integer k ≥ 1 such that q k ≤ 2r ≤ q k+1 . Therefore, kr ≤ log12 (q) ≤ k+1 r . Moreover, by the Monotonicity axiom, we have Γqk (. . .) ≤ Γ2r (. . .) ≤ Γqk+1 (. . .). So, by Lemma 2, | ΓΓ2q (...) (...)



1 log2 (q) |



k r



Γ2 (...) Γq (...)



k+1 r .

Therefore,

1 r.

Letting r → ∞, we have desired.

Γ2 (...) Γq (...)

=

1 log2 (q) .

So, Γq (. . .) = Γ2 (. . .)log2 (q), as 

Lemma 4. For any p ∈ Q(0, 1), Γ2 ([p, 1 − p]) = −Γ2 ([ 12 , 12 ])[plog2 (p) + (1 − p)log2 (1 − p)]. Proof: I shall show for all integers s > r ≥ 1, Γ2 ([ rs , 1 − rs ]) = −Γ2 ([ 12 , 12 ])[ rs log2 ( rs )+ (1 − rs )log2 (1 − rs )]. Let s > r ≥ 1. If s = 2, then the result holds trivially, so, assume s ≥ 3. By r − 1 applications of Grouping, followed by single application of Symmetry, followed by another s − r − 1 applications of Grouping we have s−r r   1 i i 1 r s−r 1 i−1 1 i−1 ]) + Γ2 ([ , ]) + Γ2 ([ , ]) Γs ([ , . . . , ]) = Γ2 ([ , s s s s s i i s i i i=2 i=2 s−r s−r  i r s−r 1 i−1 = Γ2 ([ , ]) + Γ2 ([ , ]) s s s i=2 s − r i i r

+

ri 1 i−1 Γ2 ([ , ]). s i=2 r i i

By Lemma 1 and Lemma 3, we have 1 s−r 1 r s−r 1 1 r 1 1 ])+ Γ2 ([ , ])log2 (s−r)+ Γ2 ([ , ])log2 (r). Γs ([ , . . . , ]) = Γ2 ([ , s s s s s 2 2 s 2 2 (2)

An Axiomatic Approach to Defining Approximation Measures

47

By Lemma 3, again, we have 1 1 1 1 Γs ([ , . . . , ]) = Γ2 ([ , ])log2 (s). s s 2 2 From equation (2), it follows that r s−r 1 1 s−r r ]) = −Γ2 ([ , ])[ log2 (s − r) + log2 (r) − log2 (s)] Γ2 ([ , s s 2 2 s s r r s−r log2 (s − r) − = −Γ2 (. . .)[( log2 (r) − log2 (s)) + ( s s s s−r log2 (s))] s r r r r = −Γ2 (. . .)[ log2 ( ) + (1 − )log2 (1 − )]. s s s s  Now we prove the proposition by induction on q ≥ 2. The base case of q = 2 follows directly from Lemma 4. Consider now the induction case of q ≥ 3. By Grouping we have Γq ([f1 , . . . , fq ]) = Γq−1 ([f1 , . . . , fq−2 , fq−1 + fq ]) + fq−1 fq (fq−1 + fq )Γ2 ([ , ]). fq−1 + fq fq−1 + fq Now we apply the induction assumption to both terms in the right-hand side and get q−2 1 1  Γq ([f1 , . . . , fq ]) = −Γ2 ([ , ])( fi log2 (fi ) + (fq−1 + fq )log2 (fq−1 + fq )) − 2 2 i=1

fq−1 1 1 fq−1 log2 ( )+ (fq−1 + fq )Γ2 ([ , ])( 2 2 fq−1 + fq fq−1 + fq fq−1 fq−1 log2 ( )) fq−1 + fq fq−1 + fq q−2 1 1  = −Γ2 ([ , ])[ fi log2 (fi ) + fq−1 log2 (fq−1 ) + fq log2 (fq )] 2 2 i=1 q 1 1  = −Γ2 ([ , ]) fi log2 (fi ). 2 2 i=1

 Remark: All normalized measures violate one of the axioms since the information dependency measure is unbounded (e.g. g3 does not satisfy the Grouping axiom). We leave to future work modification the axioms to account for normalized approximation measures.

48

5

Chris Giannella

Conclusions

The primary purpose of this paper was to develop a deeper understanding of the concept of FD approximation degree. To do so, we developed a set of axioms based on the following intuition. The degree to which X → Y holds in r is the degree to which r determines a function from ΠX (r) to ΠY (r). The axioms apply to measures that depend only on frequencies (i.e. the frequency of x ∈ ΠX (r) is c(x) |r| ). We proved that a unique measure (up to a constant multiple) satisfies the axioms, namely, the information dependency measure of [5]. Care must be taken in how the result is interpreted. We do not think it should be interpreted to imply that the information dependency measure is the only reasonable, frequency-based, approximation measure. Other approximation measures may be reasonable as well. In fact, the determination of the “reasonability” of a measure is subjective (like the determination of the “interestingness” of rules in KDD). The way to interpret the result is as follows. It implies that frequency-based measures other than information dependencies must violate one of the FD Approximation Axioms. Hence, if a measure is needed for some application and the designers decide to use another measure, then they must accept that the measure they use violates one of the axioms. There are two primary directions for future work. The first is to examine how the axioms can be modified to account for normalized approximation measures (see the remark at the end of Section 4). The second is based on work done to rank the “interestingness” of generalizations (summaries) of columns in relation instances (see [10] and the citations contained therein). The basic idea of this work is that it is often desirable to generalize a column along pre-specified taxonomic hierarchies; each generalization forms a different data set. There are often an large number of ways that a column can be generalized along a hierarchy (e.g. levels of granularity). Moreover, if there are many available hierarchies, then the number of possible generalizations increases yet further; the number can become quite large. Finding the right generalization can significantly improve the gleaning of useful information out of column. A common approach to addressing this problem is to develop a measure of interestingness of generalizations and use it to rank them. Moreover, this approach bases the interestingness of a generalization in terms of the diversity of its frequency distribution. No work has been done in taking an axiomatic approach to defining a diversity measure. Our second direction for future work is to take such an axiomatic approach. In conclusion, we believe that the problem of defining an FD approximation measure is interesting and difficult. Moreover, we feel that the study of approximate FDs more generally is worthy of greater consideration in the KDD community. Acknowledgments The author thanks the following people (in no particular order): Edward Robertson, Dirk Van Gucht, Jan Paredaens, Marc Gyssens, Memo Dalkilic, and Dennis

An Axiomatic Approach to Defining Approximation Measures

49

Groth. The author also thanks a reviewer who pointed out several related works to consider.

References 1. Abiteboul S., Hull R., and Vianu V. Foundations of Database Systems. AddisonWesley, Reading, Mass., 1995. 2. Ash R. Information Theory. Interscience Publishers, John Wiley and Sons, New York, 1965. 3. Cavallo R. and Pittarelli M. The Theory of Probabilistic Databases. In Proceedings 13th International Conference on Very Large Databases (VLDB), pages 71–81, 1987. 4. Dalkilic M. Establishing the Foundations of Data Mining. PhD thesis, Indiana University, Bloomington, IN 47404, May 2000. 5. Dalkilic M. and Robertson E. Information Dependencies. In Proceedings 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principals of Database Systems (PODS), pages 245–253, 2000. 6. De Bra P. and Paredaens J. An Algorithm for Horizontal Decompositions. Information Processing Letters, 17:91–95, 1983. 7. Demetrovics J., Katona G.O.H., and Miklos D. Partial Dependencies in Relational Databases and Their Realization. Discrete Applied Mathematics, 40:127–138, 1992. 8. Demetrovics J., Katona G.O.H., Niklosb D., Seleznjevc O., and Thalheimd B. Asymptotic Properties of Keys and Functional Dependencies in Random Databases. Theoretical Computer Science, 40(2):151–166, 1998. 9. Goodman L. and Kruskal W. Measures of Associations for Cross Classifications. Journal of the American Statistical Association, 49:732–764, 1954. 10. Hilderman R. and Hamilton H. Evaluation of Interestingness Measures for Ranking Discovered Knowledge. In Lecture Notes in Computer Science 2035 (Proceedings Fifth Pacific-Asian Conference on Knowledge Discovery and Data Mining (PAKDD 2001)), pages 247–259, 2001. 11. Huhtala Y., K¨ arkk¨ ainen J., Porkka P., and Toivonen H. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. The Computer Journal, 42(2):100–111, 1999. 12. Kantola M., Mannila H., R¨ aih¨ a K., and Siirtola H. Discovering Functional and Inclusion Dependencies in Relational Databases. International Journal of Intelligent Systems, 7:591–607, 1992. 13. Kivinen J., Mannila H. Approximate Inference of Functional Dependencies from Relations. Theoretical Computer Science, 149:129–149, 1995. 14. Lee T. An Information-Theoretic Analysis of Relational Databases - Part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering, SE-13(10):1049–1061, 1987. 15. Lopes S., Petit J., and Lakhal L. Efficient Discovery of Functional Dependencies and Armstrong Relations. In Lecture Notes in Computer Science 1777 (Proceedings 7th International Conference on Extending Database Technology (EDBT)), pages 350–364, 2000. 16. Malvestuto F. Statistical Treatment of the Information Content of a Database. Information Systems, 11(3):211–223, 1986. 17. Mannila H. and R¨ aih¨ a K. Dependency Inference. In Proceedings 13th International Conference on Very Large Databases (VLDB), pages 155–158, 1987.

50

Chris Giannella

18. Nambiar K. K. Some Analytic Tools for the Design of Relational Database Systems. In Proceedings 6th International Conference on Very Large Databases (VLDB), pages 417–428, 1980. 19. Novelli N. and Cicchetti R. Functional and Embedded Dependency Inference: a Data Mining Point of View. Information Systems, 26:477–506, 2001. 20. Piatatsky-Shapiro G. Probabilistic Data Dependencies. In Proceedings ML-92 Workshop on Machine Discovery, Aberdeen, UK, pages 11–17, 1992. 21. Ramakrishnan R., Gehrke J. Database Management Systems Second Edition. McGraw Hill Co., New York, 2000. 22. Wyss C., Giannella C., and Robertson E. FastFDs: A Heuristic-Driven, DepthFirst Algorithm for Mining Functional Dependencies from Relation Instances. In Lecture Notes in Computer Science 2114 (Proceedings 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK)), pages 101–110, 2001.

Suggest Documents