Efficient Discovery of Functional Dependencies with Degrees of Satisfaction Qiang Wei,* Guoqing Chen † School of Economics and Management, Tsinghua University Beijing 100084, China
Functional dependency (FD) is an important type of semantic knowledge reflecting integrity constraints in databases, and has nowadays attracted an increasing amount of research attention in data mining. Traditionally, FD is defined in the light of precise or complete data, and can hardly tolerate partial truth due to imprecise or incomplete data (such as noises, nulls, etc.) that may often exist in massive databases, or due to a very tiny insignificance of tuple differences in a huge volume of data. Based on the notion of functional dependencies with degrees of satisfaction (FDs)d, this article presents an efficient approach to discovering all satisfied (FDs)d using some important results obtained from exploration of (FDs)d properties such as extended Armstrong-like axioms and their derivatives. In this way, many dependencies can be inferred from previously discovered ones without scanning databases, and those unsatisfied ones could be filtered out inside (rather than after) the mining process. Fuzzy relation matrix operation is used to infer transitive dependencies in the mining algorithm. Finally, the efficiency is demonstrated with data experiments. © 2004 Wiley Periodicals, Inc.
1.
INTRODUCTION
Data mining is one of the important and interesting fields in computer science and computational intelligence, and is used to discover hidden, novel, and useful knowledge to support scientific, engineering, and business decisions.1,2 Usually the knowledge to discover can be of various forms, such as association, clustering, classification, summarization, and so forth, from different types of data sources. This article concentrates on a particular type of association in relational databases (RDB), which are categorized as a mainstream data model in current research and applications. *Author to whom all correspondence should be addressed: e-mail:
[email protected]. edu.cn. † e-mail:
[email protected]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 19, 1089–1110 (2004) © 2004 Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com). • DOI 10.1002/int.20038
1090
WEI AND CHEN
The particular form of association of concern is functional dependency (FD), which is an important notion in RDB reflecting data integrity constraints as a sort of semantic knowledge from the universe of disclosure. For two collections X and Y of data attributes, a FD X r Y means that X values uniquely determine Y values. In other words, it expresses the semantic that equal Y values correspond to equal X values. An example of X r Y is (student #, course #) r grade, meaning that the value of grade can be uniquely determined by a given value of student # and a given value of course #, in a database relation containing attributes student #, course #, and grade. However, sometimes FDs are not explicitly known or are hidden, and therefore need to be discovered. This partly stems from the fact that several decades of database applications had resulted in a large number of databases that were constructed and maintained in which useful and interesting functional dependencies might have already been hidden. Since the 1990s, an increasing effort has been devoted to mining FDs and related issues.3–10 Moreover, some other attempts centered on extended forms of FD, such as functional dependencies with null values,11 partial determination,12 approximate functional dependencies,13,14 fuzzy functional dependencies (FFDs),15–20 functional dependencies with degrees of satisfaction (FDs)d ,21–23 and so forth. Formally, let R~I1 , I2 , . . . , In ! be an n-ary relation scheme on domains D1 , D2 , . . . , Dn with Dom~Ii ! ⫽ Di , X and Y be subsets of the attribute set I ⫽ $I1 , I2 , . . . , In %, that is, X, Y 債 I, and R be a relation of scheme R, R 債 D1 ⫻ D2 ⫻{{{⫻ Dn . X functionally determines Y (or Y is functionally dependent on X !, denoted by X r Y, if and only if ∀R 僆 R, ∀t, t ' 僆 R, if t~X ! ⫽ t ' ~X ! then t~Y ! ⫽ t ' ~Y !, where t and t ' are tuples of R, and t~X !, t ' ~X !, t~Y !, and t ' ~Y ! are values of t and t ' for X and Y, respectively.24,25 It is important to note that functional dependency possesses several desirable properties, including so-called Armstrong axioms that constitute an FD inference system. Concretely, the axiomatic system composed of the axioms (A1, A2, A3) is as follows 24,25 : A1: If B 債 A, then ~A r B!; A2: If A r B, then AC r BC; A3: If A r B and B r C, then A r C.
Suppose that F is a set of FDs on scheme R. Then the set of all the FDs that are derived from F using the inference rules A1, A2, and A3 is denoted as F A . In discovering functional dependencies, there still exist some open problems to solve. First, in large existing databases, noises often pertain, such as conflicts, nulls, and errors that may result from, for instance, inaccurate data entry, transformation, or updates. Apparently, by definition, FDs do not tolerate such noisy or disturbing data. Second, even without noisy data, sometimes a partial truth of an FD may still make sense. For instance, “an FD almost holds in a database” expresses a sort of partial knowledge, meaning that the functional dependency satisfies the RDB of concern to a large extent. In 2002, Huhtala et al.13,14 considered using the concept of approximate dependency to deal with so-called error tuples. In the meantime, Wei and Chen 21–23 presented a notion of functional dependency with a degree
DISCOVERY OF FUNCTIONAL DEPENDENCIES
1091
of satisfaction (FDd : ~X r Y !a ! to reflect the semantic that equal Y values correspond to equal X values at a certain degree (a). Third, in developing corresponding mining methods, FD inference is desirable but still needs to be further investigated. That is, deriving an FD by inference from discovered FDs without scanning the database may help improve the computational efficiency of the mining process. For example, if both A r B and B r C satisfy an RDB, and if A r C could be inferred directly, then the effort in scanning the database for checking whether A r C holds can be saved. In this article, we will investigate some of the important properties of (FDs)d and develop an approach to discovering (FDs)d that incorporates FD inference strategies into the mining process. The article is organized as follows. In Section 2, the preliminaries will be discussed. Section 3 presents a set of extended Armstrong-like inference rules, along with some important notions and derivatives. In Section 4, to discover all (FDs)d efficiently, a fuzzy relation matrix (FRM) operation is proposed to perform transitivity-type FD inference. Accordingly, the algorithm for mining (FDs)d is provided in Section 5. Some results of data experiments and analysis are illustrated in Section 6 to show the semantics and scalability of the method. 2.
PRELIMINARIES
Definition 1. Let R~I1 , I2 , . . . , In ! be a relation scheme on domains D1 , D2 , . . . , Dn , with Dom~Ik ! ⫽ Dk , X and Y be two subsets of attribute set I ⫽ $I1 , I2 , . . . , In %, that is, X, Y 債 I, and R be a relation of R~I!, R 債 D1 ⫻ D2 ⫻{{{⫻ Dn , where tuples ti , tj 僆 R and ti ⫽ tj . Then Y is called to functionally depend on X for a tuple pair ~ti , tj !, denoted as ~ti, tj! ~X r Y !; if ti ~X ! ⫽ tj ~X !, then ti ~Y ! ⫽ tj ~Y !. It can easily be seen that the FD for a tuple pair could be represented in terms of the truth value, TRUTH~ti, tj ! ~X r Y !, where if ti ~X ! ⫽ tj ~X ! and ti ~Y ! ⫽ tj ~Y !, then TRUTH~ti, tj ! ~X r Y ! ⫽ 0; otherwise 1. Subsequently, FD for relation R can be defined in terms of degree of satisfaction. Definition 2. Let R~I! be a relation scheme, X, Y 債 I, and R be a relation of R~I! with n tuples. Then the degree that R satisfies X r Y is TRUTHR ~X r Y !:
(
TRUTHR ~X r Y ! ⫽
∀ti , tj 僆R ti ⫽tj
TRUTH~ti , tj ! ~X r Y ! NTP
where NTP represents the number of tuple pairs in R and equals n~n ⫺ 1!/2. For example, given a relation R as shown in Table I, the degree of satisfaction of (FD)d “Fruits r Drinks” is TRUTHR ~Fruits r Drinks) ⫽ 77.78%, meaning that “Fruits determine Drinks to some extent (0.7778).” In the context of data
1092
WEI AND CHEN Table I. An example of relation R. ID
Fruits
Drinks
1 2 3 4 5 6 7 8 9 10
Apple Apple Apple Apple Apple Orange Orange Orange Banana Banana
Spirit Spirit Coca Cola Coca Cola N/A Coca Cola Coca Cola Spirit Coca Cola Coca Cola
mining, with the scanning of databases and evaluation of all possible combinations of attributes, any (FDs)d could be obtained by definition. Definition 3. Let relation R be a relation on scheme R~I!. Then given a minimum satisfaction threshold u, 0 ⱕ u ⱕ 1 and X, Y 債 I, if TRUTHR ~X r Y ! ⱖ u, then X r Y is called a satisfied functional dependency for R. For the sake of convenience, a (FD)d X r Y with TRUTHR ~X r Y ! ⫽ a is denoted as ~X r Y !a . Thus, a straightforward way to find all satisfied (FDs)d is to scan the database for each possible combination of attributes, count its TRUTH values and filter (FDs)d with the threshold. However, this way is obviously costly and inefficient. Instead, optimization can be made based on the following properties, which will be discussed in this and later sections. Property 1 (Reflexivity). Let R be a relation on R~I! and X, Y 債 I. If Y 債 X, then TRUTHR ~X r Y ! ⫽ 1. Proof. Because Y 債 X, then for each tuple pair ~ti , tj ! in R, if ti ~X ! ⫽ tj ~X !, then ti ~Y ! ⫽ tj ~Y !, obviously. According to Definition 1, TRUTH~ti, tj ! ~X r Y ! ⫽ 1. If ti ~X ! ⫽ tj ~X !, then whether ti ~Y ! equals tj ~Y ! or not, TRUTH~ti, tj ! ~X r Y ! ⫽ 1.
䡲
Property 2 (Augmentation). Let R be a relation on R~I! and X, Y, Z 債 I. If TRUTHR ~X r Y ! ⱖ a, then TRUTHR ~XZ r YZ! ⱖ a, 0 ⱕ a ⱕ 1. Proof. First, we can list all the combinations of X, Y, and Z exhaustively and Table II could be derived. In the left part of the table, value 1 represents ti ~X ! ⫽ tj ~X ! and value 0 represents ti ~X ! ⫽ tj ~X !. The same is for Y and Z. In the right part of the table, value 1 represents that TRUTH~ti, tj ! ~X r Y ! ⫽ 1, value 0 represents that TRUTH~ti, tj ! ~X r Y ! ⫽ 0. The same is for XZ r YZ.
1093
DISCOVERY OF FUNCTIONAL DEPENDENCIES Table II. Truth values of Property 2. ti ~X ! ⫽ tj ~X ! 0 0 0 0 1 1 1 1
ti ~Y ! ⫽ tj ~Y !
ti ~Z! ⫽ tj ~Z!
TRUTH~ti, tj ! ~X r Y !
TRUTH~ti, tj ! ~XZ r YZ!
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
1 1 1 1 0 0 1 1
1 1 1 1 1 0 1 1
Based on Table II, it could be found that for each tuple pair ~ti , tj !, there exists TRUTH~ti, tj ! ~XZ r YZ! ⱖ TRUTH~ti, tj ! ~X r Y!. Since TRUTHR ~X r Y ! ⱖ a, then we have: TRUTHR ~XZ r YZ! ⫽ S~TRUTH~ti, tj ! ~XZ r YZ!)/ 䡲 NTP ⱖ S~TRUTH~ti, tj ! ~X r Y !)/NTP ⱖ TRUTHR ~X r Y ! ⱖ a. Property 3 (Partial Transitivity). Let R be a relation on R~I! and X, Y, Z 債 I. If TRUTHR ~X r Y ! ⱖ a and TRUTHR ~Y r Z! ⱖ b, then TRUTHR ~X r Z! ⱖ a ⫹ b ⫺ 1. Proof. Similar to the proof of Property 2, we have the values listed in Table III. Only in No. 5 and No. 7 could it be found that TRUTH~ti, tj ! ~X r Z! ⫽ 0. Clearly, the percentage of all tuple pairs with No. 5 will not exceed 1 ⫺ a. Also the percentage of tuple pairs with No. 5 and No. 6 will not exceed 1 ⫺ a because of TRUTHR ~X r Y ! ⱖ a. Similarly, the percentage for No. 7 will not exceed 1 ⫺ b. Then it could be derived that 0 ⱕ 1 ⫺ TRUTHR ~X r Z! ⱕ ~1 ⫺ a! ⫹ 䡲 ~1 ⫺ b!. That is, 1 ⱖ TRUTHR ~X r Z! ⱖ a ⫹ b ⫺ 1. The above three properties are similar to the three Armstrong inference rules, except for Property 3 in that it guarantees a lower-bound TRUTH value for a transitive (FD)d that could be inferred without scanning a database. Furthermore, another important property could be obtained as follows. Property 4. Let R be a relation on R~I! and X, Y, Z 債 I. If TRUTHR ~X r Y ! ⫽ a and TRUTHR ~Y r Z! ⫽ b, then a ⫹ b ⱖ 1. Proof. Similarly, we have Table IV. From this table, it is clear that, because TRUTHR ~X r Y ! ⫽ a, then the percentage of tuple pairs for No. 5 and No. 6 䡲 will not exceed 1 ⫺ a. Then we have b ⱖ 1 ⫺ a. That is, a ⫹ b ⱖ 1. In brief, Property 4 guarantees that invalid values less than 0 will not be generated in transitive inference. In Ref. 22, an algorithm for mining all satisfied (FDs)d has been proposed, which is also constructed on a lattice structure in a manner similar to that of
ti ~Y ! ⫽ tj ~Y !
ti ~X ! ⫽ tj ~X !
0 0 0 0 1 1 1 1
No.
1 2 3 4 5 6 7 8
0 0 1 1 0 0 1 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
1 2 3 4 5 6 7 8
ti ~Y ! ⫽ tj ~Y !
ti ~X ! ⫽ tj ~X !
No. 1 1 1 1 0 0 1 1
TRUTH~ti, tj ! ~X r Y !
0 1 0 1 0 1 0 1
ti ~Z! ⫽ tj ~Z! 1 1 1 1 0 0 1 1
TRUTH~ti, tj ! ~X r Y !
Table IV. Truth values of Property 4.
0 1 0 1 0 1 0 1
ti ~Z! ⫽ tj ~Z!
Table III. Truth values of Property 3.
1 1 0 1 1 1 0 1
TRUTH~ti, tj ! ~Y r Z!
1 1 0 1 1 1 0 1
TRUTH~ti, tj ! ~Y r Z!
1 1 1 1 0 1 0 1
TRUTH~ti, tj ! ~X r Z!
1 1 1 1 0 1 0 1
TRUTH~ti, tj ! ~X r Z!
1094 WEI AND CHEN
1095
DISCOVERY OF FUNCTIONAL DEPENDENCIES
Huhtala’s.13,14 This algorithm (called SFDD in this article) contains two parts: The first part focuses on generating all satisfied 1-consequent (FDs)d , and the second part focuses on generating an i-consequent (FDs)d based on all satisfied 1-consequent (FDs)d . Notably, by a q-consequent FDd ~X r Y !a we mean an FDd with q attributes contained in Y. Similarly, a p-antecedent ~X r Y !a means an FDd with p attributes contained in X. The algorithm (SFDD) applies certain optimization strategies for improving efficiency. To further extend the algorithm, Property 3 needs to be incorporated in a more complete manner so as to infer all possible (FDs)d transitively as Property 3 permits. For example, if we already have TRUTHR ~Y r Z! ⫽ 1.0 and TRUTHR ~X r Y ! ⫽ 0.8, then it could be inferred that TRUTHR ~X r Z! ⱖ TRUTHR ~X r Y ! ⫹ TRUTHR ~Y r Z! ⫺ 1 ⫽ 0.8 ⫹ 1.0 ⫺ 1 ⫽ 0.8. Then ~X r Z! 0.8 is clearly satisfied without a need to scan the database or calculate its TRUTH value. 3.
A ~FDs!d INFERENCE SYSTEM
This section is to present a (FDs)d inference system in the forms of extended Armstrong-like axioms. Theorem 1 (Extended Armstrong-like Axioms). Let R be a relation on R~I! and X, Y, Z 債 I. Then for any R in R, the following inference rules, denoted as extended Armstrong-like axioms ~A1 ', A2 ', A3 ' !, hold: A1 ' : If Y 債 X, then ~X r Y !u , u 僆 ~0, 1!; A2 ' : If ~X r Y !a, then ~XZ r YZ!a; A3 ' : If ~X r Y !a and ~Y r Z!b, then ~X r Z!g , a ⫹ b ⫺ 1 ⱕ g ⱕ 1. Proof. A1 ', A2 ' and A3 ' can be proved directly according to Definition 2 and 䡲 Properties 1, 2, and 3, respectively. Suppose that F is a set of (FDs)d on R. Then we denote F A as the set of all (FDs)d that are derived from F using A1 ', A2 ', and A3 '. Moreover, because the (FDs)d derived by A3 ' are only at lower bound degrees of satisfaction, the degrees ' shown in F A are all lower bounds. Therefore, we use ~X r Y !a_ for (FDs)d in ' F A to denote a lower-bounded (FDs)d . For instance, given F ⫽ $~A r B!0.8 , ' ~B r C!0.7 , ~A r C! 0.7 %, then F A ⫽ $~A r B!0.8 , ~B r C!0.7 , ~A r C!0.7 , ~AC r B!0.8 , ~AB r C!0.7 , ~AB r C!0.7 , ~A r C!0.5_ , ~AB r C! 0.5_ %. It could be found that the first three (FDs)d are directly from F. The next three (FDs)d are inferred based on A2 ' and A1 '. ~A r C! 0.5_ is inferred by ~A r B! 0.8 and ~B r C! 0.7 with A3. Further ~AB r C! 0.5_ is inferred by ~A r C! 0.5_ based on ' A2 ' and A1 '. Moreover, in F A , some (FDs)d may be regarded redundant, such as ~A r C! 0.5_ , because it carries less information than ~A r C! 0.7 . However, for the sake of simplicity, we will hereafter use ~X r Y !a_ and ~X r Y !a inter ' changeably for (FDs)d in F A (otherwise indicated where necessary). Further, con' sider those functional dependencies with highest degrees of satisfaction in F A . '
1096
WEI AND CHEN
Then F A⫹ could be defined as a set containing the (FDs)d with upper bound degrees ' of satisfaction, that is, F A⫹ ⫽ $~X r Y !a 6~X r Y !a 僆 F A , where a ⫽ ' ' sup $ b6~X r Y !b 僆 F A %%. For example, given F and F A of the above example, the corresponding F A⫹ ⫽ $~A r B!0.8 , ~B r C!0.7 , ~A r C!0.7 , ~AC r B!0.8 , ~AB r C!0.7 , ~AB r C! 0.7 %. Definition 4. Let F and G be two sets of (FDs)d . Then F and G are called equivalent if and only if F A⫹ ⫽ G A⫹. For example, given two sets of (FDs)d , for example, F ⫽ $~A r B!0.8 , ~B r C!0.9 , ~A r C! 0.7 % and G ⫽ $~A r B!0.8 , ~B r C! 0.9 %, then F is equivalent to G, because F A⫹ ⫽ $~A r B!0.8 , ~B r C!0.9 , ~A r C!0.7 , ~AC r B!0.8 , ~AB r C!0.9 , ~AB r C!0.7 % ⫽ G A⫹. Then given threshold u ⫽ 0.7, discovering F is equivalent to discovering G. Compared with F, G is simpler and contains the same semantics of F at u ⫽ 0.7. Note that if we have a different case with a different F ⫽ $~A r B!0.8 , ~B r C!0.9 , ~A r C! 0.8 % and the same G, then F A⫹ ⫽ $~A r B!0.8 , ~B r C!0.9 , ~A r C!0.8 , ~AC r B!0.8 , ~AB r C!0.9 , ~AB r C!0.8 % ⫽ G A⫹. That is, G is not equivalent to F. In the mining process, however, we are only concerned with those (FDs)d with satisfied degrees as mining outcomes. For u ⫽ 0.7, both ~A r C! 0.8 and ~A r C! 0.7 will be regarded as satisfied. If the set of (FDs)d could be viewed as a fuzzy set with the corresponding degrees of satisfaction as the grades of membership, then F and G are the same in terms of the 0.7-cuts of F A⫹ and G A⫹. That is, ~F A⫹ !0.7 ⫽ $ A r B, B r C, A r C, AC r B, AB r C, AB r C% ⫽ ~G A⫹ ! 0.7 . In general, the u- cut of a set of (FDs)d F is denoted as ~F!u ⫽ $ A r B 6 ~A r B!a 僆 F, and a ⱖ u%. Based on ~F!u , the notion of u-equivalence could be defined. Definition 5. Let F and G be two sets of (FDs)d and u be an u-cut threshold with u 僆 @0, 1#. Then F and G are called u-equivalent if and only if u-cuts of F A⫹ and G A⫹ are equal, that is, ~F A⫹ !u ⫽ ~G A⫹ !u. Note that, for any (FDs)d F and its F A⫹, it can be proved that there exists a smallest set MF of (FDs)d such that MF is a subset of F A⫹ and is u-equivalent to F A⫹. It seems desirable and efficient if we could develop an approach to discovering F A⫹ by only scanning databases for obtaining MF and deriving all other (FDs)d in ~F A⫹ ⫺ MF! using the extended Armstrong-like axioms. 4.
TRANSITIVITY OPERATION BASED ON FUZZY RELATION MATRIX
To apply extended Armstrong-like axioms completely for (FDs)d inference in the mining process, this section will introduce a fuzzy relation matrix-based transitivity operation for that purpose. It should be mentioned that fuzzy logic has been widely applied in data mining methods and applications.26,27 Concretely, a fuzzy relation U from X ⫽ $x 1 , x 2 , . . . , x m % to Y ⫽ $ y1 , y2 , . . . , yk % is a mapping from X ⫻ Y to [0, 1]. In other words, U is a fuzzy set on X ⫻ Y with membership
DISCOVERY OF FUNCTIONAL DEPENDENCIES
1097
function dU ~ x, y!: X ⫻ Y. r [0, 1]. For ∀~ x, y! 僆 U, dU ~ x, y! is interpreted as the degree that x relates to y. Thus an m ⫻ k matrix could be constructed to represent U as follows 28 :
U⫽
冤
d11
d12
J
d1k
d21
d22
J
d2k
J
J
J
J
dm1
dm2
J
dmk
冥
where dij ⫽ dU ~ x i , yj ! 僆 [0, 1]. Specially, if X ⫽ Y, then dij ⫽ dU ~ x i , yj ! ⫽ 1, when i ⫽ j. Given two fuzzy relations U and V from X to Y, then we define U ⱖ V for dU ~ x i , yj ! ⱖ dV ~ x i , yj !, ∀x i 僆 X, ∀yj 僆 Y. If dU ~ x i , yj ! ⫽ dV ~ x i , yj !, ∀x i 僆 X, ∀yj 僆 Y, then we have U ⫽ V. For a set of (FDs)d F, let I p be the set of all p-antecedents of (FDs)d in F and Fp be the set of all p-antecedent 1-consequent (FDs)d in F. Then Fp can be represented as an s ⫻ n fuzzy relation matrix (FRM) on I p ⫻ I, where s ⫽ 6I p 6 and n ⫽ 6I6 for 1 ⬍ p ⱕ n. Particularly, F1 is defined as a FRM on I ⫻ I. For the sake of convenience, we will hereafter refer to Fp (or F1 ) and its corresponding FRM interchangeably (otherwise indicated where necessary). We will use these settings to discuss the discovery of satisfied 1-consequent (FDs)d in the mining process, where major optimization takes place. Then, multiconsequent (FDs)d will be generated based on discovered satisfied 1-consequent (FDs)d using a similar procedure as presented in SFDD.23 Definition 6. If U is a fuzzy relation from X to Y and V is a fuzzy relation from Y to Z, then the transitive relation from X to Z, denoted as UV, is defined as UV~ x, z! ⫽ U~ x, y! 嘸 V~ y, z!, where ~ x, y! 僆 U, ~ y, z! 僆 V, and ~ x, z! 僆 UV. Concretely, dUV ~ x, z! ⫽ maxy僆Y ~dU ~ x, y! ⫹ dV ~ y, z! ⫺ 1!. Obviously, the transitive operator in Definition 6 expresses the transitivity in Property 3. Furthermore, this operation will not generate any value less than zero according to Property 4. In addition, if U is a fuzzy relation from X to X, then U 2 ⫽ U 嘸 U and U m⫹1 ⫽ U 嘸 U m, m ⱖ 1. For a Fp , Fp 嘸 F1 corresponds to one transitive inference operation, and Fp 嘸 ~F1 ! k represents k transitive inference operations, where dij 僆 Fp 嘸 ~F1 ! k is the inferred upper-bound degree of satisfaction for Ei r Ij with Ei 僆 I p and Ij 僆 I. This corresponds to the process of computing ~FpA⫹ ). The next question is whether the transitive inference operations can stop in a limited number of steps. First, we have the following lemma. Lemma 1. Given two relations V on Y ⫻ X, and U on X ⫻ X, then V 嘸 U k⫹q ⱖ V 嘸 U k, q ⱖ 1. k⫹1 Proof. According to the definition, for ∀y 僆 Y and ∀x 僆 X, DUV ~ y, x! ⫽ ' k k⫹1 ' maxx ' 僆X ~dUV ~ y, x ! ⫹ dV ~ x , x! ⫺ 1!. Then it is clear that, DUV ~ y, x! ⱖ k k⫹1 k ~ y, x! ⫹ dV ~ x, x! ⫺ 1!, because dV ~ x, x! ⫽ 1, so dUV ~ y, x! ⱖ dUV ~ y, x!. ~dUV k⫹2 k⫹1 k Further it could be derived that dUV ~ y, x! ⱖ dUV ~ y, x! ⱖ dUV ~ y, x!. Then,
1098
WEI AND CHEN
k⫹q k DUV ~ y, x! ⱖ dUV ~ y, x!, for ∀y 僆 Y and ∀x 僆 X, q ⱖ 1. That is, V 嘸 U k⫹q ⱖ k 䡲 V 嘸 U .
Theorem 2. For two sets X and Y ~6 X6 ⫽ m, 6Y 6 ⫽ k!, given two fuzzy relations V on Y ⫻ X and U on X ⫻ X, respectively, then V 嘸 U m⫺1 ⫽ V 嘸 U ~m⫺1!⫹q, q ⱖ 1. Proof. For V 嘸 U, for ∀y 僆 Y and ∀x 0 僆 X, DVU ~ y, x 0 ! ⫽ maxx1僆X ~dV ⫻ ~ y, x 1 ! ⫹ dU ~ x 1 , x 0 ! ⫺ 1!, x 0 僆 X and y 僆 Y. Further for V 嘸 U 2, 2 ~ y, x 0 ! ⫽ maxx 2僆X ~dVU ~ y, x 2 ! ⫹ dU ~ x 2 , x 0 ! ⫺ 1! DVU
⫽ maxx 2僆X ~maxx1僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫺ 1! ⫹ dU ~ x 2 , x 0 ! ⫺ 1! ⫽ maxx 2僆X ~maxx1僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹ dU ~ x 2 , x 0 ! ⫺ 2!! ⫽ maxx1, x 2僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹ dU ~ x 2 , x 0 ! ⫺ 2!. So, after multiple iteration, for V 嘸 U m⫺1 , it could be derived that m⫺1 ~ y, x 0 ! ⫽ maxx1, . . . , xm⫺1僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! DVU
⫹{{{⫹ dU ~ x m⫺2 , x m⫺1 ! ⫹ dU ~ x m⫺1 , x 0 ! ⫺ ~m ⫺ 1!! ⫽ maxx1, x 2 , . . . , xm⫺1僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹{{{⫹ dU ~ x m⫺2 , x m⫺1 ! ⫹ dU ~ x m⫺1 , x 0 !! ⫺ ~m ⫺ 1! m and dVU ~ y, x 0 ! ⫽ maxx1, x 2 , . . . , xm僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹{{{⫹ dU ~ x m⫺1 , x m ! ⫹ dU ~ x m , x 0 ! ⫺ m!. Then it could be found that, because x 1 , x 2 , . . . , x m , x 0 belongs to X, then there must be at least two identical elements for 6 X6 ⫽ m. Without loss of generality, suppose x i ⫽ x j , i ⬍ j, i, j ⫽ 0, 1, . . . , m. Then m ~ y, x 0 ! DVU
⫽ maxx1, x 2 , . . . , xm僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹{{{⫹ ~dU ~ x i , x i⫹1 ! ⫹{{{⫹ dU ~ x j⫺1 , x j !! ⫹{{{⫹ dU ~ x m⫺1 , x m ! ⫹ dU ~ x m , x 0 ! ⫺ m! ⫽ maxx1, x 2 , . . . , xm僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! ⫹{{{⫹ dU ~ x i⫺1 , x i ! ⫹{{{⫹ dU ~ x j , x j⫹1 ! ⫹{{{⫹ dU ~ x m , x 0 ! ⫺ ~m ⫺ ~ j ⫺ i !! ⫹ ~dU ~ x i , x i⫹1 ! ⫹{{{⫹ dU ~ x j⫺1 , x j ! ⫺ ~ j ⫺ i !!! Because 0 ⱕ dU ~ x i , x i ' ! ⱕ 1, x i , x i ' 僆 X, so ~dU ~ x i , x i⫹1 ! ⫹{{{⫹ dU ~ x j⫺1 , x j ! ⫺ ~ j ⫺ i !! ⱕ 0, then m ~ y, x 0 ! ⱕ maxx1, x 2 , . . . , xm僆X ~dV ~ y, x 1 ! ⫹ dU ~ x 1 , x 2 ! DVU
⫹{{{⫹ dU ~ x i⫺1 , x i ! ⫹{{{⫹ dU ~ x j , x j⫹1 ! m⫺~ j⫺i ! ⫹{{{⫹ dU ~ x m , x 0 ! ⫺ ~m ⫺ ~ j ⫺ i !!! ⫽ dVU ~ y, x 0 ! m m⫺~ j⫺i ! m So dVU ~ y, x 0 ! ⱕ dVU ~ y, x 0 !, it can be derived that dVU ~ y, x 0 ! ⱕ m x 0 ! for ∀y 僆 Y and ∀x 0 僆 X; thus V 嘸 U ⱕ V 嘸 U ~m⫺1! .
m⫺1 ~ y, dVU
DISCOVERY OF FUNCTIONAL DEPENDENCIES
1099
According to Lemma 1, V 嘸 U m ⱖ V 嘸 U ~m⫺1! . That is, V 嘸 U m ⫽ V U ~m⫺1! . Further, for V 嘸 U ~m⫺1!⫹q, q ⱖ 1, V 嘸 U ~m⫺1!⫹q ⫽ V 嘸 U m q⫺1 ⫽ V 嘸 U ~m⫺1! 嘸 U ~q⫺1! ⫽{{{⫽ V 嘸 U ~m⫺1! 嘸 U ~q⫺1! ⫽ V U ~m⫺1! . Then V 嘸 U m⫺1 ⫽ V 嘸 U ~m⫺1!⫹q, q ⱖ 1. U
嘸 嘸 嘸
䡲
Briefly, Theorem 2 contains two aspects. First, if Y ⫽ I p and X ⫽ I, then V ⫽ Fp and U ⫽ F1 , and V 嘸 U m⫺1 is stable, which corresponds to that ~FpA⫹ ) could be obtained in a limited number of steps. Second, a constructive method based on FRM operation is presented to derive ~FpA⫹ ). Based on Theorem 2, given a R on R~I! with n tuples and 6I6 ⫽ m, for the set of 1-consequent (FDs)d F, it could be divided into ~m ⫺ 1) sets, for example, 1-antecedent set F1 , 2-antecedent set F2 , . . . , ~m ⫺ 1)-antecedent set Fm⫺1 . If F1 is FRM-operated with F1 , then the derived ~F1 ! m is ~F1A⫹ ). Similarly, after Fp is FRM-operated with F1 , then the derived ~Fp ! 嘸 ~F1 ! m⫺1 is ~FpA⫹ ). Finally, after combining all ~Fp ! 嘸 ~F1 ! m⫺1, p ⫽ 1,2, . . . , m ⫺ 1, ~F A⫹ ! is derived. Note that the above discussion presented an operational FRM method to derive ~F A⫹ !. However, in the cases when only those dependencies satisfying the given threshold u are concerned, ~F A⫹ !6u is needed, that is, ~F A⫹ !6u ⫽ $~A r B!a 6 ~A r B!a 僆 F A⫹ and a ⱖ u%. The following theorem guarantees that only those satisfied (FDs)d need to be retained in ~F A⫹ ! operation. In brief, ~F A⫹ !6u is necessary. Generally, given an FRM U ⫽ @dij # , i ⫽ 1,2, . . . , m x , j ⫽ 1,2, . . . , m y , then U 6u is an FRM in that U 6u ⫽ @dij' # , i ⫽ 1,2, . . . , m x , j ⫽ 1,2, . . . , m y , where if dij ⱖ u then dij' ⫽ dij , otherwise dij' ⫽ 0. Theorem 3. Given two sets X and Y ~6 X6 ⫽ m, 6Y 6 ⫽ k!, and two FRMs V on Y ⫻ X and U on X ⫻ X, then ~V 6u 嘸 U 6u !6u ⫽ ~V 嘸 U !6u. Proof. Recall that, for V 嘸 U, dVU ~ y, x! ⫽ maxx ' 僆X ~dV ~ y, x ' ! ⫹ dU ~ x ' , x! ⫺ 1). The geometric implication is to search the maximum value of dV ~ y, x ' ! ⫹ dU ~ x ' , x! ⫺ 1 on route y ] x ' ] x, then it could be derived that dV ~ y, x ' ! ⫹ dU ~ x ', x! ⫺ 1 ⱕ dV ~ y, x ' ! and dV ~ y, x ' ! ⫹ dU ~ x ', x! ⫺ 1 ⱕ dU ~ x ', x! Now there exist two situations while calculating all the possible routes on dVU ~ y, x!: (1) If there exists ∀dV ~ y, x ' ) or dU ~ x ', x! ⬍ u, then dV ~ y, x ' ! ⫹ dU ~ x ', x! ⫺ 1 ⬍ u. In this situation, dV ~ y, x ' ) or dU ~ x ', x! could be regarded as zero and need not to be considered in calculation; (2) If we have both dV ~ y, x ' ) and dU ~ x ', x! ⱖ u, then dVU ~ y, x! is calculated as usual. After scanning all possible routes, dVU ~ y, x! is calculated. While computing ~S 嘸 R!6u , it depends on the concrete value, which means that any dS ~ y, x ' ! or dR ~ x ', x! less than u will not take effect in computing ~S 嘸 R!6u . Then computing ~S 嘸 R!6u is equivalent to computing ~S!6u 嘸 ~R!6u , except for a little difference, that is, after deriving ~S!6u 嘸 ~R!6u , further filtering with u is needed. Then ~~S!6u 嘸 ~R!6u !6u is finally derived. That is, (~S!6u 嘸 ~R!6u !6u ⫽ ~S 嘸 R!6u .
䡲
1100
WEI AND CHEN
Theorem 3 makes it possible that those unsatisfied (FDs)d could be eliminated in the ~F A⫹ !6u operation. In real FRM operation, those dij ⬍ u need not to be stored. 5.
THE ALGORITHM FOR MINING SATISFIED ~FDs!d BASED ON FRM OPERATION—MFDD
This section will provide details of an algorithm for discovering satisfied (FDs)d , called MFDD, which is an extension of SFDD. First, in MFDD, the FRM-based transitive operation performs the dependency inference for A3 ' in the extended Armstrong-like axiomatic system. Next, MFDD is a “filtering while generating” strategy. Concretely, consider the following algorithm (Figure 1). For a certain (FD)d in the set of all possible i-antecedent 1-consequent candidate (FDs)d (CA_Fi ), if it is not in the set of inferred satisfied (FDs)d (IN_Fi ) already, then scan databases to calculate its degree of satisfaction and compare with u. If it is satisfied, then update it to the set of scanned satisfied (FDs)d (SC_Fi ), otherwise to the set of scanned unsatisfied (FDs)d (UN_Fi ). Based on the updated set of satisfied (FDs)d (SC_Fi 艛 IN_Fi ), FRM operation could be performed to derive its ~FiA⫹ !6u , and update the inferred satisfied (FDs)d into IN_Fi . After going through all (FDs)d in CA_Fi , each (FD)d in CA_Fi is assigned into SC_Fi , UN_Fi , and IN_Fi . Thus all the SC_Fi constitute the smallest set MF which is u-equivalent to the whole set of satisfied 1-consequent (FDs)d . After deriving all the satisfied 1-consequent (FDs)d , including scanned satisfied 1-consequent (FDs)d and inferred satisfied 1-consequent (FDs)d , the multiconsequent (FDs)d could be generated.
Figure 1.
Algorithm MFDD.
DISCOVERY OF FUNCTIONAL DEPENDENCIES
1101
Table V. Example of a noisy database. No.
A
B
C
#1 #2 #3 #4 #5
1 1 2 2 2
A A B B D
10 20 20 20 N/A
In the algorithm all the attributes are sorted alphabetically. It should be mentioned that, because the multiconsequent generation process is similar to that in SFDD, readers may refer to Refs. 22 and 23 for technical details. Given an example in Table V with a threshold u ⫽ 0.8, the mining process of MFDD is shown in Figure 2. The resultant ~F A⫹ !6u ⫽ $~A r B!0.8 , ~B r A!1.0 , ~B r C!0.9 , ~C r A!0.8 , ~C r B!0.8 , ~B r AC!0.9 , ~C r AB!0.8 , ~BC r A!1.0 , ~AB r C!0.9 , ~AC r B! 0.9 %, where ~C r A! 0.8 could be inferred from ~C r B! 0.8 and ~B r A! 1.0 directly without scanning database as in SFDD. Intuitively, FRM operation allows more satisfied (FDs)d to be inferred without database scanning (database scanning is usually considered time consuming). In MFDD, the computation of ~F1A⫹ !6u is the important process, because in the operation of updating F1 , iterative FRM operations should be performed whose computational complexity is O~2m 3 ) at worst, where m is the number of attributes. If the FRM operation could infer a satisfied FDd , an operation of scanning
Figure 2.
Mining process of MFDD.
1102
WEI AND CHEN
databases could be saved. According to the definition of FDd , the computational complexity of calculating an FDd is about O~n 2/2) on average, where n is the number of tuples. Obviously, O~2m 3 ! ⬍⬍ O~n 2/2), because in large-scale databases, the number of tuples will be much larger than the number of attributes. In fact, the calculation of FRM is limited compared with the database scanning operation. Moreover, because ~F1A⫹ !6u is derived already, then in computing ~FiA⫹ !6u , 1 ⱕ i ⱕ m ⫺ 1, ~F1A⫹ !6u could be derived by a single FRM operation ~Fi 嘸 ~FiA⫹ !6u !6u ; no iterative steps are needed. So theoretically, MFDD is more efficient than SFDD. 6.
EXPERIMENTAL ANALYSIS ON SCALABILITY
A real-world database was used for testing the presented approach. The database contains the real business data of The Insurance Company (TIC) Benchmark provided by Dutch Data Mining Company Sentient Machine Research,29 which is usually used as benchmarking data for evaluating data mining algorithms. The database is from “The Insurance Company 2000,” containing 5822 tuples with 86 attributes. Table VI lists some parameters in evaluating the efficiency of the algorithms. The experiments were conducted in an environment with a Pentium III 650 computer, with RAM 256 and Visual C⫹⫹ 6.0. Algorithms for comparison such as SFDD and MFDD were all coded in the same environment. 6.1.
Parameter u
Table VII shows that CA_F ⫽ SC_F ⫹ IN_F ⫹ UN_F, and S_F ⫽ SC_F ⫹ IN_F, which is consistent with the theoretical analysis. It shows clearly that, in MFDD, IN_F could be inferred, whereas in SFDD, it could be obtained only by
Table VI. Description of the main parameters. Parameters
Descriptions
u N M
The threshold u 僆 [0, 1] The number of tuples in the database The number of attributes in the database
Time SC_F UN_F IN_F CA_F
The whole time that the entire process takes, unit: seconds The number of scanned satisfied (FDs)d The number of scanned unsatisfied (FDs)d The number of inferred satisfied (FDs)d The number of all candidate (FDs)d, actually CA_F ⫽ SC_F ⫹ UN_F ⫹ IN_F. This parameter is used for verification. The number of satisfied 1-consequent (FDs)d by revised SFDD; it could be proved that S_F ⫽ SC_F ⫹ IN_F, which represent all the satisfied (FDs)d including scanned satisfied (FDs)d and inferred satisfied (FDs)d .
S_F
1103
DISCOVERY OF FUNCTIONAL DEPENDENCIES
Table VII. The threshold u on algorithms MFDD and revised SFDD ~N ⫽ 1000, M ⫽ 10). MFDD u 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 0.92 0.94 0.96 0.98 0.99 1.00
SFDD
SC_F
UN_F
IN_F
CA_F
Time (s)
S_F
UN_F
CA_F
Time (s)
18 18 25 25 21 18 23 48 47 161 226 303 458 608 697 1
0 0 0 1 5 8 9 18 61 149 196 228 382 655 1086 4854
72 72 65 64 64 64 58 36 75 133 136 88 65 58 48 0
90 90 90 90 90 90 90 102 183 443 558 619 905 1321 1831 4855
0 0 1 1 1 1 1 3 6 19 27 34 54 81 112 262
90 90 90 89 85 82 81 84 122 294 362 391 523 666 745 1
0 0 0 1 5 8 9 18 61 149 196 228 382 655 1086 4854
90 90 90 90 90 90 90 102 183 443 558 558 905 1321 1831 4855
4 4 4 4 4 4 4 4 10 27 35 39 59 85 116 261
database scanning. Generally, comparing MFDD and SFDD, S_F ⫽ SC_F ⫹ IN_F, UN_F (MFDD) ⫽ UN_F (SFDD), CA_F (MFDD) ⫽ CA_F (SFDD). Further, with the increase of u, SC_F, IN_F, and S_F vary as shown in Figure 3. Figure 3 reveals that, when u ⱖ 0.7, SC_F, IN_F, and S_F increased sharply. However, where u is in the range 0.7–0.9, IN_F held a remarkable proportion. When u ⬎ 0.9, however, it was more and more difficult to infer satisfied (FDs)d .
Figure 3.
Number of SC_F, S_F, IN_F.
1104
WEI AND CHEN
Figure 4.
Time of MFDD and SFDD.
At the extreme, when u ⫽ 1.00, there was only one scanned satisfied FDd , so nothing could be inferred. Figure 4 shows that the running time of MFDD was less than SFDD, especially with the increase of IN_F, and the saving time increased simultaneously, which is consistent with the theoretical analysis on computational complexity of MFDD and SFDD, that when m ⬍⬍ n, FRM operation will save more system consumption than database scanning, in general. Nevertheless, the situation of u ⫽ 1.00 is an exception, in that the time of MFDD was larger than that of SFDD. This may be because, since only one satisfied FDd could be obtained, FRM operation would infer nothing but the computation was still to take place. However, in this extreme situation, MFDD cost only 1 s more than that of SFDD. Again, in general, because m ⬍⬍ n, the consumption of FRM operation is limited. Additionally, though higher u could lead to more convincible (FDs)d , it is not that the higher u is better. This is because higher u may result in (FDs)d with more antecedents, which could be too specific. On the contrary, smaller u will lead to (FDs)d with less antecedents, which are more general. For instance, with u ⫽ 0.95, (Customer Subtype r Average Size Household)0.97 could be obtained, whereas with u ⫽ 0.995, satisfied (FDs)d in a form like “Customer Subtype & A & B & . . . r Average Size Household” could be obtained. 6.2.
Parameter N
In Figure 5, with the increase of N, SC_F, IN_F, and UN_F maintained stable levels, which means that only N affects running time whereas the structure of data values does not change remarkably. Figure 6 shows more details on running time on N. Generally, the computational complexity of calculating a degree
DISCOVERY OF FUNCTIONAL DEPENDENCIES
Figure 5.
1105
N and SC_F, UN_F, and IN_F (u ⫽ 0.9, M ⫽ 10).
of satisfaction is O~N ! ; O~N 2 ). Further, the function of running time on N is Estimated Time ⬇ k ⫻ ~N ! 2. In Table VIII, Time/N 2 is the estimated value of k. In Figure 6, the estimated time fits real running time very well, which is corresponding to the analysis on the complexity. 6.3.
Parameter M
Table IX shows that, with the increase of M, SC_F, UN_F, IN_F, and CA_F increased. It appeared that a larger number of attributes would lead to a larger
Figure 6.
N and Time, Estimated Time (u ⫽ 0.9, M ⫽ 10).
1106
WEI AND CHEN
Table VIII. SC_F, UN_F, IN_F, and CA_F, time, time difference, time/~N ! 2, estimated time on N (u ⫽ 0.9, M ⫽ 10). N 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 5822
SC_F
UN_F
IN_F
CA_F
Time (s)
Time difference
Time/~N ! 2
Est. time
161 188 173 164 164 164 163 163 162 162 164
149 145 146 147 146 147 146 146 146 146 145
133 108 123 130 126 130 127 127 128 128 121
443 441 442 441 436 441 436 436 436 436 430
19 47 79 122 175 239 312 402 513 645 749
28 32 43 53 64 73 90 111 132 104
0.000019 2.08889E⫺05 0.00001975 0.00001952 1.94444E⫺05 1.95102E⫺05 0.0000195 1.98519E⫺05 0.00002052 2.13223E⫺05 2.20972E⫺05
20.12 45.28 80.51 125.79 181.14 246.56 322.04 407.58 503.19 608.86 682.24
candidate set of attributes, more (FDs)d , and more computational time. This is illustrated in Figure 7. Figure 8 shows that, with the increase of M, the number of (FDs)d increased quickly and the running time increased quickly as well. This is a common phenomenon supposedly largely due to the nature of association mining of this kind. In general cases, the complexity could go exponentially with the increase in the number of attributes. It is worth noticing that the curve of running time of MFDD is not exponential. This may be due to the fact that in the mining process many (FDs)d could be inferred or filtered out without scanning databases. 7.
EXPERIMENTAL ANALYSIS OF SEMANTICS
For illustrative purposes, the same TIC database was used with the same experimental platform as in Section 6. For simplicity, the experiment was carried out Table IX. SC_F, UN_F, IN_F, CA_F, and Time on M (u ⫽ 0.9, N ⫽ 1000). M
SC_F
UN_F
IN_F
CA_F
Time (s)
2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 3 5 12 25 41 80 115 161 251 336 443 544 644
1 4 13 21 41 62 90 117 149 196 231 273 325 367
0 0 0 0 1 17 31 67 133 178 251 367 559 686
2 7 18 33 67 120 201 299 443 625 818 1083 1428 1697
0 0 1 1 4 6 10 14 19 28 36 46 55 64
DISCOVERY OF FUNCTIONAL DEPENDENCIES
Figure 7.
1107
SC_F, UN_F, and IN_F on M (u ⫽ 0.9, N ⫽ 1000).
with all the 5822 tuples on the first 10 major attributes, for example, customer subtype, number of houses, average size household, average age, customer main type, Roman Catholic, Protestant, other religion, no religion, and married. Given u ⫽ 0.8, 47 scanned satisfied (FDs)d were discovered. Moreover, 75 inferred satisfied (FDs)d were generated by inference. The satisfied (FDs)d listed in Table X are some of the remarkable ones (at 0.95) with the following characteristics to mention. First, in the case of traditional functional dependency, only FDd 9 (“Customer Subtype r Customer Maintype”) could be discovered because its degree of satisfaction ⫽ 100%. This is obvious semantically, and may be likely to be identified at database modeling. In addition, this reveals that FDd is a generalization of traditional FD. Second, (FDs)d 1, 7, 8,
Figure 8.
Time on M (u ⫽ 0.9, N ⫽ 1000).
1108
WEI AND CHEN Table X. Satisfied (FDs)d. Degrees of satisfaction
Satisfied (FDs)d 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Customer Subtype r Number of Houses Customer Maintype r Number of Houses Protestant r Number of Houses Other Religion r Number of Houses No Religion r Number of Houses Married r Number of Houses Customer Subtype r Average Size Household Customer Subtype r Average Age Customer Subtype r Customer Maintype Customer Subtype r Roman Catholic Customer Subtype r Protestant Customer Subtype r Other Religion Customer Subtype r No Religion Customer Subtype r Married
0.99 0.98 0.97 0.95 0.97 0.97 0.97 0.97 1.00 0.98 0.96 0.97 0.96 0.96
10, 11, 12, 13, and 14 were found associated with “Customer Subtype,” which turn to make sense. Each of them might reflect an important piece of hidden knowledge, which can hardly be discovered with traditional FD. Moreover, all these as a whole seemed to imply that “Customer Subtype” could be regarded as a candidate for a key or partial key. Third, (FDs)d 1, 2, 3, 5, 6, and 7 appeared to suggest that attribute “Number of Houses” highly depended on other attributes. A further examination indicated that this attribute was supposedly related to transitive dependencies. This is a useful piece of information not only as semantic knowledge but also for database modeling. Important to note, FDd 1 could be inferred based on (FDs)d 3 and 11, or 4 and 12, or 5 and 13, or 6 and 14. Totally, the 75 inferred (FDs)d out of 122 discovered would allow saving a significant amount of time otherwise caused by scanning databases. 8.
CONCLUDING REMARKS
This article has aimed at tolerating partial truth of functional dependencies in data mining. An efficient approach has been presented to discover all satisfied functional dependencies with degrees of satisfaction (FDs)d using certain important results obtained from extended Armstrong-like axioms and their derivatives. The approach enables us to derive many dependencies by inference from previously discovered ones without scanning databases, and to filter out those unsatisfied ones inside (rather than after) the generating process. Fuzzy relation matrix operation has been used to infer transitive dependencies. Finally, data experiments have revealed satisfactory results on the efficiency of the proposed algorithm. Future research is being centered on further algorithmic optimization, as well as on more real-data experiments.
DISCOVERY OF FUNCTIONAL DEPENDENCIES
1109
Acknowledgments Partly supported by the National Natural Science Foundation of China (79925001/70231010), the MOE Funds for Doctoral Programs (20020003095), and the Bilateral Scientific and Technological Cooperation Between China and Czech. References 1. 2.
3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Mitra S, Pal SK, Mitra P. Data mining in a soft computing framework: A survey. IEEE Trans Neural Networks 2002;13(1):3–14. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: An overview. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in knowledge discovery and data mining. Cambridge, MA: AAAI Press/The MIT Press; 1996. pp 1–30. Andersson M. Extracting an entity relationship schema from a relational database through reverse engineering. Entity-Relationship Approach—ER ’94. Lecture Notes in Computer Science, Vol. 881; 1994. pp 403– 419. Baudinet M, Chomicki J, Wolper P. Constraint-generating dependencies. J Comput Syst Sci 1999;59(1):94–115. Bell S, Brockhausen P. Discovery of data dependencies in relational databases. LS-8 Report 14. University of Dortmund, Germany; 1995. Wyss C, Giannella C, Robertson E. FastFDs: A heuristic-driven depth-first algorithm for mining functional dependencies from relation instances.Technical Report 551, Computer Science Department, Indiana Univesity, July 2001. Castellanos M, Saltor F. Extraction of data dependencies. Report LSI-93-2-R. Barcelona: University of Catalonia; 1993. Flach PA, Savnik I. Database dependency discovery: A machine learning approach. AI Commun 1999;12(3):139–160. Savnik I, Flach PA. Discovery of multivalued dependencies from relations. Technical Report 00135. Marseille: Albert-Ludwigs-Universitaet Freiburg, Institut fuer Informatik; 2000. Wijsen J, Ng RT, Calders T. Discovering roll-up dependencies. Proc of ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. San Diego, CA; 1999. pp 213–222. Liao SY, Wang HQ, Liu WY. Functional dependencies with null values, fuzzy values, and crisp values. IEEE Trans Fuzzy Syst 1999;7:97–103. Kramer S, Pfahringer B. Efficient search for strong partial determinations. In: Simoudis E, Han J, Fayyad U, editors. Proc 2nd Int Conf on Knowledge Discovery and Data Mining (KDD’96). Menlo Park, CA: AAAI Press; 1996. pp 371–378. Huhtala Y, Karkkainen J, Paokka P, Toivonen H. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal 1999;42(2): 100–111. Huhtala Y, Karkkainen J, Porkka P, Toivonen H. Efficient discovery of functional and approximate dependencies using partitions. Proc 14th Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society Press; 1998. pp 392– 401. Chen GQ, Kerre EE, Vandenbulcke J. A computational algorithm for the FFD closure and a complete axiomatization of fuzzy functional dependency (FFD). Int J Intell Syst 1994;9:421– 439. Chen GQ, Vandenbulcke J, Kerre EE. A step towards the theory of fuzzy database design. In: Lowen R, Roubens M, editors. Proc 4th World Congress of International Fuzzy Systems Association (IFSA’91). 1991. pp 44– 47. Cubero JC, Medina JM, Pons O, Vila MA. Data summarization in relational databases through fuzzy dependencies. J Inform Sci 1999;121:233–270. Cubero JC, Medina JM, Pons O, Vila MA. Rules of discovery in fuzzy relational databases. In: Proc Conf North American Fuzzy Information Processing Society, NAFIPS’95. Los Alamitos, CA: IEEE Computer Society Press; 1995. pp 414– 419.
1110 19. 20. 21.
22. 23.
24. 25. 26. 27.
28. 29.
WEI AND CHEN Wang SL, Shen JW, Hong TP. Incremental discovery of functional dependencies using partitions. Annual Conf of the North American Fuzzy Information Processing Society— NAFIPS 2001. Vol. 3. IEEE; 2001. pp 1322–1326. Yang YP, Singhal M. Fuzzy functional dependencies and fuzzy association rules. In: Mukesh K, Mohania A, Min Tjoa, editors. Proc of Data Warehousing and Knowledge Discovery. Germany: Springer Verlag; 1999. pp 229–240. Wei Q, Chen GQ. Mining a minimal set of functional dependencies with degrees of satisfaction. In: Liu YM, Chen GQ, Ying MS, Cai KY, editors. Proc Int Conf on Fuzzy Information Processing: Theories and Applications. Beijing: Tsinghua University Press; 2003. pp 379–384. Wei Q, Chen GQ. An efficient algorithm on mining a minimal set of functional dependencies with degrees of satisfaction. In: Bilgic T, De Baets B, Kaynak O, editors. Conf of Int Fuzzy Systems Association. Germany: Springer; 2003. pp 376–379. Wei Q, Chen GQ, Kerre EE. Mining functional dependencies with degrees of satisfaction in databases. In: Caulfield HJ, Chen SH, Duro R, Honavar V, Kerre EE, Lu M, Romay MG, Shih TK, Ventura D, Wang PP, Yang YY, editors. Proc 6th Joint Conf on Information Sciences. Durham, NC: Association of Intelligent Machinery; 2002. pp 184–187. Ullman JD, Widom J. A first course in database systems. Upper Saddle River, NJ: Prentice Hall, Inc.; 1997. Chen GQ. Fuzzy logic in data modeling: Semantics, constraints and database design. Boston, MA: Kluwer Academic Publishers; 1998. Kruse R, Nanck D, Borgelt C. 2001. Data mining with fuzzy methods: Status and perspectives. Proc 7th European Congress on Intelligent Techniques and Soft Comnputing (EUFIT’99). Aachen, Germany: Verlag Mainz; 1999 (CD-ROM). Maimon O, Kandel A, Last M. Information-theoretic fuzzy approach to knowledgediscovery in databases. In: Roy R, Furuhashi T, Chawdhry PK, editors. Advances in soft computing—engineering design and manufacturing. London: Springer-Verlag; 2001. pp 315-326. Kerre EE. Introduction to basic principles of fuzzy set theory and some of its applications, 2nd ed. Ghent, Belgium: Communication & Cognition; 1993. The Insurance Company 2000© Sentient Machine Research, http://www.smr.nl.