Mining a Minimal Set of Functional Dependencies

0 downloads 0 Views 69KB Size Report
The concept of functional dependencies with degrees of satisfaction is presented in [10], ... be proven that such (FDs)d as A→B have the following properties.
Mining a Minimal Set of Functional Dependencies with Degrees of Satisfaction1 Qiang Wei2, Guoqing Chen School of Economics and Management Tsinghua University, Beijing 100084, China Abstract The concept of functional dependencies with degrees of satisfaction is presented in [10], by which some conflicts, null values, which exist commonly, could be tolerated. In this paper, we will further present the concept of minimal set of (FDs)d without certain redundancies, based on which the whole set could be inferred. More concretely, the optimized mining method will be illustrated. Finally, some conclusions and future studies will be proposed. 1.

Introduction

Functional Dependency (FD) is valuable in analyzing the associations or relationships among items in databases. A lot of attractions have been devoted to discovering the dependencies from conventional datasets, such as classical FDs [1, 3, 7], minimal keys [6], multi-valued dependencies [9], constraint-generating dependencies [2], roll-up dependencies [11], and other data dependencies [4, 5, 7, 8]. However, since in large real databases and data warehouses, nulls or conflicting tuples do exist, many dependencies will be missed because of some tiny “exception”. In [10], Wei et al presented a concept of FD with degree of satisfaction, which can not only accommodate conflicts and null-values but also provide a general setting to deal with the situation in that a FD is satisfied by a relation in a certain degree. Some preliminaries are listed as follows. Definition 1: Let RS(I) , I = {I1, I2, …, Im}, be a relation scheme on domains D1, D2, …, Dm with Dom(Ik) = Dk, A and B be subsets of the attribute set I, i.e., A, B ⊆ I, and T be a relation of RS, T ⊆ D1 × D2 × … × Dm, and ti, tj ∈ T, i ≠ j. Then B is called to functionally depend on A for a transaction pair (ti, tj), denoted as (ti,tj)(AÆB), if ti(A) = tj(A) then ti(B) = tj(B). Let TRUTH(ti, tj)(AÆB) be the truth value that (ti, tj)(AÆB) holds. Then, if ti(A)=tj(A) and ti(B) ≠ tj(B), then TRUTH(ti, tj)(AÆB) = 0, otherwise 1. Definition 2: Let RS(I) be a relation scheme on domains D1, D2, …, Dm with Dom(Ik) = Dk, A, B ⊆ I, and T be a relation of RS with n tuples, T ⊆ D1 × D2 × … × Dm. Let T(AÆB) denote that B is functionally dependent on A for T. Then the degree α that T satisfies AÆB is TRUTHT(AÆB), where

∑ TRUTH

α = TRUTH T ( A → B ) =

∀ti ,t j ∈T ti ≠ t j

( t i ,t j ) ( A

→ B)

number of tuple pairs of T

, where the number of tuple pairs of T is n(n–1)/2.

So, usually, a (FD)d AÆB with degree of satisfaction α is denoted as (AÆB)α. Furthermore, it can 1

Partly supported by ‘Nation’s Outstanding Young Scientists Funds’ of China(No. 79925001), the Bilateral Scientific and Technological Cooperation Programme Between China & Flandres(174B0201) and Tsinghua’s Soft Science Key Project on E-Commerce. 2 Corresponding author. Email address: [email protected].

1

be proven that such (FDs)d as AÆB have the following properties. Given a T on RS(I), A, B, C ⊆ I: P1: if B ⊆ A, then TRUTHT(AÆB) = 1. P2: TRUTHT(ACÆBC) ≥ TRUTHT(AÆB). P3: TRUTHT(AÆC) ≥ TRUTHT(AÆB) + TRUTHT(BÆC) - 1. P4: TRUTHT(AÆB) + TRUTHT(BÆC) ≥ 1. Given a threshold θ∈(0, 1], a (FD)d (AÆB)α is qualified if α ≥ θ. In [10], the properties are used in generating the whole set of qualified (FDs)d. However, P3 is partly incorporated, while P4 has not been used yet. In this paper, we will further consider more efficient optimization strategies totally using P3. The whole set of qualified (FDs)d, however, contains some (FDs)d which are redundant. Briefly, there are two types of redundancies. First, any qualified (FD)d, that could be inferred from other qualified (FDs)d, are redundant. Second, since TRUTHT(AÆX) ≤ TRUTHT(AÆX’), where X’ ⊂ X, thus it could infer that only 1-consequent (FDs)d need to be considered (which will be discussed later). In this paper, we will present a concept called minimal set of qualified (FDs)d, which contains all non-redundant 1-consequent (FDs)d. This paper will be organized as follows. Section 2 will introduce the minimal set of (FDs)d. A certain transitive closure will be constructed in Section 3. Further, Section 4 will illustrate the optimized algorithm of mining the minimal set of qualified (FDs)d. Finally, some conclusions and further consideration and studies will be given in Section 5. 2.

Preliminaries and the Compacted Set of Qualified (FDs)d

First, Theorem 1 could be derived as follows. Based on Theorem 1, the low-bound of the degrees of satisfaction of some (FDs)d could be inferred without scanning the database. Then if the low-bound of an inferred (FD)d is no less than minimal threshold θ, then without scanning the database, the (FD)d could be determined qualified. Theorem 1: Given relation schema RS(I), I = {I1, I2, …, Im} and T be a relation of RS. Let F be a set of (FDs)d on T and F+ be the set of (FDs)d that are logically implied by F by the axiomatic system, which contains Properties 1, 2 and 3, i.e., A1: If B ⊆ A, then (AÆB)θ ∈ F+ for all θ ∈ (0, 1]; A2: If (AÆB)α ∈ F+, then (ACÆBC)α ∈ F+; A3: If (AÆB)α and (BÆC)β ∈ F+, then (AÆC)γ ∈ F+ with γ = α + β - 1. Definition 3: Let FA be the set of all (FDs)d that are derived from F using the axioms (A1, A2, A3), i.e., FA = {(AÆB)α | (AÆB)α is inferred from F using the axioms and α ∈ (0, 1]}. Definition 4: Let FA+ is the set contains all the upper-bound (FDs)d of FA, e.g., FA+ = {(AÆB)α | (AÆB)α ∈ FA and α = sup{β | (AÆB)β ∈ FA}}, FA+ is also called transitive closure on F. Clearly, FA+ will be more interesting than FA. This is obvious, for example, given any two (FDs)d, (AÆB)0.5 and (AÆB)0.7, the latter will be more interesting for it has higher reliability semantically. Since in data mining, the task is to drill F from FA+, so in this paper, FA+ will be denoted specially as the whole set of all possible (FDs)d with θ ≥ 0, which means that FA+ contains all the functional dependencies composed of all the possible combination of items are listed along with degrees of

2

satisfaction. If FA+ is regarded as a fuzzy set whose element are (FDs)d with certain membership degrees denoted by degrees of satisfaction, the task of mining the set of qualified (FDs)d described in [10] is equal to discovering the θ-cut of FA+. However, the θ-cut of FA+ is not enough, for some (FDs)d could be inferred by other (FDs)d according to Theorem 1. Then, the minimal set could be considered. Before the formal definition of minimal set, the equivalent set should be introduced. Definition 5: Let RS(I) be the relation scheme, and T be a relation of RS. Two set of (FDs)d, F and G, are called equivalent with respect to R if and only if FA+ = GA+. Then, based on the definition, the task of discovering the set of (FDs)d could then be further focused on discovering the smallest equivalent set of FA+. Further, a strong assumption could also be incorporated to further reduce the size of discovered set. As we discussed previously, AÆX is redundant on AÆX’. Accordingly, any k-consequent (FDs)d AÆI1I2…Ik, k ≤ m, could be deemed as redundant on all AÆIj, j = 1, …, k. Since in mining task, then if a 1-consequent AÆI is not qualified, then any AÆIX will not be qualified. On the other hand, if all AÆIj, 1 ≤ j ≤ k, are qualified, though it cannot be inferred whether AÆI1I2…Ik is qualified or not, it need not to be further calculated yet, for it provide no more useful information than AÆIj, j = 1, …, k, semantically. Thus, in mining qualified (FDs)d, only 1-consequent (FDs)d are focused. Thus, the definition of minimal set of (FDs)d could be given. Definition 6: Given relation schema RS(I) and T be a relation of RS. Then for a set of (FDs)d F on T, a set of (FDs)d MF is called minimal, if and only if: 1) The right side of each AÆB ∈ MF is a single item; 2) There is not (AÆB)α in MF such that MF- {(AÆB)α} is equivalent to MF; 3) There is no (AÆB)α in MF and no Z ⊂ A such that (MF – {(AÆB)α}) ∪ {(ZÆB)α} is equivalent to MF; Further, if MF is minimal and MF is equivalent to F, then MF is called the minimal set of F. Theorem 2 could guarantee that, if the θ-cut of minimal set of FA+ is derived, then the θ-cut of 1-consequent FA+ could be inferred. Theorem 2: θ-cut of minimal set of FA+ is equivalent to θ-cut of 1-consequent FA+. This theorem is important, based on which, the task of discovering the whole set of qualified (FDs)d in [10] is equivalent to discovering the minimal set of qualified (FDs)d. Thus, the above redundancies could be avoided. 3.

Mining Minimal Set of (FDs)d based on Transitive Closure Operation

The definition of minimal set give the description of what the set without redundancy is, however, it did not illustrate how to discover the minimal set. In this section, a transitive closure based operation will be introduced, based on which the θ-cut of minimal set could be derived. Definition 7: A dependency relation from X = {x1, x2, …, xm} to Y = {y1, y2, …, yk} R is defined as a X ×

3

Y Æ [0, 1] mapping. In other words, a dependency relation R from X to Y is a fuzzy set on X × Y with a degree dR(x, y): X × Y. Æ [0, 1]. For ∀(x, y) ∈ R, dR(x, y) is interpreted as the degree of satisfaction that y depends on x. Thus an mx ×my matrix could be constructed to denote R, where dij = dR(x, y) = TRUTHT(xiÆyj). Specially, dij = dR(xi, yj) = 1, when yj ⊆ xi. Then, for the set of 1-precedent 1-consequent (FDs)d, denoted as F1, a dependency relation on I × I can be constructed correspondingly. Since only 1-consequent (FDs)d are focused, then for each set of p-precedent 1-consequent (FDs)d, denoted as Fp, a dependency relation Ip × I can be constructed correspondingly. Thus, for simplicity, Fp is used to denote the dependency relation on Ip × I. Definition 8: If S1 is a dependency relation from X to Y and S2 is a dependency relation from Y to Z, then the transitive relation from X to Z, denoted as S1S2 is defined as follows: S1S2(x, z) = S1(x, y) ⊗ S2(y, z). Concretely, dS1S2(x, z) = maxy∈Y(dS1(x, y) + dS1(y, z) – 1). Definition 9: If R is a dependency relation from X to X, then denote Rk+1 = Rk ⊗ R, k = 1, 2, … . Theorem 3: If R is a dependency relation from X to X, then Ra+b = Ra ⊗ Rb. Theorem 4: If R is a dependency relation from X to X, and S is a dependency relation from Y to X, then S ⊗ Ra+b = S ⊗ Ra ⊗ Rb. With the above definitions and theorems, dij in (F1)k is the upper-bound of the degree of satisfaction that IiÆIj that could be inferred based on Theorem 1 in k transitive operations. And dij in (Fp) ⊗ (F1)k is the upper-bound of the degree of satisfaction that EiÆIj, Ei ∈ Ip. Further, a property could be derived as shown in Theorem 5. Lemma 1: Given two relations S on Y × X, and R on X × X, then S ⊗ Rk+q ≥ S ⊗ Rk, q ≥ 1. Theorem 5: For two sets X and Y (|X| = m, |Y| = k), given two fuzzy relations S on Y × X and R on X × X respectively, then S ⊗ Rm-1 = S ⊗ R(m-1) + q, q ≥ 1. Theorem 5 guarantees that the inference operations could stop in limited steps (at most m steps). And since the transitive operation is based on the axioms in Theorem 1, then (F1)m is equal to 1-precedent 1-consequent FA+ and (Fp) ⊗ (F1)m-1 is equal to p-precedent 1-consequent FA+. Then after combining all the (Fp) ⊗ (F1)m-1, p ∈ [1, m], the 1-consequent FA+ could be derived. Definition 10: For a dependency relation R with dij, i ∈ [1, mx], j ∈ [1, my], the θ cut of R is denoted as a dependency relation θ-R with dij = 0, if dij < θ, otherwise dij. Theorem 6: (θ-R) ⊗ (θ-R) = θ-(R ⊗ R). Theorem 6 guarantees that, while inferring qualified (FDs)d, only the qualified (FDs)d need to be considered. 4.

The Process of Mining θ-cut of Minimal Set of (FDs)d

4

In this section, the process of mining θ-cut of minimal set of (FDs)d will be illustrated. First, computing all p-precedent 1-consequent (FDs)d. At the beginning, the original relation F1 on I × I should be initiated. Thus, dii = 1, i = j and dij = 0, i ≠ j. Thus, for a 1-precedent 1-consequent (FD)d, Ii Æ Ij, if it does not belong to the set of inferred 1-precedent qualified (FDs)d IN_F1, then scanning the database to counting its α. If α is no less than θ, then replace dij in F1 with α and add (IiÆIj)α into the set of scanned 1-precedent qualified (FDs)d SC_F1. Otherwise, add IiÆIj into the set of dis-qualified (FDs)d DIS_F and keep dij = 0. Then a upgraded F1 is derived, based on which the transitive closure operation could be performed to infer other (FDs)d (Theorem 6), after at most m - 1 step operations, (F1) ⊗ (F1)m-1 = (F1)m could be derived (Theorem 5). Then scanning (F1)m, if dmij is no less than θ, then insert IiÆIj with dmij into IN_F1, otherwise, set dmij back to zero, and set Rm1 to R1. Then go to next 1-precedent 1-consequent (FD)d. Finally, after going through all the 1-precedent (FDs)d, all the 1-precedent 1-consequent (FDs)d will be assigned into three separate set: SC_F1, IN_F1 and DIS_F. Clearly, the (FDs)d in SC_F1 belong to the minimal set. And DIS_F will be used to generate candidate 2-precedent (FDs)d [10]. This is because that only when AÆI and BÆI are not qualified, than ABÆI has potential to be qualified and is minimal. And the finally upgraded F1 is equal to 1-precedent 1-consequent FA+, which will be useful for further transitive closure operation. Second, given a DIS_F of (p-1)-precedent (FDs)d, the candidate set of p-precedent (FDs)d CA_F could be generated, based on which Fp on Y × I is initiated, where Y is the set of all precedents of CA_F. Then using the process similar to dealing with F1, Fp could be derived. It should be mentioned that, since (F1)m has been derived, the transitive closure (Fp) ⊗ (F1)m-1 could be derived in one step operation. After going through all the p-precedent candidate (FDs)d, SC_Fp, IN_Fp and DIS_F could be obtained. Then based on DIS_F, the CA_F of (p+1)-precedent (FDs)d could be obtained. Then when CA_F is empty then, the whole process stop. And it after combine all the SC_Fk, the minimal set is derived. Concretely, the pseudo-codes of the algorithm is shown as follows: Algorithm Minimal_Set_(FDs)d SC_Fp = ∅; IN_Fp = ∅; DIS_F = ∅; CA_Fp = ∅; CA_F1 = {(Ij Æ Ii)α | Ij ∈ I, α = 0 while i ≠ j, and α = 1 while i = j}; p = 1; WHILE CA_Fp ≠ ∅ Fp = CA_Fp; FOR ALL f ∈ CA_Fp IF f ∉ IN_Fp THEN Degree = Degree_Safisfaction(f); IF Degree ≥ θ THEN f ⇒SC_Fp and f ⇒ Fp; ELSE f ⇒ DIS_F; Fp = Fp ⊗ (F1)m-1; FOR ALL f ∈ Fp IF f.degree ≥ θ THEN f ⇒ IN_Fp; ELSE f.degree = 0; ENDFOR ENDFOR CA_Fp+1 = Generate_Candidate(DIS_F); DIS_F = ∅; p ++;

5

END WHILE M_F = ∪1 ≤ k ≤ pSC_Fk Next, we will give a brief analysis on computational complexity compared with the algorithm in [10]. For example, while deriving a 1-precedent (FDs)d, if it is qualified, then F1 will be modified, based on which the transitive closure operation will be carried on at most m steps. Then this computational complexity will be at most O(2m3). If the transitive closure operation will generate one (FD)d whose degree of satisfaction is no less than θ, then this (FD)d could be inferred qualified, while scanning database is on the level of O(n2/2) complexity. It is clear that O(2m3)

Suggest Documents