Mining Functional Dependencies with Degrees of

0 downloads 0 Views 65KB Size Report
Abstract. Mining functional dependencies (FDs) is valuable ... conventional datasets, such as functional dependencies ... T(A→B) have the following properties.
Mining Functional Dependencies with Degrees of Satisfaction in Databases1 Qiang Wei*2, Guoqing Chen* *School of Economics and Management, Tsinghua University, Beijing 100084, China

This paper focuses on functional dependencies

Abstract Mining functional dependencies (FDs) is valuable

(FDs) and defines a notion of FDs with degrees of

in analyzing the relationships among items in databases.

satisfaction, which can not only accommodate conflicts

This paper presents a notion of FDs with degrees of

and null-values but also provide a general setting to

satisfaction, i.e., (FDs)d, aimed at reflecting the extent

deal with the situation in that a FD is satisfied by a

to which FDs are satisfied by given database relations.

relation in a certain degree.

Furthermore, some desirable properties and derivatives are derived. Consequently, an algorithm is proposed to

2.

FDs with Degrees of Satisfaction

deal with large-scale databases from a set of qualified (FDs)d using data mining approaches.

Definition 1: Let R(I1, I2, …, Im) be a relation scheme on domains D1, D2, …, Dm with Dom(Ik) = Dk, A and B

1.

Introduction

be subsets of the attribute set I = {I1, I2, …, Im}, i.e., A, B ⊆ I, and T be a relation of R, T ⊆ D1 × D2 × … × Dm,

Mining association rules is one of the important

and ti, tj ∈ T, ti ≠ tj. Then B is called to functionally

issues in analyzing the relationships among items in

depend on A for a transaction pair (ti, tj), denoted as

large-scale databases [1-2, 7]. As data dependencies in

(ti,tj)(AÆB),

relational databases can be viewed as a form of association, both researchers and practitioners have

if ti(A) = tj(A) then ti(B) = tj(B).

Let µ(ti, tj)(AÆB) be the truth value that (ti, tj)(AÆB) holds. Then we have:

been attracted to discovering the dependencies from

1) If ti(A)=tj(A) and ti(B) = tj(B), then µ(ti, tj)(AÆB) = 1;

conventional datasets, such as functional dependencies

2) If ti(A)=tj(A) and ti(B) ≠ tj(B), then µ(ti, tj)(AÆB) = 0;

[3, 8, 11], minimal keys [9], multi-valued dependencies

3) If ti(A) ≠ t2(A), then µ(ti, tj)(AÆB) = 1.

[12], constraint-generating dependencies [4], roll-up dependencies [13], and other data dependencies [5, 6].

That is, we call (ti, tj) to satisfy AÆB if µ(ti, = 1, while (ti, tj) dissatisfies AÆB if µ(ti,

However, since in large real databases or data

tj)(AÆB)

warehouses, nulls or conflicting tuples do exist, many

tj)(AÆB) = 0. Next, the degree that T satisfies AÆB is

dependencies will be missed because of some tiny

defined as follows:

“errors”. In 1998, Huhtala et al. considered using the concept of approximate dependency to deal with

Definition 2: Given T (totally n tuples) and I, and A ⊆ I,

so-called error tuples, however, it is only regarded as

B ⊆ I. Let T(AÆB) denote that B is functionally

exception [10].

dependent on A for T. Then the degree that T satisfies

1

2

Partly supported by “Nation’s Outstanding Young Scientists Funds” of China (No. 79925001), the Bilateral Scientific and Technological Cooperation Programme Between China and Flandres (174B0201) and Tsinghua’s Soft Science Key Project on E-Commerce. Corresponding author. Email: [email protected].

AÆB is µT(AÆB), where

instance, if T(AÆB) is qualified, then T(AXÆB) being

∑µ

µ T ( A → B) =

∀ti ,t j ∈T ti ≠ t j

(ti ,t j ) ( A

→ B)

⎛ n⎞ ⎜⎜ ⎟⎟ ⎝ 2⎠

qualified can be directly inferred without scanning database. On the other hand, if T(AÆB) is already known to be disqualified, then

T(AÆBX)

will be

dissatisfied too so that further mining steps for T(AÆBX)

would not be necessary.

Apparently, the notion of conventional FDs is the special case of Definition 2 with µT(AÆB) = 1.

3.

Mining Algorithm

Furthermore, it can be proven that such functional dependencies with degrees of satisfaction (FDs)d as T(AÆB)

have the following properties.

The task of mining functional dependencies with degrees of satisfaction could be regarded as discovering

Property 1: Given a T on R(I1, I2, …, Im), A, B ⊆ I, if B

all qualified (FDs)d given a threshold θ, θ ∈ [0, 1].

⊆ A, then µT(AÆB) = 1.

Similar to mining association rules [1, 2], the mining

Property 2: Given a T on R(I1, I2, …, Im), A, B, C ⊆ I,

algorithm is constructed on the lattice, which could be

if µT(AÆB) ≥ α, then µT(ACÆBC) ≥ α.

searched efficiently in a breadth-first strategy.

Property 3: Given a T on R(I1, I2, …, Im), and A, B, C

Let an i-antecedent (FD)d be a (FD)d with i

⊆ I, if µT(AÆB) ≥ α and µT(BÆC) ≥ β, then µT(AÆC)

attributes in the antecedent and a j-consequent (FD)d be

≥ α + β - 1.

a (FD)d with j attributes in the consequent (i, j = 1, 2, …,

Property 4: Given a T on R(I1, I2, …, Im), and A, B, C

m). Then, given the set of all i-antecedent 1-consequent

⊆ I, if µT(AÆB) = α and µT(BÆC) = β, then α + β ≥ 1.

(FDs)d, after filtered with θ, the set of qualified i-antecedent 1-consequent (FDs)d (QFi1) could be

Given a threshold θ in [0, 1], a (FD)d: T(AÆB), is

derived, based on which the set of candidate

called a qualified (FD)d if µT(AÆB) ≥ θ. Let us

i-antecedent 2-consequent (FDs)d (CFi2) could be

consider the following three important derivatives of

generated using D3. Then QFi2 could be filtered out

(FDs)d.

based on CFi2. And further CFi3 could be generated, and

D1: If A or B = ∅, then AÆB is generally meaningless.

so on until the set of generated candidate i-antecedent

Further, if A ∩ B ≠ ∅, without loss of generality,

(FDs)d is empty then stop. Thus all qualified

suppose A = A’X and B = B’X, A’ ∩ B’ = ∅. Then that

i-antecedent (FDs)d are derived.

T(A’XÆB’)

is a qualified (FD)d infers that T(A’XÆB’X)

More concretely, the mining process will begin

is qualified as well due to the fact that µT(A’XÆB’) =

with 1-antecedent (FDs)d. After generating all qualified

µT(A’XÆB’X).

1-antecedent (FDs)d, QF21 could be derived without

D2: if T(AÆB) is qualified, then T(AXÆB) is also

scanning the database based on QF11 according to D2,

qualified, because µT(AXÆB) ≥ µT(AXÆA) +

further CF21 could also be generated. After filtering, the

µT(AÆB) – 1 ≥ 1 + µT(AÆB) – 1 ≥ θ, according to

qualified (FDs)d in CF21 are added into QF21, which

Property 1 and Property 3.

compose the final QF21. Then all CF2l and QF2l could be

D3: For X ⊆ I, X ∩ A = ∅, and X ∩ B = ∅, then

derived. Likewise, all the qualified 2-antecedent (FDs)d

µT(AÆB) ≥ µT(AÆBX) + µT(BXÆB) – 1 ≥ µT(AÆBX)

are obtained. Then go to 3-antecedent, 4-antecedent, ...,

+ 1 – 1 ≥ µT(AÆBX). This means that if T(AÆB) is not

k-antecedent (FDs)d, until both the set of generated

qualified, then T(AÆBX) is not qualified either.

candidate 1-consequent (FDs)d and the set of generated

The above properties and derivatives are deemed so important that they can be used as inferring and pruning strategies in the (FDs)d mining process. For

qualified 1-consequent (FDs)d are empty. Finally, the whole set of qualified (FDs)d is obtained.

Algorithm Functional_Dependency

and database approaches, along with inferring and

// QFij: the set of qualified i-antecedent j-consequent

pruning strategies, which is at the same computational

(FDs)d

level as classical data mining methods [2].

// CFij: the set of candidate i-antecedent j-consequent (FDs)d;

4.

An Example

QFij = ∅; QF = ∅; Suppose a database is as shown in Table 1 with θ =

CF11 = {f: Ix Æ Iy, ∀ Ix, Iy ∈ I, Ix ≠ Iy} i = 1;

80%. The mining process and discovered results are

WHILE QFi1 ≠ ∅ OR CFi1 ≠ ∅ DO

shown in Figure 1.

j = 1;

Table 1

WHILE CFij ≠ ∅ OR QFij ≠ ∅ DO FOR ALL f ∈ CFij Degree = Degree_Satisfaction(f); IF Degree ≥ θ THEN f ⇒ QFij; END FOR // Generate CFij+1 based on QFij CFij+1 = Candidate_Generation(QFij); j ++; END WHILE // Generate QFi+11 from QFi1 QFi+11 = Qualified_Generation(QFi1); Fi+11 = {f: I1I2…Ii+1 Æ Ii+2, ∀ Ix ∈ I, Ix ≠ Iy, x, y ∈ [1. i+2]} CFi+11 = Fi+11 – QFi+11; i ++; END WHILE QF = ∪ QFij In

this

algorithm,

the

sub-procedure

Degree_Satisfaction is to compute the degree of satisfaction of each (FD)d using relation grouping operations, which is more efficient than directly using definition

2.

And

the

An Example (θ = 80%)

No.

A

B

C

#1

1

a

10

#2

1

a

20

#3

2

b

20

#4

2

c

20

#5

2

d

N/A

CF11 (FDs)d µ AÆB 0.7 AÆC 0.7 BÆA 1.0 BÆC 0.9 CÆA 0.8 CÆB 0.7

CF12 (FDs)d µ BÆAC 0.9

QF11 (FDs)d BÆA BÆC CÆA

QF12 (FDs)d BÆAC

CF21 (FDs)d µ ACÆB 0.9

CF22 ∅

CF13 ∅

sub-procedure

Candidate_Generation is to generate CFij+1 based on QFij using relation join operations of consequents of different (FDs)d with the same antecedent. And the sub-procedure Qualified_Generation is to generate QFi+11 based on QFi1 using join operations of

QF21 (FDs)d BCÆA ABÆC ACÆB

antecedents of different (FDs)d with the same

CF31 ∅

consequent. Thus all qualified (FDs)d could be derived

Figure 1

QF31 ∅ Mining Process and Results

finally and stored in QF. It is worth noting that the algorithm is efficient by combining with data mining

As a result, all the qualified dependencies can be

derived: QF = {BÆA, BÆC, CÆA, BÆAC, BCÆA,

[4]

BAÆC, ACÆB}.

Marianne

Baudinet,

Jan

Chomicki,

Pierre

Constraint-Generating

Notably, there are three advantages for (FDs)d with its corresponding mining algorithm. First, partial degree

Dependencies,

http://citeseer.nj.nec.com/. [5]

Siegfried Bell, Peter Brockhausen, Discovery of Data

of satisfaction for a FD is accommodated, considering

Dependencies

the possible null values and semantic interests. This

http://citeseer.nj.nec.com/.

leads to several (FDs)d rather than only a single

in

Castellanos,

Relational

Felix

Saltor,

Databases,

[6]

Malu

[7]

G. Q. Chen, Q. Wei, E. E. Kerre, Fuzzy Data Mining:

traditional FD: BÆA. Second, incorporating D2 into the algorithm enables us to directly derive many (FDs)d,

Wolper,

Extraction

of

Data

Dependencies, http://citeseer.nj.nec.com/.

such as BCÆA and ABÆC, without scanning the

Discovery of Fuzzy Generalized Association Rules. In

database, which reduces the computational complexity.

Bordagna & Pasi (eds.)

Third, with D3 as a pruning strategy in the algorithm,

Management

the procedure of generating multi-consequent (FDs)d is more efficient, as examining and generating many

of

Recent Research Issues on

Fuzziness

in

Databases,

Springer

(Physica-Verlag), 2000. [8]

candidate attributes/rules can be avoided.

Peter A. Flach, Iztok Savnik, Database Dependency Discovery: A Machine Learning Approach, AI Communications, ISSN 0921-7126.

5.

Conclusions and Future Work

[9]

In this paper, the notion of (FDs)d has been

[10] Yka Huhtala, Juha karkkainen, Pasi Porkka, Hannu Toivonen,

C. Giannella, C. M. Wyss, Finding Minimal Keys in a Relation Instance, http://citeseer.nj.nec.com/.

presented, which can reflect partial degrees of

Efficient

satisfaction

Dependencies Using Partitions, http://citeseer.nj.nec.com/.

for

functional

dependencies, and is

regarded meaningful for large databases. Accordingly,

Discovery

of

Functional

and

Approximate

[11] Stefan Kramer, Bernhard Pfahringer, Efficient Search for

the corresponding mining methods have been provided

Strong Partial Determinations, http://citeseer.nj.nec.com/.

to discover all the qualified (FDs)d efficiently based on

[12] Iztok Savnik, Peter A. Flach, Discovery of Multivalued

certain proven properties. Future work is being carried

Dependencies

out to deal with such issues as further algorithmic

http://citeseer.nj.nec.com/.

optimizations, properties of the (FDs)d set, and extensive data experiments.

References [1]

Rakesh Agrawal; Tomasz Imielinski; Arun Swarmi, Mining Association Rules between Sets of Items in Large Databases, In Proc. of the ACM-SIGMOD 1993 Int'l Conference on Management of Data, Washington D.C., May 1993, 207-216.

[2]

R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules, Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Expanded version available as IBM Research Report RJ9839, June 1994.

[3]

Martin Andersson, Extracting an Entity Relationship Schema from A Relational Database Through Reverse Engineering, http://citeseer.nj.nec.com/.

From

Relations,

March

7,

2000,

[13] Jef Wijsen, Raymond T. Ng, Toon Calders, Discovering Roll-Up Dependencies, http://citeseer.nj.nec.com/.

Suggest Documents