Mining Functional Dependencies with Degrees of Satisfaction in Databases1 Qiang Wei*2, Guoqing Chen* *School of Economics and Management, Tsinghua University, Beijing 100084, China
This paper focuses on functional dependencies
Abstract Mining functional dependencies (FDs) is valuable
(FDs) and defines a notion of FDs with degrees of
in analyzing the relationships among items in databases.
satisfaction, which can not only accommodate conflicts
This paper presents a notion of FDs with degrees of
and null-values but also provide a general setting to
satisfaction, i.e., (FDs)d, aimed at reflecting the extent
deal with the situation in that a FD is satisfied by a
to which FDs are satisfied by given database relations.
relation in a certain degree.
Furthermore, some desirable properties and derivatives are derived. Consequently, an algorithm is proposed to
2.
FDs with Degrees of Satisfaction
deal with large-scale databases from a set of qualified (FDs)d using data mining approaches.
Definition 1: Let R(I1, I2, …, Im) be a relation scheme on domains D1, D2, …, Dm with Dom(Ik) = Dk, A and B
1.
Introduction
be subsets of the attribute set I = {I1, I2, …, Im}, i.e., A, B ⊆ I, and T be a relation of R, T ⊆ D1 × D2 × … × Dm,
Mining association rules is one of the important
and ti, tj ∈ T, ti ≠ tj. Then B is called to functionally
issues in analyzing the relationships among items in
depend on A for a transaction pair (ti, tj), denoted as
large-scale databases [1-2, 7]. As data dependencies in
(ti,tj)(AÆB),
relational databases can be viewed as a form of association, both researchers and practitioners have
if ti(A) = tj(A) then ti(B) = tj(B).
Let µ(ti, tj)(AÆB) be the truth value that (ti, tj)(AÆB) holds. Then we have:
been attracted to discovering the dependencies from
1) If ti(A)=tj(A) and ti(B) = tj(B), then µ(ti, tj)(AÆB) = 1;
conventional datasets, such as functional dependencies
2) If ti(A)=tj(A) and ti(B) ≠ tj(B), then µ(ti, tj)(AÆB) = 0;
[3, 8, 11], minimal keys [9], multi-valued dependencies
3) If ti(A) ≠ t2(A), then µ(ti, tj)(AÆB) = 1.
[12], constraint-generating dependencies [4], roll-up dependencies [13], and other data dependencies [5, 6].
That is, we call (ti, tj) to satisfy AÆB if µ(ti, = 1, while (ti, tj) dissatisfies AÆB if µ(ti,
However, since in large real databases or data
tj)(AÆB)
warehouses, nulls or conflicting tuples do exist, many
tj)(AÆB) = 0. Next, the degree that T satisfies AÆB is
dependencies will be missed because of some tiny
defined as follows:
“errors”. In 1998, Huhtala et al. considered using the concept of approximate dependency to deal with
Definition 2: Given T (totally n tuples) and I, and A ⊆ I,
so-called error tuples, however, it is only regarded as
B ⊆ I. Let T(AÆB) denote that B is functionally
exception [10].
dependent on A for T. Then the degree that T satisfies
1
2
Partly supported by “Nation’s Outstanding Young Scientists Funds” of China (No. 79925001), the Bilateral Scientific and Technological Cooperation Programme Between China and Flandres (174B0201) and Tsinghua’s Soft Science Key Project on E-Commerce. Corresponding author. Email:
[email protected].
AÆB is µT(AÆB), where
instance, if T(AÆB) is qualified, then T(AXÆB) being
∑µ
µ T ( A → B) =
∀ti ,t j ∈T ti ≠ t j
(ti ,t j ) ( A
→ B)
⎛ n⎞ ⎜⎜ ⎟⎟ ⎝ 2⎠
qualified can be directly inferred without scanning database. On the other hand, if T(AÆB) is already known to be disqualified, then
T(AÆBX)
will be
dissatisfied too so that further mining steps for T(AÆBX)
would not be necessary.
Apparently, the notion of conventional FDs is the special case of Definition 2 with µT(AÆB) = 1.
3.
Mining Algorithm
Furthermore, it can be proven that such functional dependencies with degrees of satisfaction (FDs)d as T(AÆB)
have the following properties.
The task of mining functional dependencies with degrees of satisfaction could be regarded as discovering
Property 1: Given a T on R(I1, I2, …, Im), A, B ⊆ I, if B
all qualified (FDs)d given a threshold θ, θ ∈ [0, 1].
⊆ A, then µT(AÆB) = 1.
Similar to mining association rules [1, 2], the mining
Property 2: Given a T on R(I1, I2, …, Im), A, B, C ⊆ I,
algorithm is constructed on the lattice, which could be
if µT(AÆB) ≥ α, then µT(ACÆBC) ≥ α.
searched efficiently in a breadth-first strategy.
Property 3: Given a T on R(I1, I2, …, Im), and A, B, C
Let an i-antecedent (FD)d be a (FD)d with i
⊆ I, if µT(AÆB) ≥ α and µT(BÆC) ≥ β, then µT(AÆC)
attributes in the antecedent and a j-consequent (FD)d be
≥ α + β - 1.
a (FD)d with j attributes in the consequent (i, j = 1, 2, …,
Property 4: Given a T on R(I1, I2, …, Im), and A, B, C
m). Then, given the set of all i-antecedent 1-consequent
⊆ I, if µT(AÆB) = α and µT(BÆC) = β, then α + β ≥ 1.
(FDs)d, after filtered with θ, the set of qualified i-antecedent 1-consequent (FDs)d (QFi1) could be
Given a threshold θ in [0, 1], a (FD)d: T(AÆB), is
derived, based on which the set of candidate
called a qualified (FD)d if µT(AÆB) ≥ θ. Let us
i-antecedent 2-consequent (FDs)d (CFi2) could be
consider the following three important derivatives of
generated using D3. Then QFi2 could be filtered out
(FDs)d.
based on CFi2. And further CFi3 could be generated, and
D1: If A or B = ∅, then AÆB is generally meaningless.
so on until the set of generated candidate i-antecedent
Further, if A ∩ B ≠ ∅, without loss of generality,
(FDs)d is empty then stop. Thus all qualified
suppose A = A’X and B = B’X, A’ ∩ B’ = ∅. Then that
i-antecedent (FDs)d are derived.
T(A’XÆB’)
is a qualified (FD)d infers that T(A’XÆB’X)
More concretely, the mining process will begin
is qualified as well due to the fact that µT(A’XÆB’) =
with 1-antecedent (FDs)d. After generating all qualified
µT(A’XÆB’X).
1-antecedent (FDs)d, QF21 could be derived without
D2: if T(AÆB) is qualified, then T(AXÆB) is also
scanning the database based on QF11 according to D2,
qualified, because µT(AXÆB) ≥ µT(AXÆA) +
further CF21 could also be generated. After filtering, the
µT(AÆB) – 1 ≥ 1 + µT(AÆB) – 1 ≥ θ, according to
qualified (FDs)d in CF21 are added into QF21, which
Property 1 and Property 3.
compose the final QF21. Then all CF2l and QF2l could be
D3: For X ⊆ I, X ∩ A = ∅, and X ∩ B = ∅, then
derived. Likewise, all the qualified 2-antecedent (FDs)d
µT(AÆB) ≥ µT(AÆBX) + µT(BXÆB) – 1 ≥ µT(AÆBX)
are obtained. Then go to 3-antecedent, 4-antecedent, ...,
+ 1 – 1 ≥ µT(AÆBX). This means that if T(AÆB) is not
k-antecedent (FDs)d, until both the set of generated
qualified, then T(AÆBX) is not qualified either.
candidate 1-consequent (FDs)d and the set of generated
The above properties and derivatives are deemed so important that they can be used as inferring and pruning strategies in the (FDs)d mining process. For
qualified 1-consequent (FDs)d are empty. Finally, the whole set of qualified (FDs)d is obtained.
Algorithm Functional_Dependency
and database approaches, along with inferring and
// QFij: the set of qualified i-antecedent j-consequent
pruning strategies, which is at the same computational
(FDs)d
level as classical data mining methods [2].
// CFij: the set of candidate i-antecedent j-consequent (FDs)d;
4.
An Example
QFij = ∅; QF = ∅; Suppose a database is as shown in Table 1 with θ =
CF11 = {f: Ix Æ Iy, ∀ Ix, Iy ∈ I, Ix ≠ Iy} i = 1;
80%. The mining process and discovered results are
WHILE QFi1 ≠ ∅ OR CFi1 ≠ ∅ DO
shown in Figure 1.
j = 1;
Table 1
WHILE CFij ≠ ∅ OR QFij ≠ ∅ DO FOR ALL f ∈ CFij Degree = Degree_Satisfaction(f); IF Degree ≥ θ THEN f ⇒ QFij; END FOR // Generate CFij+1 based on QFij CFij+1 = Candidate_Generation(QFij); j ++; END WHILE // Generate QFi+11 from QFi1 QFi+11 = Qualified_Generation(QFi1); Fi+11 = {f: I1I2…Ii+1 Æ Ii+2, ∀ Ix ∈ I, Ix ≠ Iy, x, y ∈ [1. i+2]} CFi+11 = Fi+11 – QFi+11; i ++; END WHILE QF = ∪ QFij In
this
algorithm,
the
sub-procedure
Degree_Satisfaction is to compute the degree of satisfaction of each (FD)d using relation grouping operations, which is more efficient than directly using definition
2.
And
the
An Example (θ = 80%)
No.
A
B
C
#1
1
a
10
#2
1
a
20
#3
2
b
20
#4
2
c
20
#5
2
d
N/A
CF11 (FDs)d µ AÆB 0.7 AÆC 0.7 BÆA 1.0 BÆC 0.9 CÆA 0.8 CÆB 0.7
CF12 (FDs)d µ BÆAC 0.9
QF11 (FDs)d BÆA BÆC CÆA
QF12 (FDs)d BÆAC
CF21 (FDs)d µ ACÆB 0.9
CF22 ∅
CF13 ∅
sub-procedure
Candidate_Generation is to generate CFij+1 based on QFij using relation join operations of consequents of different (FDs)d with the same antecedent. And the sub-procedure Qualified_Generation is to generate QFi+11 based on QFi1 using join operations of
QF21 (FDs)d BCÆA ABÆC ACÆB
antecedents of different (FDs)d with the same
CF31 ∅
consequent. Thus all qualified (FDs)d could be derived
Figure 1
QF31 ∅ Mining Process and Results
finally and stored in QF. It is worth noting that the algorithm is efficient by combining with data mining
As a result, all the qualified dependencies can be
derived: QF = {BÆA, BÆC, CÆA, BÆAC, BCÆA,
[4]
BAÆC, ACÆB}.
Marianne
Baudinet,
Jan
Chomicki,
Pierre
Constraint-Generating
Notably, there are three advantages for (FDs)d with its corresponding mining algorithm. First, partial degree
Dependencies,
http://citeseer.nj.nec.com/. [5]
Siegfried Bell, Peter Brockhausen, Discovery of Data
of satisfaction for a FD is accommodated, considering
Dependencies
the possible null values and semantic interests. This
http://citeseer.nj.nec.com/.
leads to several (FDs)d rather than only a single
in
Castellanos,
Relational
Felix
Saltor,
Databases,
[6]
Malu
[7]
G. Q. Chen, Q. Wei, E. E. Kerre, Fuzzy Data Mining:
traditional FD: BÆA. Second, incorporating D2 into the algorithm enables us to directly derive many (FDs)d,
Wolper,
Extraction
of
Data
Dependencies, http://citeseer.nj.nec.com/.
such as BCÆA and ABÆC, without scanning the
Discovery of Fuzzy Generalized Association Rules. In
database, which reduces the computational complexity.
Bordagna & Pasi (eds.)
Third, with D3 as a pruning strategy in the algorithm,
Management
the procedure of generating multi-consequent (FDs)d is more efficient, as examining and generating many
of
Recent Research Issues on
Fuzziness
in
Databases,
Springer
(Physica-Verlag), 2000. [8]
candidate attributes/rules can be avoided.
Peter A. Flach, Iztok Savnik, Database Dependency Discovery: A Machine Learning Approach, AI Communications, ISSN 0921-7126.
5.
Conclusions and Future Work
[9]
In this paper, the notion of (FDs)d has been
[10] Yka Huhtala, Juha karkkainen, Pasi Porkka, Hannu Toivonen,
C. Giannella, C. M. Wyss, Finding Minimal Keys in a Relation Instance, http://citeseer.nj.nec.com/.
presented, which can reflect partial degrees of
Efficient
satisfaction
Dependencies Using Partitions, http://citeseer.nj.nec.com/.
for
functional
dependencies, and is
regarded meaningful for large databases. Accordingly,
Discovery
of
Functional
and
Approximate
[11] Stefan Kramer, Bernhard Pfahringer, Efficient Search for
the corresponding mining methods have been provided
Strong Partial Determinations, http://citeseer.nj.nec.com/.
to discover all the qualified (FDs)d efficiently based on
[12] Iztok Savnik, Peter A. Flach, Discovery of Multivalued
certain proven properties. Future work is being carried
Dependencies
out to deal with such issues as further algorithmic
http://citeseer.nj.nec.com/.
optimizations, properties of the (FDs)d set, and extensive data experiments.
References [1]
Rakesh Agrawal; Tomasz Imielinski; Arun Swarmi, Mining Association Rules between Sets of Items in Large Databases, In Proc. of the ACM-SIGMOD 1993 Int'l Conference on Management of Data, Washington D.C., May 1993, 207-216.
[2]
R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules, Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Expanded version available as IBM Research Report RJ9839, June 1994.
[3]
Martin Andersson, Extracting an Entity Relationship Schema from A Relational Database Through Reverse Engineering, http://citeseer.nj.nec.com/.
From
Relations,
March
7,
2000,
[13] Jef Wijsen, Raymond T. Ng, Toon Calders, Discovering Roll-Up Dependencies, http://citeseer.nj.nec.com/.