Efficient Algorithms for Mining Inclusion

9 downloads 0 Views 194KB Size Report
To illustrate this fact, let us compare with FDs: FDs are studied as a basic concept of databases since ... In this paper, a new data mining algorithm for computing unary. INDs is given. ..... Addison-Wesley, second edition, 1994. ... 8th International Conference on Databases Theory, London, UK, volume 1973 of. Lecture Notes ...
Efficient Algorithms for Mining Inclusion Dependencies Fabien De Marchi1 , St´ephane Lopes2 , and Jean-Marc Petit1 1

Laboratoire LIMOS, CNRS FRE 2239 Universit´e Blaise Pascal – Clermont-Ferrand II, 24 avenue des Landais, 63 177 Aubi`ere cedex, France {demarchi,jmpetit}@math.univ-bpclermont.fr 2 Laboratoire PRISM, CNRS UMR 8636 45, avenue des Etats-Unis, 78035 Versailles Cedex, France [email protected]

Abstract. Foreign keys form one of the most fundamental constraints for relational databases. Since they are not always defined in existing databases, algorithms need to be devised to discover foreign keys. One of the underlying problems is known to be the inclusion dependency (IND) inference problem. In this paper a new data mining algorithm for computing unary INDs is given. From unary INDs, we also propose a levelwise algorithm to discover all remaining INDs, where candidate INDs of size i + 1 are generated from satisfied INDs of size i, (i > 0). An implementation of these algorithms has been achieved and tested against synthetic databases. Up to our knowledge, this paper is the first one to address in a comprehensive manner this data mining problem, from algorithms to experimental results.

1

Introduction

Inclusion dependencies (INDs) are one of the most important kind of integrity constraints in relational databases [6,4,22]. Together with functional dependencies, they represent an important part of database semantic. Some recent works were proposed to discover functional dependencies (FDs) holding in a relation [11,15,21,23], but IND discovery in databases has not raised great interest yet. We identify two reasons for that: 1) the difficulty of the problem due to the potential number of candidate INDs (cf [4,12] for complexity results) and 2) the fact that INDs “lack of popularity”. To illustrate this fact, let us compare with FDs: FDs are studied as a basic concept of databases since they are used to define normal forms (e.g. BCNF or 3NF) and to define keys, very popular constraints in practice. We think that what is good for FDs is good for INDs too: they can be used 1) to define other normal forms, such as IDNF (see [17,13,14] for details on such normal forms), 2) to avoid update anomalies and to ensure data coherence and integrity, and 3) to define foreign keys (or referential integrity constraints), since recall that a foreign key is the left-hand side of an IND having a right-hand side which is a key. C.S. Jensen et al. (Eds.): EDBT 2002, LNCS 2287, pp. 464–476, 2002. c Springer-Verlag Berlin Heidelberg 2002 

Efficient Algorithms for Mining Inclusion Dependencies

465

Moreover, in many existing databases, foreign keys are only partially defined, or not defined at all. As an example, recall that old versions of many DBMS (e.g. Oracle V6) did not support foreign key definition. Thus, even if an old Oracle database has been upgraded to Oracle V7 or V8, analysts would probably not have defined foreign keys during the migration process. So, there is an obvious practical interest for discovering this kind of knowledge. Another practical application of IND inference is pointed out in the CLIO project [20] devoted to data/schema integration. As a perspective, the authors mention the necessity to discover keys and referential integrity constraints over existing databases. More generally, INDs are known to be a key concept in various applications, such as relational database design and maintenance [17,3,13], semantic query optimization [9,5] or database reverse engineering [19]. Contribution. In this paper, a new data mining algorithm for computing unary INDs is given. A data pre-processing step is performed in such a way that unary IND inference becomes straightforward. From discovered unary INDs, a levelwise algorithm, fitting in the framework defined in [18], has been devised to discover all remaining INDs in a given database (i.e. INDs between attributes of size greater than 1). We propose an elegant Apriori-like algorithm to generate candidate INDs of size i + 1 from satisfied INDs of size i, (i > 0). Despite the inherent complexity of this inference task, experiments on synthetic databases show the feasibility of this approach, even for medium size databases (up to 500000 tuples). Paper organization. The layout of the rest of this paper is as follows: Related works is introduced in Section 2. Section 3 recalls some basic concepts of relational databases. Section 4 deals with IND inference: a new approach for unary IND inference is given in Section 4.1, and a levelwise algorithm is proposed to discover all remaining INDs in Section 4.2. Experimental results on synthetic databases are presented in Section 5, and we conclude in Section 6.

2

Related Works

To the best of our knowledge, only a few papers deal with IND inference problem. For unary IND inference, the domain of attributes, their number of distinct values and the transitivity property can be used to reduce the number of data accesses, as proposed in [12,2]. Nevertheless, these technics do not provide an efficient pruning, and a large number of tests have to be performed against the database. The IND discovery is also an instance (among many others) of a general framework for levelwise algorithms defined in [18]. However, unary IND inference is not considered as an important sub-problem, no details are given about a key step of such an algorithm, i.e. the generation of candidate INDs of size i + 1 from satisfied INDs of size i, and no implementation is achieved.

466

Fabien De Marchi, St´ephane Lopes, and Jean-Marc Petit

Only one implementation achieving IND inference was presented in [16]; the principle is to reduce the search space by considering only duplicated attributes. Such duplicate attributes are discovered from SQL join statements performed during a period of time over the database server. This approach uses semantic information “to guess” relevant attributes from SQL workloads. However, this work does not provide an exhaustive search of satisfied INDs, i.e. only a subset of INDs satisfied in the database can be discovered. For instance, if an IND between A and B holds in the database, and if there is no join between these attributes in a workload, this IND will never be discovered. Moreover, even if we have A ⊆ C and B ⊆ D, the candidate AB ⊆ CD is never considered.

3

Basic Definitions

We briefly introduce some basic relational database concepts used in this paper (see e.g. [17,13] for details). Let R be a finite set of attributes. For each attribute A ∈ R, the set of all its possible values is called the domain of A and denoted by Dom(A). A tuple over R is a mapping t : R → ×A∈R Dom(A), where t(A) ∈ Dom(A), ∀A ∈ R. A relation is a set of tuples. The cardinality of a set X is denoted by |X|. We say that r is a relation over R and R is the relation schema of r. If X ⊆ R is an attribute set1 and t is a tuple, we denote by t[X] the restriction of t to X. The projection of a relation r onto X, denoted as πX (r), is defined by πX (r) = {t[X] | t ∈ r}. A database schema R is a finite set of relation schemas Ri . A relational database instance d (or database) over R corresponds to a set of relations ri over each Ri of R. Given a database d over R, the set of distinct domains (e.g. int, string . . . ) is denoted by Dom(d). An attribute sequence (e.g. X = {A, B, C } or simply ABC) is an ordered set of distinct attributes. Given a sequence X, X[i] refers to the ith element of the sequence. When it is clear from context, we do not distinguish a sequence from its underlying set. Two attributes A and B are said to be compatible if Dom(A) = Dom(B). Two distinct attribute sequences X and Y are compatible if |X| = |Y | = m and if for j = [1, m], Dom(X[j]) = Dom(Y [j]). Inclusion dependencies and the notion of satisfaction of an inclusion dependency in a database are defined below. An inclusion dependency (IND) over a database schema R is a statement of the form Ri [X] ⊆ Rj [Y ], where Ri , Rj ∈ R, X ⊆ Ri , Y ⊆ Rj , X and Y are compatible sequences. An inclusion dependency is said to be trivial if it is of the form R[X] ⊆ R[X]. An IND R[X] ⊆ R[Y ] is of size i if |X| = i. We call unary inclusion dependency an IND of size 1. Let d be a database over a database schema R, where ri , rj ∈ d are relations over Ri , Rj ∈ R respectively. An inclusion dependency Ri [X] ⊆ Rj [Y ] is satisfied 1

Letters from the beginning of the alphabet introduce single attributes whereas letters from the end introduce attribute sets.

Efficient Algorithms for Mining Inclusion Dependencies

467

in a database d over R, denoted by d |= Ri [X] ⊆ Rj [Y ], iff ∀u ∈ ri , ∃v ∈ rj such that u[X] = v[Y ] (or equivalently πX (ri ) ⊆ πy (rj )). Let I1 and I2 be two sets of inclusion dependencies, I1 is a cover of I2 if I1 |= I2 (this notation means that each dependency in I2 holds in any database satisfying all the dependencies in I1 ) and I2 |= I1 . A sound and complete axiomatization for INDs was given in [4]. Three inference rules form this axiomatization: 1. (reflexivity) R[A1 , ..., An ] ⊆ R[A1 , ..., An ] 2. (projection and permutation) if R[A1 , ..., An ] ⊆ S[B1 , ..., Bn ] then R[Aσ1 , ..., Aσm ] ⊆ S[Bσ1 , ..., Bσm ] for each sequence σ1, ..., σm of distinct integers from {1, ..., n} 3. (transitivity) if R[A1 , ..., An ] ⊆ S[B1 , ..., Bn ] et S[B1 , ..., Bn ] ⊆ T [C1 , ..., Cn ] then R[A1 , ..., An ] ⊆ T [C1 , ..., Cn ] The satisfaction of an IND can be expressed in relational algebra in the following way [16]: Let d be a database over a database schema R, where ri , rj ∈ d are relations over Ri , Rj ∈ R respectively. We have: d |= Ri [X] ⊆ Rj [Y ] iff |πX (ri )| = |πX (ri ) (X=Y ) πY (rj ). An SQL query can easily be devised from this property, performing two costly operations against the data: a join and a projection.

4

Inclusion Dependency Inference

The IND inference problem can be formulated as follows: “Given a database d over a database schema R, find a cover of all non trivial inclusion dependencies R[X] ⊆ S[Y ], R, S ∈ R, such that d |= R[X] ⊆ S[Y ]”. In this paper, we propose to re-formulate this problem into two sub-problems: the former is the unary IND inference problem, and the latter is the IND inference problem being understood that unary INDs have been discovered. Two reasons justify this reformulation: 1) INDs in real-life databases are most of the time of size one, and 2) no efficient pruning method can be applied for unary IND inference. Therefore, specialized algorithms need to be devised to discover unary INDs. 4.1

Unary Inclusion Dependency Inference

We propose a new and efficient technic to discover unary INDs satisfied in a given database. The idea is to associate, for a given domain, each value with every attributes having this value. After this preprocessing step, we get a binary relation from which unary INDs can be computed. Data pre-processing. Given a database d over a database schema R, for each data type t ∈ dom(d), a so-called extraction context Dt (d) = (V, U, B) is associated, defined as follows:

468

Fabien De Marchi, St´ephane Lopes, and Jean-Marc Petit

– U = {R.A | dom(A) = t, A ∈ R, R ∈ R}. U is the set of attributes2 whose domain is t; – V = {v ∈ πA (r) | R.A ∈ U, r ∈ d, r defined over R}. V is the set of values taken by attributes in their relations; – B ⊆ V × U is a binary relation defined by:(v, R.A) ∈ B ⇐⇒ v ∈ πA (r), where r ∈ d and r defined over R. Example 1. Let us consider the database d given in table 1 as a running example. Table 1. A running example t

s

r A

B

C

D

E

F

G

H

I

J

K

L

1

X

3

11.0

1

X

3

11.0

11.0

11.0

1

X

1

X

3

12.0

2

Y

4

12.0

12.0

12.0

2

Y

2

Y

4

11.0

4

Z

6

14.0

11.0

14.0

4

Z

1

X

3

13.0

7

W

9

14.0

11.0

9.0

7

W

13.0

13.0

9

R

Domains of attributes of these relations are of three types: int, real, string. For the type int, U = {A, C, E, G, K} and V = {1, 2, 3, 4, 6, 7, 9}. For instance, the value 1 appears in πA (r), πE (s), πK (t), and thus (1, A), (1, E) and (1, K) ∈ B. Table 2 summarizes the binary relations associated with int, real and string. Table 2. Extraction contexts associated with the database d. int

real

string

V

U

1

AEK

9.0

J

R

L

2

AEK

11.0

DHIJ

X

BFL

3

CG

12.0

DHIJ

Y

BFL

4

CEGK

13.0

DIJ

Z

FL

6

G

14.0

HJ

W

FL

7

EK

9

GK

V

U

V

U

Such extraction contexts can be built from existing databases, for instance with an SQL query (with only one full scan on each relation) or with external programs and cursors computed via ODBC drivers. 2

When clear from context, we will omit to prefix attributes by their relation schema.

Efficient Algorithms for Mining Inclusion Dependencies

469

A new algorithm for unary IND inference. With this new data organization, unary INDs can be now discovered efficiently. Informally, if all values of attribute A can be found in values of B, then by construction B will be present in all lines of the binary relation containing A. Property 1. Given a database d and a triple Dt (d) = (V, U, B), t ∈ dom(d),  {C ∈ U | (v, C) ∈ B} d |= A ⊆ B ⇐⇒ B ∈ v∈V|(v,A)∈B

where A, B ∈ U. Proof. Let A ∈ R, B ∈ S such that d |= R[A] ⊆ S[B]. ⇐⇒ ∀v ∈ πA (r), ∃u ∈ s such that u[B] = v ⇐⇒ ∀v ∈ V such that (v, A) ∈ B, we have (v, B) ∈ B Thus, the whole task of unary IND inference can be done in only one pass of each binary relation. Algorithm 1 finds all unary INDs in a database d, between attributes defined on a type t ∈ dom(d), taking in input the extraction context as described before. For all attribute A, we note rhs(A) (for right-hand side) the set of attributes B such that A ⊆ B.

Algorithm 1. Unary IND inference Input: the triplet V, U, B, associated with d and t. Output: I1 the set of unary INDs verified by d between attributes of type t. 1: for all A ∈ U do rhs(A) = U; 2: for all v ∈ V do 3: for all A s.t. (v, A) ∈ B do 4: rhs(A) = rhs(A) ∩ {B | (v, B) ∈ B}; 5: for all A ∈ U do 6: for all B ∈ rhs(A) do 7: I1 = I1 ∪ {A ⊆ B}; 8: return I1 .

Example 2. Let us consider the type int (cf Table 2) in example 1. The initialization phase (line 1) gives: rhs(A) = rhs(C) = . . . = rhs(K) = {A, C, E, G, K}. Then, we consider the set of attributes in the first line of the binary relation: l1 = {A, E, K}. For each attribute in l1 , its rhs set is updated (line 4) as follows: rhs(A) = {A, E, K}, rhs(E) = {A, E, K}, rhs(K) = {A, E, K}; rhs(C) and rhs(G) remain unchanged. These operations are repeated for each value of the binary relation (line 2). Finally, after one pass of the relation, the result is: rhs(A) = {A, E, K}, rhs(C) = {C, G}, rhs(E) = {E, K}, rhs(G) = {G}, rhs(K) = {K} From these sets, unary INDs between attributes of type int are (lines 5, 6 and 7): {A ⊆ E, A ⊆ K, C ⊆ G, E ⊆ K}.

470

Fabien De Marchi, St´ephane Lopes, and Jean-Marc Petit

The same operation has to be repeated for each context (each data type), and then, thanks to property 1, we deduce the following set of unary inclusion dependencies satisfied by d: {A ⊆ E, A ⊆ K, E ⊆ K, C ⊆ G, D ⊆ I, D ⊆ J, H ⊆ J, I ⊆ J, B ⊆ F, B ⊆ L, F ⊆ L}. 4.2

A Levelwise Algorithm

Once unary INDs are known, the problem we are interested in can be reformulated as follows: “Given a database d over a database schema R and the set of UINDs verified by d, find a cover of all non trivial inclusion dependencies R[X] ⊆ S[Y ], R, S ∈ R, such that d |= R[X] ⊆ S[Y ]”. We first recall how IND properties justify a levelwise approach to achieve their inference [18]. Then, we give a algorithm, with an natural but not trivial method to generate candidate INDs of size i + 1 from satisfied INDs of size i [8]. Definition of the search space. Candidate INDs are composed of a left-hand side and a right-hand side. Given a set of attributes, we do not have to consider all the permutations to build a left-hand side or a right-hand side, thanks to the second inference rule presented in section 3. Example 3. Let R[AB] ⊆ S[EF ] and R[AB] ⊆ T [KL] be two satisfied INDs. Then, thanks to the second inference rule of INDs (permutation), R[BA] ⊆ S[F E] and R[BA] ⊆ T [LK] are also satisfied. Then, we are faced with the following problem: in which order attribute sequences have to be built to avoid considering several permutations of the same IND ? We have chosen to fix an order for the left-hand side. This order is the lexicographic order on attributes. Reduction of the search space. In this set of candidates, a specialization relation ≺ can be defined as follows [18]: Let I1 : Ri [X] ⊆ Rj [Y ] and I2 : Ri [X  ] ⊆ Rj [Y  ] be two candidate INDs. We define I2 ≺ I1 iff: – Ri = Ri and Rj = Rj and – X  =< A1 , ..., Ak >, Y  =< B1 , ..., Bk >, and there exists a set of indices i1 < ... < ih ∈ {1, ..., k} with h ≤ k such that X =< Ai1 , ..., Aih >, Y =< Bi1 , ..., Bih >3 . Note that X, Y, X  and Y  are sequences, and thus the specialization relation respects the order of attributes. Example 4. We have (Ri [AC] ⊆ Rj [EG])  (Ri [ABC] ⊆ Rj [EF G]), but (Ri [AC] ⊆ Rj [GE])  (Ri [ABC] ⊆ Rj [EF G]). We note I1 ≺ I2 if I1  I2 and I2  I1 . From the second inference rule of INDs, we can deduce the following property, which justifies a levelwise approach for IND inference. 3

This definition is slightly different from that given in [18]. Here, we impose an order for indices i1 , ..., ih without any loss of information.

Efficient Algorithms for Mining Inclusion Dependencies

471

Property 2. Let I1 , I2 be 2 candidate INDs such that I1 ≺ I2 . – If d |= I2 then d |= I1 and – If d |= I1 then d |= I2 . This property extends the Apriori property to our problem; we say that the relation ≺ is anti-monotone w.r.t. the satisfiability of INDs [10]. Then, knowing not satisfied INDs at a given level, allows us to prune candidates for the next level. More precisely, only satisfied INDs will be used to generate candidate INDs for the next level. Thus, the search space will be considerably reduced, for levels higher than one. The algorithm. From now, notations given in Table 3 will be used throughout the paper. Table 3. Notations Ci

Set of candidate inclusion dependencies of size i.

Ii

Set of satisfied inclusion dependencies of size i.

I.lhs

Left-hand side sequence of the IND I

I.rhs

Right-hand side sequence of the IND I

X.rel

Relation schema of attributes of the sequence X

Algorithm 2 finds all INDs holding in a given database d, taking in input the set of unary INDs satisfied by d (cf section 4.1). The first phase consists in computing candidate INDs of size 2, from satisfied INDs of size 1. Then, these candidates are tested against the database. From the satisfied ones, candidate INDs of size 3 are generated and then tested against the database. This process is repeated until no more candidates can be computed.

Algorithm 2. IND inference Input: d a database, and I1 the set of UINDs satisfied by d. Output: Inclusion dependencies satisfied by d 1: C2 := GenN ext(I1 ); 2: i := 2; 3: while Ci = ∅ do 4: forall I ∈ Ci do 5: if d |= I then 6: Ii := Ii ∪ {I}; 7: Ci+1 := GenN ext(Ii ); 8: i := i + 1; 9: end while 10: return ∪j