A framework for understanding existing databases - Semantic Scholar

4 downloads 0 Views 207KB Size Report
In such a context, there is never enough useful informa- tion about how the database is operating and what is go- ing wrong. Since such useful information is ...
A framework for understanding existing databases St´ephane Lopes Jean-Marc Petit Laboratoire LIMOS, FRE CNRS 2239 Universit´e Blaise Pascal - Clermont-Ferrand II 24 avenue des Landais 63 177 Aubi`ere Cedex, France fslopes,[email protected] Abstract In this paper, we propose a framework for a broad class of data mining algorithms for understanding existing databases: Functional and approximate dependency inference, minimal key inference, example relation generation and normal form tests. We point out that the common data centric step of these algorithms is the discovery of agree sets. A set-oriented approach for discovering agree sets from database relations based on SQL queries is proposed. Experiments have been performed in order to compare the proposed approach with a data mining approach. We present also a novel way to extract approximate functional dependencies having minimal errors from agree sets.

1 Introduction Today’s database administrators are required to tune more and more parameters for an optimal use of their databases. The difficulty of such tasks is widely recognized while number of companies cannot argue for a DBA fulltime attendance. Therefore, simplifying the administration of database systems is becoming a new and critical challenge for the database community [3]. Self-tuning physical databases is investigated to improve performances of the system, for instance by defining indexes [9, 5] or by gathering sufficient statistics for query optimizers [6]. In this paper, instead of considering the physical level, we consider the problem of understanding existing databases at the logical level. For example, the DBA would have the opportunity to normalize relation schemas with respect to functional dependencies or to declare missing minimal keys over existing relations. In such a context, there is never enough useful information about how the database is operating and what is going wrong. Since such useful information is often hidden

Lotfi Lakhal Laboratoire LIM, FRE 2246 Universit´e de la m´editerran´ee 163, avenue de Luminy - Case 901 13 288 Marseille Cedex 9, France [email protected]

in the data themselves, data mining algorithms need to be devised to achieve these tasks. As a matter of fact, one of the most useful information is known to be functional dependencies. Many algorithms addressing the discovery of functional dependencies and some other related problems (approximate functional dependency inference, minimal key inference, data sampling through Armstrong relations and normal form tests) have been proposed in the literature [2, 18, 8, 19, 7, 20, 15, 17, 14, 16, 21, 22]. Nevertheless, rather than defining different algorithms for each of these related problems, emphasis on identifying a common data centric step for a broad class of algorithms should be promoted [4]. In this setting, a common data centric step of many algorithms is the discovery of agree sets [8, 7, 20, 16]. Based on this fact, we propose a framework for dealing with the following problems: functional and approximate dependency inference, minimal key inference, data sampling and normal form tests. In [16], we pointed out how functional dependencies and Armstrong relations can be discovered thanks to agree sets and minimal transversals of simple hypergraph. In this paper, we have made the following improvements: (1) we propose a set-oriented approach for discovering agree sets from database relations based on SQL queries. We provide a comparison between this approach and the data mining approach based proposed in [16], through experimental results for various data sets; (2) we deal with the discovery of approximate functional dependencies having minimal errors from agree sets.

Paper organization In the next section, we give some necessary definitions. In section 3, we propose a framework to deal with a class of algorithms useful for understanding databases using agree sets. As an example of the use of agree sets, we address the approximate functional dependency inference problem. Our approach to discover agree sets is detailed in section 4. We conclude in section 5.

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

2 Basic definitions

the relation Assign representing the assignment of employees to departments [16].

This section briefly summarizes definitions and results from relational database theory, which are relevant for our work [20, 1]. Let R be a finite set of attributes. For each attribute A 2 R, the set of all its possible values is called the domain of A and denoted by Dom(A). A tuple over R is a mapping t : R ! A2R Dom(A), where t(A) 2 Dom(A); 8A 2 R. A relation is a set of tuples. We say that r is a relation over R and R is the relation schema of r. If X  R is an attribute set and t is a tuple, we denote by t[X ] the restriction of t to X. A functional dependency over R is an expression X ! A where X  R and A 2 R. The functional dependency X ! A holds in a relation r (denoted by r j= X ! A) if and only if 8ti ; tj 2 r; ti [X ] = tj [X ] ) ti [A] = tj [A]. A functional dependency X ! A is minimal if A is not functionally dependent on any proper subset of X . The functional dependency X ! A is trivial if A 2 X . We denote by dep(r) the set of all functional dependencies holding in r: dep(r) = fX ! AjX [ A  R; r j= X ! Ag. Let F be a set of functional dependencies over R. The closure of X with respect to F , denoted by (X )+ , is the set F of attributes A 2 R such that X ! A can be derived from F (denoted by F j= X ! A): (X )+ = fA 2 RjF j= F = X ! Ag. A set X  R is closed if and only if (X )+ F X . We denote by CL(F ) the family of closed sets induced by F and GEN (F ) the single minimal subfamily of generators in CL(F ) such that each member of CL(F ) can be expressed as an intersection of sets in GEN (F ). Knowing GEN (F ), the computation of the closure of an attribute set = Y. can be easily performed: (X )+ F Y 2GEN (F ); X Y Let F and G be two sets of functional dependencies, F is a cover of G if GEN (F ) = GEN (G). Let us introduce agree and maximal sets. Let ti and tj be two tuples and X an attribute set. The tuples ti and tj agree on X if ti [X ] = tj [X ]. The agree set1 of ti and tj is defined as follows: ag (ti ; tj ) = fA 2 Rjti [A] = tj [A]g. If r is a relation, ag (r) = fag (ti ; tj )jti ; tj 2 r; ti 6= tj g. Given an attribute A, a maximal set is an attribute set X which is the largest possible set not determining A. We denote by max(F; A) the set of maximal sets for A w.r.t. F : max(F; A) = fX  RjF 2 X ! A and 8Y  R; X  Y; F j= Y ! Ag; and M AX (F ) = A2R max(F; A). Moreover, in [18, 20], a result relating maximal sets and intersection generators is given: M AX (F ) = GEN (F ).

S

T

S

Example 2.1 This example will be used throughout this paper for illustrating the different algorithms. Let us consider 1 also

called equality set in [7].

1 2 3 4 5 6 7

empnum 1 1 2 3 4 5 6

depnum 1 5 2 2 3 1 5

year 85 94 92 98 98 75 88

depname Biochemistry Admission Computer Sce Computer Sce Geophysics Biochemistry Admission

mgr 5 12 2 2 2 5 12

For simplicity, attributes empnum, depnum, year, depname, mgr are renamed A, B , C , D, E respectively. The set of agree sets for this relation is: ag (r) = f;; A; BDE; CE; E g.

3 Framework for understanding existing relations This section is devoted to the presentation of our framework. The former section presents how maximal sets are seen as the starting point of several analysis algorithms. As an example of such algorithms, the latter section exposes a new algorithm for approximate functional dependency inference.

3.1 Maximal sets for understanding relations From an initial relation, let us assume that agree sets have been discovered (see section 4). From these sets, maximal sets can be deduced (cf Lemma 3.1). In the proposed approach, maximal sets, and hence agree sets, are seen as a common and starting point for providing solutions to several problems. The problems we consider in the framework are: functional dependency inference, approximate functional dependency inference, minimal key inference, example relation generation, 3NF/BCNF tests and maximal set projection. Some of these problems have already been adressed using maximal sets [8, 7, 20, 16]. Our framework is depicted in figure 1. In this paper, we focus only on the approximate dependency inference problem in section 3.2. Note that database accesses are performed only during agree set computation. All other steps can be performed directly on the basis of agree sets in main memory, i.e. without any additional pass over original data. Thus, agree set discovery is the most time-consuming part of the algorithms. The discovery of maximal sets from agree sets is based on the characterization provided by lemma 3.1. It states that maximal sets for an attribute are agree sets which do not contain this attribute and which are maximal with respect to inclusion.

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

Database relation Agree sets Maximal set projection

Maximal sets

Approximate Functional Example functional Minimal key dependency relation 3NF/BCNF test dependency inference inference generation inference

Figure 1. Framework for understanding databases

Lemma 3.1 [16] max(dep(r); A) = M ax fX

2 ag(r)jA 2= X g.

Example 3.1 With our example relation, we obtain the following maximal sets: max(dep(r); A) = fBDE; CE g, max(dep(r); B ) = fA; CE g, max(dep(r); C ) = fA; BDE g, max(dep(r); D) = fA; CE g, max(dep(r); E ) = fAg Thus, we obtain: M AX (dep(r)) = fA; BDE; CE g.

3.2 Approximate functional dependency inference In this section, we discuss one of the applications described in the framework depicted in figure 1. We are going to show how maximal sets (and hence agree sets) can be used to deal with approximate functional dependencies. An approximate functional dependency is a functional dependency that almost holds [15, 14]. To measure the approximateness of a dependency, several error measures were proposed [15]. The definition we use is based on the minimum number of rows that need to be removed from a relation for the dependency to hold. This measure is called g3 in [15]. Definition 3.1 Let r be a relation and X dependency.

g3 (X

! A) = 1

! A a functional

maxfjsj j s  r; s j= X

! Ag=jrj

We denote by X !g3 (X !Y ) Y , the approximate functional dependency X ! Y with an error measure equal to g3 (X ! Y ). Each maximal set can be written and interpreted as excluded functional dependencies with maximal left hand side, i.e. as an expression X 6! A such that 8B 2 R n X; X [ B ! A [11]. Obviously, an excluded functional dependency is an approximate functional dependency. The

error measure can be easily computed, for instance by using the algorithm given in [14]. By considering maximal sets, we obtain all approximate functional dependencies with maximal left hand sides and minimal error measures (without any user supplied threshold). The left hand side of an approximate dependency having a non null error measure is obviously a subset of one or more maximal sets. Due to the following property, the error measure of this latter dependency is greater than the one associated with the maximal sets. Property 3.1 Let X; Y  R and A g3 (X ! A)  g3 (Y ! A)

2

R. X



Y

)

The obtained approximate functional dependencies are also the approximate dependencies with minimal error measures. This result is stated by the following lemma. Lemma 3.2 Let X ! A be a functional dependency and r a relation. Either r j= X ! A or 9Y 2 max(dep(r); A) such that g3 (Y ! A)  g3 (X ! A). Example 3.2 Let us suppose that we would like to find approximate functional dependencies with minimal error measures. We have: max(dep(r); A) = fBDE; CE g, max(dep(r); B ) = fA; CE g, max(dep(r); C ) = fA; BDE g, max(dep(r); D) = fA; CE g, max(dep(r); E ) = fAg. Thus, the excluded functional dependencies are: BDE 6! A, CE 6! A, A 6! B , CE 6! B , A 6! C , BDE 6! C , A 6! D, CE 6! D, A 6! E . The approximate functional dependencies with minimal error measures are: BDE ! 37 A, CE ! 17 A, A ! 17 B , CE ! 17 B , A ! 71 C , BDE ! 37 C , A ! 17 D, CE ! 17 D, A ! 17 E .

4 Set-oriented mining of agree sets The proposed approach can be seen as a database approach because agree sets are computed using SQL queries. In section 4.3, we compare it with a data mining approach using partitions presented in [16]. Up to our knowledge, no other approaches deal with the discovery of agree sets.

4.1 Discovering agree sets using SQL queries Let R(A1; : : : ; An) be a relation schema. To discover agree sets from a relation over R, we have to consider all couples of tuples that can be formed from the relation. This can be express in SQL using the cartesian product of the relation by itself. It is possible to decrease the number of

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

resulting tuples by noticing that couples (ti ; tj ) and (tj ; ti ) produce the same agree set. We can deal with this optimization by assuming that each tuple of the original relation has an unique ID (ROWID with Oracle). Finally, we have to compute agree sets from each resulting tuple. We propose to use RDMBS integrated functions to solve this problem. With Oracle, the appropriated function is decode which compares its two first arguments and returns the third if they are equal. This leads to the query denoted by Q1 . In the context of our example, figure 2 shows the Q1 query. The query presented here could be easily adapted for being processed by another system because it only uses standard RDBMS specific functions. SELECT DISTINCT decode(R1.A, R2.A, ’A’) decode(R1.B, R2.B, ’B’) decode(R1.C, R2.C, ’C’) decode(R1.D, R2.D, ’D’) decode(R1.E, R2.E, ’E’) FROM Assign R1, Assign R2 WHERE R1.ROWID < R2.ROWID;

|| || || ||

4.2 Cost model Let R be a relation schema, and r a relation over R. Assume that n = jRj, p = jrj and V (R; A) be the number of distinct values relation R has for attribute A (these notations came from [10]). We would like to estimate the size of the intermediate relation of Q1 and Q2 , referred to as jQ1 j and jQ2 j respectively. For Q1 , we have immediately the result, i.e. jQ1 j = p(p 1) . For Q2 , we need in a first time to estimate the 2 size of a join condition for one attribute, say A (recall 2 R1 = R2 = R) [10]: jR1 ./R1 :A=R2 :A R2 j = V (pR;A) . When the ROWID condition in taken into account, we obtain: jR1 ./R1 :A=R2 :A and R1 :rowid=R2 :rowid R2 j =

P

p(p 1) . Finally, the 2V (R;A) n p(p 1) 1 i=1 V (R;Ai ) 2

size of Q2 can be estimated as

jQ j goes positive, i.e. P

Figure 2. Q1 SQL query

As a consequence, Q2 wins Q1 if the expression jQ1 j n 1  1. i=1 V (R;Ai ) This estimation gives us valuable information:

2

By examining the query result on our example (ag (r) =

f;; A; BDE; CE; E g), we note that the empty set is an

agree set. Moreover, it can be obtained many times with lots of couples. We propose to modify the Q1 query to eliminate the empty set from the result. To achieve this goal, we add in the WHERE clause a disjunction of auto-join. This leads to the query denoted by Q2 . In the context of our example, the Q2 query (see figure 3) returns the following result: ag (r) = fA; BDE; CE; E g. SELECT DISTINCT decode(R1.A, R2.A, ’A’) decode(R1.B, R2.B, ’B’) decode(R1.C, R2.C, ’C’) decode(R1.D, R2.D, ’D’) decode(R1.E, R2.E, ’E’) FROM Assign R1, Assign R2 WHERE R1.ROWID < R2.ROWID AND ( R1.A = R2.A OR R1.B OR R1.C = R2.C OR R1.D OR R1.E = R2.E );

could think that this latter query will be more efficient in some cases. To be able to estimate when, for a given relation, does it make sense to use or not the OR clauses, we consider the size of intermediate relations.

|| || || ||

= R2.B = R2.D

Figure 3. Q2 SQL query Note that Q2 has to perform 2 full scans for each OR clause (that is to say a total of 2jRj full scans for the complete query) whereas Q1 executes only 2 full scans. So, we

 

if at least one attribute has only one distinct value, Q1 should win Q2 . if each attribute of R has ”enough” distinct values, Q2 should win Q1 .

Experiments, given in the Section 4.3, illustrate the usefulness of such a database approach. Note that, for the query with OR clauses (Q2 ), several physical query plans can been generated with respect to the join algorithms and the existence of indexes while for Q1 , the physical query plan cannot be improved since Q1 is a cartesian product.

4.3 Experiments The aims of this section is twofold: The former is to compare queries Q1 and Q2 and verifies experimentaly our cost model. The latter is to evaluate the performances of the database approach w.r.t. the data mining approach based on stripped partitions proposed in [16]. Instead of taking advantages of the DBMS querying capabilities, it introduces an efficient algorithm for yielding agree sets, which is based on a characterization of such sets. This characterization uses a reduced representation of the initial relation called stripped partition database. This data mining approach will be referred to as ”Partitions” in the sequel. To experiment the two approaches, we performed several tests on an Intel Pentium II with a CPU clock rate of 350

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

Mhz, 256 MB of main memory and running Windows NT 4. We implemented the algorithm for Partitions using the C++ language and STL (Standard Template Library). The RDBMS used during the tests is Oracle 8. For the tests, we generated synthetic data sets (i.e. relations) of several sizes (by varying the number of attributes, the number of tuples and the number of distinct values in columns). Q1 vs. Q2 The first part of the tests allows to compare the two queries (Q1 and Q2 ) on generated relations having 500 tuples and 20 attributes. The number of identical values in columns, denoted by %d, has been set from 10 % to 99 %. For example, if this parameter has a value of 50 % for an attribute and the number of tuples is 1000, this means that each value for this attribute is chosen between 500 possible values. Resulting execution times are shown in table 1 and in figure 4. %d Q1 Q2

10 4,6 1,2

50 4,6 2,0

80 5,0 3,5

85 5,1 4,4

90 5,0 5,8

95 5,7 9,3

99 23 30

used by default for each OR clause. We have also compelled the query engine to execute a hash join instead of a sort merge join. A hash table is build from one of the two relations and the other one is used to probe the hash table. Tests were executed to compare these two execution plans for the query (see table 3). The previous tests were made without using indexes. To be complete, we have experimented the behavior of the query in the presence of indexes. In this case, the DBMS chooses a nested loop join by default. It performs a full scan of one table and accesses the corresponding values in the other table using indexes. Query execution times are shown in table 3 (notations are explained in table 2) and figure 5. Figure 6 point out that the data mining approach remains more efficient than the database approach even using indexes (up to our knowledge, no other approach exists for agree set mining). Note that execution times include the partition construction times or the index building time. SQL1 SQL2 SQL3 SQL4 PART

Table 1. Execution times in seconds

SQL query (sort merge join without indexes) SQL query (nested loop join with indexes) SQL query (hash join without indexes) SQL query (hash join with indexes) Data mining approach with partitions Table 2. Notations of table 3

30 Q1 Q2 25

6000 Sort merge join without indexes Nested loop join with indexes Hash join without indexes Hash join with indexes Partitions

Times (s)

20

5000 15

4000

Times (s)

10

5

3000

2000 0 10

20

30

40

50 60 70 Number of identical values (%)

80

90

100 1000

Figure 4. Execution times in seconds These experimental results show the validity of our estimation (cf section 4): For a small rate of identical values, Q2 outperforms Q1 . From a threshold (between 80 % and 90 %), Q1 becomes more efficient than Q2 . Database vs. data mining approach The second part of the tests concerns the comparison of Q2 with the data mining approach. Several execution paths for executing Q2 were studied against relations having many distinct values in columns. With Oracle 8, a sort-merge join algorithm is

0 20000

30000

40000

50000

60000 70000 Number of tuples

80000

90000

100000

Figure 5. Execution times in seconds (20 attributes)

We can made several remarks on these results with respect to the number of distinct values in each column: 1. the relation has many distinct values: Q2 is always better than Q1 and large relations (e.g 100 000 tuples) can be handled (see Table 3). Furthermore,

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

jrjnjRj 25000

50000

75000

100000

SQL1 SQL2 SQL3 SQL4 PART SQL1 SQL2 SQL3 SQL4 PART SQL1 SQL2 SQL3 SQL4 PART SQL1 SQL2 SQL3 SQL4 PART

10 206 82 204 130 88 507 194 507 318 201 895 347 886 541 321 1258 500 1286 817 734

20 815 180 814 468 181 2039 443 2043 1265 410 3525 747 3534 2238 667 5286 1115 5335 3282 947

the performances of Q2 are almost equivalent to the performances of the approach by partitions whenever indexes have been constructed. Note also that there is no significant difference between SQL queries using either the hash-join algorithm or the sort merge join algorithm, this latter one is even sometimes better. The presence of indexes greatly improves the query execution times. Nevertheless, indexes have to be created on each attribute of the relation which leads to two problems: the size of the indexes is important and the cost of updates on the relation can be greatly increased. Practically, approach using index seems to be unacceptable but becomes valid if indexes are only created for this purpose.

30 2178 327 2178 1160 264 5127 732 5123 3090 625 8633 1283 8662 5456 1045 12875 1924 13096 7502 1514

2. the relation has not many distinct values: Q1 is better than Q2 but large relations (e.g. 50 000 tuples) cannot be handled neither by Q1 nor by the approach by Partitions with reasonable execution times.

5 Conclusion

Table 3. Execution times in seconds

1200 Nested loop join with indexes Partitions 1100

1000

Times (s)

900

800

700

600

500

400 50000

55000

60000

65000

70000 75000 80000 Number of tuples

85000

90000

95000

100000

Figure 6. Execution times in seconds (20 attributes)

In this paper, we identified the discovery of agree sets as the data centric step of many algorithms useful for understanding databases: functional and approximate dependency inference, minimal keys inference, normal form tests and data sampling. Indeed, each of these algorithms takes maximal sets in input, such maximal sets being straightforward to derive from agree sets. Let us underline that our proposal follows from principles similar to the idea under which frequent patterns solve a variety of problems (from association rules to episodes or correlations) [4, 13]. The other contribution of this paper is to study how competitive can the discovery of agree sets expressed in SQL be compared to a specialized implementation proposed in [16]. Rather surprisingly, experimental evaluations point out that performances for the database approach are almost equivalent to the data mining approach (if indexes are created in case of many duplicated values) even if it implies a higher storage penalty than the data mining approach. However, when the relation is large and some attributes have not many duplicated values, neither of them seems to be practicable and thus the problem remains still open. More research needs to be carried out, one interesting direction being to compute agree sets ”on the fly” during query processing [12].

References [1] S. Abiteboul, R. Hull, and V. Vianu. Databases. Addison Wesley, 1995.

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE

Foundations of

[2] C. Beeri, M. Dowd, R. Fagin, and R. Statman. On the structure of Armstrong relations for functional dependencies. Journal of the ACM, 31(1):30–46, 1984. [3] P. A. Bernstein, M. L. Brodie, S. Ceri, D. J. DeWitt, M. J. Franklin, H. Garcia-Molina, J. Gray, G. Held, J. M. Hellerstein, H. V. Jagadish, M. Lesk, D. Maier, J. F. Naughton, H. Pirahesh, M. Stonebraker, and J. D. Ullman. The Asilomar report on database research. SIGMOD Record, 27(4):74–80, 1998. [4] S. Chaudhuri. Data mining and database systems: Where is the intersection? Data Engineering Bulletin, 21(1):4–8, 1998. [5] S. Chaudhuri and V. R. Narasayya. Autoadmin ’what-if’ index analysis utility. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, pages 367–378, 1998. [6] S. Chaudhuri and V. R. Narasayya. Automating statistics management for query optimizers. In Proceedings of the 16th International Conference on Data Engineering, 28 February - 3 March, 2000, San Diego, California, USA, pages 339–348. IEEE Computer Society, 2000. [7] J. Demetrovics, L. Libkin, and I. B. Muchnik. Functional dependencies in relational databases: A lattice point of view. Discrete Applied Mathematics, 40:155–185, 1992. [8] J. Demetrovics and V. D. Thi. Relations and minimal keys. Acta Cybernetica, 8(3):279–285, 1988. [9] S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Transaction on Database Systems, 13(1):91–128, 1988. [10] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 1999. [11] G. Gottlob and L. Libkin. Investigations on Armstrong relations, dependency inference, and excluded functional dependencies. Acta Cybernetica, 9(4):385–402, 1990. [12] G. Graefe, U. M. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large sql databases. In R. Agrawal, P. E. Stolorz, and G. Piatetsky-Shapiro, editors, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), August 27-31, 1998, New York City, New York, USA, pages 204–208. AAAI Press, 1998. [13] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation (to appear). In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, volume 29, pages 1–12. ACM, 2000. [14] Y. Huhtala, J. K¨arkk¨ainen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(3):100–111, 1999. [15] J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129–149, 1995. [16] S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In C. Zaniolo, P. C. Lockemann, M. H. Scholl, and T. Grust, editors, Proceedings of the Sixth International Conference on Extending Database Technology, Konstanz, Germany, volume

[17]

[18]

[19]

[20] [21]

[22]

1777 of Lecture Notes in Computer Science, pages 350–364. Springer, 2000. H. Mannila. Methods and problems in data mining. In Proceedings of the International Conference on Database Theory, Delphi, Greece, pages 41–55, 1997. H. Mannila and K.-J. R¨aih¨a. Design by example: An application of Armstrong relations. Journal of Computer and System Sciences, 33(2):126–141, 1986. H. Mannila and K.-J. R¨aih¨a. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the Eighth ACM Symposium on Principles of Database Systems, Philadelphia, Pennsylvania, pages 128–133. ACM Press, 1989. H. Mannila and K.-J. R¨aih¨a. The Design of Relational Databases. Addison Wesley, 1994. N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In J. V. den Bussche and V. Vianu, editors, Proceedings of the Eighth International Conference on Database Theory, London, UK, volume 1973, pages 189–203. Springer, 2001. C. Wyss, C. Giannella, and E. Robertson. Fastfds: A heuristic-driven depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001), Munich, Germany, To appear in Lecture Notes in Computer Science.

Proceedings of the International Database Engineering and Applications Symposium (IDEAS’01) 1098-8068/01 $10.00 © 2001 IEEE