Towards a parallel approach for incremental mining of functional dependencies on multi-core systems Ghada Gasmi, Yahya Slimani and Lotfi Lakhal Department, Faculty of Sciences of Tunis, University of Tunis El Manar, Tunisia Laboratoire d’Informatique Fondamentale de Marseille (LIF), Aix-Marseille, Universit´e IUT d’Aix en Provence, France {
[email protected],
[email protected] and
[email protected]}
Abstract. A general assumption in all existing algorithms for mining functional dependencies is that the database is static. However, real life databases are frequently updated. To the best of our knowledge, the discovery of functional dependencies in dynamic databases has never been studied. A na¨ıve solution consists in re-applying one of the existing algorithms to discover functional dependencies holding on the updated database. Nevertheless, in many domains, where response time is crucial, re-executing algorithms from the scratch would be inacceptable. To address this problem, we propose to harness the multi-core systems for an incremental technique for discovering the new set of functional dependencies satisfied by the updated database. Through a detailed experimental study, we show that our parallel algorithm scales very well with the number of cores available.
Keywords: Multi-core systems, data mining, emergent architectures, parallel programming, incremental mining of functional dependencies.
1
Introduction
Originally, the study of functional dependencies (FDs) has been motivated by the fact that they could express constraints holding on a relation independently of a particular instance [1]. Later, Mannila and Rih studied FDs with a data mining point of view. Indeed, the idea was introduced in [9] as the inference of functional dependencies problem. Its principle consists in determining a cover of all functional dependencies holding on a given relation r. Motivations for addressing functional dependencies inference arise in several areas [10, 3, 5, 6, 14, 12, 15, 7, 2]. Indeed, FDs were applied in database management [4, 8], data reverse engineering [13] , query optimization and data mining [5]. A crucial study of the dedicated literature allows one to note that a general assumption in all existing algorithms is that the database is static. However, real life databases are dynamic where they are constantly updated. Hence, a possible solution consists in re-applying one of the existing algorithms on the updated database. This
2
Ghada Gasmi, Yahya Slimani and Lotfi Lakhal
solution though simple, has disadvantages. All previous computation done to discover FDs is wasted and the process of FDs discovery must restart from the scratch. Indeed, the more the size of the database and the frequency of its update increase, the more this solution becomes time consuming and inacceptable in many applications. To address this problem, we harness multi-core systems to propose an incremental algorithm, called MT-IncFDs, which reuses previously mined FDs and combine them with the new tuples to insert into r in order to efficiently compute the new set FDs. Indeed, we have been inspired by the following two advances. The first innovation is to place multiple cores on a single chip, called Chip Multiprocessing (CMP). Each core is an independent computational unit, allowing multiple processes to execute concurrently. The second advancement is to allow multiple processes to compete for resources simultaneously on a single core, called simultaneous multithreading (SMT). By executing two threads concurrently, an application can gain significant improvements in effective instructions per cycle (IPC). These advances arise in part because of the inherent challenges with increasing clock frequencies. The increase in processor frequencies over the past several years has required a significant increase in voltage, which has increased power consumption and heat dissipation of the central processing unit. In addition, increased frequencies require considerable extensions to instruction pipeline depths. Finally, since memory latency has not decreased proportionally to the increase in clock frequency, higher frequencies are often not beneficial due to poor memory performance. Hence, by combining CMP and SMT technology, chip vendors can continue to improve IPC without raising frequencies. In order to highlight the benefit of using multi-core systems, we conducted a set of experiments on synthetic databases. The remainder of the paper is organized as follows. Section 2 presents some key notions used throughout the paper. In section 3, we introduce the MTIncFDs algorithm for incremental mining of FDs on CMP systems. Experimental evaluations are given in Section 4. Section 5 summarizes our study and points out some future research directions.
2
Preliminaries
Hereafter, we present some key notions which are of use to comprehend the remaining of the paper database [1, 11, 6]. Relation: Let A = {a1 , . . . , am } be a finite set of attributes. Each attribute ai has a finite domain, denoted dom(ai ), representing the values that ai can take on. For a subset X = {ai , . . . , aj } of A, dom(X) is the Cartesian product of the domains of the individual attributes in X. A relation r on A is a finite set of tuples {t1 , . . . , tn } from A to dom(A) with the restriction that for each tuple t ∈ r, t[X] must be in dom(X), such that X ⊆ A and t[X] denotes the restriction of the tuple t to X.
Title Suppressed Due to Excessive Length
3
Functional dependency: Let r be a relation on A. A functional dependency (FD) over A is an expression X → A where X ⊆ A and A ∈ A. We refer to X as the antecedent and A as the consequent. A FD X → A holds on r (denoted r |= X → A) if and only if ∀(ti , tj ) ∈ r, ti [X] = tj [X] ⇒ ti [A] = tj [A]. A FD X → A is minimal if and only if r |= X → A and ∀Z ⊂ X, r 6|= Z → A. We denote by Fr the set of all functional dependencies satisfied by r. Canonical cover: Let r be a relation on A. The canonical cover of Fr is defined as follows: Cover(Fr ) = {X → A|X ⊂ A, A ∈ A, r |= X → A, X → A is minimal}. Agree sets of a relation: Let t and t’ be two tuples of r and X ⊆ A be a set of attributes. Then, the tuples t and t0 agree on X if and only if t[X] = t0 [X]. Hence, according to t and t0 , the agree set, denoted Ag(t, t0 ) = {A|t[A] = t0 [A], A ∈ A}. Ag(r) = {Ag(t, t0 )|(t, t0 ) ∈ r, t 6= t0 } denotes all agree sets of r. Agree sets induced by a tuple: Let r be a relation and t be a tuple. Then, with respect to r, agree sets induced by the tuple t are {Ag(t, t0 )|t0 ∈ r} and denoted by Ag(r)t . Equivalence class: Let r be a relation on A and t be a tuple. Then, for a given attribute A ∈ A, the equivalence class of r with respect to t is r(A)t = {t0 |t[A] = t0 [A], t0 ∈ r}. We denote by EC(r)t = {r(A)t |A ∈ A} all equivalence classes of r with respect to t.
3
The MT-IncFDs algorithm
We are at the beginning of the multi-core era. When combined with simultaneous multithreading technology, multi-core can provide powerful optimization opportunities, increasing system throughput substantially. Inspired by these modern technologies, we propose a multi-threaded algorithm, called MT-IncFDs, for incremental mining of functional dependencies, which harnesses multi-core systems. Let r be a relation on A and t be a set of tuples to insert into r. The aim of MT-IncFDs consists to harness multi-core systems to determine the canonical cover Cover(Fr0 ) of the updated relation r0 = r ∪ t by taking advantage of of the already discovered Cover(Fr ). Since r ⊂ r0 , then Fr0 ⊆ Fr . In other words, there are two types of FDs: – the winners: functional dependencies of Fr which still hold on r0 . – the losers: functional dependencies of Fr which do not hold on r0 For that, MT-IncFDs proceeds in two steps: 1. initializing Cover(Fr0 ) with winners that belong to Cover(Fr ); 2. maintaining losers of Cover(Fr ) in order to deduce the new FDs of Cover(Fr0 ).
4
3.1
Ghada Gasmi, Yahya Slimani and Lotfi Lakhal
Initialization of Cover(Fr0 ) with winners
MT-IncFDs assumes an alternate characterization of winners, which relies on the concept of agree sets induced by t. Computing agree sets induced by t In order to compute the agree sets induced by t, MT-IncFDs starts by computing the equivalence classes of r with respect to t. For that, we divide r among the p available threads. Each thread is given |A|/p equivalence classes to compute.
t1 t2 t3 t4 t5 t6 t7 Table 1.
A B CD E 1 100 1 2 50 4 101 1 2 50 1 102 2 2 70 1 200 1 2 50 2 101 3 3 100 2 200 1 3 70 1 100 3 2 50 Example of a relation.
Example 1. Let us consider the relation of Table 1 and suppose that it contains only the six first tuples and suppose that we dispose of three threads {T1 , T2 , T3 }. In order to determine the equivalence classes of r with respect to t7 , we assign to each thread 35 equivalence classes. Hence, T1 will compute r(A)t7 and r(B)t7 . T2 will compute r(C)t7 and r(D)t7 . T3 will compute r(E)t7 . The obtained equivalence classes are : r(A)t7 = {t1 , t3 , t4 }; r(B)t7 = {t1 }; r(C)t7 = {t5 }; r(D)t7 = {t1 , t2 , t3 , t4 }; r(E)t7 = {t1 , t2 , t4 }. Then, EC(r)t7 = {r(A)t7 , r(B)t7 , r(C)t7 , r(D)t7 , r(E)t7 }. It is important to mention that there is no dependence among the threads, because each thread has to compute equivalence classes of a fixed number of attributes. Furthermore, the relation r is accessed in read-only. Thus, concurrent accesses to r by multiple threads do not need synchronization and do not cause cache coherence invalidations and misses. Afterwards, MT-IncFDs determines the maximal equivalence classes of r with respect to t denoted M C(r)t . For that, each thread Ti checks the maximality of the equivalence classes at hand by comparing them to the equivalence classes computed by the remaining p − 1 threads. This test of maximality is done
Title Suppressed Due to Excessive Length
5
concurrently without need of synchronization since the equivalence classes are accessed in read-only. Example 2. Let us continue with the previous example. T1 has to check the maximality of r(A)t7 and r(B)t7 by comparing them to equivalence classes computed by T2 and T3 . T2 has to check the maximality of r(C)t7 and r(D)t7 by comparing them to equivalence classes computed by T1 and T3 . T3 has to check the maximality of r(E)t7 by comparing it to equivalence classes computed by T1 and T2 . The result of these test is M C(r)t7 = {{t1 , t2 , t3 , t4 }, {t5 }}. Next, MT-IncFDs uses Proposition 1 in order to compute the agree sets induced by t. Proposition 1. For a couple of tuples (t, t0 ), Ag(t, t0 ) = {A|t0 ∈ r(A)t , r(A)t ∈ EC(r)t }. Hence, given a relation r and a tuple t, the agree sets induced by t with respect to r are given by: Ag(r)t = {Ag(t, t0 )|t0 ∈ c, c ∈ M C(r)t }. For that, MT-IncFDs computes the number of couples (t, t0 ), which can be generated in order to find the average number of couples that ought to be generated by each thread. If S is this found average, MT-IncFDs goes over the sorted list of maximal equivalence classes by their respective sizes and assigns maximal classes consecutively for each thread until the cumulated number of couples is equal or greater than the average S. Once the collection of couples of tuples is generated, each thread should compute agree sets of the couples at hand. This step requires a scan of the equivalence classes. Since they are accessed in read-only, they can be thus shared without synchronization among all the threads. Example 3. Let us continue with the previous example. First, maximal equivalence classes are sorted by their respective sizes. The number of couples of tuples that ought to be generated from maximal equivalence classes is equal to 6. Then the average S = 2. We start by assigning the only the first equivalence class to T1 since the number of couples obtained from this equivalence class is greater that S. Hence, T1 generates the following couples of tuples {(t7 , t1 ), (t7 , t2 ), (t7 , t3 ), (t7 , t3 ), (t7 , t5 )}. After, it deduces the agree sets induced by t7 as follows. Ag(t7 , t1 ) = ABDE. As we note t1 belongs to r(A)t7 , r(B)t7 , r(D)t7 and r(E)t7 . Ag(t7 , t2 ) = DE. As we note t2 belongs to r(D)t7 and r(E)t7 Ag(t7 , t3 ) = AD. As we note t3 belongs to r(A)t7 and r(D)t7 . Ag(t7 , t4 ) = ADE. As we note t4 belongs to r(A)t7 , r(D)t7 and r(E)t7 . Ag(t7 , t5 ) = C. As we note t5 belongs to r(C)t7 . Consequently, Ag(r)t7 = {ABDE, AD, ADE, DE, C}. Identification of winners MT-IncFDs uses Proposition 2 in order to identify the winners of Cover(Fr ). Proposition 2. Let Ag(r)t be the agree sets induced by t with respect to r and X → A be a functional dependency of Cover(Fr ). X → A is a winner if and only if 6 ∃Y ∈ Ag(r)t such that X ⊆ Y and A 6∈ Y .
6
Ghada Gasmi, Yahya Slimani and Lotfi Lakhal
To achieve a suitable load-balancing, we divide Cover(Fr ) among the p threads. Each thread is given |Cover(Fr )|/p FDs to check as explained by Proposition 2. It is important to mention that this step does not need syn- chronization since each thread can check the FDs independently. Example 4. Let us consider the relation r of Table 1 and suppose that it contains only the five first tuples and we will insert the tuple t6 . The following table gives Cover(Fr ). BC → A AB → C A → D AB → E BD → A BD → C C → D BD → E BE → A E → C E → D C → E After computing the agree sets induced by t6 , we obtain Ag(r)t6 = {C, E, BC, AD}. Since Cover(Fr ) contains 12 FDs, then each thread has to check 4 FDs. Indeed, T1 has to check BC → A, BD → A, BE → A and AB → C. T2 has to check BD → C, E → C, A → D and C → D. T3 has to check E → D, AB → E, BD → E and C → E. According to Proposition 2, the threads deduce that BD → A, BE → A, AB → C, BD → C, A → D, AB → E and BD → E are winner. 3.2
Maintaining losers
Once winners are identified, we can easily determine the set of losers since losers are equal to Cover(Fr )- winners. For that, MT-IncFDs puts all losers in a queue and assigns them dynamically to a pool of threads on a ”first come-first served” basis. In order to maintain a loser X → A at hand, each thread has to determine the agree sets induced by t which do not contain A. For that, each thread uses a tree recursive procedure whose recursion structure is based on depth-first search exploration of a search tree. Indeed we use a search tree having X as root and whose leaf nodes represent candidate antecedents of FDs which would replace X → A. An arbitrary node, in level i, represents an antecedent obtained by considering maximal agree sets induced by t which do not contain A {G1 , G2 , . . . , Gi }. Each node represents a potential antecedent of a new FD that replace X → A. In order to obtain a candidate of level i + 1, from a node Y of level i, we should consider the i + 1th maximal agree set induced by t. We distinguish two cases: – if Y ∩ Gi+1 6= ∅, then this maximal agree set induced by t is ignored; – else, we generate a child node equal to the union of Y and {e}, such that {e} ∈ Gi+1 . For each leaf node Y , we have to verify that it does not exist a FD of Cover(Fr0 ) having an antecedent included in Y . Once a thread has finished its task, it has to dequeue another unprocessed loser. Else, it polls the other running threads until it steals from a heavy loaded thread. For this step, there is no dependence among the threads, because each search tree corresponds to a disjoint set of potential antecedents of different FDs. Since each thread can proceed independently there is no synchronization while maintaining losers.
Title Suppressed Due to Excessive Length
7
Example 5. Through this example we explain how a thread maintain a loser. Let us continue with the previous example. Let us maintain the loser C → E. The agree sets induced by t6 that do not contain E are {AD, BC}. We build a search tree having ”C ” as root. Then, we consider the first agree set AD. Since C ∩ AD 6= ∅, then this agree set is ignored. Afterwards, we consider the agree set BC. C ∩BC 6= ∅. Consequently, we generate a child node AC that represents the antecedent of the functional dependency AC → E. Next, we generate a second child node CD that represents the antecedent of the functional dependency CD → E. (We ignore CE → E because it is a trivial FD i.e., the consequent is contained in the antecedent).
4
Experimental results
In this section, we assess the performances of MT-IncFDs and we highlight the benefit of using multi-core systems. Unfortunately, large multi-core cpus are still not available. For that, we conducted a set of experiments on a machine equipped with an Intel processor with 4 cores and 8 GB of RAM. We generated synthetic data sets (i.e., relations) in order to control various parameters during the tests. We firstly create a table with |A| attributes in the database and then insert |r| tuples one by one. Each inserted value depends on the parameter c, which is the rate of identical values. It controls the number of identical values in a column of the table. The experiments were carried out on: (1) a relation composed of 10000 tuples, 10 attributes and 50% of identical values per attributes and (2) a relation composed of 10000 tuples, 20 attributes and 50% of identical values per attributes. The curves of Figure 2 illustrates the execution time of MT-IncFDs
(2) 1000 4000 7000 10000
1
1.5
2 2.5 3 number of threads
execution time (ms)
execution time (ms)
(1) 1800 1600 1400 1200 1000 800 600 400 200
3.5
4
2000 1800 1600 1400 1200 1000 800 600 400 200
1000 4000 7000 10000
1
1.5
2 2.5 3 number of threads
Fig. 1. Execution time of MT-IncFDs vs the variation of the number of threads.
when, the 1000th , 4000th , 7000th , 10000th tuple is added. According to theses curves we can note that: – the more we increase the number of threads the more execution time of MT-IncFDs decreases and we obtain a quasi-linear speed up;
3.5
4
8
Ghada Gasmi, Yahya Slimani and Lotfi Lakhal
– the proposed solution scales with both the increasing of number of attributes and the number of threads; – the speed up of MT-IncFDs decreases slightly when we increase the number of tuples of the initial relation r. This is justified by the fact that the more the number of tuples increases, the more cache misses become important (for the step of computing of equivalence classes).
5
Conclusion
In this paper, we proposed the first parallel incremental algorithm for FD mining. Indeed, MT-IncFDs is a multi-threaded algorithm which harnesses the multicore systems. We showed that our algorithm use data structures exhibiting high locality. Moreover, they are accessed in read-only. Thus, concurrent accesses to such data structures by multiple threads do not need synchronization and do not cause cache coherence invalidations and misses. The results of experiments carried out on synthetic databases showed a quasi-linear scale up with the number of cores for MT-IncFDs algorithm. As perspective, it would be interesting to study the use of another parallel framework: clusters including clusters of multi-core machines in order to benefit from both architectures.
References 1. Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. 2. J. Baixeries. A formal concept analysis framework to mine functional dependencies. In Proceeding of the Workshop on Mathematical Methods for Learning, Villa Geno, Italy, 2004. 3. J. Demetrovics, L. Libkin, and I. B. Muchnik. Functional dependencies in relational databases : A lattice point of view. Discrete Applied Mathematics, 40:155–185, 1992. 4. S. Flesca, F. Furfaro, S. Greco, and E. Zumpano. Repairing inconsistent xml data with functional dependencies. In Laura C. Rivero, Jorge Horacio Doorn, and Viviana E. Ferraggine, editors, Encyclopedia of Database Technologies and Applications, pages 542–547. Idea Group, 2005. 5. Y. Huhtala, J. K¨ arkk¨ ainen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal, 42(2):100–111, 1999. 6. S. Lopes, J. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of the 7th International Conference on Extending Database Technology, Konstanz, Germany, March 27-31, 2000, pages 350–364. Springer, 2000. 7. S. Lopes, J. Petit, and L. Lakhal. Functional and approximate dependency mining: database and FCA points of view. J. Exp. Theor. Artif. Intell., 14(2-3):93–114, 2002. 8. D. Maier. The Theory of Relational Databases. Computer Science Press, 1983.
Title Suppressed Due to Excessive Length
9
9. H. Mannila and K. R¨ aih¨ a. Design by example: An application of armstrong relations. Journal of Computer and System Sciences, 33(2):126–141, 1986. 10. H. Mannila and K. R¨ aih¨ a. Dependency inference. In Proceedings of 13th International Conference on Very Large Data Bases, September 1-4, 1987, Brighton, England, pages 155–158, 1987. 11. H. Mannila and K. R¨ aih¨ a. Algorithms for inferring functional dependencies from relations. Data Knowl. Eng., 12(1):83–99, 1994. 12. N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Jan Van den Bussche and Victor Vianu, editors, Proceedings 8th International Conference on Database Theory, London, UK, January 4-6, volume 1973 of Lecture Notes in Computer Science, pages 189–203. Springer, 2001. 13. H. Beng Kuan Tan and Y. Zhao. Automated elicitation of functional dependencies from source codes of database transactions. Information & Software Technology, 46(2):109–117, 2004. 14. C. M. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances - extended abstract. In Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2001, Munich, Germany, September 5-7, pages 101–110. Springer, 2001. 15. H. Yao, Howard J. Hamilton, and Cory J. Butz. Fd mine: Discovering functional dependencies in a database using equivalences. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9-12 December, Maebashi City, Japan, pages 729–732, 2002.