Data Mining: A Tightly-Coupled Implementation on a Parallel Database Server Mauro Sousa
Marta Mattoso
Nelson Ebecken
Computer Science Department COPPE/UFRJ – Federal University of Rio de Janeiro P.O. Box 68511, Rio de Janeiro, RJ, Brazil, 21945-970 Fax: +55 21 2906626 e-mail: mauros, marta @cos.ufrj.br,
[email protected]
Abstract Recent years have shown the need of an automated process to discover interesting and hidden patterns in real-world databases, due to the difficulty of analyzing large volumes of data using only OLAP tools. This sort of process implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. Our work studies many contributions of a tightly-coupled use of a parallel DBMS and parallel architectures when deploying data mining applications. We present our practical experience, addressing performance problems with parallel processing and data fragmentation, using a commercial database. 1. Introduction With recent explosion of information, and the availability of cheap storage, it has been possible to gather massive data during the last decades. This massive data gathering can be used to gain competitive advantages, by discovering previously unknown patterns in data, which can guide the decision making. Recent years have shown that it is becoming difficult to analyze large amounts of data using only OLAP tools, showing the need of an automated process to discover interesting and hidden patterns in data. Data mining techniques have increasingly been studied [12], especially in their application in real-world databases. One typical problem is that databases tend to be very large, and these techniques often repeatedly scan the entire set. Sampling has been used for a long time, but subtle differences among sets of objects in the database become less evident [9].
The need to handle large amounts of data implies a lot of computational power, memory and disk I/O, which can only be provided by parallel computers. Particularly, a parallel database system can be used to provide such performance improvement, although their engines are not specifically optimized to handle data mining applications. Related Work A variety of data mining algorithms have been proposed addressing parallel processing. Most of them use flat files and exploit parallelism through parallel programming. Many data mining implementations have already used databases. DBMiner [8], for instance, is a system that implements data mining techniques integrated to database technologies, designed to extract rules from large relational databases. Although the idea of using a DBMSs is not new, most implementations use databases only to issue queries that are going to be processed by a client machine. Such an example is the SKICAT system [5], which uses a relational DBMS (Sybase) for catalog store, modification and management. It means that most of the current data mining applications have a loose connection with databases, thus resulting in poor performance [3]. Quest data mining system [2] has been addressing the development of fast, scalable algorithms. We can cite the parallel implementation of the mining of association rules (APRIORI algorithm) and the SPRINT [16] classification algorithm - an evolution of SLIQ [10] - where multiple processors are able to work together to construct a classification model, namely a decision tree. These algorithms run against data in flat files as well as DB2 family of database products, but databases are accessed in a loosely-coupled mode using dynamic SQL. Freitas [6] has presented a series of primitives for rule induction (RI), after analyzing many KDD algorithms of the RI paradigm. His conclusion is that the core operation of
their candidate rule evaluation procedure is a primitive he calls Count by Group. This primitive can be translated to usual GROUP BY clauses in SQL statements, which often provide a potential degree of parallelism in parallel DBMSs. Our work addresses a solution that integrates a machine learning algorithm and a tigthly-coupled use of a parallel DBMS for data mining applications, addressing performance problems with parallel processing and data fragmentation. To validate our experiments, we have used Oracle Parallel Server and Oracle Parallel Query Option, configured to run on a multiprocessor IBM RS/6000 SP2 system. We have constructed a set of stored procedures, which are stored within Oracle data dictionary, providing high performance and native integration to the database. The remainder of this work is organized as follows. Section 2 presents advantages and shortcomings of implementing a data mining algorithm in parallel databases. Section 3 describes general characteristics of parallel database systems and data mining algorithms, and section 4 presents practical experiments, such as the development of tightly-coupled routines, data fragmentation and special features of DBMS exploited. Finally, section 5 presents our conclusions and observations. 2. Parallel Databases and Data Mining There are a variety of data mining algorithms constructed to run in parallel, taking advantage of parallel architectures using specific programming routines. On the other hand, our implementation is based on parallel database systems. There are many advantages of this approach, and we cite some here: •
The implementation becomes simpler. There is no need to use parallel routines such as MPI libraries. The parallel DBMS is responsible itself for parallelizing queries that are issued against it.
•
DBMS can choose between serial and parallel execution plans transparently to application. Depending on the data mining algorithm and technique used (classification, clustering, association rules, etc.), parallelism is useful only during phases where large amounts of data are being searched. If there are not many examples being analyzed, communication overhead between coordinator and slave processes does not compensate.
•
It is possible to work with data sets that are considerably larger than main memory. We have to translate most of the data mining algorithm into SQL statements. If data does not fit in memory, the database itself is responsible for handling information, paging and swapping when necessary.
•
Simplified data management. A DMBS itself has its own tools to manage data.
•
Data may be updated or managed as a part of a larger operational process. Data may become available using transformation techniques in data warehouses, or it can be the dataware itself.
•
Extensibility to mine complex data types. This characteristic is becoming more realistic as emerging object-relational databases are providing the ability to handle image, video, voice and other multimedia objects.
•
Ad-hoc and OLAP queries can be used to validate discovered patterns. Queries can be issued to verify rules discovered by the data mining process, providing means to easily evaluate results. Ad-hoc queries and OLAP can be used to complement the whole process.
•
Opportunity for database fragmentation. Database fragmentation provides many advantages related to reduced number of page accesses necessary for reading data. Irrelevant data could be just ignored depending on the filter specified by SQL statements. Furthermore, some DBMSs can process each partition in parallel, using different execution plans when applicable. In general, DBMS systems provide automatically management of such functionality.
However, some drawbacks have to be considered, such as: •
Less control of how parallelization will occur. Although there is the possibility of some customizations, such as setting the nodes that will participate in the process, the DBMS itself is in charge for parallelization.
•
The algorithm may become dependent on some characteristic of the parallel DBMS used. In a parallel database, we have some chance to direct the optimizer to use a specific execution plan, but our algorithm may be dependent on some characteristic of this DBMS.
•
Overhead of the database system kernel. A kernel of a database system is designed to handle a large set of operations, such as OLTP transactions, OLAP queries, etc. Although it is possible to minimize the overhead and customize the environment to take the best available advantages of the corresponding architecture, there will always exist some functionality implemented that is not applicable to the data mining algorithm, which can degrade performance when compared to access to flat files.
3. Implementation Issues 3.1.
Tightly-Coupled Integration
Even though when using databases, most of the current data mining applications have a loose connection with them. They treat database simply as a container from which data is extracted directly to the main memory of the computer responsible for running the data mining algorithm. This approach limits the amount of data that can be handled, forcing applications to filter information (figure 1), and use only a part of it to discover patterns, which is definitely not the best approach to be used.
Client
Server FILTER Data
Figure 1: Traditional approach used by current data mining applications. Occasionally, the client and the server could be represented by the same machine.
The idea of executing user-defined computation within databases, thus avoiding unnecessary network traffic (figure 2), has been a major concern in many DBMSs. One example of this sort of implementation is the stored procedure in commercial databases. For instance, Oracle provides the possibility of implementing procedural code in PL/SQL and storing it within the data dictionary, in a pre-compiled form.
MakeDecisionTree
PruneTree errorRate
errorRate
Classify
Classify
giniSplit
giniSplit
ExtractRules
F2
F2
F2
F2
F2
F2
F2
Figure 2: Integration to the DBMS. Simplified approach showing client calls and PL/SQL functions that are issued within SQL statements executed by the server.
F2
Some authors have also developed a methodology for tightly-coupled integration of data mining applications with database systems [3], selectively pushing parts of the application program into the database system, instead of bringing the records into the application program. What we experiment is a variation of these methodologies, taking advantage of parallel features available in many parallel database systems, in particular the ones provided by Oracle Parallel Server. Besides, we should minimize the number of calls to the database server. Ideally, one should be able to issue a SQL statement such as SELECT EXTRACTRULES (depth, latitude, longitude) FROM mine_relation,
where MINE_RELATION represents the data itself, and EXTRACTRULES a function that returns the list of rules discovered analyzing the attributes passed as parameters. Because of the nature of SQL offered by current database systems, it is not possible to perform such task. The way we have used to implement this kind of feature was implementing a series of procedures and functions, which were logically grouped together in an Oracle object named package. Therefore, the user can call a single procedure (e.g. MakeDecisionTree), interactively informing all available parameters. Ideally, an algorithm implemented to handle a large volume of data should run on the server where the data resides, in order to avoid network traffic. Furthermore, it is desirable that the constructed algorithm has a close connection to the data to avoid, for example, casting of types, which can degrade performance. 3.1.
Typical Queries in Rule Induction Algorithms
Most of queries issued in rule induction algorithms, particularly decision trees, deals with two kinds of operations: sorts (ORDER BY) and groupings (GROUP BY). These
kinds of queries represent the time consuming tasks obtained during the implemented algorithm. SELECT temperature, class, COUNT(*) FROM mine_relation GROUP BY temperature, class
Instead of using only one relation, we can use several tables and perform joins to process them, but this operation is very expensive, even when processed in parallel. Alternatively, we could create a view that would contain the SQL statement corresponding to the join operation, and use it as the base for the process, but it would raise the same performance problems. Parallel database servers can easily process typical queries, and it is possible to make good use of the resources of parallel machines. In Massively Parallel Processing architectures (MPP), for example, each node can independently compute counts on the records stored locally. In the next phase, a coordinator process can consolidate these counts, and then return the result back to the user [14]. 3.2.
Parallel Architectures Considerations
In order to process presented typical queries in parallel efficiently, a parallel DBMS should be designed to take full advantage of the parallel architecture it is based on, avoiding unnecessary communication between processors. Parallel databases based on shared-nothing platforms often implement function shipping. It means that tables and other data sets are statically partitioned, and operations based on these partitions are performed by the nodes that own them, in response to a message sent by the coordinator process. The main advantage of function shipping is scalability. Besides, contention in the shared network is minimized because only execution plan fragments and filtered responses are transmitted. On the other hand,
function shipping in shared-nothing architecture presents the skew problem, since resource subutilization may occur when data is not distributed in a uniform manner among nodes, or when data relevant to a query is located in only a subset of existing nodes. An approach that can be used in shared disks and also in shared-nothing architectures is data shipping, in which any process can request any disk block, independent of its location. Data may not necessarily be statically partitioned among disks.
Many
cooperating processes running in parallel provides CPU and I/O parallelism. In sharednothing systems messages can be used to send database blocks to other nodes through network. The main advantage of this approach is that load balance can be achieved automatically, once data can be partitioned among slave processes during query execution. Besides, the data dictionary is available throughout the cluster, so that any node can check SQL syntax and optimize statements. One shortcoming is the potentially inefficient remote access to disk in shared-nothing machines. Ideally, a better approach would be one that could eliminate the disadvantages of both function and data shipping, using best features available, with the minimum overhead possible. 3.3.
User-defined Functions
In order to provide a close integration to the database, the construction of user-defined functions that can be called from within SQL statements is advisable. It decreases the number of calls to the database, and makes our code considerably simpler. SELECT MIN(giniSplit(attribute, class, num)) FROM (SELECT attribute, class, count(*) num FROM mine_relation GROUP BY attribute, class)
The above query presents the SQL statement used during the growing phase of a decision tree, and it is used to calculate the best splitting point of an internal node using a statistical method called gini index. The inner SQL statement sorts the training set based on attribute values (in Oracle, a GROUP BY operation implicitly causes a sorting). The GROUP BY clause is used instead of a simple ORDER BY to count the number of occurrences for each pair attribute/class, since we cannot split cases having the same attribute value. Hence, the splitting criteria is computed faster than before, using only one database call per attribute in each phase. When the attribute has too many distinct values, a discretization would be advisable. 3.4.
Fragmentation Techniques
One characteristic of SQL that can degrade the performance of a Rule Induction algorithm, in particular a decision tree one, is that its structure requires a separate query to construct the histogram for each attribute. Thus, it is not possible to perform a SQL statement that can bring us the result reading data just once. Fragmentation techniques come to play an important role in this context, once they tend to eliminate irrelevant information during query execution. 3.4.1. Horizontal Fragmentation
Horizontal fragmentation partitions a relation along its tuples, each fragment having a subset of the tuples of the relation. Several commercial databases provide this feature, most of them in a transparent manner to the application. In a previous step, one can create several similar tables, each corresponding to a horizontal fragment, and associate special constraints to them. These constraints will represent partitioning functions.
M easurem ent_1
M easurem ent_2
depth latitude longitude tem perature
depth latitude longitude tem perature
depth < 500m
select * from M easurem ent_1
depth between 500m and 1500m
M easurem ent_3 dpeth latitude longitude tem perature depth > 1500m
select * from M easurem ent_2
UNIO N A LL M easurem ent
select * from M easurem ent_3
F igure 3: Horizontal fragm entation. S everal fragm ents are created based on partitioning functions. A ll the data set is processed by a UNIO N A LL operator.
Figure 3 presents a horizontal fragmentation example, where depth < 500, depth between 500 and 1500, and depth > 1500 represent partitioning functions. The original relation can be obtained using a SQL statement that unions all the fragments into one data set. Some parallel DBMSs, particularly Oracle server, are able to use these functions (table constraints) to eliminate some partitions when reading data, depending on the filter used in the corresponding SQL statement. Each partition can be processed in parallel, using different execution plans when applicable. Another problem is related to fragment allocation. Suppose we define F fragments to be distributed among N nodes. If we simply allocate F / N fragments to each node, there will be a load imbalance, since the decision tree algorithm will work on only a few fragments beyond a certain point, of course if the partitioning functions were appropriately defined.
HPS
N1
F1
N2
F1
F2
N3
F2
F3
N4
F3
F4
F4
F ig u re 4 : L o a d Im b a la n c e . S in ce b e yo n d a c e rta in p o in t o f th e a lg o rith m re a d s w ill b e c o n c e n tra te d o n o n ly s o m e p a rtitio n s , m o s t o f n o d e s a re g o in g to is s u e re m o te re a d re q u e s ts . F 1 , F 2 , F 3 , a n d F 4 re p re s e n ts fra g m e n ts d is trib u te d a m o n g fo u r n o d e s (N 1 , N 2 , N 3 , a n d N 4 ). G ra ye d d is ks m e a n s th e y a re b e in g c u rre n tly a c c e s s e d in a p p lic a tio n .
Horizontal partitioning will provide the ability to discard some partitions, so that the algorithm can read less data, but, ideally, data should be evenly distributed or declustered among all nodes that will participate in the process [11].
HPS
n1
n2
n3
n4
F1 F2
F2
F2
F2
F2
F2
F2
F2
F2 F3 F4
F ig u re 5 : L o a d B a la n ce . W e s till h a ve fo u r p a rtitio n s d is trib u te d a m o n g fo u r n o d e s , w ith th e d iffe re n t th a t d a ta is s p re a d a m o n g a ll a va ila b le d is k s . G ra ye d d is k p o rtio n s m e a n s th a t d a ta is b e in g cu rre n tly re a d b y d a ta m in in g a p p lica tio n .
Declustering exploits I/O parallelism and minimizes skew, but involves a higher startup and termination costs because a process has to be started and terminated on each of the nodes where relation resides [16]. 3.4.2. Vertical Fragmentation
A typical approach used that allows user queries to deal with smaller relations, causing a smaller number of page accesses, is the vertical fragmentation [13]. In the
case of decision trees, it would be necessary to create one relation for each pair attribute/class (remember that decision trees work with list of attributes, in each phase trying to find the best splitting point among all attributes), as it is done with flat files in SPRINT [16]. 4. Practical Experiments To validate the proposed approach, we have constructed a set of stored procedures (grouped into packages) that are stored within Oracle Server data dictionary. We have used Oracle Parallel Server and Oracle Parallel Query Option, configured to run on a multiprocessor IBM RS/6000 SP2 system to run. Our implementation is based on decision tree algorithms, because they represent well-suited classifiers for data mining problems, producing similar accuracy when compared to other methods for classification. Besides, they represent one of the most widely used and practical methods for inductive inference, and they can be built considerably faster than other machine learning methods. Furthermore, decision trees can be easily constructed from and transformed into SQL statements [1], which can be used to query a database system, particularly a parallel one. There are many implementations using decision tree learning algorithms, such as CART [4], ID-3 and its successor C4.5 [15], and SPRINT (SLIQ’s parallel implementation) [16, 10]. A decision tree interactively subdivides the training set until each partition represents cases totally or dominantly belonging to the same class. In order to partition training data, a statistical test is used as the splitting criteria. The test we implemented is one called gini index [4,16]. For a data set T containing examples from n classes, gini(T) is defined as gini (T )= 1 − ∑ p j
2
where pj is the relative frequency of class j in T. The recursive partitioning method imposed to the training examples often results in a very complex tree that overfits the data by inferring more structure than is represented by training cases [15]. We have also implemented a pruning module, which removes parts of the tree that do not contribute to classification accuracy, producing less complex, and more comprehensible trees. Even after pruning a tree, it can still represent a complex, incomprehensible structure at first glance. Large decision trees are difficult to understand because each node has a specific context established by the outcomes of tests at antecedent nodes [15]. Finally, there is also a rule extraction module, which tends to solve this problem, presenting a more compact and simpler representation, as soon as irrelevant expressions are eliminated from the originally generated tree. 4.1.
PL/SQL
PL/SQL is Oracle’s SQL extension that adds procedural constructs, variables, and other features. A block of PL/SQL is sent to the server and executed as a unit, thus reducing the network load and resulting in better processing. It is possible to develop procedures and functions in PL/SQL and store them within Oracle data dictionary. PL/SQL is designed to integrate closely with the Oracle database system and with SQL. Therefore, it supports all the internal datatypes native to the Oracle database. In PL/SQL, parallelism should be achieved through queries and not through the language. The only way to parallelize a code written in PL/SQL is constructing a stored function, which can be called from within a SQL statement.
4.2.
User-defined Functions
We have constructed many user-defined functions, which are called within SQL statements, enforcing a tightly-coupled integration with the database and its query language, avoiding network traffic. One of these functions is the one presented in the previous section (giniSplit), which is used during the tree-growing phase. Besides, during the pruning phase, a user-defined function created based on the original tree is used to classify test cases in parallel. Each processor is able to compute values locally and these values are then returned to a coordinator process. Function Classify returns the number of hits and misses during classification. SELECT DECODE(class, Classify(...), 1, 0) hits, DECODE(class, Classify(...), 0, 1) misses FROM mine_relation_test
Once the classification model is constructed, a set of rules can be derived reading the tree, from top to bottom, until reaching each leaf node. The n rules originally defined (where n represents the number of leaf nodes) are generalized, and some of them may be simplified or even eliminated. Rule generalization process is a time consuming task, since we examine many possibilities in order to eliminate irrelevant expressions in rules, producing a simpler subset. This process issues many SQL statements, in which WHERE conditions represent rules discovered in previous phases. These statements, each one processed in parallel, return the number of false positives and false negatives examples that are going to be used in conjunction with a well-known method called Minimum Description Length (MDL), as it is done in C4.5 [15].
As a final result we will reach an ordered rule set, such as the one shown below. IF a > 1 AND b 2 THEN RETURN 1 ELSE IF a