Rule Based Classification on a Multi Node Scalable Hadoop Cluster Shashank Gugnani, Devavrat Khanolkar, Tushar Bihany, and Nikhil Khadilkar BITS Pilani K.K. Birla Goa Campus, Goa - 403726, India
[email protected]
Abstract. Hadoop framework is one of the reliable, scalable framework for the big data analytics. In this paper we investigate the Hadoop framework for distributed data mining to reduce the computational cost for the exponentially growing scientific data. We use the RIPPER (Repeated Incremental Pruning for Error Reduction) algorithm [5] to develop a rule based classifier. We propose a parallel implementation of RIPPER based on the Hadoop MapReduce framework. The data is horizontally partitioned so that each node operates on a portion of the dataset and finally the results are aggregated to develop the classifier. We tested our algorithm on two large datasets and results showed that we can achieve a speed up of as high as 3.7 on 4 nodes. Keywords: Hadoop, computing.
1
Distributed
Data
Mining,
Data-intensive
Introduction
Computational power is increasing with time. However, the requirements to process the everyday generated data are still challenging. Moreover, the data is distributed all over the globe. Silicon based architectures have almost reached upper limits in terms of processing capabilities (clock speed). Significant technological advancements have paved way for cost-effective parallel computing systems. Hence, there is a sudden increase in the importance of parallel and distributed computing. Apache Hadoop [1] is an open-source framework which allows users to store and process huge datasets in a distributed environment. Today, it is used by most of the top corporates viz. Yahoo, Facebook, etc. The framework of Apache Hadoop is composed of the following modules: 1. Hadoop Common - contains libraries and utilities needed by other Hadoop modules. 2. Hadoop Distributed File System (HDFS) [4] - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster. G. Fortino et al. (Eds.): IDCS 2014, LNCS 8729, pp. 174–183, 2014. c Springer International Publishing Switzerland 2014
Rule Based Classification on a Multi Node Scalable Hadoop Cluster
175
3. Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of user’s applications. 4. Hadoop MapReduce - a programming model for large scale data processing. Apache Hadoop uses MapReduce to process large datasets in parallel and on thousands of nodes in a reliable, fault-tolerant manner. Initially, Hadoop MapReduce job divides the input data into chunks (default size is 64MB). These independent chunks are then processed by the Map tasks on different nodes in a completely parallel manner. The outputs are passed on after sorting to the reduce tasks which does the job of collating the work and combining the results into a single value. Monitoring, scheduling and re-executing the failed tasks are the responsibility of MapReduce.
Fig. 1. Map-Reduce Framework
In this paper we present the parallel implementation of Rule-Based Classifier on Hadoop. We use RIPPER (Repeated Incremental Pruning for Error Reduction) algorithm for rule generation. Proposed by William Cohen, RIPPER employs a general-to-specific strategy to grow rules and FOILs (First Order Inductive Learning) information measure to choose the next conjunct to be added into the rule. In this paper we propose an implementation of RIPPER on Hadoop by using a data parallel model and then give the results of our experiments. The rest of the paper is organized as follows. In Section 2 we discuss related work, Section 3 presents the standard RIPPER algorithm and its parallel implementation on Hadoop. Section 4 gives the complexity analysis of our algorithm, Section 5 shows the experimental results and finally in Section 6 we conclude our paper.
2
Related Work
Classification in data mining is a kind of mapping that maps items in a collection to target categories. But for this mapping one needs to process datasets
176
S. Gugnani et al.
which can be very large. To handle the problem of processing large datasets, the processing model of MapReduce is taken into consideration. The embedded features like parallelization across large scale clusters, handling node failures and effective communication among machines have made this model pretty demanding. Dean and Ghemawat [6] have expressed their ideas towards this model and have designed simple pseudo codes for mapper and reducer functions. Another illustration was given by Mackey et al. [9] on Hadoop MapReduce implementation and they also broadcasted their ideas about Hadoop Distributed File System (HDFS). Dean and Ghemawat [7] hailed MapReduce as a flexible data processing tool in their publication concerning the advantages of MapReduce over other parallel databases. Nguyen et al. [10] implemented the complex computation problem called the N-body problem using map reduce. The N-body problem simulates the movement of particles under gravitational or electrostatic forces. Zhou et al. [12] have used the Hadoop MapReduce framework to show how parallel implementing Nave Bayes algorithm is much faster than the standard algorithm. A rule based classification is a technique used for classification using a collection of conditional rules. A general idea of rule based classification algorithm was shared by Qin et al. [11] in their URULE algorithm which also includes the criterion for growing and pruning the rules. More specifically the ripper algorithm is most widely used rule induction algorithm and works well with even noisy datasets as it uses validation set to prevent over fitting. Further ideas on RIPPER were elaborated by Basu and Kumaravel [3] and they implemented RIPPER algorithm (JRIP). The model for fast and effective rule algorithm using RIPPER was formulated and implemented by Cohen [5] in his IREP (Incremental Reduced Error Pruning) algorithm. Ishibuchi et al. [8] have proposed an island model to build a fuzzy rule based classifier. They divide the data equally among each island (node) and regularly shift the data to the adjacent node. For each set of data, a classifier is built. The classifier which performs the best out of all the sets, is selected as a member of the final ensemble classifier. Even though the accuracy of the ensemble classifier is better than the individual classifiers, the total accuracy never reaches beyond 90%. Also, the classifiers built on each island only represent a locally optimum solution and may not be the best solution of the problem.
3 3.1
Rule Based Classification Using RIPPER The RIPPER Algorithm
RIPPER is a widely used rule induction algorithm. It scales linearly with the number of training records used and is suited for building models with imbalanced class distribution. In addition, it uses a validation set to prevent model over-fitting. RIPPER orders the classes according to their frequencies. If (y1 , y2 , .... , yc ) are the class labels and y1 is the least frequent and yc the most frequent, then, RIPPER first builds rules for y1 taking remaining class records as negative
Rule Based Classification on a Multi Node Scalable Hadoop Cluster
177
records. Next, RIPPER extracts rules for y2 . This process is repeated until yc is left, which is labeled as the default class. For rule growing, RIPPER uses a general to specific strategy, where initially each rule is empty and then it is built by adding conjuncts to it serially. It uses FOIL’s information gain to add conjuncts to the rule. Suppose we have a rule R : A → class that covers p0 positive records and n0 negative records. After adding a new conjunct B, the rule R : A ∧ B → class covers p1 positive records and n1 negative records. Then, the FOIL’s information gain can be calculated as p1 p0 (1) FOIL’s information gain = p1 × log − log p1 + n1 p0 + n0 Conjuncts are added until the rule starts covering negative examples. The rule is then pruned based on its performance on the validation set using the following metric (p − n)/(p + n), where p is the number of positive records covered by the rule in the validation set and n is the number of negative records covered by the rule in the validation set. If the value of the above mentioned metric increases after removing a conjunct, then the rule is pruned. Upon generating a rule, all records covered by the rule are eliminated. The algorithm then continues with building a new rule. Rules are built as long as the rule set doesn’t violate the Minimum Description Length (MDL) principle and the error on the validation set is less than 50%. 3.2
Why Is Parallelizing RIPPER Important?
RIPPER is an iterative algorithm, and in each iteration it has to go over the complete dataset. Thus, for datasets of order 106 − 108 , it would take forever to complete even one iteration of the algorithm. It is important to parallelize the work done in an iteration, and develop an algorithm to distribute work among the parallel cluster nodes. 3.3
RIPPER Implementation on Hadoop
We implemented RIPPER in Java using Hadoop Java libraries. The dataset was partitioned horizontally to support the Hadoop MapReduce framework and ensure parallel execution of the code. Three sets of mapper-reducer functions were used, one each for rule building, rule pruning and calculating accuracy. Hence, each mapper executes its code on a portion of the dataset and the reducer aggregates over the output of mapper to produce one common output. For rule building, the mapper-reducer functions calculate the value of p1 and n1 values for computing the FOIL’s information gain (the p0 and n0 values are the p1 and n1 values for the old rule respectively). For adding a conjunct, every possible value for all attributes are considered as conjuncts to be added to the rule. The FOIL’s information gain for each of these values is calculated, and the value for which the information gain is maximum is added as a conjunct to the rule.
178
S. Gugnani et al.
The rule pruning is done using the validation set as reference. The mapperreducer functions for pruning calculate the p and n values for the metric (p − n)/(p + n). Depending on the value of the metric, the rule is pruned and added to the rule set. After all rules have been built the rule set needs to be validated on the test records. The accuracy mapper-reducer functions calculate the number of positive and negative records covered by each rule and the whole rule set. These values are then used to calculate the accuracy of each rule and the overall accuracy as well. The algorithm is briefly described below and the pseudo code is given in Figure 2: 1. Rule Growing Stage The rule is initialized as an empty rule, i.e., it covers all records. After that, conjuncts are added one by one to the rule. The conjunct to be added is selected by the value of FOIL’s information gain measure. The parameters of the measure are calculated using a MapReduce function and the pairs have the values of p0 , p1 , n0 & n1 . Conjuncts are added to the rule as long as it does not cover negative records. 2. Rule Pruning Stage The rule generated in 1. is then pruned using the (p − n)/(p + n) metric. To calculate the parameters p & n of this metric a Map-Reduce function is called in which the pairs have the values of p & n. Stage 1. and 2. are repeated until adding a new rule violates the Minimum Description Length (MDL) principle. 3. Model Evaluation Stage After the rule set has been generated, the rules are used to classify the test records. A Map-Reduce function is called to classify the records and calculate the accuracy of the model. The pairs in the function contain the value of the total positive and negative records covered by the model. The code returns the pruned rule set and the accuracy of the rules and the rule set on the test records.
4
Complexity Analysis
We now calculate the time complexity for the sequential as well as the parallel implementation of RIPPER. Let the total number of training records be N , the total number of attributes in a record be A, the average of the number of possible values for each attribute be V and the number of nodes in the Hadoop cluster be K. Since the data is partitioned among each node, the total records in each node will be N/K. 4.1
Sequential Implementation
For adding each conjunct to the rule, we calculate the FOIL’s information gain for all values of all attributes. For calculating gain we must iterate over all
Rule Based Classification on a Multi Node Scalable Hadoop Cluster
179
Algorithm 1: RIPPER(Dataset D) Input: Labeled Dataset D Output: Rule Set R 1. NR = New Rule 2. R = Rule Set 3. FIG: FOIL’s Information Gain 4. Max FIG: Maximum FIG among all conjuncts 5. A = Accuracy 6. Max Rules: Maximum Possible Rules (MDL) 7. P: Pruning Metric 8. While loop number < Max Rules do 9. Initialize new Rule NR to empty 10. While Max FIG != 0 do 11. MapReduce 1(Train): Calc. FIG for all possible conjuncts 12. compute Max FIG 13. Add conjunct having Max FIG to NR 14. End While 15. While Old P < New P do 16. MapReduce 2(Pruning): Calculate P 17. if Old P < New P then 18. Prune last conjunct in NR 19. End if 20. End While 21. Add NR to R 22. End While 23. MapReduce 3(Accuracy): Calculate A 24. Return R, A Fig. 2. Algorithm 1
records in the dataset. The total conjuncts that can be added overall is limited by the Minimum Description Length (MDL). Hence, the time complexity of the sequential implementation is O(A · V · N · M DL). 4.2
Parallel Implementation
The runtime of the Mapper and Reducer functions are as follows: 1. Mapper: In each mapper we calculate the FOIL’s information gain for all possible values for all attributes. Each Mapper runs for N/K records. Hence, time taken for each mapper to execute is A · V · N/K. 2. Reducer: In each reduce task, we simply shuffle the pairs to the appropriate nodes and aggregate the results. Each node has N/K records. Hence, each node may send out at most N/K records to other nodes. Assuming a completely connected network, the time to shuffle the pairs is N/K. We generate one key-value pair per record, hence the time
180
S. Gugnani et al.
taken for aggregating the key values is N/K. The total time taken for the reducer is O(N/K). Since RIPPER uses a Minimum Description Length (M DL) as a stopping condition, the number of conjuncts added are limited to constant. Each time MapReduce is called, a conjunct is added to the Rule. Hence, the MapReduce function will only be called a maximum of M DL times. Hence, the total time complexity of the algorithm is O(A · V · N · M DL/K + M DL · N/K). We now know that the execution time of the algorithm is proportional to 1/K. Hence, by increasing the number of nodes in the Hadoop cluster, the execution time decreases. 4.3
Speed Up
We define the speed up factor of the Hadoop cluster as the ratio of time taken for the sequential algorithm and the time taken to execute the parallel algorithm on K nodes. We represent this factor as S@K. Using the time complexity calculated above, we can calculate the speed up as S@K =
A·V A · V · N · M DL × K = CK, = A ·V +1 + MDL·N K
A·V ·N ·MDL K
(2)
where C is a constant. We now see that the speed up we achieve on a Hadoop cluster is linear in the number of nodes in the cluster (K). 4.4
Cost Optimality
The cost of a parallel algorithm is the number of processors used times the time taken to execute the parallel algorithm (Tp ). A parallel algorithm is cost optimal if the cost of the parallel algorithm is equal to the time taken by the sequential algorithm (Ts ). Cost = K · Tp = O((A · V + 1) · N · M DL) = O(A · V · N · M DL) = Ts [Assuming A · V >> 1] Since Cost = Ts , our parallel RIPPER algorithm is cost optimal.
5 5.1
Experimental Results Experimental Environment
We setup a Hadoop Cluster with four nodes to test the algorithm. Tables 1 and 2 show the configuration of the cluster.
Rule Based Classification on a Multi Node Scalable Hadoop Cluster
181
Table 1. Configuration of each node
SNo. Software/Package 1 Ubuntu 13.04 2 Hadoop 1.1.2 3 sun java6-jdk 4 100 Mbps Ethernet
Table 2. Configuration of cluster
Node No of cores RAM Clock Speed Master Slave1 Slave2 Slave3
5.2
2 2 2 2
4GB 8GB 4GB 4GB
2.1GHz 2.1GHz 2.2GHz 1.8GHz
Datasets Used
To test the accuracy of our algorithm on Hadoop we used two datsets; one randomly generated dataset of 100 million records with 22 categorical attributes, each attribute having an average of 6 values, the other dataset was extracted from the SDSS (Sloan Digital Sky Survey) Server [2]. We used only a subset (6) of the attributes from the SDSS dataset and considered records for two classes only (’STAR’ and ’GALAXY’). The total records extracted amounted to about 2.5 million. Table 3 gives the description of the datasets used. Table 3. Description of datasets
Dataset
No of records No of attributes
Randomly generated 100 million SDSS 2.5 million
5.3
22 6
Speed Up
To evaluate the performance of our algorithm, we calculated the speed up (S@K) by varying the number of nodes for both the datasets. The results are shown in Figures 3 and 4. For the randomized dataset we achieve a speed up of almost 3.7 on 4 nodes. One can see that the speed up of the algorithm increases almost linearly with the number of nodes as predicted by the complexity analysis. This shows that
182
S. Gugnani et al.
Fig. 3. Change in Speed Up Factor by varying number of nodes in the Cluster for randomly generated dataset
Fig. 4. Change in Speed Up Factor by varying number of nodes in the Cluster for SDSS dataset
Rule Based Classification on a Multi Node Scalable Hadoop Cluster
183
our parallel implementation of RIPPER is very efficient and scalable. Also, the final classifier built is a globally optimum solution and the model is independent of number of nodes and distribution of data.
6
Conclusion and Future Work
We studied the Hadoop framework for reducing the computational cost of the exponential growing scientific data using a rule based classifier. The results shows that the efficiency of the parallel execution algorithm of RIPPER is higher than the standard implementation of the algorithm. Experimental results shows that by using the MapReduce framework on multiple nodes the computation time is reduced. In future we will implement our algorithm in General-Purpose computation on Graphics Processing Units and will compare the result with the CPU implementation.
References 1. Apache hadoop, http://hadoop.apache.org/ 2. Sloan Digital Sky Survey Data Release 10, http://skyserver.sdss3.org/dr10/en/home.aspx 3. Basu, S., Kumaravel, A.: Classification by rules mining model with map- reduce framework in cloud. International Journal of Advanced and Innovative Research 2, 403–409 (2013) 4. Borthakur, D.: The hadoop distributed file system: Architecture and design. Hadoop Project Website (2007) 5. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning (ICML 1995), pp. 115–123 (1995) 6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) 7. Dean, J., Ghemawat, S.: MapReduce: A flexible data processing tool. Communications of the ACM 53(1), 72–77 (2010) 8. Ishibuchi, H., Yamane, M., Nojima, Y.: Ensemble fuzzy rule-based classifier design by parallel distributed fuzzy gbml algorithms. In: Bui, L.T., Ong, Y.S., Hoai, N.X., Ishibuchi, H., Suganthan, P.N. (eds.) SEAL 2012. LNCS, vol. 7673, pp. 93–103. Springer, Heidelberg (2012) 9. Mackey, G., Sehrish, S., Bent, J., Lopez, J., Habib, S., Wang, J.: Introducing mapreduce to high end computing. In: 3rd Petascale Data Storage Workshop, PDSW 2008. 3rd, pp. 1–6 (2008) 10. Nguyen, T.-C., Shen, W.-F., Chai, Y.-H., Xu, W.-M.: Research and implementation of scalable parallel computing based on map-reduce. Journal of Shanghai University (English Edition) 15(5), 426–429 (2011) 11. Qin, B., Xia, Y., Prabhakar, S., Tu, Y.-C.: A rule-based classification algorithm for uncertain data. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) ICDE, pp. 1633–1640. IEEE (2009) 12. Zhou, L., Wang, H., Wang, W.: Parallel implementation of classification algorithms based on cloud computing environment. Indonesian Journal of Electrical Engineering 10(5), 1087–1092 (2012)