CCIS 361 - An Optimized Formulation of Decision Tree ... - Springer Link

An Optimized Formulation of Decision Tree Classifier Fahim Irfan Alam1 , Fateha Khanam Bappee2 , Md. Reza Rabbani1 , and Md. Mohaiminul Islam1 1

2

Department of Computer Science & Engineering University of Chittagong, Chittagong, Bangladesh {fahim1678,md.reza.rabbani,mohaiminul2810}@gmail.com Department of Mathematics, Statistics and Computer Science St. Francis Xavier University, Antigonish, Canada [email protected]

Abstract. An effective input dataset, valid pattern-spotting ability, good discovered pattern evaluation is required in order to analyze, predict and discover previously unknown knowledge from a large data set. The criteria of significance, novelty and usefulness need to be fulfilled in order to evaluate the performance of the prediction and classification of data. Thankfully data mining, an important step in this process of knowledge discovery extract hidden and non-trivial information from raw data through useful methods such as decision tree classification. But due to the enormous size, high-dimensionality and heterogeneous nature of the data sets, the traditional decision tree classification algorithms sometimes do not perform well in terms of computation time. This paper proposes a framework that uses a parallel strategy to optimize the performance of decision tree induction and cross-validation in order to classify data. Moreover, an existing pruning method is incorporated with our framework to overcome the overfitting problem and enhancing generalization ability along with reducing cost and structural complexity. Experiments on ten benchmark data sets suggest significant improvement in computation time and better classification accuracy by optimizing the classification framework. Keywords: Data Mining, Decision Tree, Knowledge Discovery, Classification, Parallel Strategy.

1

Introduction

Recent times have seen a significant growth in the availability of computers, sensors and information distribution channels which has resulted in an increasing flood of data. However, these huge data is of little use unless it is properly analyzed, exploited and useful information are extracted. Prediction is a fascinating feature that can be easily adopted in data-driven techniques of extracting useful knowledge tasks. Because, if we can not predict a useful pattern from an S. Unnikrishnan, S. Surve, and D. Bhoir (Eds.): ICAC3 2013, CCIS 361, pp. 105–118, 2013. c Springer-Verlag Berlin Heidelberg 2013

106

F.I. Alam et al.

enormous data set, there is little point of gathering these massive sets of data unless they are recognized and acted upon in advance. Data mining [23], a relatively recently developed methodology and technology aims to identify valid, novel, non-trivial, potentially useful, and understandable correlations and patterns in data [6]. It does this by analyzing the datasets in order to extract patterns that are too subtle or complex for humans to detect [11]. There are number of applications regarding prediction e.g. predicting diseases [10], stock exchange crash [9], drug discovery [17] etc where we see an unprecedented opportunity to develop automated data mining techniques of extracting concealed knowledge. Extraction of hidden, non-trivial data from a large data set leads us to obtain useful information and knowledge to classify the data according to the pattern-spotting criteria [5]. The study of data mining builds upon the ideas and methods from various versatile fields such as statistical and machine learning, database systems, and data visualization. Decision tree classification method [4] [19] is one of the most widely used technique for inductive inference in data mining applications. A decision tree which is a predictive model is a set of conditions organized in a hierarchical structure [19]. In this structure, an instance of data is classified by following the path of satisfied conditions from the root node of the tree until reaching a leaf, which will correspond to a class value. Some of the most wellknown decision tree algorithms are C4.5 [19] and CART [4]. The research in decision tree algorithms is greatly influenced by the necessity of developing algorithms for data-sets arising in various business, information retrieval, medical and financial applications. For example, various business organizations are using decision tree to analyze the buying patterns of customers and their needs as well, medical experts are using them for discovering exciting pattern from the data in order to facilitate the cure process, and credit card industry is using them for fraud detection. But due to the enormous size, high-dimensionality, and heterogeneous nature of the data sets, the traditional decision tree classification algorithms sometimes fail to perform well in applications that require computationally intensive tasks and fast computation of classification rules. Also, computation and analysis of massive data sets in decision trees are becoming difficult and almost impossible as well. For example, in medical domain for disease prediction tasks that require learning the properties of atleast thousands of cases for a safe and accurate prediction and classification, it is now merely possible for a human analyst to analyze and discover useful information from these data-sets. An optimized formulation of decision trees hold great promises for developing new sets of tools that can be used to automatically analyze the massive data-sets resulting from such simulations. However, the large data sets and their high-dimensionality make data mining applications computationally very demanding and in this regard, highperformance parallel computing is fast becoming an essential part of the solution. We can also improve the performance of the discovered knowledge and classification rules by utilizing available computing resources which mostly

An Optimized Formulation of Decision Tree Classifier

107

remain unused in an environment. This has motivated us to develop a parallel strategy for the existing decision tree classification algorithms in order to save computation time and a better utilization of resources. As the nature of the decision tree induction shows a natural concurrency, the parallel formulation is undoubtedly a suitable option and solution for an optimized performance. However, designing such parallel strategy is challenging and require different issues to look into. Computation and communication costs are two such issues that most parallel processing algorithms consider while the formulation take place. As multiple processors work together in order to optimize the performance but at the same time the internal exchange of information (if any) between them increases the communication cost to some extent which in turn affects the optimization performance. SLIQ [16] is a fast, scalable version decision tree algorithm which achieves better classification accuracy with small execution time. But the performance of SLIQ is limited by its use of a centralized, memory-resident data-structure - the class list which puts a limit on the size of the datasets SLIQ can deal with. SPRINT [21] is the parallel implementation of SPRINT which solves the problem of SLIQ regarding memory by splitting the attribute lists evenly among processors and find the split point for a node in the decision tree in parallel. However in order to do that the entire hash table is required on all the processors. In order to construct the hash table, an all- to- all broadcast [12] is performed which in turn makes this algorithm unscalable with respect to runtime and memory requirements. Because each processor requires O(N ) memory to store the hash table and O(N) communication cost for all- to- all broadcast , where N is the size of the dataset. ScalParcC [8] is an improvised version of the SPRINT in the sense that it uses a distributed hash table to efficiently implement the splitting phase of the SPRINT. Here the overall communication overhead of the phase does not exceed above O(N) and the memory does not exceed O(N/p) for each processor. This ensures the scalable property of this algorithm in terms of both runtime and memory requirement. Another optimized formulation of decision trees is a concatenated parallelism strategy of divide and conquer problems [7]. In this method, the combination of data parallelism and task parallelism is used as a solution to the parallel divide and conquer algorithm. However, in the problem of classification decision tree, the workload cannot be determined based on the size of data at a particular node of the tree. Therefore, one time load balancing used in this method is not desired. In this paper, we propose a strategy that puts the computation and communication cost to minimal. Moreover, our algorithm particularly considers the issue of load-balancing so that every processor handle roughly equal portion of the task and there is no underutilized resources left in the cluster. Another exciting feature of our algorithm is that we propose to work with both discrete and continuous attributes.

108

F.I. Alam et al.

Section 2 discusses a sequential decision tree algorithm that we want to optimize in this paper. The parallel formulation of the algorithm is explained in section 3. Experimental results are shown in section 4. Finally, concluding remarks are stated in section 5.

2

Sequential Decision Tree Classification Algorithm

Most of the existing inductionbased decision tree classification algorithms e.g. C4.5 [19], ID3 [18] and CART [4] use Hunts method [19] as the basic algorithm. Those algorithms mostly fail to optimize for applications that require analysis of large data sets in a short time. The recursive description of Hunts method for constructing a decision tree is explained in Algorithm 1. Algorithm 1. Hunt’s Method [19] Inputs: Training Set T of n examples {T1 , T2 , . . . Tn } with classes {c1 , c2 , . . . ck } and attributes {A1 , A2 , . . . Am } that have one or more mutually exclusive outcomes {O1 , O2 , . . . Op } Output: A decision tree D with nodes N1 , N2 , . . . Case 1: if {T1 , T2 , . . . Tn } belong to a single class cj then D ← leaf identifying class cj Case 2: if {T1 , T2 , . . . Tn } belong to a mixture of classes then Split T into attribute-class Table Si for each i = 1 to m do for each j = 1 to p do Separate Si for each value of Ai Compute degree of impurity using either entropy, gini index or classification error. Compute Information gain for each Ai D ← node Ni with largest information gain Case 3: if T is an empty set then D ← Ni labeled by the default class ck

3

Optimized Algorithm

Decision tree is an important method for classification problem. A data-set called the training set is given as input to the tree first, which consists of a number of examples each having a number of attributes. The attributes are either continuous, when the attribute values are ordered, or categorical, when the attribute values are unordered. One of the categorical attributes is called the class value or the classifying attribute. The objective of inducing the decision tree is to use the training set to build a model of the class value based on the other attributes such that the model can be used to classify new data not from the training data-set.


109

In 3.1, we give a parallel formulation for the classification decision tree construction using partition and integration strategy followed by measuring predictive accuracy of the tree using cross-validation approach [3]. We focus our presentation for discrete attributes only. The handling of continuous attributes is discussed in Section 3-2. Later, a pruning will be done in order to optimize the size of the original decision tree and reduce its structural complexity, as explained in Section 3-3. With an artificially created training set, we will describe our parallel algorithm by the following steps: 3.1

Partition and Integration Strategy

Let us consider an artificial training set Tr with n examples, each having m attributes as shown in Table 1.

Table 1. Artificial Training Set Gender Car male male female male female male female female male female female male male female male

Ownership zero one zero one one two two one zero one one two one two one

Travel Cost Income Level cheap low cheap medium cheap low cheap medium expensive high expensive medium expensive high cheap medium standard medium standard medium expensive medium expensive medium standard high standard low standard low

Class bus bus bus bus car car car car train train bus car car bus train

1. A root processor M will calculate the degree of impurity using either entropy, gini index or classification error for Tr . Then it will divide Tr into a set of subtables Tsub = {Tr1 , Tr2 , . . . Trm } for each m according to attributeclass combination and will send the subtables to the set of child processors C = {Cp1 , Cp2 , . . .} by following the cases belowCase 1: If |C| < |Tsub |, M will assign the subtables to C in such a way that the number of subtables to be handled by each child processor Cpi where i = 1, 2, . . . |C| is roughly equal. Case 2: If |C| > |Tsub |, M will assign the subtables to C in such a way that each Cpi handles exactly one subtable.

110

F.I. Alam et al.

2. Each Cpi will simultaneously calculate the information gain of respective attributes and will return the calculated information gain to M . 3. M will compare the information gain found from each Cpi and will find the optimum attribute as the root node that produces the maximum information gain. Our decision tree now consists of a single root node as shown in Fig.1 for our training data in Table 1 and will now expand it.

Fig. 1. Root node of the decision tree

4. After obtaining the root node, M will again split Tr into subtables according to the values of the optimum attribute found in Step 3. Then it will send the subtables for which impurities are found to the child processors by following the cases below. Pure class is assigned into leaf node of the decision tree. Case 1: If |C| < |Tsub |, the case 1 explained in Step 1 will be applicable and each Cpi will iterate in the same way followed by above steps. Case 2: If |C| > |Tsub |, the case 2 explained in Step 1 will be applicable and each Cpi will partition the subtables into a set of sub-subtables Tsubsub = {Tsub1 , Tsub2 , . . . Tsubm } according to attribute-class combination. The active child processors will balance the load among the idle processors in such a way that the total number of sub-subtables to be handled by all the child processors is roughly equal. After computing the information gain, the child processors will synchronize to find the optimum attribute and send it to M . 5. Upon receiving, M will add those optimum attributes as child nodes in the decision tree. According to the training data in Table 1, the current form of the decision tree is shown in Fig.2.

Fig. 2. Decision Tree after First Iteration

6. This process continues until no more nodes are available for the expansion. The final decision tree for the training data is given in Fig.3. The entire process of our optimization is depicted in Algorithm 2.


111

Fig. 3. Final Decision Tree

Next, we focus on determining predictive accuracy of the generated hypothesis on our dataset. The hypothesis will produce highest accuracy on the training data but will not work well in case of unseen, new data. This overfitting problem restricts the determination of predictive accuracy of a model. To prevent this problem, we carry out cross-validation [3] which is a generally applicable and very useful technique for tasks such as accuracy estimation. It consists of partitioning a data set Tr into n subsets Tri and then running the tree generation algorithm n times, each time using a different training set Tr -T ri and validating the results on Tri . An obvious disadvantage of cross-validation is its computation intensive nature. For example, an n-fold cross-validation is implemented by running the algorithm n times. To reduce the computation overhead, again a parallel strategy is carried out to generate the n number of trees for an n-fold cross-validation which is explain below: 1. A root processor will divide the original dataset into n folds of which n1 folds are considered as training set and the remaining fold is considered as validation or test set. The root will continue dividing the dataset until all of the examples in the dataset is used as validation example exactly once. 2. The root processor will send the divided datasets (consisting of both training and test) to the child processors in such a way that the assignment of datasets to the child processors is roughly equal. 3. The respective child processors will act as roots and will form n temporary decision trees by following algorithm 2. 4. Next, the child processors will calculate the predictive accuracy of the temporary decision trees by running the validation sets and send the results to the root processor. 5. Finally, the root processor will average the results and determine the actual predictive accuracy of the original decision tree.

112

F.I. Alam et al.

Algorithm 2. Partition & Integration Algorithm Inputs: Training Set Tr of n examples {Tr1 , Tr2 , . . . Trn } with classes {c1 , c2 , . . . ck } and attributes {A1 , A2 , . . . Am } that have one or more mutually exclusive values {V1 , V2 , . . . Vp }, root processor M and child processors C = {Cp1 , Cp2 , . . .} Output: A decision tree D with nodes N1 , N2 , . . . Processor M : Compute degree of impurity for Tr Divide Tr into Tsub ={Tr1 , Tr2 , . . . Trm } for each m Send Tsub to C //SubTables Assignment Process to C if |C| < |Tsub | then j=1 while (j = |Tsub |) do for each i = 1 to |C| do C pi ← T r j j++ if i == |C| then i=1 if |C| >= |Tsub | then j=1 while (j = Tsub ) do for each i = 1 to |C| do C pi ← T r j j++ Child Processors Cpi ; i = 1, . . . |C|: Compute Information Gain for each Trj Send computed Information gain to M Processor M : Find Optimum Attribute Aopt NROOT ← Aopt D ← NROOT Divide Tr into Tsub according to Aopt Send Tsub to C Repeat actions for SubTables Assignment Process to C Child Processors Cpi : Partition Trj into Tsubsub = {Tsub1 , Tsub2 , . . . Tsubm } for each m Repeat Actions for First Iteration Compute Information Gain and send Aopt ’s to M for each Vx of Ay ; x = 1, 2, . . . p, y = 1, 2, . . . m Processor M : if Entropy==0 then D ← C for Vx else N ← Aopt D←N


3.2

113

Computing Continuous Attributes

Handling continuous attributes in decision trees is different than that of discrete attributes. They require a special attention and can not fit into the learning scheme if we try to deal with them exactly the same way as discrete attributes. However, the use of continuous attributes are very common in practical tasks and we can not ignore the use of these. One way of handling continuous attributes is to discretize them which is a process of converting or partitioning continuous attributes to discretized attributes by some interval [2]. One possible way of finding the interval would be selecting a splitting criteria for dividing the set of continuous values in two sets [20]. This again involves a critical issue of unsorted values which makes it difficult to find the splitting point. For large data sets, the sorting of values and then selecting interval require significant time to do. In this regard, we chose to do the sorting and selecting splitting point in parallel approach in order to avoid the additional time needed for discretizing large set of continuous values. The root processor will divide the continuous attribute set into N/P cases where N = number of training cases for that attribute and P = number of processors. Then it will send the divided cases to each child processor. Therefore, if each processor contains N/P training cases and subsequently will do the sorting. After the individual sorting done by each child processor, another round of sorting phase will be carried out among the child processors by using merge sort [13]. The final sorted values will be send over to the root and the splitting criteria will be decided accordingly. The overall process is depicted in Algorithm 3.

Algorithm 3. Discretize Continuous Attributes Inputs: Training Set Tr of n examples, Continuous Attributes {A1 , A2 , . . . Am } from the training set , Root processor M and child processors C = {Cp1 , Cp2 , . . .} Output: Discretized Continuous Attribute Values Processor M : Send n/|C| of Ai ; i = 1, . . . m to each C Child Processors Cpi : Sort values of Ai using [13] Perform Merge Sort over the sorted individual groups from each Cpi Send the final sorted value to M where a= lowest value; and b=highest value Processor M : Compute Split Point = a+b 2

3.3

Pruning Decision Tree

Pruning is usually carried out to avoid overfitting the training data, and eliminates those parts that are not descriptive enough to predict future data. A pruned decision tree has a simpler structure and good generalization ability which comes at the expense of classification accuracy. A pruning method, called Reduced Error Pruning (REP) is simple in structure and provides a reduced

114

F.I. Alam et al.

tree in good speed but the classification accuracy is effected. For optimization purpose, we should produce a simpler structure but at the same time achieving classification accuracy is also our one of the main concerns. To minimize the trade-off, a novel method for pruning decision tree, proposed in [24], is used in this paper. This method Cost and Structural Complexity (CSC) takes into account both the classification ability and structural complexity which evaluate the structural complexity in terms of the number of all kinds of nodes, the depth of the tree, the number of conditional attributes and the possible class types. The cost and structural complexity of a subtree in the decision tree T (to be pruned) is defined as CSC(Subtree(T )) = 1 − r(v) + Sc(v) where r(v) is the explicit degree of the conditional attribute v and Sc(v) is the structural complexity of v. Here, the explicit degree of a conditional attribute is measured in order to evaluate the explicitness before or after pruning. This measurement is absolutely desirable for maintaining explicitness as much as possible for achieving high classification accuracy. A fascinating feature of the pruning method is its post-pruning action which deals with the problem of ‘horizon effect’ which causes inconsistency in the prepruning process [24]. The pruning pays attention to overcome the problem of over-learning of the details of data that takes place when the subtree has many leaf nodes against the number of classes. Along with this, the method also handles the number of conditional attributes of a subtree against the set of all possible conditional attributes. The final pruned tree is simple in structure which is a very desirable feature of an optimization algorithm. This pruned tree is again considered as the originally induced decision tree which we again use to predict classification accuracy by using cross-validation as directed in section 3.1. It is to be noted that the time spent on pruning for a large dataset is a small fraction, less than 1% of the initial tree generation [22]. Therefore, the pruning process is not adding much overhead toward the computational complexity. Experimental results on classification accuracy based on both pruned and non-pruned decision trees are given in the following section. Next, we perform a theoretical comparison between our proposed framework and other existing parallel strategies of decision tree algorithm. Other parallel strategies defined in [14] such as dynamic data fragment, static (both horizontal and vertical division) data fragments are considered to make a comparison in terms of load balancing and communication cost. Our framework proposes an optimization algorithm that particularly pays attention toward uniform load balancing among the child processors which we explained in section 3.1. Moreover, the communication cost is also reduced in our proposed framework. The comparison is given in Table 2.


115

Table 2. Comparison of Parallel Strategies Strategies Communication Cost Load Balancing Dynamic Data Fragment High Low Vertical Data Fragment High Low Horizontal Data Fragment Low Medium Our Framework Low Excellent

4

Experimental Results

We used MPI (message passing interface)[15] as the communications protocol for the implementation purposes in order to facilitate processes to communicate with one another by sending and receiving messages. The reason for choosing MPI is its point-to-point and collective communication supported features. These features are very significant to fit into high performance computing today. That makes MPI a very dominant and powerful model today for doing computationally intensive works. We worked with mpicc compiler for the compilation purpose. For testing purposes, we implemented and tested our proposed formulation with classification benchmark dataset from UCI machine learning repository [1]. We tested with ten datasets from this benchmark dataset collection. Table 3 summarizes the important parameters of these benchmark datasets. Table 3. Benchmark Datasets Dataset # Examples # Classes # Conditional Attributes Adult 32561 2 14 Australian 460 2 14 Breast 466 2 10 Cleve 202 2 13 Diabetes 512 2 8 Heart 180 2 13 Pima 512 2 8 Satimage 4485 6 36 Vehicle 564 4 18 Wine 118 3 13

The comparison with serial implementation in terms of computation time is given in Table 4. The effect of pruning with REP and CSC in our framework is depicted in terms of the reduced tree size which we can compare in the following Table 5. The classification accuracy is one of the major parts of a decision tree algorithm to measure its performance. In this proposed framework, the prediction was done by taking cross-validation approach into consideration. The average accuracy for all the ten datasets were calculated in case of both without and

116

F.I. Alam et al. Table 4. Comparison of Execution Times Dataset Adult Australian Breast Cleve Diabetes Heart Pima Satimage Vehicle Wine

Serial 115.2 1.09 1.08 0.55 1.09 0.31 1.01 7.56 1.02 0.31

Parallel 17.6 0.0027 0.0032 0.0015 0.004 0.0008 0.0022 1.42 0.0018 0.00003

Table 5. Tree Size before (Hunt’s Method) and after pruning (REP and CSC) Dataset Hunt’s Method Adult 7743 Australian 121 Breast 33 Cleve 56 Diabetes 51 Heart 41 Pima 55 Satimage 507 Vehicle 132 Wine 9

REP 1855 50 29 42 47 41 53 466 128 9

CSC 246 30 15 29 27 22 32 370 112 9

Table 6. Classification Accuracy on datasets (in %) Dataset Without Pruning Adult 84 Australian 85.5 Breast 94.4 Cleve 75.4 Diabetes 70.2 Heart 82.7 Pima 73.1 Satimage 85 Vehicle 68.4 Wine 86

REP 85 87 93.1 77 69.9 82.7 73.2 84.2 67.7 86

CSC 87.2 87 95 77.2 70.8 84.1 74.9 86.1 69.2 86

with pruning effect. The following table 6 depicts the effect of before and after pruning in classification accuracy for test sets which we believe is a standard way of measuring the predictive accuracy of our hypothesis. From the table we noticed that the inclusion of pruning using method in [24] achieves better accuracy.


5

117

Conclusion

In this paper, we proposed an optimized formulation of decision tree in terms of a parallel strategy as a inductive-classification learning algorithm. We designed an algorithm in a partition and integration manner which particularly reduces the work load from a single processor and distributes it among a number of child processors. Instead of performing computation for the entire table, each child processor computes a particular portion of the training set and upon receiving the results from the respective processor, the root processors forms the decision tree. An existing pruning method that particularly draws a balance between structural complexity and classification accuracy is incorporated in our framework which produces a simple structured tree that generalizes well for new, unseen data. Our experimental results on benchmark datasets indicate that the inclusion of parallel algorithm along with pruning optimizes the performance of the classifier as indicated by the classification accuracy. In future, we will attempt to concentrate on several issues regarding the performance of the decision tree. Firstly, we will experiment our algorithm on image dataset in order to facilitate different computer vision applications. Also, we will focus on selecting multiple splitting criteria for discretization of continuous attributes.

References 1. http://archive.ics.uci.edu/ml/datasets.html/ 2. An, A., Cercone, N.J.: Discretization of Continuous Attributes for Learning Classification Rules. In: Zhong, N., Zhou, L. (eds.) PAKDD 1999. LNCS (LNAI), vol. 1574, pp. 509–514. Springer, Heidelberg (1999) 3. Blockeel, H., Struyf, J.: Efficient algorithms for decision tree cross-validation. In: Proceedings of ICML 2001- Eighteenth International Conference on Machine Learning, pp. 11–18. Morgan Kaufmann (2001) 4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont (1984) 5. Chen, M.-S., Hun, J., Yu, P.S., Ibm, T.J., Ctr, W.R.: Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 866–883 (1996) 6. Chung, H., Gray, P.: Data mining. Journal of Management Information Systems 16(1), 11–16 (1999) 7. Goil, S., Aluru, S., Ranka, S.: Concatenated parallelism: A technique for efficient parallel divide and conquer. In: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP 1996), pp. 488–495. IEEE Computer Society, Washington, DC (1996) 8. Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: A new scalable and efficient parallel classification algorithm for mining large datasets. In: Proceedings of the 12th International Parallel Processing Symposium on International Parallel Processing Symposium, IPPS 1998, pp. 573–579. IEEE Computer Society, Washington, DC (1998)

118

F.I. Alam et al.

9. Senthamarai Kannan, K., Sailapathi Sekar, P., Mohamed Sathik, M., Arumugam, P.: Financial stock market forecast using data mining techniques. In: Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS 1996), Hong Kong, March 17-19, pp. 555–559 (2010) 10. Koh, H.C., Tan, G.: Data mining applications in healthcare. Journal of Healthcare Information Management 19(2), 64–72 (2005) 11. Kreuze, D.: Debugging hospitals. Technology Review 104(2), 32 (2001) 12. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City (1994) 13. Lipschutz, S.: Schaum’s Outline of Theory and Problems of Data Structures. McGraw-Hill, Redwood City (1986) 14. Liu, X., Wang, G., Qiao, B., Han, D.: Parallel strategies for training decision tree. Computer Science J. 31, 129–135 (2004) 15. Madai, B., AlShaikh, R.: Performance modeling and mpi evaluation using westmere-based infiniband hpc cluster. In: Proceedings of the 2010 Fourth UKSim European Symposium on Computer Modeling and Simulation, Washington, DC, USA, pp. 363–368 (2010) 16. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996) 17. Milley, A.: Healthcare and data mining. Health Management Technology 21(8), 44–47 (2000) 18. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986) 19. Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann, San Mateo (1992) 20. Quinlan, J.R.: Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research 4, 77–90 (1996) 21. Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: VLDB, pp. 544–555 (1996) 22. Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 237–261 (1998) 23. Trybula, W.J.: Data mining and knowledge discovery. Annual Review of Information Science and Technology (ARIST) 32, 197–229 (1997) 24. Wei, J.M., Wang, S.Q., Yu, G., Gu, L., Wang, G.Y., Yuan, X.J.: A novel method for pruning decision trees. In: Proceedings of 8th International Conference on Machine Learning and Cybernetics, July 12-15, pp. 339–343 (2009)