Intelligent Data Analysis 16 (2012) 649–664 DOI 10.3233/IDA-2012-0542 IOS Press
649
Building fast decision trees from large training sets A. Franco-Arcegaa,b,∗ , J.A. Carrasco-Ochoaa , G. S´anchez-D´iazc and J. Fco Mart´inez-Trinidada a Computer
Science Department, National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico b Research Center of Technologies on Information and Systems, Autonomous University of State of Hidalgo, Hidalgo, Mexico c Faculty of Engineering, Universidad Autonoma de San Luis Potosi, SLP, Mexico Abstract. Decision trees are commonly used in supervised classification. Currently, supervised classification problems with large training sets are very common, however many supervised classifiers cannot handle this amount of data. There are some decision tree induction algorithms that are capable to process large training sets, however almost all of them have memory restrictions because they need to keep in main memory the whole training set, or a big amount of it. Moreover, algorithms that do not have memory restrictions have to choose a subset of the training set, needing extra time for this selection; or they require to specify the values for some parameters that could be very difficult to determine by the user. In this paper, we present a new fast heuristic for building decision trees from large training sets, which overcomes some of the restrictions of the state of the art algorithms, using all the instances of the training set without storing all of them in main memory. Experimental results show that our algorithm is faster than the most recent algorithms for building decision trees from large training sets. Keywords: Decision trees, large datasets, supervised classification
1. Introduction Nowadays there are a lot of supervised classification problems described through large training sets, which contain a lot of previously classified instances, described by numeric and non numeric attributes (mixed data). Decision trees [29] are among the most used supervised classification algorithms. A Decision Tree (DT) is a tree structure formed by internal nodes and leaves. Each internal node has a splitting attribute and one or more children. If the splitting attribute of an internal node is non numeric, then a child is associated with each value of the splitting attribute. But, if the splitting attribute is numeric, then the node has a splitting test of the form X V and two output edges. The leaf nodes contain a class label. Currently, there are a lot of algorithms for building decision trees [3,5,6,8,26–29,33,34,36] however these algorithms cannot handle large training sets. ∗ Corresponding author: A. Franco-Arcega, Computer Science Department, National Institute of Astrophysics, Optics and Electronics Luis Enrique Erro 1, Santa Maria Tonantzintla, C.P. 72840, Puebla, Mexico. E-mail:
[email protected].
1088-467X/12/$27.50 2012 – IOS Press and the authors. All rights reserved
650
A. Franco-Arcega et al. / Building fast decision trees from large training sets
There are some algorithms that have been developed for building DTs from large training sets: SLIQ [25], SPRINT [32], CLOUDS [2], Rainforest [16], BOAT [15], ICE [40], VFDT [9], BOAI [39], IIMDT [10] and IIMDTS [11], however all of them have some restrictions (see Section 2). In this work, we introduce a new fast heuristic for building traditional DTs from large training sets described by numeric and non numeric attributes that overcomes the restrictions of the above algorithms. The proposed heuristic allows processing all the instances of the training set without storing the whole training set in main memory. Besides, our heuristic uses only one parameter that can be easily determined by the user. The rest of the paper is organized as follows: Section 2 describes the previous algorithms for DT induction from large datasets. In Section 3, the proposed heuristic for building DTs from large training sets is introduced. Section 4 presents our experimental results. Finally, in Section 5, conclusions are presented. 2. Related work Currently, there are a lot of algorithms for building DTs, for example: ID3 [24], C4.5 [29], CART [5], QUEST [33], ModelTrees [8], CTC [28], FDT [19], etc. Besides, several algorithms have been developed for building DTs in an incremental way, such as UFFT [13], ID5R [36], PT2 [37] and ITI [38]. However, all these algorithms have to keep the whole training set in memory for building a DT, therefore they cannot handle large datasets. For this reason, all these algorithms will not be used for comparing our algorithm. Although our algorithm and VFDT [9] could be used over streaming data, both were designed for fast building DTs for large datasets, therefore we will not include a description of algorithms for building DTs over streaming data [4,18,20] nor a comparison against them. The algorithms that have been developed for building DTs from large training sets are: SLIQ [25], SPRINT [32], CLOUDS [2], Rainforest [16], BOAT [15], ICE [40], VFDT [9], BOAI [39], IIMDT [10] and IIMDTS [11]. SLIQ, SPRINT and CLOUDS represent each attribute of the training set as a list, which stores the value that this attribute takes in the instances in the training set. Using this representation, for storing the lists, these algorithms require at least the double of the space required for storing the whole training set. Following this approach, RainForest and BOAI also use lists for representing the attributes, but, in order to maintain the lists in main memory, these algorithms reduce the lists storing only different values for each attribute. However, when a large training set is being processed, these structures are still too big, therefore they cannot be stored in main memory. BOAT and ICE do not store the whole training set in main memory, instead, these algorithms use subsets of instances for building the DT. However, BOAT and ICE use extra time for finding those subsets of instances. For example, ICE divides the training set in several parts and then it uses a treebased sampling technique based on local clustering [41] for finding subsamples, which together will form the subset of instances used for building the DT. VFDT is an algorithm that uses Hoeffding bounds [17] for building DTs from large training sets. The algorithm VFDT needs three parameters for building a DT, but values for these parameters could be very difficult to be a priori determined by the user. IIMDT [10] builds multidimensional DTs from large numeric training sets using the mean of the instances of a leaf as splitting value to create new nodes. The algorithm IIMDT uses the whole set of attributes to split nodes. In this way, the process to select a splitting attribute is avoided, which saves
A. Franco-Arcega et al. / Building fast decision trees from large training sets
651
processing time and allows to handle large training sets. In IIMDTS [11] a similar approach is followed but different subsets of attributes are used for expanding each node of the tree. The previous described algorithms have some restrictions. Almost all of them have spatial restrictions (SLIQ, SPRINT, CLOUDS, RainForest and BOAI) because a representation of the attributes that requires more space than the whole training set is commonly used; or they have to keep the whole training set in main memory. Other algorithms (BOAT and ICE) use only a subset of instances for building a DT, but additional time is required for computing this subset, which could be too expensive for large training sets. In VFDT several parameters are used, which could be very difficult to be a priori determined by the user. Finally, IIMDT and IIMDTS are algorithms that build multidimensional DTs, which are difficult to be interpreted by the user. Besides, these algorithms only can process large numeric training sets, since when a leaf is expanded they create a new node for each class of instances, using as splitting value the mean of the instances of each class. For these reasons, in this paper, we propose a new fast heuristic for building traditional unidimensional DTs from large training sets described by numeric and non numeric attributes. The proposed algorithm, named DTLT, uses all the instances of the training set without storing the whole training set in main memory. Besides, this heuristic only needs one parameter that is easy to determine by the user. 3. Proposed algorithm In this work, we introduce a new heuristic for building DTs from Large Training sets (DTLT) that overcomes some of the shortcomings of the algorithms described in the previous section. In order to avoid storing the whole training set in main memory, DTLT processes the training instances one by one, updating the DT with each one and discarding those instances that have been used for expanding a node. Besides, in order to allow a fast building of the DT, DTLT expands a node when it has a small amount of instances; this amount is a parameter of DTLT. In this way, DTLT only maintains in main memory a few instances of the training set and uses a fast expansion process, which together allow DTLT to process large training sets. A DT built by DTLT has the same structure that a DT built by a conventional DT induction algorithm. It contains internal and leaf nodes. An internal node has associated a splitting attribute. If the splitting attribute is numeric, then the node has an splitting test of the form X V and two output edges. The first edge is for instances that have in the splitting attribute a value smaller than or equal to V ; the second edge is for instances that have in the splitting attribute a value greater than V . If the splitting attribute is categorical, then the internal node has one edge for each possible value of its splitting attribute. On the other hand, each leaf has associated a class label. Since the structure of a DT built by DTLT is the same of a DT built by traditional DT induction algorithms (like C4.5), the traversing of a DT built by DTLT is the same as the one of traditional DTs. Several criteria for splitting attribute selection in DTs have been proposed [7,22]. In this work we use the Gain Ratio Criterion [29], one of the most used criteria for selecting the splitting attribute for expanding a node [30]. This criterion allows to find the attribute that best divides in an homogeneous way the instances in a node. DTLT will process the training instances one by one, updating the DT with each one and discarding those instances that have already been used for expanding a node. However, incrementally processing a long sequence of instances from only one class, for building a DT, makes that the nodes do not have diversity of instances (instances from different classes). In order to avoid this situation, DTLT reorganizes the training set, in a preprocessing step, alternating instances from each class, i.e., the first instance from
652
A. Franco-Arcega et al. / Building fast decision trees from large training sets Table 1 DTLT Algorithm
Table 2 UpdateDT function
DTLT Algorithm Input: T S, s T S – training set s – maximum number of instances in the nodes
UpdateDT (I, N ODE) I – instance to be processed N ODE – node in the tree to be traversed 1 if N ODE.numIns < s, then 2 AddInsNode(N ODE, I) 3 N ODE.numIns = N ODE.numIns + 1 4 if N ODE.numIns = s, then 5 N umClass = CountClassNode(N ODE.Ins) 6 if N umClass = 1, then 7 Delete(N ODE.Ins) 8 N ODE.numIns = 0 9 else 10 ExpandNode(N ODE) 11 N ODE.numIns = N ODE.numIns + 1 12 else /∗ N ODE.numIns > s ∗/ 13 if N ODE.Attr is categorical, then 14 Edge = EvalueIns 15 else 16 if valueIns < N ode.Attrvalue , then 17 Edge = Eleft 18 else 19 Edge = Eright 20 UpdateDT(I, N ODE.Edge)
1 2 3 4 5
ReorganizeTS(T S) ROOT = CreateNode() for each instance I in T S, do UpdateDT(I, ROOT ) AssingClassToLeaves()
the first class, the second instance from the second class and so on, if there are r classes, the instance in the position r + 1 is from class 1 and so on. In our experiments the runtime of DTLT includes the time spent by this reorganization process. The algorithm starts building a tree with an empty root node without descendants (a leaf). Each instance traverses the DT until it reaches a leaf, where the instance is stored. Since, at the beginning, the DT only has the root node, the first instances are stored in that node. When the root node has s instances (s is a parameter of DTLT) the node is expanded. In DTLT, unlike traditional DT algorithms, all the training instances are not used for selecting a splitting attribute at the first step. Instead of it, DTLT uses the s instances stored in the root node (the reorganizing step guaranties that the root node has instances from more than one class). Thus DTLT takes into account these s instances to select the splitting attribute for expanding the root node. We use the parameter s in order to process only a small amount of instances for expanding a node, in this way the expansion of a node is fast. Table 1 shows the main algorithm of DTLT and Table 2 shows the recursive function for updating the DT with each instance of the training set. For expanding a node, DTLT applies the Gain Ratio Criterion [29] to each attribute, but only using the s instances stored in the node. For each attribute X , first, we obtain the information gain from the s instances in the leaf using Eqs (1) and (2). info (S) =
c
−pi log2 pi
(1)
i=1
gain (X) = info (S) −
n |Si | i=1
|S|
info (Si )
(2)
Where S is the set having the s instances stored in the leaf, pi is the proportion of S that belongs to class i and n is the number of possible splits of the attribute X .
A. Franco-Arcega et al. / Building fast decision trees from large training sets
653
Table 3 ExpandNode function ExpandNode (N ODE) NODE – node in the tree to be expanded 1 for each attribute Xi in T S, do 2 GR[i] = GainRatio(Xi , N ODE.Ins) 3 N ODE.Attr = chooseBestAttr(GR) 4 if N ODE.Attr is categorical, then 5 for each value Vj ∈ N ODE.Attr, do 6 Ej = CreateEdge() 7 leafi = CreateNode() 8 else 9 Eleft = CreateEdge() 10 leafleft = CreateNode() 11 Eright = CreateEdge() 12 leafright = CreateNode() 13 Distribute Instances(N ODE.Ins) 14 Assign Temporal Classes(N ODE) 15 Delete Instances(N ODE.Ins)
Then, we use Eq. (3) for measuring the Gain Ratio of each attribute. n |Si | |Si | × log2 gain ratio (X) = gain (X) − |S| |S|
(3)
i=1
The attribute with the highest Gain Ratio is chosen as splitting attribute. The number of edges or splits is created according to the type of the selected splitting attribute. For numeric attributes the expanded node has two edges. For categorical attributes, one edge is created for each different value of the splitting attribute. In both cases, each generated edge leads to a new empty leaf. Once the edges have been created, DTLT assigns to each leaf a temporary class. For this, DTLT first distributes the instances in the new leaves. If the node has a categorical splitting attribute, each instance stored in the expanded node is placed in the leaf that has the edge with the value that corresponds to the value of the instance in that attribute. Otherwise, if the expanded node has a numeric splitting attribute, then each instance is placed in the leaf that corresponds with the splitting test X V , i.e., the instances is put in the first leaf if the value of the splitting attribute in the instance is smaller than or equal to V ; or the instance is put in the second leaf if the value of the splitting attribute in the instance is greater than V . Afterward, the temporary class of each leaf is the majority class of the instances that arrive to the leaf, if two classes have the majority number of instances, then the first class is chosen. If a leaf does not receive any instance, then the temporary class assigned to this leaf is the majority class of the instances used for expanding the node. Finally, the s instances used for the expansion are deleted, leaving all the new leaves empty, and DTLT marks the expanded node as an internal node. The algorithm for the expansion process is shown in Table 3. Once the root node has been expanded, DTLT continues processing the remaining instances in the training set. Each instance traverses the DT built so far, until the instance reaches a leaf, where it is stored. When a leaf has s instances stored in it, DTLT expands the node as we explained before. However, in some of these leaves could happen that all the s instances are from only one class, i.e., they are homogeneous nodes. For a node of this kind, DTLT verifies if the temporary class assigned to this node is the same that the class of the s instances stored in the node. If this is not the case, DTLT assigns as new temporary class of the node the class of these s instances. Finally, DTLT deletes the s instances stored in the leaf, so it can receive other s instances.
654
A. Franco-Arcega et al. / Building fast decision trees from large training sets
Finally, when all the training instances have been processed, DTLT assigns to each leaf the class label of the majority class of the instances stored in that leaf. If a leaf does not have instances (it is empty), then DTLT assigns the temporary class as class label of that leaf. Once a DT has been built, DTLT uses it for classifying new instances in the same way as a traditional DT. An instance traverses the DT until it reaches a leaf, where the class label associated to that leaf is assigned to the new instance. 3.1. Time complexity analysis of DTLT algorithm In order to obtain the time complexity of DTLT, we take into account its main steps for building a DT: reorganization of the training set, traversing the tree and expanding nodes. For a training set with m instances, described by x attributes and divided in c classes, the reorganization step is O(m). Since only one traverse of the training set is needed for alternating instances from each class. The complexity for traversing a DT with each instance depends on the number of levels in the DT. In a DT, in the worst case, we have at most O(m/s) levels since DTLT uses s instances for each expansion, and each expansion could generate a new level, then traversing the DT with all m instances is O(m ∗ m/s) = O(m2 ) in the worst case. In the expansion process, for a single node, DTLT chooses the best splitting attribute applying to each attribute the Gain Ratio Criterion, but using only the s instances stored in the leaf to be expanded, this is O(s ∗ x). As the maximum number of expansions that DTLT does for building a DT is in the worst case O(m/s) (since for each expansion DTLT uses s instances), then the whole expansion process in DTLT is O(s ∗ x ∗ (m/s)) = O(x ∗ m), however, for large training sets x m then the complexity of all the expansions for building a DT is O(m). Finally, the complexity of building a DT with DTLT is the sum of the complexities of the reorganizing, traversing and expansion steps, this is: O(m + m2 + m) = O(m2 ); in the worst case
[40] analyzed the complexity of ICE, which is O(m ∗ log(m)). The complexity of BOAI is also O(m ∗ log(m)) [39] and the complexity of VFDT is, in the worst case, O(m2 ) [23]. As it can be observed, the time complexity of DTLT is, in the worst case, equal to VFDT’s and greater than ICE’s and BOAI’s. Nevertheless, the complexities of DTLT and VFDT are for the worst case, therefore, in Section 4 we show an experimental analysis of the runtime spent by each algorithm for building DTs from large training sets. 3.2. Memory consumption analysis of DTLT algorithm For building a DT, DTLT has to keep in main memory only the instance that is being processed at each moment and the DT already built with the previous instances. The maximum number of expansions that DTLT can do is m/s, then the space required for DTLT is O(m/s) = O(m). The space that ICE needs for building the DT depends on the number of epochs (e) and the proportion of instances (0 < p 1) that must be extracted from each epoch. Each epoch contains m/e instances, thus the size of each subsample is p ∗ (m/e), therefore the space required for storing all the subsamples is p ∗ (m/e) ∗ e = p ∗ m. Finally, since only one epoch and all the subsamples need to be stored at each moment, the number of instances that ICE needs to store is O((p ∗ m) + (m/e)) = O(m), since 0 < p 1 and, for large training sets, e m.
A. Franco-Arcega et al. / Building fast decision trees from large training sets
655
Table 4 Description of the datasest used in the experiments Dataset Forest CoverType KDD Cup GalStar Agrawal F1 Agrawal F2 Agrawal F7
Classes 7 2 2 2 2 2
Instances 581,012 4,800,000 4,000,000 6,500,000 6,500,000 6,500,000
Numerical att. 10 34 30 6 6 6
Categorical att. 44 7 0 3 3 3
VFDT needs to store the instance that is being processed at each moment and the DT built previously. Therefore, the space required for this algorithm depends on the maximum number of expansions that VFDT can make, which is m/n, where n is the number of instances used for verifying if a node must be expanded, and n m, then the space required by VFDT is O(m/n) = O(m). BOAI is an algorithm that needs to store the whole training set in main memory for building the DT. The instances are stored in the leaves of the DT as a set of lists, then the space that BOAI requires is O(m). According to [10,11], the space require for IIMDT and IIMDTS is O(m). As it can be noticed, the space required by BOAI, VFDT, ICE, IIMDT, IIMDTS and DTLT for building a DT is, in the worst case, O(m), the difference is in the effective space (the amount of memory that they use for building a DT). Therefore, in Section 4 we show an experimental analysis of the amount of memory used for each algorithm. 4. Experimental results In this section, first we analyze the behavior of DTLT when the maximum number of instances stored in a node (the parameter s) varies. Then, we show a comparison of DTLT against ICE, VFDT, and BOAI (the most recent algorithms for building DTs) when the number of instances and attributes in the training set increases. For those tests using numeric training sets we included IIMDT and IIMDTS in the comparison. Besides, we show the behavior of DTLT when the order of the training instances varies. Finally, we present a comparison of the amount of memory that each algorithm uses for building a DT, since in the previous section we observed that the memory consumption complexity of DTLT is the same as that of ICE, VFDT, BOAI, IIMDT and IIMDTS. The datasets used in these experiments are described in Table 4. The first dataset was obtained from the UCI Repository [35], the second dataset was obtained from the KDD Cup [21], the third was obtained from SDSS [31] and the last three datasets are synthetic datasets built using the database generator developed by Agrawal et al. using the Functions 1, 2 and 7 [1]. GalStar and all synthetic datasets are balanced datasets, since their classes have the same number of instances. The distribution of Forest CoverType is 211,840 instances in class 1, 283,301 in class 2, 35,754 in class 3, 2,747 in class 4, 9,493 in class 5, 17,367 in class 6 and 20,510 in class 7. KDD’s distribution is 972,781 instances in class 1 and 3,827,219 in class 2. Therefore, Forest CoverType and KDD Cup are imbalanced datasets. Additionally, all these datasets does not contain noise or missing values. For all the experiments, we used 10-fold cross validation and each plotted result includes the 95% confidence interval, however, in some cases these confidence intervals are not visible in the figures because they are very small. The results show the processing time (including building and classification time, and for DTLT also the preprocessing time) and the accuracy rate for each dataset.
656
A. Franco-Arcega et al. / Building fast decision trees from large training sets
Fig. 1. Runtime and accuracy rate for DTLT, when the value of s changes.
Our algorithm was implemented in C and all our experiments were performed on a PC with a Pentium 4 at 3.06 GHz, with 2 GB of RAM running Linux Kubuntu 7.10. For ICE, we implemented in C language a version based on [40], using 5 epochs and extracting 10% of the instances in each epoch for generating the subsample used for building the final DT (this percentage of instances was chosen according to the experiments presented by [40]). For finding the subsample in each epoch of ICE, we used a tree-based sampling technique that chooses from each leaf node of the DT some instances of the corresponding epoch. For VFDT we got the authors’ version (implemented in C) and for this algorithm we did the experiments using the parameter values recommended by the authors [9]. For BOAI we also got the authors’ version (an executable file), establishing 30 iterations for building the DT (this value was chosen based on [39]). Finally, IIMDT and IIMDTS were implemented in C language, according to [10,11].1 4.1. Parameter s This experiment was done in order to analyze the behavior of DTLT when the maximum number of instances stored in a node, the parameter s, changes. For each dataset in Table 4, we evaluated several values for s: 50 and from 100 to 600 with increments of 100. Figure 1 shows the results for DTLT when the value of the parameter s varies. As we can observe, the processing time increases when the value of s increases. With respect to the accuracy rate, for each dataset DTLT obtains similar accuracy no matter the value of s. Taking into account these results, we recommend to use small values for s. In the next experiments, we will use s = 100, since with this value DTLT spent less time than using other values, and the accuracy was the highest of all tested values for s, in all the datasets. 4.2. Increasing the number of instances in the training set For these experiments we used the datasets described in Table 4. For each dataset we created several training sets, increasing the number of instances in each one. 1 The executable files and the used databases needed to replicate the results are available at http://ccc.inaoep.mx/∼ariel/ DTLT/DTLT.html.
A. Franco-Arcega et al. / Building fast decision trees from large training sets
657
Fig. 2. Runtime and accuracy rate for DTLT, ICE and VFDT algorithms for Forest CoverType.
Fig. 3. Runtime and accuracy rate for DTLT, ICE and VFDT algorithms for KDD dataset.
With the Forest CoverType dataset we created training sets containing 50,000 to 550,000 instances (with increments of 50,000 instances). With this dataset we cannot apply BOAI, since this algorithm is only for two-class problems. Figure 2 shows the processing time and the accuracy rate obtained for this dataset. As we can notice, DTLT, VFDT and ICE obtained similar accuracy, but DTLT was up to 4.5 and 31.5 times faster than VFDT and ICE, respectively. For the KDD dataset, we created different-size training sets from 500,000 to 4,500,000 with increments of 500,000 instances. Figure 3 presents a comparison among DTFS, ICE and VFDT. The algorithm BOAI does not appear in this figure because it could not process training sets with more than 200,000 instances, since for bigger training sets the algorithm presented a memory failure. From Fig. 3, we can notice that DTLT was faster than ICE and VFDT. With respect to the processing time, DTLT was up to 23.5 times faster than ICE and up to 22.5 times faster than VFDT. About the accuracy rate, DTLT and VFDT obtained similar results, and both of them were better than ICE.
658
A. Franco-Arcega et al. / Building fast decision trees from large training sets
Fig. 4. Runtime and accuracy rate for DTLT, ICE, VFDT, IIMDT and IIMDTS algorithms for GalStar dataset.
It is important to highlight that despite the fact that Forest CoverType and KDD datasets are imbalanced, DTLT obtained good results. For GalStar we created different-size training sets (from 500,000 to 4,000,000 with increments of 500,000). Figure 4 presents a comparison between DTLT, ICE and VFDT. The algorithm BOAI does not appear in this figure because for this dataset it could not process training sets with more than 300,000 instances, since this algorithm presented a memory failure. Besides, since GalStar is a numeric training set, we included IIMDT and IIMDTS algorithms in the comparison. As it can be noticed, all the algorithms obtained similar accuracy rates, but DTLT and IIMDTS were about 9.16, 1.13 and 1.10 times faster than ICE, VFDT, and IIMDT respectively. For this experiment IIMDTS was the fastest algorithm, however it is important to highlight that IIMDTS is an algorithm that only can process numeric training sets, moreover the DT built by IIMDTS is a multidimensional DT. For the Agrawal datasets, we created, for each one, training sets from 500,000 to 6,500,000, with increments of 500,000 instances. Figure 5 shows the results obtained by DTLT, ICE, VFDT and BOAI with Agrawal F1, F2 and F7 datasets. In Fig. 5, we only show the results of BOAI until the training set of 2,000,000 of instances, since the processing time spent by this algorithm increased much faster than the processing time of the other algorithms. We can notice from Fig. 5 that DTLT is the fastest. For the Agrawal datasets, DTLT was, in average, up to 2.72 and 6.25 times faster than ICE and VFDT, respectively. With respect to the accuracy rate, DTLT and VFDT obtained similar results, and both of them were better than ICE. The algorithm ICE showed this behavior, since for these training sets in our experiments this algorithm did not find a representative subset of instances for building the DTs. 4.3. Increasing the number of attributes in the training set In this experiment, we use the KDD dataset for showing the behavior of DTLT when the number of attributes in a training set is increased. We used this dataset since it has the major number of attributes among the datasets in Table 4. For this experiment we created several training sets with different number of attributes, with 5 and from 10 to 40, with increments of 10 attributes. We compare our results only against ICE and VFDT, because BOAI could not process any of these training sets, since this algorithm presented a memory failure.
A. Franco-Arcega et al. / Building fast decision trees from large training sets
659
Fig. 5. Runtime and accuracy rate for DTLT, ICE, VFDT and BOAI for Agrawal F1, Agrawal F2 and Agrawal F7, respectively.
660
A. Franco-Arcega et al. / Building fast decision trees from large training sets
Fig. 6. Runtime and accuracy rate for DTLT, ICE and VFDT algorithms for KDD, when the number of attributes is increased.
Fig. 7. Runtime and accuracy rate for DTLT for different training sets, using GalStar.
The results of this experiment are shown in Fig. 6. From this figure we observe that, when the number of attributes in the training set was increased, the processing time of DTLT increased slowly, while the processing time of ICE and VFDT increased much more faster. With respect to the accuracy rate, DTLT and VFDT obtained similar results, and both of them were better than ICE. 4.4. Varying the order of the training instances In order to show the behavior of DTLT with respect to the preprocessing step, we have used GalStar dataset to create ten different training set orders. Each training set has a different random order of the data. Results of accuracy and time are shown in Fig. 7, where we can observe that does not matter the order of the input data, DTLT has a stable behavior. The processing time spent by DTLT as well as the accuracy rate were very similar for all the training sets.
A. Franco-Arcega et al. / Building fast decision trees from large training sets
Fig. 8. Forest Memory.
Fig. 9. KDD Memory.
Fig. 10. GalStar Memory.
Fig. 11. Agrawal F1 Memory.
Fig. 12. Agrawal F2 Memory.
Fig. 13. Agrawal F7 Memory.
661
662
A. Franco-Arcega et al. / Building fast decision trees from large training sets
4.5. Memory use In Section 3.2 we showed that ICE, VFDT, BOAI, IIMDT, IIMDTS and DTLT have the same memory consumption complexity. Therefore, in this experiment we show the difference among these algorithms in terms of the amount of memory that they use for building a DT. For measuring the amount of memory used by each algorithm, we use the memory tool [14] of Linux. In this experiment, we used the training sets created in Section 4.2 for the datasets described in Table 4. Figures 8–13 show the results obtained for each dataset. The algorithm BOAI was not included in these experiments, since it could not build a DT for all the training sets. From Fig. 8, we observe that DTLT and ICE used less memory than VFDT for building a DT with the Forest CoverType dataset. From Fig. 9, we observe, that for the KDD dataset, DTLT used less memory than ICE and VFDT. With respect to GalStar (a numeric dataset), see Fig. 10, ICE and IIMDT used less memory for building a DT than DTLT and IIMDTS, and they used less memory than VFDT. For the three Agrawal datasets, Figs 11–13, we can notice that DTLT and ICE used similar amount of memory for building a DT, and both algorithms used less memory than VFDT. From these experiments, we can observe that DTLT use less memory than VFDT and similar amount of memory than ICE, IIMDT and IIMDTS. However, IIMDT and IIMDTS only can process numeric training sets and according the experiments of Sections 4.2 and 4.3, ICE is slower and it obtains lower accuracy results than DTLT. 5. Conclusions In this work, we have introduced a new fast heuristic for building DTs from large training sets (DTLT) that overcomes the restrictions of the state of the art algorithms for building DTs from large training sets. For building a DT, DTLT processes all the instances in the training set without storing the whole training set in memory. The algorithm DTLT processes one by one the instances, in an incremental way, updating the DT with each one. Besides, in order to process large training sets, DTLT uses a small number of instances for expanding each leaf, discarding them after each node expansion. DTLT has one user-defined parameter (s), but our experiments showed that for different datasets the behavior of DTLT with respect to this parameter is very stable. For large training sets described only by numeric attributes, we observed that DTLT is competitive in accuracy and time with IIMDTS, which is an algorithm designed for this kind of training sets. However, IIMDTS builds multidimensional DTs, while DTLT builds DTs which have the same structure that DTs built by a conventional DT induction algorithm. From our experiments, we conclude that DTLT is faster than ICE, VFDT and BOAI, the most recent algorithms for building DTs from large training sets described by numeric and non numeric attributes, maintaining competitive accuracy. Besides, one important characteristic of our algorithm is that, when the number of attributes in the training set increases, DTLT has better behavior than ICE, VFDT and BOAI, since our algorithm slightly increases its processing time while ICE, VFDT and BOAI drastically increase their processing time. Moreover, we also observe, the preprocessing step of DTLT does not affect the results of the algorithm, since accuracy and processing time of DTLT conserve a similar behavior with different training set orders. Based on the memory experiments, we noticed that DTLT uses less memory than VFDT and BOAI, while DTLT and ICE use similar amount of memory, however, ICE is slower and it obtains smaller accuracy results than DTLT.
A. Franco-Arcega et al. / Building fast decision trees from large training sets
663
Finally, we can conclude that DTLT is the best option for building DTs from large mixed training sets, specially when the dataset has a big number of attributes. As future work, we are planning to modify our algorithm in order to deal with noise and missing data. Also, the application of our algorithm over streaming data will be studied.
References [1]
R. Agrawal, T. Imielinski and A. Swami, Database mining: A performance perspective, IEEE Transactions on Knowledge and Data Engineering 5(6) (1993), 914–925. [2] K. Alsabti, S. Ranka and V. Singh, CLOUDS: A decision tree classifier for large datasets, in Proc Conference Knowledge Discovery and Data Mining (KDD’98), 1998, pp. 2–8. [3] H. Altinc¸ay, Decision trees using model ensemble-based nodes, Pattern Recognition 40 (2007), 3540–3551. [4] Y. Ben-Haim and E. Tom-Tov, A streaming parallel decision tree algorithm, Journal on Machine Learning and Research 11 (2010), 849–872. [5] L. Breiman, J. Friedman and R. Olshen, Classification and regression trees, Wadsworth International Group, 1984. [6] C.E. Brodley and P.E. Utgoff, Linear machine decision trees, Technical Report TR-91-10, University of Massachusetts, Department of Computer and Information Science, Amherst, MA., 1991. [7] B. Chandra, R. Kothari and P. Paul, A new node splitting measure for decision tree construction, Pattern Recognition (2010) doi:10.1016/j.patcog.2010.02.025. [8] C.S. Chih, P.H. Kuo and L. Yuh-Jye, Model trees for classification of hybrid data types, in Intelligent Data Engineering and Automated Learning – IDEAL: 6th International Conference, 2005, pp. 32–39. [9] P. Domingos and G. Hulten, Mining high-speed data streams, in Proc of Six Int, Conference on Knowledge Discovery and Data Mining, ACM Press, 2000, pp. 71–80. [10] A. Franco-Arcega, J.A. Carrasco-Ochoa, G. S´anchez-D´iaz and J.F. Mart´inez-Trinidad, A new incremental algorithm for induction of multivariate decision trees for large datasets, in Proc of the 9th International Conference on Intelligent Data Engineering and Automated Learning – IDEAL, LNCS, 5326, 2008, pp. 282–289. [11] A. Franco-Arcega, J.A. Carrasco-Ochoa, G. S´anchez-D´iaz and J.F. Mart´inez-Trinidad, Multivariate decision trees using different splitting attribute subsets for large datasets, in Proc of the 23th Canandian Conference on Artificial Intelligence, LNAI, Springer, vol. 6085, 2010, pp. 370–373. [12] Y. Freund and L. Mason, The alternating decision tree learning algorithm, in 16th International Conference on Machine Learning, 1999, pp. 124–133. [13] J. Gama and P. Medas, Learning decision trees from dynamic data streams, Journal of Universal Computer Science 11(8) (2005), 1353–1366. [14] G. Garc´ia, Measure of time and memory of a program, Faculty of Informatic, University of Murcia, 2009, http://dis. um.es/∼ginesgm/medidas.html. [15] J. Gehrke, V. Ganti, R. Ramakrishnan and W. Loh, BOAT – Optimistic decision tree construction, ACM SIGMOD Record 28(2) (1999), 169–180. [16] J. Gehrke, R. Ramakrishnan and V. Ganti, Rainforest – A framework for fast decision tree construction of large datasets, Data Mining and Knowledge Discovery 4 (2000), 127–162. [17] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association 58 (1963), 13–30. [18] G. Hulten, L. Spencer and P. Domingos, Mining time – chaning data streams. In Proc 7th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, 2001, pp. 97–106. [19] C.Z. Janikow, Fuzzy decision trees: Issues and methods. IEEE Transactions on Systems, Man and Cybernetics – Part B: Cybernetics 28(1) (1998), pp. 1–14. [20] R. Jin and G. Agrawal, Efficient decision tree construction on streaming data, ACM KDD Conference, 2003, pp. 571–576. [21] KDD cup 1999 data, The UCI KDD Archive, Information and Computer Science, University of California, Irvine, 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [22] R. Kohavi and J.R. Quinlan, Decision-tree discovery, Will Klosgen and Jan M. Zytkow, Eds, Oxford University Press, 2002. [23] Z. Li, T. Wang, R. Wang, Y. Yan and H. Chen, A new fuzzy decision tree classification method for mining high-speed data streams based on binary search trees, in Proc of FAW Conference, 2007, pp. 216–227. [24] T. Mitchell, Machine Learning, McGraw Hill, 1997. [25] M. Mehta, R. Agrawal and J. Rissanen, SLIQ: A fast scalable classifier for data mining, in Proc 5th International Conference Extending Database Technology (EDBT), Avignon, France, 1996, pp. 18–32.
664 [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41]
A. Franco-Arcega et al. / Building fast decision trees from large training sets J. Ouyang, N. Patel and I. Sethi, Induction of multiclass multifeature split decision trees from distributed data, Pattern Recognition 42 (2009), 1786–1794. W. Pedrycz and Sosnowski, C-fuzzy decision trees, IEEE Transactions on Systems, Man and Cybernetics – Part C: Applications and reviews 35(4) (2005), 498–511. J. Perez, J. Muguerza, O. Arbelaitz, I. Gurrutxaga and J. Martin, Combining multiple class distribution modified subsamples in a single tree, Pattern Recognition Letters 28(4) (2007), 414–422. J.R. Quinlan, C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann, 1993. L. Rokach and O. Maimon, Top-down induction of decision trees classifiers - a survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 35(4) (2005), 476–487. SDSS – Adelman-McCarthy, J., Agueros, M.A., Allam, S.S., Data Release 6, ApJS, 175, 297, 2008. J.C. Shafer, R. Agrawal and M. Mehta, SPRINT: A scalable parallel classifier for data mining, in Proc 22nd International Conference Very Large Databases, 1996, pp. 544–555. Y.S. Shih and W.Y. Loh, Split selection methods for classification trees, Statistica Sinica 7(4) (1997), 815–840. H.W. Shin and S.Y Sohn, Selected tree classifier combination based on both accuracy and error diversity, Pattern Recognition 38 (2005), 191–197. UCI Machine Learning Repository, University of California, 2008. http://archive.ics.uci.edu/ml. P.E. Utgoff, Incremental induction of decision trees, Machine Learning 4 (1989), 161–186. P.E. Utgoff and C.E. Brodley, An incremental method for finding multivariate splits for decision trees, In Proc 7th International Conference on Machine Learning, 1990, pp. 58–65. P.E. Utgoff, N.C. Berkman and J.A. Clouse, Decision tree induction based on efficient tree restructuring, Machine Learning 29(5) (1997), 5–44. B. Yang, T. Wang, D. Yang and L. Chang, BOAI: Fast alternating decision tree induction based on bottom-up evaluation, in PAKDD, 2008, pp. 405–416. H. Yoon, K. Alsabti and S. Ranka, Tree-based incremental classification for large datasets, Technical Report TR-99-013, CISE Department, University of Florida, Gainesville, FL. 32611, 1999. T. Zhang, R. Ramakrishinan and M. Livny, BIRCH: An efficient data clustering method for very large databases, in Proc of ACM SIGMOD Int’l Conference on Management of Data, 1996, pp. 103–114.