ACE-Cost: Acquisition Cost Efficient Classifier by Hybrid Decision Tree

ACE-Cost: Acquisition Cost Efficient Classifier by Hybrid Decision Tree with Local SVM Leaves Liyun Li1 , Umut Topkara2, and Nasir Memon1 1

Polytechnic Institute of New York University, 6 Metrotech Center, Brooklyn, NY, 11201 [email protected], [email protected] 2 IBM Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532 [email protected]

Abstract. The standard prediction process of SVM requires acquisition of all the feature values for every instance. In practice, however, a cost is associated with the mere act of acquisition of a feature, e.g. CPU time needed to compute the feature out of raw data, the dollar amount spent for gleaning more information, or the patient wellness sacrificed by an invasive medical test, etc. In such applications, a budget constrains the classification process from using all of the features. We present, AceCost, a novel classification method that reduces the expected test cost of SVM without compromising from the classification accuracy. Our algorithm uses a cost efficient decision tree to partition the feature space for obtaining coarse decision boundaries, and local SVM classifiers at the leaves of the tree to refine them. The resulting classifiers are also effective in scenarios where several features share overlapping acquisition procedures, hence the cost of acquiring them as a group is less than the sum of the individual acquisition costs. Our experiments on the standard UCI datasets, a network flow detection application, as well as on synthetic datasets show that, the proposed approach achieves classification accuracy of SVM while reducing the test cost by 40%-80%. Keywords: Hybrid SVM, Postpruning, Cost Efficient Decision Tree, Support Vector Machine

1

Introduction

The analytics capability is a major competitive advantage for a business enterprize with large amounts of information flow, defining its efficiency from office productivity, to customer relations, and marketing. Business intelligence gets a pivotal role with online businesses, which have to make large volumes of business decisions within a fraction of a second e.g. to bid on a display ad, or to select relevant products for consumers, etc. In such applications of decision algorithms, one needs to take into account run-time concerns such as throughput, operational cost, and response time. In this paper, we study classification problem under such real-life constraints.

2

Liyun Li, Umut Topkara, and Nasir Memon

Our focus is the set of classification problems in which the run-time efficiency of the decision process is as important as the accuracy of the decision, and the feature acquisition cost has a determining share in the run-time efficiency. Clearly, gleaning as much as data as possible before making a decision will result in more accurate results, therefore limiting the amount of resources available for feature acquisition might have a negative effect on the decision accuracy. Despite the disadvantage of reducing the amount of information available to the classifier, we show that it is possible to increase the run-time efficiency of classification without compromising from accuracy. We present, AceCost, a novel classification method that achieves the accuracy of SVMs, while reducing the expected feature acquisition cost by 40% to 80%. AceCost can be used in many practical applications of classification, where feature acquisition cost has different meanings such as patient wellness, CPU time, money, etc. as in the applications of medical diagnosis [10], network monitoring [13], spam filters [11], and credit evaluation [20], etc. Support Vector Machine(SVM) is a popular machine learning classifier which is built by computing a hyperplane in the multi-dimensional feature space so that the margin of the positive and negative examples is maximized. One of the most critical concerns in applying SVM is the test cost. In order to make a decision, Standard SVM decision function uses the values of all the features of an instance, which means that the prediction cost for SVM is the cost of acquiring all the feature values. It might be impossible for some online applications to compute SVM decisions, if the total cost of extracting values for all the features is prohibitively expensive. For example, in medical diagnosis [10], it is almost impossible, or even harmful to perform all the tests to diagnosis a patient, because the total cost of all the tests might be too expensive, or performing all the tests will significantly harm the patient. Under those scenarios, SVM cannot be directly applied, if we do not reduce the test cost. In order to reduce the test cost in SVM, feature selection [21] is performed beforehand to limit the number of features needed for SVM. By choosing a subset of all the features, the classifier can emphasis on more relevant features and still achieves acceptable accuracy. However, the limitation of feature selection is that all the examples are tested against the same subset of features. And it is possible that some features are only useful at discriminating a part of the examples and get discarded in the overall feature selection process. In the other hand, decision tree(DT) is an efficient classifier which is naturally incremental and cost efficient. The test cost for an instance in decision tree is the accumulated cost of all the features along their root-to-leaf path. And in most cases, test cost in decision tree is significantly smaller than cost of obtaining all the feature values. In this sense, decision trees are born cost efficient and the average cost of prediction is the averaged cost for all the root-to-leaf paths of the instances. The limitation of decision tree is that decision trees are not always accurate enough in every application [23], and the problem of overfitting [2] occurs when the tree is too big. To avoid overfitting and increase prediction accuracy,

Hybrid Cost Efficient Classifier Using Decision Tree and Local SVM Leaves

3

prepruning and postpruning [22] are performed after decision tree construction. For example in C4.5 [2] trees, reduced-error postpruning [8] could reduce tree size while maintaining the prediction accuracy. In many cases, the costs of acquiring different feature values are independent. But there are circumstances [4] where cost of acquiring two features are less than the sum of their acquisition costs. For instance, if acquiring some feature values all require a Fourier transform, then computing one of these features may reduce the cost of getting the other features because the result of Fourier transform can be re-used. Also, in medical applications [10], all the blood tests require a sampling of blood from the patient and therefore the cost of these features share the cost of blood sampling. The dependent cost property needs to be taken into account while performing cost-effective versions of both SVM feature selection and the decision tree heuristic calculation. We propose ACE-Cost: a hybrid acquisition cost efficient classifier which combines the accuracy advantage of SVM and the cost efficiency of decision tree. ACE-Cost, first uses a decision tree to sketch the decision boundary, then replaces subtrees with local SVM leaves to construct finer hyperplanes during post-pruning. Using the features available at the non-pure leaves to build locally focused SVMs allows us to maintain the cost efficiency of decision trees, while increasing the prediction accuracy. The paper is organized as: in Section 2, we introduce the preliminary method of calculating the average test cost and related work in cost efficient decision tree and SVM. Section 3 describes the algorithm for the hybrid SVM construction algorithm in detail. Experimental results which demonstrate the performance of our algorithm on three different categories of data are in Section 4. We discuss our results and layout future in Section 5, and present our conclusions in Section 6.

2

Preliminaries and Related Work

ACE-Cost focuses on reducing the average test cost of SVM classification by using a hybrid algorithm that uses cost efficient decision trees. In this section, we first give some background on how the average test cost is calculated in SVM and decision tree classifiers. Then we give an overview of the methods for reducing the test cost of these classifiers, as well as previous work on hybrid algorithms that achieve good classification accuracy with low cost. 2.1

Computing Average Test Cost in Decision Tree and SVM

A machine learning classifier aims to build a hypothesis H which predicts the unknown label of new instances, given a set of n training data instances (Xi , Yi ) where Xi is a vector consisting of feature values (f1 , f2 ...fm ) and Yi is the corresponding label. A test cost c(j) is associated with each feature fj . To perform the prediction of the unknown label, the classifier may query the values of a subset of the feature values of the new instance. If all feature costs are independent,

4


and there is no overlapping feature costs, querying a feature set Qi for unlabeled instance i will cost: X testcosti = c(j), (1) j∈Qi

SVM requires acquiring all the feature values (f1 , ...fm ) for classifying an unlabeled instance, therefore test cost for SVM is simply the cost for extracting all the feature values. In a decision tree, the test cost of one instance is summation of costs for all the feature nodes along the root-to-leaf path. We define the test cost of instance i as: X pathcosti = c(j), (2) j∈πi

where c(j) is the cost of feature fj and πi is the permutation denoting the features along the decision path of the instance i. The average cost of a decision tree is then the averaged cost for all the instances: 1X pathcosti n i=1 n

avgCost =

(3)

Note that, in practical applications [4], [10], groups of features may have overlapping acquisition costs. If features j and k have overlapping costs, c(j) + c(k) < c(j, k), where c(j, k) is the test cost of acquiring both features. Once either of j or k is acquired for an instance i, the marginal cost of acquiring the other feature for the same instance decreases. We will present our results in such applications in Section 4. Also note that, in ACE-Cost, some of the leaf nodes are local SVM classifiers, therefore the cost for additional features in the local SVM is added to the pathcost of each instance in such nodes. 2.2

Cost efficient SVM

SVM computes a decision function that represents a hyperplane to discriminate the data examples. All of the feature values are required to compute the outcome of the SVM decision function. The most basic method of test cost reduction can be achieved by applying feature selection to obtain a subset of the features for classifying all the unlabeled instances. Standard feature selection can be forward, backward, or in both ways [21]. The backward feature selection starts from the complete set of features and adaptively tries to remove some of the features, whereas forward feature selection starts from a small set or an empty set of features and tries to add more features into the set. Both Bennett and Blue [15], and Madzarov et.al. [16] use decision trees to reduce the number of support vectors, in an effort to reduce the computational cost of executing an SVM classifier given the feature values in a multi-class setting. In these trees, each internal node is a binary class SVM and the total number of support vectors is reduced by a log factor. Kumar and Gopal [14] also tried to


5

reduce the execution time by approximating an SVM decision boundary using the SVM in a subset of the leaves. The resulting structure of their classifier is similar to classifiers produced by our algorithm. However, there is a fundamental difference: Kumar and Gopal’s aim is to approximate a single SVM, which means that all the feature costs need to be paid whenever SVM is used at a leaf, and the accuracy at an SVM leaf is bounded by the original SVM. In ACE-Cost, the leaves are replaced by local SVM classifiers, which i) are more accurate for the local data samples that reach the specific node, and ii) use only a subset of the features including the features on the path to the leaf. Therefore, our hybrid classifier approach not only reduces the test cost significantly, but also keeps the possibility for achieving even better accuracy. 2.3

Cost Efficient Decision Trees

Decision tree is versatile classifier and could be used in many applications. The leaves in classical decision trees are the nodes associated with labels, i.e. predict decisions, while the internal nodes are feature nodes which split the data into its children. The problem of constructing a decision tree with minimum height or size, given the training data, has been proved to be NP-hard [19]. Therefore, most decision tree algorithms adapts a top-down approach by choosing the feature at each node using heuristics. The most popular heuristics in decision tree construction is the information gain, or the entropy gain, which is used in C4.5 [2], and the gini gain, which is the heuristic in CART [1]. The information gain, is the decrease of entropy after using a feature fi to split, and can be written as: ∆Ifi = H(C) − H(C|fi ). The heuristic in CART is similarPdespite that the measure of uncertainty is no longer information entropy H = i −pi log2 pi , P but the gini Index defined as Gini = 1 − i p2i . There are many existing works in constructing a decision tree cost effectively. Most of them are variants of C4.5 trees where the heuristic is replaced by a function of the entropy gain and the feature cost. For example, the heuristic functions of the IDX [5], CS-ID3 [6], EG2 [7] and LASC [3] trees are listed in Table 1, where △Ii is the information gain of a feature and c(i) is the cost of a feature. Note that the LASC heuristic also take the size of a node f req into consideration and the choice of feature is becoming less sensitive at smaller nodes. Table 1. Different Heuristics of Cost Efficient Trees Tree Type CS-ID3 IDX △Ii △Ii Heuristic c(i) c(i)

2.4

EG2

LASC

△I 2△Ii −1 (c(i)+1)ω f req α C+(1−f req α )

Preprune and Postprune

When the training data is noise free, constructing a decision tree without pruning will fit the data better than any pruned trees. However, in practice, the data is usually with noise and the resulting big tree suffers from overfitting with poor

6


prediction accuracy. To solve this problem, pre-prune or post-prune is usually performed in decision tree induction to limit or reduce the size of the tree. Since the tree size is reduced, the average cost of feature acquisition is also reduced. In pre-pruning, a predefined threshold for the smallest possible leaf is established. Whenever the number of instances reaching a node is smaller than this threshold, a leaf node labeled with the majority class is made. The preprune process helps to prevent generating a tree with too many nodes. The disadvantage is that it is hard to establish the size constraint of the leaf beforehand. If the threshold is too high, the resulting tree may not be sufficiently accurate, and accuracy that could be gained with further splits is sacrificed. Postprune is adaptive, and unlike postprune, there is no requirement for prior knowledge to predefine any threshold. Instead, the tree is first grown to the utmost. In simple bottom-up post-pruning, sibling leaf nodes with the same label are recursively merged. Reduced error prune [8], is a procedure in C4.5, where the similar process is executed on a set of validation data instances. If pruning a node with its two leaf children into a single leaf does not impact the accuracy on validation data, then the subtree is pruned into a single leaf node.

3

ACE-Cost Approach: the Hybrid Decision Tree with Local SVM leaves

In this section, we present our ACE-Cost approach using cost efficient decision trees. The algorithm consists of three steps, which is also shown in Figure 3: Step 1 : Use a cost efficient tree to grow the sketch tree. Step 2 : Post-prune using the validation data which is a portion of the training data we reserved. By judiciously pruning some leaves and replace some decision leaf nodes by different SVM’s, we not only reduce the test cost, but also achieve a better accuracy. Step 3 : Perform look-ahead feature selection in each of these local SVM leaves which were generated in the post-pruning process.

3.1

Decision Tree Sketch

ACE-Cost starts by building a cost-efficient decision tree. As discussed in Section refSection:Pre, decisions trees are inherently cost efficient, since there is no need to pre-compute all the feature values for all the instances, and the feature values can be extracted when required by some node along the root-to-leaf path. To improve the cost efficiency of decision trees even further, variants of C4.5 like heuristic are proposed which takes the feature extraction cost into consideration. Examples of cost efficient trees include CS-ID3, IDX, EG2 and LASC, as described in Section 2. In ACE-Cost, we experimented all these cost efficient trees and compared their performances. Although any cost-efficient decision tree can be plugged into ACE-Cost, we suggest to use the LASC and the EG2 tree in ACE-Cost, because


7

Fig. 1. Three Steps of ACE-Cost Construction: i)Build a Cost Efficient Decision Tree Sketch; ii)Postprune with Local SVM Leaf Candidates; iii) Cost Sensitively Feature Selection to Add More Features to the Local SVM Built

they generate more efficient trees in most cases and the heuristics are more flexible. Any progress in newly developed cost efficient decision tree algorithm can be easily incorporated into the ACE-Cost structure. Given the choice of an efficient decision tree algorithm, we grow the tree using to the utmost. This implies that the resulted tree will be big and overfitting. Then we perform the post-pruning process as described in the next subsection. The reason that we grow without any preprune but prefer postprune is that we do not have any prior knowledge of the data, and postpruning can reduce the complexity of the hypothesis while maintaining the accuracy, given enough training and validation data. In the decision tree construction, ACE-Cost handles the dependent cost by continuously bookkeeping the newest costs after choosing a feature. Therefore, after selecting one of the dependent features, cost of related features will be recalculated and effect the future choice of features. subsectionPostpruning with Local SVM ACE-Cost utilizes the benefits of post-pruning, and adaptively chooses to replaces tree structures with SVM or a leaf-node in a bottom-up manner. The algorithm is depicted in Figure 3 and Algorithm 1 describes its details. More specifically, the postpruning process recursively works on two adjacent leaf nodes and their common parent, and considers replacing this substructure with a more efficient candidate as described in Algorithm 2). The candidate

8


Algorithm 1 PostpruneSVM(T, V AL) Input: a pointer to the tree root T , the data that reaches the root V AL. if T is leaf node then return; else if isLeaf(T.lef tChild) and isLeaf(T.rightChild) then CheckToPrune(T,VAL); //CheckToPrune is a procedure to compare three candidate structures return; else split the data V AL using the attribute at node T into V AL.lef t and V AL.right; PostpruneSVM(T.lef tChild, V AL.lef t); PostpruneSVM(T.rightChild, V AL.right); if isLeaf(T.lef tChild) and isLeaf(T.rightChild) then CheckToPrune(T,VAL); end if return; end if end if

structures are: i) the original structure, ii) a single leaf node labeled by the majority class, iii) SVM trained with features on the path from root to the parent node, each with respective accuracies A0 , A1 , and A2 calculated on the validation data. To prevent overfitting the SVM, we performed 5-fold cross validation and used this cross validation accuracy to compare with the first two accuracies. Then we update the tree structure by replacing it with the candidate structure with highest accuracy performance on the local validation set. Note that if we have chosen the SVM box as the new structure of the leaf node, we haven’t incurred any additional cost as we only used the features available at the node to build the SVM. Therefore, it is possible that we can even improve the accuracy of SVM by adding more features to the SVM box. For this reason, we perform f eatureSelection as the third step of ACE-Cost, in which we start from all the currently available features and adaptively tries to add more features to increase the accuracy. The feature selection process is discussed in the next subsection. The advantage of our innovative postpruning using SVM lies in several folds. Firstly, by postpruning and deleting unnecessary leaves and internal nodes, the test cost is reduced. The second benefit is that we extend the decision tree leaf nodes into different local SVM classifiers which further improves the discriminating power, even without incurring any additional cost as the SVM boxes only use known features whose values have been already extracted along the path to this leaf. In addition, by deploying a feature selection to explore and add more unextracted features into the SVM box(the f eatureSelection procedure), we are able to boost the accuracy even higher with the cost of some additional features.


9

Algorithm 2 CheckToPrune(T, V AL) Calculates three accuracies A0 , A1 , and A2 using the lcoal validation data V AL. A0 : the original accuracy with T and its two leaf children; A1 : the accuracy based on the majority class of the local validation data V AL; A2 : 5-fold Cross-Validation accuracy for the SVM built on V AL if A1 ≥ A0 and A1 ≥ A2 then delete T.lef tChild and T.rightChild; make a leaf node for T using the majority class at V AL; else if A2 > A0 and A2 > A1 then make a SVM leaf using all the features available at T featureSelection(T,VAL) end if

3.2

Feature Selection at Local SVM Leaves

After establishing an SVM leaf in the post-pruning process, the hybrid classifier further boosts the accuracy by attempting to add more features to facilitate the prediction. In ACE-Cost approach, we start from the set of already acquired features along the path from the root node, and feature selection adds more features to this set. The process for our feature selection is to first choose a new feature or a new set of features, put them into the SVM box, and obtain the new cross validation accuracy using the validation data that achieves the node. The criteria is to choose one feature or a set of features which maximize the marginal utility defined as: △Acc H= , (4) f req α △ C + (1 − f req α ) This criteria implies that we keep adding the feature(s) that maximizes the marginal efficiency. However, the sensitivity for adding a feature is tuned by the size of a node f req. The process is stopped when the best marginal efficiency is low or a satisfying accuracy(pure) is already achieved. Note that the proposed feature selection procedure employs lookahead in order to account for features with cost dependencies. If the set size of dependent features is less than the number of lookahead steps, the cost dependency will have sufficient scope to account for the changes in marginal accuracy/cost value due to dependencies.

4

Experimental Results

We performed comparative experiments with ACE-Cost on three types of data: standard UCI dataset with nonuniform but constant feature cost, synthetic data set with dependent cost and a practical application of network flow type detection. Every dataset is randomly split into ten folds, where 7-folds are used as training(including validation in the pruning process) and 3-folds are reserved as testing. The kernel of the SVM is the RBF kernel and the parameters are chosen using 5-fold cross validation. And we compare the test cost and accuracy

10


of the ACE-Cost with the best single SVM or any single decision tree approach. The result shows that, compared to single SVM with high test cost and single decision tree with low cost, ACE-Cost combines the best of both worlds, where the accuracy is similar or even better but cost is much lower than SVM. 4.1

Performance Comparison on Standard Dataset

The experiments on the standard UCI dataset is to verify two expectations: comparable or even better accuracy than single SVM and low test cost. The five standard UCI data we have picked have features with nonuniform cost. Many of them are actual medical diagnosis problems. Detailed description of these datasets and the feature cost is available at [9].

1

0.95

0.9

Accuracy

0.85

0.8

SVM ACE−Cost w. LASC ACE−Cost w. EG2 ACE−Cost w. IDX ACE−Cost w. CS−ID3 LASC w. alpha=1 C4.5

0.75 Radius denotes the Normalized Average Test Cost

0.7

0.65

Australia

Breast

Bupa−Liver

Heart

Thyroid

Average Test Cost

Fig. 2. Baseline Performance:Accuracy and Cost of SVM, C4.5,Hybrid-SVM with LASC, EG2, IDX and CS-ID3. Normalized Cost is denoted by the radius of each point and accuracy is the Y-value.The Hybrid Classifier Has Consistent Better Accuracy and Low Cost. Using LASC and EG2 to hyrbid with SVM always have smallest cost and best accuracy.

Among the 7 folds of training data, 5-folds are used to train the decision tree sketch and the remaining 2-folds are used in postpruning and SVM feature selection. All the experiments are repeated 10 times and the results are averaged. To


11

get an insight of the baseline performance, we compared ACE-Cost with standard C4.5,LASC, and SVM. Also, we used different cost efficient trees(including CS-ID3, IDX, EG2 and LASC) to compare their fitness and efficiency as the sketch structure of the hybrid SVM approach. The Ω in EG2 and the α in LASC are both set to 1 for simplicity. The step size for the local SVM feature selection process is also set to 1, which means we adaptively add features one by one based on the marginal efficiency criteria. We can also extend the feature selection process by looking ahead, which is discussed in the next later. With localized SVM nodes and efficient decision tree sketch, ACE-Cost exhibited consistent better accuracy than any single SVM, while the test cost is much smaller and even comparable to simple decision trees. The detailed results is shown in Figure 4.1. ACE-Cost achieves the highest accuracy in all the 5 datasets, and among them four of the highest accuracy is obtained using the LASC structure, with the other one using EG2. And in all the cases, LASC and EG2 performs better than IDX and CS-ID3. The results of test cost consumption is more significantly different. With similar accuracy as the most accurate SVM, the test cost of ACE-Cost is around 40%-80% of the single SVM in all the five datasets. The ’heart’ dataset has the most significant test cost reduction to 1/6 of the single most accurate SVM. This is possibly due to the highly non-uniform cost distribution of the features in the heart data. Also in the Australia dataset, the hybrid approach reaches an accuracy significantly better than the original SVM. Lookahead in Feature Selection is used to boost the accuracy even further, by providing the SVM leaves with additional informative features will help. However, the increased accuracy comes with additional incurred cost by the new features, and it is important to choose features which do not incur large costs. In the feature selection process, instead of greedily choosing the most efficient feature to add to the SVM, we can use lookahead and choose combinations of features. To get a more direct view of how the accuracy/cost performance change with feature selection lookahead, we experimented the look ahead step from 1 to 5. The results are shown at Figure 4.1.It can be seen that the marginal efficiency does not increase much when step size is larger than 3. the results of each dataset 4.2

Synthetic Dataset

To verify that the lookahead in the SVM feature selection process handles features with dependent cost properly, we create synthetic dataset and conduct experiments with lookahead in the SVM feature selection process of ACE-Cost postpruning. The underlying function for our synthetic dataset are real valued LTF(Linear Threshold Functions) functions. We choose LTF functions as the underlying function as it is simple but also widely encountered in real applications. We randomly selected a set of variables from a pool of 50 variables X1 , ...X50 . The weight on these variables are randomly generated from a uniform distribution from [0,1]. The cost of each variable is also uniformly drawn from [0.5,1].

12

Liyun Li, Umut Topkara, and Nasir Memon Breast

Australia 0.995

0.745 0.74

0.99 Accuracy

Accuracy

0.735 0.73 0.725

0.985 0.98

0.72 0.975

0.715 0.71 1

1.5

2

2.5

3 3.5 Step−Size

4

4.5

0.97 1

5

1.5

2

2.5

Bupa−Liver

4.5

5

4

4.5

5

0.85

0.745

0.84 Accuracy

Accuacry

4

Heart

0.75

0.74 0.735

0.83

0.82

0.73 0.725 1

3 3.5 Step−Size

1.5

2

2.5

3 3.5 Step−Size

4

4.5

5

0.81 1

1.5

2

2.5

3 3.5 Step−Size

Fig. 3. SVM Feature Selection with Lookahead

Each Xi takes a random value from [0,1]. After establishing the underlying truth function, 1000 examples are generated and labeled with the truth function. We make the selected variables have dependent conditional costs. The dependency is manually set for disjunct pairs of variables, or triplets of variables. If a pair or a triplet of variables are chosen to have cost dependency, the rule is made that if one of the variables in the pair or triplet of correlated variables is chosen, then the cost of the remaining variables shrinks to 50% of the original cost. Each variable is permitted to be involved in only one cost dependency relationship. For example, a generated LTH may appear as 0.3X1 +0.5X7 −X10 +2X2 +1 > 0, and the triplet of variables with dependent cost is (X1 , X2 , X10 ). Results of the accuracy and costs of the proposed hybrid-SVM with one and two step lookahead feature selection is shown in the bar graph figure 4, where the number of variables along with the number of cost dependent features increase. The trend in the synthetic dataset is straightforward. With more cost dependent features, the proposed hybrid-SVM becomes more cost efficient than the original single SVM approach(Cost actually decreases with more judicious feature selection). The reason lies in the fact that our algorithm adaptively changes the cost vector of the remaining features by adjusting their cost according to the features that have already been acquired. 4.3

A Practical Application with Dependent Cost

We have also experimented with ACE-Cost in a practical network flow type detection application. Detailed description of the dataset is available at [3]. Here the goal is to classify network flow types using collected network packages. 88 continuous features are extracted from different sizes of flow buffers and there are 8 classes such as TXT, multimedia and encrypted files. An interesting property of this dataset is, the fact that there are four groups of features which share


13

Fig. 4. Accuracy and Cost Performance on Synthetic LTF Functions as Cost Dependencies Increase: Lookahead in SVM Feature Selection Works Better As Cost Dependencies Increase Accuracy of Hybrid SVM with Different Lookahead Steps as Dependencies Increase

Accuracy

1

0.95

0.9

0.85

5 VAR, 1 PAIR

5 VAR, 1 TRIPLET

10 VAR, 2 PAIRS

10 VAR, 2 TRIPLETS

15 VAR. 3 PAIRS

15 VAR, 3 TRIPLETS

Cost of Hybrid SVM with Different Lookahead Steps as Dependencies Increase 12 10

Cost

8

Single SVM Hybrid−SVM with One Step Lookahead Feature Selection Hybrid−SVM with Two Step Lookahead Feature Selection

6 4 2 0

5 VAR, 1 PAIR

5 VAR, 1 TRIPLET

10 VAR, 2 PAIRS

10 VAR, 2 TRIPLETS

15 VAR, 3 PAIRS

15 VAR, 3 TRIPLETS

significant portion of their costs among the group. The features in every group share an FFT transformation of the respective packet byte data, and therefore the acquisition of one feature in the group will reduce the costs of others. Also the 88 features are all real valued features which provides an implicit advantage for using nonlinear local SVM leaves. The result is shown in Table 2. It shows that ACE-Cost works well in practice where features with heavily dependent costs exist. In addition, the results verified our expectation that the hybrid ACE-Cost can perform even better than SVM when sufficient feature selection is performed with lookahead. Table 2. Experimental Results on Network FlowDetection Classifier SVM ACE-Cost ACE-Cost(1-Step Lookahead) ACE-Cost(2-Step Lookahead) Accuracy 88.75% 86.25% 87.75% 89.00% Cost 189.2 38.02 40.41 55.23

5

Discussion and Future Work

The most attractive part of the proposed hybrid classifier is its high cost efficiency with still comparable accuracy as SVM. Also the ability to handle cost dependent features, and the property that with more dependent features, the classifier becomes more efficient than single SVM makes the hybrid classifier more promising. The intuition of sketching the decision boundary first using cost efficient decision tree and then draw the fine boundaries using SVM ensures a satisfying accuracy. The price we pay is that the training is more complicated and the lookahead feature selection time increases exponentially with the lookahead step size. The proposed approach is very data intensive, because the

14


critical postpruning part is essential to the performance of the classifier. Despite those constraints, the classifier performs satisfyingly in most of the scenarios especially when most of the features are continuous and the cost dependency is heavy. More theoretical justification needs to be done. Also it is plausible to consider using more delicate decision criteria rather than univariate feature in the internal nodes. We leave this as future work.

6

Conclusion

In conclusion, we presented a hybrid classifier which fuses the nice properties from cost efficient decision tree, reduced error post pruning and SVM feature selection. Experimental results show that the proposed classifier has comparable accuracy performance as SVM but the test cost is only 40%-80% of the single SVM which uses all the features. In addition, our classifier can handle features with dependent cost and has consistent better performance with more continuous features and heavier cost dependencies.

References 1. Breiman, Leo; Friedman, J. H., Olshen, R. A., Stone, C. J.:Classification and regression trees(1984). Monterey, CA: Wadsworth Brooks/Cole Advanced Books Software. ISBN 978-0412048418 2. J. R. Quinlan: Bagging, boosting, and c4.5, in In Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press, 1996, pp. 725C730. 3. Liyun Li, Umut Topkara, Baris Coskun and Nasir Memon:CoCoST: A Computational Cost Efficient Classifier. in Proceedings of the 9th International Conference on Data Mining (ICDM2009). 6-9th Dec, 2009. Miami, FL. 4. M. K. Kulesh Shanmugasundaram and N. Memon:Nabs: A system for detecting resource abuses via characterization of flow content type, Computer Security Application Conference, Annual, vol. 0, pp. 316C325, December 2004. 5. J. C. S. Ming Tan:Two case studies in cost-sensitive concept acquisition, in In Proceedings of the Eighth National Conference on Artificial Intelligence, 1990. 6. Tan, M.:Cost-sensitive learning of classification knowledge and its applications in robotics(1993). Machine Learning, 13, 7-33 7. M. Nunez:The use of background knowledge in decision tree induction, Machine Learning, vol. 6, no. 3, pp. 231C 250, May 1991. 8. Mansour, Y. :Pessimistic decision tree pruning based on tree size, Proc. 14thInternational Conference on Machine Learning(1997): 195C201 9. Murphy, P.M.,Aha, D.W.:UCI Repository of Machine Learning Databases(1994). University of California at Irvine, Department of Information and Computer Science 10. A. Kapoor and R. Greiner: Learning and classifying under hard budgets, Machine Learning: ECML 2005, pp. 170C181, 2005. 11. E. K. J. Alspector: Svm-based filtering of e-mail spam with content-specific misclassification costs, in In Proceedings of the Workshop on Text Mining (TextDM2001), 2001. 12. Edgar Osuna and Robert Freund and Federico Girosi:Training Support Vector Machines: an Application to Face Detection, Computer Vision and Pattern Recognition, 1997. Proceedings., Jun 1997, page 130 - 136


15

13. T. Abbes: Protocol analysis in intrusion detection using decision tree, in In Proc. ITCC04, 2004, pp. 404C408. 14. Arun Kumar, M. and Gopal, M.:A hybrid SVM based decision tree, Journal of Pattern Recogn., volume 43, issue 12,December,2010,3977–3987, Elsevier Science Inc.,New York, NY, USA 15. K. P. Bennett , J.A. Blue:A Support Vector Machine Approach to Decision Trees (1997) Department of Mathematical Sciences Math Report No. 97-100, Rensselaer Polytechnic Institute 16. Gjorgji Madzarov and Dejan Gjorgjevikj and Ivan Chorbev:A Multi-class SVM Classifier Utilizing Binary Decision Tree,2008 17. Fei, B. Liu, J.:Binary Tree of SVM: A New Fast Multiclass Training and Classification Algorithm, IEEE Transaction on Neural Networks, 2006, VOL 17; NUMB 3, pages 696-704 18. Alexander K. Seewald and Johann Petrak and Gerhard Widmer:Hybrid Decision Tree Learners with Alternative Leaf Classifiers: An Empirical Study in Proceedings of the 14th FLAIRS conference(2000),407–411,AAAI Press 19. Hyafil, Laurent; Rivest, RL :Constructing Optimal Binary Decision Trees is NPcomplete(1976),Information Processing Letters 5 (1): 15C17. 20. Grigoris J. Karakoulas: Cost-Effective Classification for Credit Decision Making Knowledge,1995 21. Yi-wei Chen: Combining SVMs with Various Feature Selection, Taiwan University,Springer-Verlag,2005 22. Floriana Esposito and Donato Malerba and Giovanni Semeraro: A Comparative Analysis of Methods for Pruning Decision Trees, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, VOL 19, pages 476-491 23. Jin Huang, Jingjing Lu, Charles X. Ling: Comparing Naive Bayes, Decision Trees, and SVM with AUC and Accuracy. Third IEEE International Conference on Data Mining (ICDM’03) Melbourne, Florida

ACE-Cost: Acquisition Cost Efficient Classifier by Hybrid Decision Tree

ACE-Cost: Acquisition Cost Efficient Classifier by Hybrid Decision Tree

Suggest Documents

Quantum decision tree classifier - Semantic Scholar

clustering-based decision tree classifier construction

soil data mining using decision tree classifier

A Linked Data-Based Decision Tree Classifier to Review Movies

A Decision Tree Classifier for Intrusion Detection Priority Tagging

Constructing a Decision Tree Classifier using Lexical and Syntactic ...

Gain Ratio and Decision Tree Classifier for Intrusion ...

A Decision-Tree Classifier for Extracting Transparent ... - IEEE Xplore

Fast Boost Decision Tree Algorithm: A novel classifier ...

building privacy-preserving c4.5 decision tree classifier on multi

Constructing Near Optimal Binary Decision Tree Classifier ... - IJCSET

TEC-Tree: A Low-Cost, Parallelizable Tree for Efficient ... - CiteSeerX

Latent Tree Classifier - Semantic Scholar

result analysis by decision tree and naÃ¯ve bayes classifier - CiteSeerX

result analysis by decision tree and naÃ¯ve bayes classifier - CiteSeerX

Decision Making for Company Acquisition by ...

Decision Making for Company Acquisition by ... - ExcelingTech

A hybrid decision tree training method using data ... - Springer Link

A Hybrid Decision Tree/Genetic Algorithm Method for Data Mining

A Hybrid Decision Tree/Genetic Algorithm Method for Data Mining

A hybrid genetic algorithm / decision tree approach for coping with ...

A Decision Tree- Rough Set Hybrid System for Stock ... - CiteSeerX

A hybrid decision tree training method using data streams - Core

A hybrid genetic algorithm / decision tree approach for coping with ...