Bagging One-Class Decision Trees - IEEE Xplore

4 downloads 0 Views 143KB Size Report
POSC4.5 is a one-class decision tree classifier with good classification accuracy which learns from both positive and unlabeled examples. In order to further ...
Fifth International Conference on Fuzzy Systems and Knowledge Discovery

Bagging One-Class Decision Trees Yang Zhang College of Information Engineering, Northwest A&F University, Yangling, Shaanxi Province, P.R. China, 712100 [email protected]

Chen Li College of Information Engineering, Northwest A&F University, Yangling, Shaanxi Province, P.R. China, 712100 lichen [email protected]

Abstract

sification performance of POSC4.5, in this paper, we ensemble POSC4.5 by bagging, and classify testing samples by majority voting. The experiment results show that the classification accuracy and robustness of POSC4.5 could be improved by our approach. The paper is organized as follows. Section 2 reviews the related work. The algorithm of ensembling POSC4.5 trees is described in section 3. The detailed experiment setting and results are shown in section 4, followed by our conclusion and future work in section 5.

POSC4.5 is a one-class decision tree classifier with good classification accuracy which learns from both positive and unlabeled examples. In order to further improve the classification accuracy and robustness of POSC4.5, in this paper, we ensemble POSC4.5 trees by bagging, and classify testing samples by majority voting. The experiment results on 5 UCI datasets show that the classification accuracy and robustness of POSC4.5 could be improved by our approach.

2. Related works 1. Introduction

To the best of our knowledge, there is no work on ensembling one-class decision trees. Gunnar et al. proposed a boosting-like one-class classifier ensemble method [4]; Mazhelis et al. made a comparison of the combining rules of one-class classifiers in the application of mobilemasquerader; Giacinto et al. combined the one-class classifiers in network intrusion detection [3]; and Perdisci et al. used an ensemble of one-class SVM classifiers to harden payload-based anomaly detection systems [7]. Tax et al. gave the theory of combining one-class classifiers by proposing 5 combining rules, and analyzed the differences among them [1], which could be looked as theory basis of our work. What is more, Dietterich, Banfield et al. proposed an experimental conclusion with discussing the ensemble methods, which shows that using ensemble methods in decision trees could improve the performance [6, 10], and Dietterich have proved that Multiple Classifiers System (MCS) can improve classification performance in many applications in paper [9]. As POSC4.5 is based on a successful decision tree algorithm, C4.5, the above works suggest that we could ensemble POSC4.5 trees to improve its classification performance.

One-class classification is a special kind of classification task. One-class classifiers are trained to accept data from the target class and to reject outlier data [2], which differs from traditional machine learning method of pattern recognition. In this case, in contrast with normal classification problems where one tries to distinguish between two (or more) classes of objects, one-class classification tries to describe one class of objects, and distinguish it from all other possible objects. At present, one-class classification is widely used in many research fields, including document classification, texture recognition, image retrieval, and so on. Generally speaking, it is rather expensive to obtain labeled training samples, while unlabeled data are easy to get. Hence, it is much helpful if we can learn from labeled data and unlabeled data all together. Denis et al. proposed POSC4.5, a one-class decision tree based on C4.5 [2], which learns from positive and unlabeled data and has good classification accuracy. In order to further improve the clas-

978-0-7695-3305-6/08 $25.00 © 2008 IEEE DOI 10.1109/FSKD.2008.478

420

3. Ensemble one-class decision trees

Algorithm 1 Training Algorithm Input: the set of positive training samples, P OS; the set of unlabeled samples, U N L; the capacity of the ensemble, EnsembleSize. Output: the ensemble of POSC4.5 trees, E. 1: for each i ∈ [1, EnsembleSize] do 2: P OSi = RandomResampling(P OS); 3: U N Li = RandomResampling(U N L); 4: ei = P OSC4.5(P OSi , U N Li ); 5: E = E ∪ {ei }; 6: end for 7: return E

3.1. A brief introduction to POSC4.5 POSC4.5 is a top-down induction tree algorithm from positive and unlabeled examples only. C4.5POSUNL is a version of C4.5 in which the statistical queries are estimated from positive and unlabeled examples [2]. Denis Et al. illustrated the difference between C4.5 and POSC4.5 as following [2]: 1. POSC4.5 takes the following as input: (a) A set P OS of positive examples together with a set U N L of unlabeled examples; (b) An estimate of the percentage of positive samples in the dataset.

class, the class label predicted by E could be determined by max(votepos , voteneg ). This classification algorithm is illustrated in Algorithm 2.

2. For the current node, entropy and gain are calculated in the one-class way; 3. The gainratio is used instead of the information gain, split information is calculated from unlabeled examples; 4. When to halt criteria the top-down tree generation depends on the evaluation from unlabeled data.

Algorithm 2 Classification Algorithm Input: the ensemble of POSC4.5 trees, E; the testing sample, t. Output: the class label. 1: votepos = 0; voteneg = 0; 2: for each ei ∈ E do 3: if ei .classif y(t) == +1 then 4: votepos + +; 5: end if 6: if ei .classif y(t) == −1 then 7: voteneg + +; 8: end if 9: end for 10: if votepos ≥ voteneg then 11: return +1; 12: else 13: return −1; 14: end if

For the details of the algorithm, please refer to [2].

3.2. Ensemble POSC4.5 trees by bagging In this paper, we ensemble POSC4.5 by bagging, and classify testing samples by majority voting. (1) Training algorithm Bagging, short for bootstrap aggregating [8], is one of the earliest ensemble based algorithms. It is also one of the most intuitive and simplest to implement, with a surprisingly good performance [5]. Given a dataset D, of |D| samples, for iteration i(i = 1, 2, . . . , k), bagging builds a new training set Di of |D| samples by sampling randomly with replacement from the original dataset D, and then, one basic classifier ei is built based on Di . The training algorithm is described in Algorithm 1. The input of this algorithm includes P OS, the set of positive training samples, and U N L, the set of unlabeled samples. In this algorithm, randomly sampling with replacement is performed at step 2 and step 3; and at step 4, a basic POSC4.5 classifier is trained on the dataset got in the above 2 steps. (2) Classification algorithm Majority voting is taken as our strategy for classifying testing samples by the classifier ensemble. Let t be the testing sample, votepos be the number of basic classifiers in E which classify t into positive class, and voteneg be the number of classifiers in E which classify t into negative

4. Experiments results In order to test the classification accuracy of proposed approach, we made experiments on 5 UCI datasets1 . We implemented our algorithm in Java with help of WEKA software package2. Our experiments were made on a PC with Core 2 CPU, Windows XP, and 1GB memory. 1 Available 2 Available

421

on http://www.cs.waikato.ac.nz/ml/weka/ on http://www.cs.waikato.ac.nz/ml/weka/

DataSet

#Pos

#Neg

kr-vs-kp

1527

1669

tic-tac-toe

332

626

spect

212

55

monks-problems-1

278

278

monks-problems-3

288

266

Table 1. The results of the experiment A Ensemble POSC4.5 #Att r(%) Acc(%) MSD Time(ms) Acc(%) 90 94.2 0.012 2061 90.3 36 80 90.1 0.024 2328 87.1 70 72.5 0.032 2974 66.5 90 85.6 0.020 662 83.4 9 80 75.4 0.018 656 74.5 90 75.4 0.019 344 74.6 80 66.8 0.065 359 61.5 70 73.7 0.044 359 66.9 22 60 75.2 0.032 391 68.6 50 75.7 0.036 391 68.6 40 75.9 0.042 383 69.2 30 74.5 0.039 399 66.5 90 94.8 0.023 396 89.2 80 88.7 0.029 411 87.5 6 70 71.9 0.027 390 68.4 60 57.1 0 364 57.1 90 95.7 0.025 333 89.2 6 80 96.7 0.016 343 90.4 70 57.7 0.033 349 56.5

We made our experiments by 2 groups of experiment settings. Now we report the experiment setting and results in detail here.

Let’s write P OS for the positive dataset, U N L for the unlabeled dataset, D for the whole dataset. We have D = P OS ∪ U N L. Suppose |N EGUN L | is the number of the negative samples in U N L; |P OSUN L | is the number of positive samples in U N L, |P OSALL | is the total number of the positive samples in P OS and U N L, r is the percentage of the negative samples in U N L, then we can set the size of P OS dataset and U N L dataset according to r by:

|P OS| =

t + + + + + + + + + + + + + + + + + +

examine whether our classifier has better robustness. And furthermore, we used t-Test on arcsin(Accuracy) to examine whether the improvement in classification accuracy of our ensemble approach compared to POSC4.5 is noise. Here, arcsin(Accuracy) means arcsine transformation on accuracy, which can make the values of the classification accuracy meet the normal distribution. The final results are reported in Table 1. In Table 1, column 1 lists the name of the UCI datasets; column 2 and 3 list the numbers of the positive and negative samples with the number of the attribute shown in column 4, respectively; column 5 gives the percentage of negatives samples in U N L; column 6 and 7 present the classification accuracy, MSD (Mean Square Deviation) and the runtime of our ensemble approach and single POSC4.5, respectively. Notice that the runtime of ensemble was not 20 times as the single POSC4.5, because our algorithm executed input of training and test dataset for only once. The last column shows the results of t-Test. The ”+” means the result of t-Test is significant. In dataset monks-problems-1, there are 278 positive samples and 278 negative samples, so we took the 70% as the lowest r. If we chose a even smaller r value, the positive training dataset was too small to generate a reasonable classifier. So, we ignore the experiment results for these r values from Table 1. From Table 1, it is obvious that our ensemble classifier has better classification accuracy and robustness than single POSC4.5 tree.

4.1. Experiment A

|U N L| = |P OSUN L | =

POSC4.5 MSD Time(ms) 0.070 287 0.053 290 0.047 302 0.030 161 0.018 161 0.019 125 0.064 136 0.073 133 0.066 141 0.062 133 0.055 141 0.082 125 0.097 137 0.044 147 0.057 137 0 135 0.107 135 0.086 141 0.041 141

|N EGUN L |/r; |U N L| − |N EGUN L |; |P OSALL | − |P OSUN L |.

This group of experiment was made in the hold-out way by dividing the dataset into P OS set and U N L set randomly according to the formulas illustrated above. For each dataset, one class label was selected randomly as the positive class among all the possible class labels provided by the dataset, and the rest of the class labels were all looked as negative class. We set the capacity of the ensemble to 20. For each ratio r, 100 experiments were made. We report the averaged accuracy of these experiments here as the final result. We also checked the mean square deviation (MSD) to

422

(A) Kr-vs-Kp 1

1

Ensemble POSC45 single POSC45

0.95

0.9

0.85

0.85

0.85

0.8

Accuracy

0.9

0.75

0.8 0.75

0.8 0.75

0.7

0.7

0.7

0.65

0.65

0.65

0.6 100

0.6

0.6 200

300 400 |POS|

500

600

50

100

150

200

10

20 |POS|

|POS| (D) monks-problems-1

1

30

(E) monks-problems-3 1

Ensemble POSC45 single POSC45

0.95

Ensemble POSC45 single POSC45

0.95

0.9 0.85

Accuracy

Accuracy

Ensemble POSC45 single POSC45

0.95

0.9 Accuracy

Accuracy

0.95

(C) spect

(B) tic-tac-toe 1

Ensemble POSC45 single POSC45

0.8 0.75 0.7

0.9 0.85 0.8

0.65 0.6

0.75 50

100

150

50

|POS|

100

150

|POS|

Figure 1. The results of experiment B

4.2. Experiment B

one-class basic classifiers to further improve the classification performance of POSC4.5.

In this group of experiment, we followed the experiment setting in [2]. One third of the original dataset was sampled randomly as testing dataset. P OS and U N L datasets were sampled randomly from the rest of the original dataset for training. And we kept the data distribution in P OS ∪ U N L following the data distribution in the original dataset. We made 50 trails of experiment for each experiment setting. We report the averaged results as final result in Figure 1. In Figure 1, we presented our experiments on dataset kr-vs-kp, tic-tac-toe, spect, monks-problems-1 and monksproblems-3 in Figure 1(A), Figure 1(B), Figure 1(C), Figure 1(D) and Figure 1(E), respectively. We changed the positive label in kr-vs-kp and spect. In Figure 1(A), Figure 1(D) and Figure 1(E), we set |U N L| = |P OS| and stepped by 100 in (A) while 50 in (D) and (E); in Figure 1(B), we set |U N L| = 2|P OS| and stepped by 50 in |P OS|; in Figure 1(C), we set |U N L| = 4|P OS| and stepped by 10 in |P OS|. From Figure 1, we can see that the classification accuracy of our approach is better than single POSC4.5 tree.

References [1] D.M.J.Tax and R.P.W.Duin. Combining one-class classifiers. MCS 2001, pages 299–308, 2001. [2] F.Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, 348:70–73, 2005. [3] G.Giacinto, R. Perdisci, and F. Roli. Network intrusion detection by combining one-class classifiers. ICIAP 2005, pages 58–65, 2005. [4] G.R¨ atsch, S. Mika, B. Sch¨ olkopf, and K. M¨ uller. Constructing boosting algorithms from svms: An application to oneclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(9):1184–1186, September 2002. [5] L.Breiman. Bagging predictors. Machine Learning, 24:123– 135, 1996. [6] R.Banfield, L. Hall, K. Bowyer, and W. Kegelmeyer. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):173–180, January 2007. [7] R.Perdisci, G.Gu, and W.Lee. Using an ensemble of oneclass svm classifiers to harden payload-based anomaly detection system. ICDM’06, 2005. [8] R.Polokar. Ensemble based systems in decision making. IEEE Circuits and System Magazine, 2006. [9] T.G.Dietterich. Ensemble methods in machine learning. In Multiple Classifier System(MCS), 2000. [10] T.G.Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:139– 157, 2000.

5. Conclusions and future works In this paper, we proposed to bagging one-class decision trees, so as to improve its classification performance. The experiment results show that compared with POSC4.5, our ensemble based approach has better classification accuracy and robustness. We schedule to study other approaches for combining

423