A Combination Classification Algorithm Based on ... - Semantic Scholar

4 downloads 0 Views 195KB Size Report
We ran our algorithm, C4.5 and Ripper by conducting a 10-fold cross validation on each data set respectively in order to make the comparison. In our experiment ...
A Combination Classification Algorithm Based on Outlier Detection and C4.5 ShengYi Jiang and Wen Yu School of Informatics, Guangdong University of Foreign Studies, 510006 Guangzhou, Guangdong {jiangshengyi,yuyu_588}@163.com

Abstract. The performance of traditional classifier skews towards the majority class for imbalanced data, resulting in high misclassification rate for minority samples. To solve this problem, a combination classification algorithm based on outlier detection and C4.5 is presented. The basic idea of the algorithm is to make the data distribution balance by grouping the whole data into rare clusters and major clusters through the outlier factor. Then C4.5 algorithm is implemented to build the decision trees on both the rare clusters and the major clusters respectively. When classifying a new object, the decision tree for evaluation will be chosen according to the type of the cluster which the new object is nearest. We use the datasets from the UCI Machine Learning Repository to perform the experiments and compare the effects with other classification algorithms; the experiments demonstrate that our algorithm performs much better for the extremely imbalanced data sets.

1 Introduction A dataset is imbalanced if the classes are not equally represented and the number of examples in one class (majority class) greatly outnumbers the other class (minority class). Imbalanced data sets exist in many real-world domains, such as gene profiling, text classifying, credit card fraud detecting, and medical diagnosing. In these domains, the ratio of the minority to the majority classes can be drastic such as 1 to 100, or 1 to 1000.But what we are really interested in is the minority class rather than the majority class. Thus, we need a fairly high prediction for the minority class. However, the traditional data mining algorithm behaves undesirable in the instance of imbalanced data sets, since it tends to classify almost all instances as negative and maximize the overall prediction accuracy. A number of algorithms solving imbalance classification problem have been developed so far[1], the representatives are resampling methods[2],[3], boosting-based algorithm[4],[5], methods based on SVM such as kernel methods[6], and Knowledge acquisition via information granulation[7]. However, the technique applied in the previous work is to correct the skewness of the class distribution in each sampled subset by using over-sampling or under-sampling. Over-sampling may make the decision regions of the learner smaller and more specific, thus cause the learner to over-fit. There is an inherent loss of valuable information in the process of under-sampling[3]. Many classifiers such as C4.5 tree classifier perform well for balanced datasets but poorly for imbalanced datasets. To remedy drawbacks of the random sampling, we R. Huang et al. (Eds.): ADMA 2009, LNAI 5678, pp. 504–511, 2009. © Springer-Verlag Berlin Heidelberg 2009

A Combination Classification Algorithm Based on Outlier Detection and C4.5

505

introduce a combination classification algorithm based on outlier detection and C4.5 which balance datasets by clustering them instead of simply eliminating or duplicating samples. The basic idea of our algorithm is to combine the one-pass clustering [8] and C4.5 to build decision trees on the two relatively balanced subsets respectively to achieve better classification accuracy. The experimental results demonstrate that the presented algorithm performs much better than C4.5 for the extremely imbalanced data sets.

2 A Clustering-Based Method for Unsupervised Outlier Detection Some concepts about the clustering-based method for unsupervised outlier detection [8] are described briefly as follows: For a cluster C and an attribute value a ∈ Di , the frequency of a in

Definition 1.

C with respect to Di is defined as: FreqC|D (a) = {object object ∈ C , object.Di = a} . i

Definition 2.

For a cluster C, the cluster summary information (CSI) is defined as: CSI = {kind , n, ClusterID , Summary} ,where kind is the type of the cluster C with the value of ’Major’ or ’Rare’), n is the size of the cluster C( n =| C | ), ClusterID is the set of class label of the objects in cluster C, and Summary is given as the frequency information for categorical attribute values and the centroid for numerical attributes: Summary = {< Stati , Cen > Stati = {(a, FreqC|Di (a )) a ∈ Di },1 ≤ i ≤ mC , Cen = (cmC +1 , cmC + 2 , , cmC + mN )} .

The distance between clusters C1 and C2 is defined as

Definition 3. m

d (C1 , C2 ) =

and

∑ dif (Ci(1) , Ci( 2) )2 i =1

C2

, where dif (Ci(1) , Ci( 2) ) is the difference between C1

m

on

attribute

1 dif (C i(1) , C i( 2 ) ) = 1 − C1 ⋅ C 2

Di

,

For

1 ∑ Freq C1| Di ( pi ) ⋅ Freq C2 |Di ( pi ) = 1 − C ⋅ C p i ∈C1 1 2

categorical

attributes,

∑ Freq C |D (qi ) ⋅ Freq C |D (q i )

qi ∈C 2

1

i

2

,

i

while for numerical attribute, dif (Ci(1) , Ci( 2 ) ) = ci(1) − ci( 2 ) . From the definition 3, when the cluster only include one object, the definition of the distance between two objects and the distance between the object and the cluster can also be generated. Let C = {C1 , C2 , , Ck } be the results of clustering on training data

Definition 4. k

D, D = ∪ Ci (Ci ∩ C j = Φ, i ≠ j ) . The outlier factor of cluster Ci , OF (Ci ) is defined as i =1

power

means

ters: OF (Ci ) =

of

distances

∑ d (Ci , C j ) j ≠i

k −1

between

cluster

Ci

and

the

rest

of

clus-

2

.

The one-pass clustering algorithm employs the least distance principle to divide dataset. The clustering algorithm is described as follows:

506

S. Jiang and W. Yu,

(1) Initialize the set of clusters, S, as the empty set, and read a new object p. (2) Create a cluster with the object p. (3) If no objects are left in the database, go to (6), otherwise read a new object p , and find the cluster C1 in S that is closest to the object p . Namely, find a cluster C1 in S, such that d ( p, C1 ) ≤ d ( p, C ) for all C in S. (4) If d ( p, C1 ) > r , go to (2). (5) Merge object p into cluster C1 and modify the CSI of cluster C1 , go to (3). (6) Stop.

3 Combination Classification Algorithm Based on Outlier Detection and C4.5 3.1 A Description of Algorithm The classification algorithm is composed of model building and model evaluation, the details about model building are described as follows: Step 1 Clustering: Cluster on training set D and produce clusters C = {C1 , C2 , , Ck } ; Step 2 Labeling clusters: Sort clusters C = {C1 , C2 , , Ck } and make them satisfy: OF (C1 ) ≥ OF (C2 ) ≥

≥ OF (Ck ). Search the smallest b , which satisfies

b

∑ Ci i =1

D

≥ ε (0 < ε < 1) , and label clusters C1 , C2 ,

Cb +1 , Cb+ 2 ,

, Cb with ‘Rare’, while

, Ck with ‘Major’.

Step 3 Building classification model: Build the decision tree named RareTree on the clusters labeled ‘Rare’ and the decision tree named MajorTree on the clusters labeled ‘Major’ respectively by C4.5 algorithm. Both the RareTree and MajorTree will be considered as the classification model of the whole data. If the clusters labeled ‘Rare’ or ‘Major’ are composed of only one class, the decision tree can not be built, thus, the only one class will be labeled as the default class of the cluster. The details about model evaluation are described as follows: When classify a new object p, compute the distance between object p and each cluster ( C = {C1 , C2 , , Ck } ) respectively. Find out the nearest cluster C j , if C j is labeled ‘Rare’, the decision tree RareTree will classify the object p; otherwise the decision tree MajorTree will classify the object p. Besides, if the decision tree can not be built on the corresponding clusters, label the object p the default class of the corresponding cluster. 3.2 Factors for the Classification Performance

①Selecting neighbor radius r

We use sampling technique to determine neighbor radius r. the details are described as follows:

A Combination Classification Algorithm Based on Outlier Detection and C4.5

507

(1) Choose randomly N0 pairs of objects in the data set D. (2) Compute the distances between each pair of objects. (3) Compute the average mathematical expectation EX and the average variance DX of distances from (2). (4) Select r in the range of [EX-0.5*DX, EX]. The experiments demonstrate that as r locates in the range of [EX-0.5*DX, EX], the results are stable. We set r as EX-0.5*DX in the following experiments.

② Selecting Parameter ε

ε can be specified twice as much as the ratio of minority samples based on the prior knowledge about the data distribution. If there is no prior knowledge or the data set is moderately imbalanced, ε can be set as 30%.

4 Experimental Results Here we use Recall, Precision, F-measure and Accuracy to evaluate our algorithm. If the number of minority class is more than one, the values of Recall, Precision and Fmeasure are set as the average weighted results of the minority classes respectively.11 extremely imbalanced data sets and 5 moderately imbalanced data sets from UCI machine learning repository[9] are chosen to run experiments respectively. The data set KDDCUP99 contains around 4900000 simulated intrusion records. There are a total of 22 attack types and 41 attributes (34 continuous and 7 categorical). The whole dataset is too large. We random sample a subset with 249 attack records (neptune, smurf) and 19542 normal records. A summary of 16 data sets is provided in table 1. Table 1. Summary of Datasets

Data sets

Instance

Number of

Size

Features

Anneal

798

38

Breast

699

Car

Data sets

Instance

Number of

Size

Features

Musk_clean2

6598

166

9

Mushroom

8124

22

1354

6

Page-block

5473

10

Cup99

19791

40

Pendigits

3498

16

German

1000

20

Pima

768

8

Glass

214

9

Satimage

2990

36

Haberman

306

3

Sick

3772

29

Hypothyroid

3163

25

Ticdata2000

5822

85

We ran our algorithm, C4.5 and Ripper by conducting a 10-fold cross validation on each data set respectively in order to make the comparison. In our experiment, our algorithm is implemented by C++ program; C4.5 and Ripper are implemented as J48 and JRip respectively in WEKA.

508

S. Jiang and W. Yu,

4.1 Performance Comparison on Extremely Imbalanced Data Sets The performance of the 11 extremely imbalanced data sets is reported in table 2. Table 2. The classification evaluation of minority classes on the extremely imbalanced data

Data sets Anneal

Car

Cup99

Glass

Classification algorithm C4.5 Ripper Our algorithm C4.5 Ripper Our algorithm C45 Ripper Our algorithm C4.5 Ripper Our algorithm

C4.5 Ripper Our algorithm C4.5 Ripper Musk_clean2 Our algorithm C4.5 Ripper Page-block Our algorithm C4.5 Ripper Satimage Our algorithm C4.5 Ripper Sick Our algorithm C4.5 Ripper Ticdata2000 Our algorithm C4.5 Ripper Zoo Our algorithm C4.5 Ripper Average Our algorithm Hypothyroid

Recall Precision F-measure Accuracy 85.26 94.12 90.20 59.25 50.48 67.96 99.6 98.36 100 41.03 43.6 58.97

89.41 96.66 93.25 62.87 49.94 67.34 98.79 99.19 100 52.31 53.58 59.04

86.65 95.26 90.87 60.63 50.21 67.65 99.18 98.77 100 45.7 46.86 58.65

92.61 94.11 93.36 90.77 84.27 90.92 99.98 99.97 100 67.29 66.82 66.36

91.4 92.1 91.39 89.4 84.9 92.53 81.46 84.31 84.11 49.1 52 49.46 88.3 87.9 87.01 0 1.1 1.33 64.71 64.71 76.47 68.14 68.51 72.68

92.6 90.8 92.62 90.3 90.6 93.63 84.54 84.47 84.49 50.9 62.8 52.67 91.5 84.6 95.26 0 23.5 25 75 76.99 74.51 71.66 73.92 76.16

92 91.4 92 89.8 87.6 93.08 82.84 84.35 84.19 50 56.9 51.02 89.9 86.2 90.95 0 2.2 2.53 69.19 62.12 74.78 69.63 69.26 73.25

99.24 99.18 99.24 96.88 96.3 97.88 96.88 97 97.06 86.12 86.15 85.08 98.78 98.28 98.94 93.97 93.87 93.38 92.08 89.11 92.08 92.24 91.37 92.21

A Combination Classification Algorithm Based on Outlier Detection and C4.5

509

4.2 Performance Comparison on Moderately Imbalanced Data Sets The performance of the 5 moderately imbalanced data sets is reported in table 3. Table 3. The classification evaluation on the moderately imbalanced data

Data sets

Classification algorithm C4.5 Ripper Breast Our algorithm C4.5 Ripper German Our algorithm C4.5 Ripper Mushroom Our algorithm C4.5 Ripper Pendigits Our algorithm C4.5 Ripper Pima Our algorithm C4.5 Ripper Average Our algorithm

Recall Precision F-measure Accuracy 94.4 95.71 95.21 70.36 70.84 72.63 100 100 99.99 94.2 93.76 94.88 72.1 71.7 73.03 86.21 86.40 87.15

94.42 95.69 95.20 68.55 68.45 71.17 100 100 99.99 94.22 93.8 94.91 71.45 70.86 72.34 85.73 85.76 86.72

94.39 95.7 95.20 69.01 68.91 71.57 100 100 99.99 94.2 93.77 94.89 71.62 71.03 72.5 85.84 85.88 86.83

94.28 95.57 95.1 70.3 70.8 72.53 100 100 99.98 94.17 93.74 94.86 72.01 71.61 72.95 86.15 86.34 87.08

The experiments imply that the presented algorithm performs much better than C4.5 in terms of F-measure when the dataset is extremely imbalanced, while showing slight improvement or comparable results on moderately imbalanced data, and does not sacrifice one class for the other but attempts to improve accuracies of majority as well as minority class. Above all, for any degree of imbalance in dataset, our algorithm performs better or at least comparable to C4.5. 4.3 Performance Comparison with the Classifiers for Imbalanced Datasets In order to make the comparison with the Classifiers for imbalanced datasets, we make some other experiments whose evaluation is the same as the kernel-based twoclass classifier[6]. The performance comparison of each data set is reported in table 4, table 5 and table 6 respectively. The results from the table 4 to table 6 imply that the algorithm presented in this paper performs better than the other Classifiers for imbalanced datasets such as k-nearest neighbor (1-NN and 3-NN), a kernel classifier and LOO-AUC+OFS in terms of Accuracy and is comparable in terms of Precision, F-measure and G-mean.

510

S. Jiang and W. Yu, Table 4. Eight-fold cross-validation classification performance for Pima data set

Algorithm KRLS with all data as centres 1-NN 3-NN LOO-AUC+OFS(ρ=1) LOO-AUC+OFS(ρ=2) Our algorithm

Precision 0.68±0.07 0.58±0.06 0.65±0.07 0.70±0.09 0.62±0.07 0.62

F-measure 0.61±0.04 0.56±0.04 0.61±0.04 0.63±0.05 0.67±0.05 0.62

G-mean 0.69±0.03 0.65±0.02 0.69±0.04 0.71±0.03 0.74±0.04 0.7

Accuracy 0.71±0.02 0.66±0.02 0.70±0.03 0.72±0.03 0.74±0.04 0.74

Table 5. Two-fold cross-validation classification performance for Haberman data set

Algorithm KRLS with all data as centres 1-NN 3-NN LOO-AUC+OFS(ρ=1) LOO-AUC+OFS(ρ=2) Our algorithm

Precision 0.63±0.07 0.36±0.01 0.30±0.07 0.61±0.05 0.51±0.02 0.44

F-measure 0.41±0.05 0.38±0.02 0.22±0.04 0.31±0.03 0.44±0.06 0.52

G-mean 0.54±0.04 0.50±0.02 0.38±0.04 0.45±0.02 0.57±0.05 0.68

Accuracy 0.61±0.03 0.56±0.01 0.51±0.03 0.58±0.01 0.63±0.03 0.71

Table 6. Ten-fold cross-validation classification performance for Satimage data set

Algorithm LOO-AUC+OFS(ρ=1) LOO-AUC+OFS(ρ=2) Our algorithm

Precision 0.6866 0.5279 0.59

F-measure 0.5497 0.5754 0.58

G-mean 0.6689 0.7708 0.72

Accuracy 0.7187 0.7861 0.87

5 Conclusion and Future Work In this paper, we present a combination classification algorithm based on outlier detection and C4.5 which groups the whole data into two relatively balanced subsets by outlier factor. As a result, the skewness of the datasets of the both major clusters and rare clusters has been reduced so that the bias towards the majority class of the both decision trees built on the clusters respectively will be alleviated. The experiments on datasets from UCI imply that the performance of our algorithm is superior to C4.5, especially in case of extremely imbalanced data sets. In the further research, other classification algorithm will replace C4.5 to combine with data division by clustering to solve the imbalance classification problem. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No.60673191), the Natural Science Research Programs of Guangdong Province’s Institutes of Higher Education (No.06Z012).

References 1. Weiss, G.M.: Mining with Rarity: A Uinfying Framework. Sigkdd Explorations 6(1), 7–19 (2004) 2. Marcus, A.: Learning when data set s are imbalanced and when costs are unequal and unknown. In: Proc. of t he Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC (2003)

A Combination Classification Algorithm Based on Outlier Detection and C4.5

511

3. Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(2), 539–550 (2009) 4. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. LNCS, pp. 878–887. Springer, Heidelberg (2005) 5. Guo, H., Viktor, H.L.: Learning from Imbalanced Data Set s with Boosting and Data Generation: The DataBoost-IM Approach. Sigkdd Explorations 6, 30–39 (2003) 6. Hong, X., Chen, S., Harris, C.J.: A Kernel-Based Two-Class Classifier for Imbalanced Data Sets. IEEE Transactions on Neural Networks 17(6), 786–795 (2007) 7. Su, C.-T., Chen, L.-S., Yih, Y.: Knowledge acquisition through information granulation for imbalanced data. Expert Systems with applications 31, 531–541 (2006) 8. Jiang, S., Song, X.: A clustering-based method for unsupervised intrusion detections. Pattern Recognition Letters 5, 802–810 (2006) 9. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html

Suggest Documents