A Novel Distance-Based Classifier Built on Pattern Ranking - UniTo

14 downloads 0 Views 117KB Size Report
[19] D. Randall Wilson and Tony R. Martinez. Reduction techniques for instance-based learning algorithms. Mach. Learn., 38(3):257–286, 2000.
A Novel Distance-Based Classifier Built on Pattern Ranking Dipankar Bachar

Rosa Meo

Universita` degli Studi di Torino - Italy

Universita` degli Studi di Torino - Italy

[email protected]

[email protected]

ABSTRACT Instance-based classifiers that compute similarity between instances suffer from the presence of noise in the training set and from overfitting. In this paper we propose a new type of distance-based classifier that instead of computing distances between instances computes the distance between each test instance and the classes. Both are represented by patterns in the space of the frequent itemsets. We ranked the itemsets by metrics of itemset significance. Then we considered only the top portion of the ranking that leads the classifier to reach the maximum accuracy. We have experimented on a large collection of datasets from UCI archive with different proximity measures and different metrics of itemsets ranking. We show that our method has many benefits: it reduces the number of distance computations, improves the classification accuracy of state-of-the art classifiers, like decision trees, SVM, k-nn, Naive Bayes, rule-based classifiers and association rule-based ones and outperforms the competitors especially on noise data.

Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining; I.2.6 [Learning]: Concept Learning

General Terms Algorithms, experimentation, reliability

Keywords instance-base learning, frequent itemsets

1.

INTRODUCTION

In this paper we design a new distance-based classifier with the intention of solving some of the well known limitations of traditional Instance Based Learners (IBL). In general, IBLs such as Knn perform classification by using directly single training instances: as Figure 1-left shows, an IBL calculates the proximity between a test instance and each training instance for selecting K nearest neigh-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’09 March 8-12, 2009, Honolulu, Hawaii, U.S.A. Copyright 2009 ACM 978-1-60558-166-8/09/03 ...$5.00.

bors of the test instance1 . Majority voting is used to assign a class label to a test instance [2, 6, 19]. The most used proximity measure is Euclidean distance or cosine similarity: with instances described by the values of d attributes, each instance is thought as a vector in an d-dimensional space. Proximity is then computed between the two vectors. These classifiers are simple and powerful but some of the well known limitations of Knn are: (i) If there are many training instances then Knn requires many distance calculations as well; (ii) it suffers from over-fitting, because it relies too much on the training data for its predictions and is not able to generalize its model to new test data. Successive research [2] has adopted reductionist approaches and edited versions, in which an incremental learning occurs; it has tried generalization of training instances with prototypes, and so on. In summary, improvements mainly focused in providing a better usage of the storage space, in the use of similarity-based indexes and in the attempt to make the instance-based learners more robust to noise [17, 19]. Further research has tried to overcome these limitations with a different strategy: on one side, [7] has tried to join instance-based learning with rule-based learning in a multi-strategy system; on the other side, classical methods of generation of classification rules and decision rules by rule covering (such as RIPPER) have been compared with methods directly based on association rules and frequent patterns [4, 8, 13, 14]. In particular, [4] has observed the power of using frequent patterns in classification by studying with Fisher score and Information Gain their direct relationship with the most discriminative and informative patterns of the classes. In addition, [8] decided to keep an abundance of rules giving the priority to frequent, accurate and discriminative rules. Finally, [16] uses frequent itemsets as the characteristic features of the class and adopt their conditional probability in the class by a Naive Bayes approach. The above observations and stated limitations of Knn are the main motivating force behind our approach. Similarly to [4, 16] we also keep an abundance of frequent patterns (in our case they are itemsets - frequent within each class) but we eliminate the less useful ones by first performing a ranking of frequent itemsets by a measure of itemsets evaluation and then by retaining only the top part of the ranking. The class model is built as an ensemble of the single itemsets, which are weighted by their probability to occur in instances of the class. The underlying idea is that the selected highly significant frequent itemsets collectively are able to provide a reliable model of the class. As Figure 1-right shows, our learner is Model-Based. It computes the proximity between single test instances and the model of 1 All training instances are checked when no index or optimizations are available.

each class. Thus we overcome the critical and risky decision of which representative instances to keep to represent classes. test instance

test instance

training instance

Instance-based classifier

class model

Model-based classifier

Figure 1: Difference between our classifier and Knn. We think that such a class model could be more reliable and robust against the risk of over-fitting in which classifiers incur both as a consequence of instance and rule selection. In addition, it allows a reduction of the number of necessary distance computations w.r.t. IBLs. In Section 2 we discuss in more detail the proposed classifier and we include the pseudo-code of the algorithm. In the experimental section we will demonstrate that our classifier is able to classify better not only w.r.t. K-nn, but also w.r.t. many other learners, including rule-based and association-rules-based ones. As we will discuss in later Sections, we experiment and report results also on our model-based classifier with the adoption of different measures of itemsets evaluation. By the same experimentation we also aim to verify that classification improvements are due not only thanks to mechanisms of distance computations on class probabilistic models but also thanks to the adopted measure of itemsets selection. Finally we draw conclusions.

2.

MBL: THE MODEL-BASED LEARNER

Our new classifier is called Model-Based Learner (MBL). It first constructs a classification model for each class. It consists in a probabilistic description of the class, made by the set of itemsets that are frequent in the training data instances of the class. Then, it uses the model of each class to predict the previously unknown class label of test instances by calculating the distance between the test instance and each of the class model. In Section 2.2 we discuss in more detail class models based on itemsets.

2.1 Benefits of class-model The probabilistic description of the class is more robust to the presence of noise in some local portions of the dataset. Figure 2 exemplifies a noise case with the presence of training instances with a different class label in a region characterized by certain features. By the use of a probabilistic description of the entire set of training

Figure 2: Misclassification of some examples by Knn due to the presence of noise.

examples belonging to the class, the most frequent features contribute more to reduce the distance of instances with the same features. Without this probabilistic model description, the presence of

noise in local portions of the training set could determine a wrong prediction for some test instances in the noisy region. In this approach our main assumption is that it is possible to achieve two main benefits by our new classifier over Knn: 1) MBL builds a generalization model for each class which is quite useful to avoid noisy data. Thus, differently to Knn, it is not a Lazy Learner. 2) MBL makes less distance computations as it calculates the distance between a test instance and each class model in contrast to the K nearest neighbor classifier where for making a prediction the distances between a test instance and each one of all the training instances are calculated.

2.2 Multi-dimensional Feature Space Determined by Itemsets We construct the descriptive model of a class by means of the frequent itemsets extracted from the examples of that class. The model is probabilistic since for each itemset we store the probability with which the itemset has been observed in the class. In this way the class model is also easily interpretable and self-explanatory. An itemset reveals hidden relationships among attribute values in the database [1]. In a dataset of examples, we can describe the regularities in the examples of the different classes, by extracting frequent itemsets in which an item is an attribute-value pair. An itemset describes recurrent values in instances attributes: for continuous attributes, it is very difficult to find an actual recurrent value. Thus, usually continuous attributes are discretized into intervals. We used the same supervised-discretization step described in [9] for continuous attributes (it chooses the intervals that present a higher correlation with the target class). Since the number of frequent itemsets is often very high, as we will see in Section 2.4, we perform a selection of the itemsets in order to: (1) reduce computational workloads, (2) identify the important itemsets for the characterization and prediction of each class, (3) select the right cardinality level of the itemsets. All possible frequent itemsets from a set of examples can be shown in a lattice structure. Since the lattice is exponentially large in the number of items and because itemsets at different levels of the lattice might have dependencies, in order to create the model of a class we select only the frequent itemsets of a certain level in the lattice. We call the itemsets at level l l-itemsets. Each level of the lattice is a possible multi-dimensional space in which a class might be represented. A different choice of the level l gives rise to a different multi-dimensional space whose total number of dimensions is determined by the number of frequent litemsets in the examples of that class. We call this multi-dimensional space the feature space of the model in which any l-itemset is a different feature. Any l-itemset occurring frequently in the examples of a class is one of the dimensions of this multi-dimensional space. The class model is represented as a vector in this space whose i-th component is the probability with which the corresponding itemset occurs in the examples of that class. Similarly, any test instance can be seen as a vector in the same multi-dimensional space. However, it is not probabilistic: it has a 1 or a 0 value in correspondence to each l-itemset of the feature space: 1 or 0 are chosen according to the presence of absence of the l-itemset in the instance. The class model vector construction is exemplified in Figure 3 taking in consideration the lattice level l = 2. In this simple example we refer to a dataset with boolean attributes, in which the possible items are only {A,B,C,D} and a test instance is described by the items ABC.

2.3 Construction and Operation of MBL

Test instance : ABC

Class Two dimensional feature space.

A B A C A D B C

B D

C D

< P , P ,P ,P ,P , P A B

A C

A D

B C

B D

>

All possible two dimensional combinations of items.

A B

A C

< 1,

1,

B C

0,

1,

0,

0 >

C D

Class model vector.

Test instance vector.

Figure 3: Vector Creation for a Class and a Test Instance in the Two-Dimensional Feature Space. For prediction of the class of a test instance, MBL computes distances between a test instance and each class in the multi-dimensional feature space as exemplified by Figure 3. In our classifier we used association algorithms to generate the frequent itemsets. We extract the frequent itemsets from the given input training instances of the different classes. After a choice of the itemsets cardinality l the multi-dimensional feature space of representation is selected and the l-itemsets are used for constructing vectors for the respective classes and for each test instance. Then we calculate the distances between the test instance vector and the training class vectors. Higher is the probability of an itemset in one class higher will be the weight of that itemset in the class model and lower will be its contribution to the distance of test instances with that itemset. Prediction of class label for the test instance is made according to the result of this distance calculation. Prediction rule is quite simple: the class with the smallest distance (or highest similarity) is predicted as the class for the test instance.

2.4 Itemsets Selection In the model construction for the classes, we want to consider a selection of the itemsets. This selection occurs in the following method. First we rank the itemsets by a measure of itemsets evaluation. Successively we retain only the top portion of this ranking, where the fraction of the itemsets to be retained must be determined. Itemsets selection is useful because it reduces the complexity of the class models (over-fitting), improves the models such that they characterize effectively the classes and correctly perform predictions and finally it reduces the computational workload in distance computations. We have tried different mechanisms for itemsets selection reported in classification literature. Here, for the sake of space, we report results on some of them: classification accuracy (also known as confidence of class association rules), entropy, Kullback-Leibler divergence (based on relative entropy) and strong emerging patterns. (1) Classification accuracy is a classical measure in rule-based classifiers and association-rules-based classifiers that gives the priority to rules for which the lower classification error is expected. It is employed in CBA [14], RIPPER [5], etc. (2) Δ is a measure proposed in [15] for the determination of the existing dependencies among the items in an itemset. It seems appropriate to determine the right level of specificity of the itemsets (cardinality). Since a dependency in the items of an itemset can be ‘inherited’ from the existing dependencies among the items of the subsets we

need to contrast intrinsic dependencies in the itemset with the inherited ones. Δ makes this job. It is defined by the residual between the probability of an itemset and a referential probability, estimated in the condition of maximum entropy of the itemset. This condition can be considered as a generalized condition of independence of the subsets in the itemset. Higher is Δ higher is the evidence that the existing dependency among the items is intrinsic to the itemset and it is not due to dependencies already present in the subsets. In contrast to [15], we normalize Δ by the itemset probability since Δ is lower for low probability itemsets. (3) Kullback-Leibler divergence [12] is defined instead as a relative entropy and measures the distance between two probability distributions (even if it is not symmetric). It has been often used in classification both to measure the distance between the probability distribution of the observation and the probability distribution of the class [3] and with the aim to discriminate between two classes. (4) Strong jumping emerging patterns [8] constitute one of the latest proposals of classifiers based on itemsets. Similarly to our method and to Kullback-Leibler divergence, strong jumping emerging patterns are itemsets that occur with very different frequencies in the instances of two classes. Therefore, they are the good candidates to discriminate between the classes and give a probabilistic indication of the class. We experimented with the above four measures to rank the itemsets of each class, identify (at the top of the ranking) the most important ones for the class characterization task and select the itemsets cardinality (feature space and lattice level l). Then, we perform our model-based classification in the reduced feature space. At the current status of our work on the class models, for simplicity, we decided to keep itemsets of the same cardinality l for all the classes. At the same way, we retain the same percentage of itemsets for all the classes and adopted the same itemset evaluation measure. In future work, especially in unbalanced datasets, we could modify these choices from class to class. As regards the percentage of itemsets to be retained in class models, we tested many selection thresholds in the itemsets ranking. By decreasing the percentage of the itemsets retained, first we observed an improvement in the classification accuracy. This is a well-known effect of feature selection due to elimination of noisy features and simplification of the feature space. Then, continuing to decrease the selection percentage the classifier gets a minimum of classification error and then starts to increase again. This is due to an over-simplification of the model. This behavior motivates us to employ Algorithm 1 to select the itemsets from the ranking. It takes in input Fi , the rankings of frequent itemsets in classes by decreasing value of significance (according to one of the four →

cited measures), the class vector models Ci , and r, the percentage of reduction of the ranking at each iteration. The algorithm performs a validation and update of the class vectors on the test instances: it calls MBL in the test phase (Algorithm 3). This latter one, on the basis of the class vector models built on Fi , makes class prediction. Then classification error is updated and the resulting error is compared with the error obtained by MBL with the previous class models (built on the basis of the rankings stored in prevFi ). When the difference in error reaches convergence (minimum) the procedure stops, otherwise it continues to eliminate r% of the itemsets at the bottom of the rankings.

2.4.1 Selection of itemsets cardinality The choice of itemsets cardinality is a critical factor. Setting the cardinality to a too high value will give a little number of frequent itemsets; setting a too low value will give too many itemsets that are not able to catch a sufficient number of recurrent attributes.We

N. of attributes ≤ 10 10..20 ≥20

Algorithm 1 Itemsets Selection Algorithm. 1: Input: D test set 2: Input: all Fi rankings of frequent itemsets from class Ci . 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:



Input: Ci class models Input: r percentage of ranking reduction at each iteration Output: reduced Fi based on misclassification error. exit-Loop=FALSE prevFi = Fi // temp best Fi →





prevCi = Ci // temp best Ci prevError = 100% initialize to empty the misclassification matrix M // Test class models by calling MBL on test set REPEAT error = 100% FORALL j in D: →

MBL-test(j, all Fi , all Ci ) Update error END FOR IF (prevError - error) > errTolerance THEN prevError = error prevFi = Fi →



prevCi = Ci eliminate from Fi the bottom part equal to r% →

23: build class vector Ci based on Fi 24: ELSE exit-Loop=TRUE 25: UNTIL exit-Loop →

26: return prevFi , prevCi

propose and experimented with two strategies. (1) The value of cardinality l could be determined from dataset to dataset, with a wrapper approach, by cross-validation. The performance of the classifier is evaluated with respect to the adoption of different feature spaces (itemsets with different cardinalities). The best performing feature space is determined by the cardinality of the itemsets at which the classifier has the highest accuracy. Unfortunately, this strategy has a high computational workload. (2) The second strategy consists in selection of the cardinality l by observation of the most frequent value of cardinality of the itemsets at the top of the ranking. This second strategy requires only the extraction of frequent itemsets and their ranking based on a measure of itemsets significance. We experimented with the two strategies and we saw, from several experiments, that almost always the two methods agree.

2.4.2 Selection of the remaining parameters Another critical parameter is the minimum support threshold minsup of frequent itemsets algorithms. We saw that letting minsup to decrease improves the classification performance but causes the number of itemsets to increase significantly and makes heavier the itemsets selection phase. We tried different values of minsup in the experiments. In Section 3 we report the value of support threshold we adopted case by case as a tradeoff between accuracy improvement and the volume of the itemsets (set to a maximum of 8 thousands). As a general rule of thumb, in order to determine a suitable range of values of minsup in the extraction of frequent itemsets from the training examples of the different classes, we determined the initial value of minsup as a function of the number of attributes of the dataset (see Table of Figure 4). We have also experimented with four different proximity measure techniques: 1) Euclidean, 2) Cosine, 3) Jaccard and 4) Ex-

range of values of minsup 0.01 – 0.1 0.1 – 0.3 0.3 – 0.8

Figure 4: minsup as a function of the number of attributes. tended Jaccard. We noticed that they were not a critical factor: all of them gave almost the same results in accuracy. In particular, Euclidean Distance and Cosine Similarity perform equally well and have the outstanding performance. The pseudo-code of our model-based classification algorithm is given as Algorithm 2. It uses cross-validation (with ten folds) for the computation of the classification error (lines from 6–19). From lines 8 to line 13 it builds the class vector of each class Ci : it takes as the vector components the probabilities of occurrence of the itemsets of the ranking Fi in the training instances of Ci . From line 14 to 18 it makes predictions of test instances by calling Algorithm 3 (MBL-test). This latter one iterates on each class Ci and generates the vector of →

a test instance j taking into consideration the feature space Fi of the class. Then it computes the distance function between the class vector and the instance vector. Finally, the class prediction will be the class whose distance to the test instance is minimum. Algorithm 2 MBL: Class Model Learning Algorithm. 1: Input: D dataset 2: Input: minsup frequency threshold for itemsets 3: Output: Fi , for i=1 to Number of classes 4: Output: M Misclassification Matrix 5: initialize to empty the misclassification matrix 6: for all Fold-num = 1 .. TotFolds do 7: Divide dataset D into Test set (take fold = Fold-num) and Training set (D Test set) 8: // Build class models 9: FORALL class Ci in Training set: 10: // Extract set of frequent itemsets in examples of class Ci 11: Fi = FIMI-algo(Training set, minsup) 12: 13: 14: 15:



build class vector Ci based on Fi ENDFOR // Make predictions FORALL instance j in Test set: →

16: MBL-test(j, all Fi , Ci ) 17: Update Misclassification matrix M 18: ENDFOR 19: end for 20: return all Fi , M.

3. EXPERIMENTAL EVALUATION We have performed classification experiments with our modelbased classifier on several datasets from the Machine Learning Repository, maintained by UCI as a service to the machine learning community (http://archive.ics.uci.edu/ml/). We compared the classification performances of our classifier with many well known classification algorithms: Knn [2], J48 (an implementation of Decision Trees) [18], Naive Bayes [18], SVM [10] and CBA [14], RIPPER [5] and decision table [11] as representative learners for conjunctive rules-based classifiers. We used the implemented version of these classifiers that is available in Weka (http://www.cs.waikato.ac.nz/ml/weka/), a collection of machine learning algo-

Algorithm 3 MBL-test: Class Prediction. → Ci )

1: MBL-test (test instance j, all Fi , all 2: Input: test instance j 3: Input: all Fi , for i=1 to Number of classes 4: 5: 6: 7: 8: 9: 10: 11: 12:



Input: Ci class models Output: class prediction extract itemsets from j FORALL class Ci → build instance vector j on feature space Fi →



disti = distance between Ci and j ENDFOR Output for instance j the class Ci : argmini {disti } return Ci

rithms for data mining tasks. The overall results of our experiments are presented in Figure 6. Our classifier is included in four versions, according to the method of itemsets evaluation chosen to rank itemsets: accuracy/confidence indicated by Conf, normalized Δ, KL divergence and strong jumping emerging patterns (SJEP). This distinction is necessary in order to clearly determine if the superior performance of our classifier is due to the itemsets selection method or to the adoption of the mechanism of distance-based computation on a class probabilistic model. It comes out that all the methods of itemsets ranking are beneficial and give good results in comparison with the competitors. Nevertheless, normalized Δ is preferable to the other measures for itemsets selection since it allows the classifier to obtain higher or equal accuracy with a lower number of itemsets. Figure 6 also contains some important information about our classifier for each experiment like the frequency threshold used for itemsets, the itemsets length (which determines the multi-dimensional feature space) and the percentage of retained features. We performed a careful and extensive tuning to the relevant parameters of any competitor learner as shown in Table of Figure 5. For each learner and each dataset we reported the best result obtained. learner J48 J48 DTABLE DTABLE NB SVM (SMO) SVM (SMO) RIPPER RIPPER RIPPER KNN CBA CBA

parameter pruning confidence instances per leaf n. folds cross-valid. perf. eval. measure no parameters needed complexity C polyKernel exp. n. folds for REP (1-fold as pruning set) min inst. weight in split n. optimiz. runs k minsup minconf

range of values 0.05 – 0.5 2 – 10 1 – 10 [acc,rmse,mae,auc] – -5 – 5 1–3 1 – 10

step 0.05 1 1 1 – 0.25 0.5 1

0.5 – 5 1 – 10 1 – 10 0.01% 50%

0.5 1 1 – –

Figure 5: Tuning learners parameters. In the Tables the best results (with lower classification error in test) are shown in bold. We can notice that, apart rare exceptions, our model-based learner outperforms the other learners. As regards the computational workload, in comparison with Knn, we saw that the total number of distance computations performed by MBL (given by the number of itemsets multiplied by the number of classes) and by an IBLs like Knn (given by the number of

instances multiplied by the number of attributes) is always much lower for MBL (by one or two orders of magnitude in dependence to the datasets). This comparison clearly shows that MBL is superior. One of the believed benefits of a model-based classifier is that it is supposed to be more robust to the local presence of noise in data. In order to establish concretely this claim we experimented with the same datasets with a variable amount of added noise. We added noise to the above datasets in the form of a random change of the class. We varied the percentage of noise from 5% to 20% of the instances. Figure 7 shows classification results for the two extreme cases (5% and 20% of noise) only. These experiments clearly show that MBL outperforms both Knn and all the other classifiers, and the improvement increases especially with an increasing amount of noise. These good results highlight some important conclusions: 1) our model-based classifier is a valid one, that performs better than state of the art classifiers, like SVM, CBA, RIPPER, decision tables, Knn, decision trees and Naive Bayes. 2) If we compare the four versions of our MBL with the other rule based classifiers like CBA and RIPPER that adopt the same measure for rule selection (rule accuracy) our learner outperforms them. This is a clear indication that model-based classification is more robust than rule-based classification. 3) Ranking of the itemsets by the normalized Δ provides an effective feature selection method. This can be noticed since the classification error is often lower then by the other measures and it requires less itemsets. 4) MBL is more robust w.r.t noise presence.

4. CONCLUSIONS In this paper we have proposed a new distance-based classifier which is model-based. The class model is built on the frequent itemsets extracted from the instances of each given class in the training set. The adoption of a model-based distance is an advantage because it reduces the chance of model overfitting and is more robust w.r.t. the presence of noise. We have validated this claim by several experiments on many UCI datasets, with and without noise, in which we have shown that our model-based classifier outperforms traditional IBL such as Knn and other state of the art classifiers, like decision trees, SVM, NB and rule-based classifiers. We have also experimented with different techniques of itemsets ranking: accuracy, KL divergence, strong emerging patterns and an entropy based measure (normalized Δ). Then we reduced the rankings by a wrapper approach. From the experiments, the number of retained itemsets reaches in the best case 67% of the itemsets, at a given lattice level l. Experiments showed that the observed good performance of MBL is due not only to the mechanism of modelbased distance computation but also to the effectiveness of Δ. Acknowledgements: We thank Dino Ienco for the extensive tuning of the competitor learners.

5. REFERENCES [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. VLDB’94. [2] D. Aha and D. Kibler. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991. [3] B. Bigi. Using K-L distance for text categorization. Advances in Information Retrieval, 2633:76, 2003. [4] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent pattern analysis for effective classification. ICDE, 0:716–725, 2007. [5] W. Cohen. Fast effective rule induction. Proc. Int. Conf. Machine Learning, pages 115–123, 1995.

Dataset Name

Itemset length

Frequency threshold

Analcatdata-Bankruptcy Analdata-cyyoung8092 Analcatdata-Creditscore Analcatdata-Lawsuit BioMed Bupa Credit-a Diabetes Haberman Horse HD Hepatitis Heartstatlog Monks1 Mushroom Prnsynth Titanic Vote Wisconsin-Breast-Cancer

3 3 4 3 5 3 3 3 1 3 3 3 3 5 4 2 2 4 4

0.01 0.05 0.01 0.01 0.02 0.05 0.3 0.05 0.01 0.4 0.2 0.3 0.1 0.05 0.5 0.01 0.01 0.2 0.05

Δ Test %of retained error itemsets 6 100% 13.36 75% 1 100% 1.05 85% 9 80% 31.05 95% 12 73% 17.66 77% 22.73 93% 15.7 86% 15.12 67% 11.35 70% 12.37 73% 44.67 100% 0 77% 12.4 92% 20.1 87% 3 83% 2.27 77%

Our Classifier Confidence Test %of retained error itemsets 6 100% 17.7 80% 1 100% 1.1 90% 13.7 91% 36.6 100% 12.7 75% 19.3 90% 24.8 90% 17.9 69% 16.38 85% 16.3 92% 17.7 85% 44.67 100% 1.7 83% 12.69 90% 21.7 91% 4.1 92% 4.39 89%

Test error 6 22.22 1 1.15 12.5 35.35 14.49 24.48 23.33 16.6 17.21 13.33 22.22 44.67 0.24 12.69 21.2 5.7 2.89

K-L %of retained itemsets 100% 100% 100% 100% 100% 100% 95% 100% 95% 90% 100% 90% 85% 100% 80% 100% 100% 95% 80%

SJEP %of retained itemsets 100% 100% 100% 100% 100% 100% 95% 100% 95% 90% 100% 90% 85% 100% 80% 100% 100% 90% 80%

Test error 6 22.22 1 1.15 12.5 36.8 14.49 24.48 23.33 16.6 17.21 13.33 22.22 44.67 0.29 12.69 21.2 5.1 2.89

Knn

J48

NB

SVM

D-Table

Ripper

CBA

Test error 6 14.43 1 1.13 9.56 36.8 14.05 22.26 23.20 22.66 17.82 11.61 16.29 50 0 12.8 21.08 7.12 3

Test error 10 16.49 1 1.13 10.52 36.30 12.31 21.22 21.56 16.66 20.79 16.77 16.66 50 0 12.8 20.94 3.67 4

Test error 6 20.61 1 1.13 7.17 36.8 13.47 22.13 23.85 22.66 15.51 14.83 16.66 50 4.17 12.8 22.12 9.88 3

Test error 8 14.43 1 1.13 12.44 36.73 13.91 21.74 21.56 20.66 12.54 9.67 14.81 50 4.79 12.4 20.94 3.21 3

Test error 8 16.49 1 1.11 7.17 36.81 12.75 21.09 25.81 17.66 17.82 19.35 14.81 50 0 12.4 20.99 4.13 4

Test error 6 16.49 1 1.12 7.17 36.8 12.46 20.96 24.83 16.66 14.52 13.54 14.07 50.8 0 12.8 21.58 4.13 3

Test error 8.5 18.44 1 1.13 12.25 36.8 13.47 27.1 26.26 21.38 18.24 19.8 22.8 50 0.17 12.8 22.12 6.5 4.7

Figure 6: Classification results of MBL in comparison with state of the art classifiers. Dataset name Analcatdata-Bankruptcy Analdata-cyyoung8092 Analcatdata-Creditscore Analcatdata-Lawsuit BioMed Bupa Credit-a Diabetes Haberman Horse HD Hepatitis Heartstatlog monks1 Mushroom Prnsynth Titanic Vote Wisconsin-Breast-Cancer

Our Classifier

Knn

j48

NB

Noise Added

Noise Added

Noise Added

5%

20%

5%

20%

5%

20%

8 16.78 3.8 3.43 14.87 32.89 16.31 25.37 26.9 20 19.36 16.93 23.29 43.9 3.37 14.92 23.2 10.79 5.9

15 29.39 13.5 23.6 24.74 40.37 25.73 33.19 33.93 32.49 32.29 25.79 29.27 45.12 14.98 28.67 31.3 18.5 17.73

22 30.92 27 15.9 19.61 37.10 23.33 33.98 36.27 36 30.69 21.93 25.18 50 5 18.4 23.8 11.72 13.8

40 41.23 40 24.24 37.79 49.73 35.94 44.66 39.86 43.66 48.51 38.7 44.81 50 21.31 40.4 31.48 24.13 36.4

14 27.83 6 12.5 14.83 35.94 19.85 30.85 30.71 28.66 28.38 19.35 26.29 50 4.99 19.2 24.3 8.27 7.4

18 31.95 25 21.96 31.57 45.21 26.08 37.23 34.64 39.33 45.21 34.83 44.07 50 20.02 36.4 32.34 18.39 23

CBA ( Supp = 1%, Conf = 50% )

SVM

D-table

Ripper

Noise Added

Noise Added

Noise Added

Noise Added

5%

20%

5%

20%

5%

20%

5%

20%

5%

20%

12 34.02 21 15.53 16.74 46.66 23.91 27.73 26.47 39 21.12 19.35 20.37 50 11.12 18.8 24.03 13.56 8.8

16 36.08 35 24.24 25.53 49.56 26.66 37.5 33.06 40.33 34.98 29.67 30 50 25 32.4 34.21 22.29 22.2

14 24.7 23 15.9 18.18 43.18 17.68 25.65 29.41 29.33 19.8 18.7 19.62 50 11.97 18.8 24.62 8.73 7.4

24 37.11 31 24.24 26.8 45.50 26.37 35.41 35.94 36.66 36.3 28.3 30.74 50 25.52 32 32.62 18.62 21.6

10 25.77 6 8.33 14.83 42.89 19.56 26.69 27.45 23 26.07 17.41 22.22 50 4.99 15.6 24.03 9.42 10.30

20 43.29 19 24.24 32.05 45.21 31.44 35.41 36.92 35 38.94 30.32 32.59 50 20.02 30 32.39 25.74 23.03

10 21.64 6 9.09 11.96 42.89 18.98 27.47 27.12 20 23.43 20 24.07 50 4.99 15.6 24.12 8.5 9.01

20 41.23 19 24.24 30.62 45.21 30.28 35.42 36.92 32.33 33.35 25.80 32.96 50 20.48 30.4 33.66 22.98 23.17

8.5 17.69 6.11 14.36 12.12 43.08 23.36 26.11 26.58 26.09 22.78 19.3 20.47 50 5.41 15.62 31.44 9.88 8.75

21 37.91 20.22 24.48 27.52 45.43 31.35 35.74 36.03 41.78 38.48 35.41 30.49 50 21.06 29.75 38.59 27.13 23.53

Noise Added

Figure 7: Misclassification at different levels of noise. [6] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967. [7] Pedro Domingos. Unifying instance-based and rule-based induction. Machine Learning, 24(2):141–168, 1996. [8] H. Fan and K. Ramamohanarao. Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers. IEEE Trans. Knowl. Data Eng., 18(6):721–737, 2006. [9] Usama M. Fayyad and Keki B. Irani. Multi-interval discretization of continuous valued attributes for classification learning. Proc. IJCAI’93, pp. 1022–1027. [10] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649, 2001. [11] Ron Kohavi. The power of decision tables. In Proc. ECML’95, LNAI 914, pp. 174–189, Springer Verlag. [12] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951.

[13] Wenmin Li, Jiawei Han, and Jian Pei. CMAR: Accurate and efficient classification based on multiple class-association rules. In ICDM, Int. Conf. Data Mining, pages 369–376, 2001. [14] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining. In SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 80–86, 1998. [15] R. Meo. Theory of dependence values. ACM TODS, 45(3), 2000. [16] Dimitris Meretakis and Beat W¨uthrich. Extending Na¨ıve Bayes classifiers using long itemsets. In Proc. KDD’99, pages 165–174, 1999. [17] R. F. Sproull. Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica, 6(1-6):579–589, 1991. [18] T. Steinbach and Kumar. Introduction to Data Mining. Pearson education, 2006. [19] D. Randall Wilson and Tony R. Martinez. Reduction techniques for instance-based learning algorithms. Mach. Learn., 38(3):257–286, 2000.

Suggest Documents