decision trees and decision lists. In this study, the two decision list algorithms RIPPER and Chipper, and the decision tree algorithm C4.5, are evaluated for ...
Evaluating Algorithms for Concept Description Cecilia Sönströd, Ulf Johansson and Tuve Löfström
Abstract—When performing concept description, models need to be evaluated both on accuracy and comprehensibility. A comprehensible concept description model should present the most important relationships in the data in an accurate and understandable way. Two natural representations for this are decision trees and decision lists. In this study, the two decision list algorithms RIPPER and Chipper, and the decision tree algorithm C4.5, are evaluated for concept description, using publicly available datasets. The experiments show that C4.5 performs very well regarding accuracy and brevity, i.e. the ability to classify instances with few tests, but also produces large models that are hard to survey and contain many extremely specific rules, thus not being good concept descriptions. The decision list algorithms perform reasonably well on accuracy, and are mostly able to produce small models with relatively good predictive performance. Regarding brevity, Chipper is better than RIPPER, using on average fewer conditions to classify an instance. RIPPER, on the other hand, excels in relevance, i.e. the ability to capture a large number of instances with every rule.
I. INTRODUCTION
I
N the data mining task concept description [1], the aim is to gain insights. The focus is not to produce models with high predictive accuracy, but to adequately describe the most important relationships in the data. Recommended techniques for this task are rule induction and conceptual clustering. In many concept description situations, what is actually needed is a highly understandable classification model, indicating that rule induction is most suitable. However, most rule induction algorithms focus on optimizing predictive performance, i.e. accuracy, often at the expense of comprehensibility, and few, if any, include direct ways for the user to control the tradeoff between accuracy and comprehensibility. Of course, the term comprehensibility is far from unproblematic, since many factors influence whether a model is understandable or not. Often, model size is used to estimate comprehensibility, with the implicit assumption that small models are easier to interpret. In [2], we proposed that concept description models should be evaluated using accuracy and three properties that capture different aspects of comprehensibility. Hence, the four criteria for evaluating concept description models become:
• Accuracy: The model should have high accuracy on All authors are with the CSL@BS at the School of Business and Informatics, Unversity of Borås, Sweden. Tuve Löfström is also with the School of Humanities and Informatics, University of Skövde, Sweden. e-mail: {cecilia.sonstrod, ulf.johansson, tuve.lofstrom}@hb.se
unseen data to guarantee that relationships found hold in general • Brevity: The model should classify as many instances as possible with few conditions • Interpretability: The model should express conditions in a clear and readable way • Relevance: Only those relationships that are general and interesting should be included in the model The most common representation language for transparent models is decision trees, and many successful decision tree algorithms exist. Looking at the above criteria, it is clear that standard decision tree algorithms such as C4.5 [3] and CART [4] are capable of fulfilling the accuracy criterion. Intuitively, the decision tree representation also seems suitable for obtaining good brevity, since a balanced decision tree will make many different classifications with relatively few conditions used for each path to a leaf node. However, regarding the two other criteria, decision trees have some serious drawbacks, since the representation leads to models that are hard to survey, and typically also contain many branches that classify only a few instances. Looking at a decision tree, especially in its textual representation, a decision-maker will have a hard time finding and interpreting the most important relationships; indeed, he would probably trace the sequence of tests and write out a separate rule, consisting of a conjunction of tests, for branches of interest. The above example is an argument for the alternative representation decision lists, or ordered rule sets. A decision list is, in essence, a maximally unbalanced decision tree, where each rule, containing one or more tests, directly classifies a number of instances. Those instances not covered by one rule are tested on the next rule, and this proceeds until a default rule classifies all remaining instances. This representation is especially suitable for concept description, since it admits a rule construction method prioritizing high coverage for the top rules, thereby obtaining good brevity and relevance by capturing the most general relationships in the dataset with very few conditions; or, put in another way, describing the most important relationships with very simple rules. Furthermore, since the top rules in a decision list are very easy to identify, a decision list algorithm with high coverage in its top rules will also have good interpretability. Finally, the default rule used in decision lists also provides a mechanism for increasing both interpretability and relevance, since models can avoid formulating rules that only classify a few instances.
Ultimately, a decision list algorithm suitable for concept description should consistently obtain accuracy and brevity comparable to standard decision tree algorithms, with its representation offering benefits for interpretability and relevance. It must be noted that this is a tough challenge, since industrial-strength decision trees are very good at optimizing accuracy and has a representation that lends itself very well to good brevity
used to control the granularity of the rules produced, with high ignore values being used for finding rules with high coverage (possibly with lower rule accuracy) and low values for more specific rules. The other main parameter in Chipper is called stop and is used to determine the proportion of instances that should be described by rules before the default rule is formulated. Thus, a stop value of 80% means that the default rule will be formed when at least 80% of the instances are covered by the rule set.
II. BACKGROUND Most decision list algorithms are based on the sequential covering algorithm, which in each step constructs a rule covering a number of instances and then removes those instances from the dataset before the next rule is found. This procedure is repeated until some stop criterion is met, when a default rule, classifying all remaining instances, is formulated. Examples of decision list algorithms based on this idea are AQ [5], CN2 [6], IREP[7] and RIPPER [8]. RIPPER (Repeated Incremental Pruning to Produce Error Reduction) is widely used and considered to be the most successful decision list algorithm to date [9], and is reported to have good performance regarding both accuracy [10], running time and ability to handle noise and unbalanced data sets [11]. RIPPER works by constructing rules only for the minority class(es) and using the default rule for the majority class. For multiclass problems, rules are constructed for classes in order of class distribution. In [9], most of RIPPER’s power is attributed to its optimization (post-pruning) procedure and the authors argue that postpruning is probably the best way to obtain good predictive performance for decision list algorithms. However, some of the properties that give RIPPER good performance regarding accuracy and speed are detrimental to its concept description ability. First, and foremost, for binary problems with relatively even class distributions, rules are only formulated for the class that happens to have the lowest number of instances, and hence no explicit description (other than as the negation of the rules for the minority class) is offered for instances belonging to the other class. This will also often result in RIPPER obtaining relatively poor brevity even for quite short rule sets. Furthermore, RIPPER is built to optimize accuracy and will sometimes do so at the expense of comprehensibility, without any clear possibility for the user to control this tradeoff via parameter settings. We have previously, see [12], introduced the decision list algorithm Chipper, aimed at performing concept description. Chipper is a greedy algorithm based on sequential covering, and the basic idea is to, in each step, find the rule that classifies the maximum number of instances using only one split on one attribute. The algorithm uses a parameter, called ignore, to control the tradeoff between accuracy and coverage. This parameter specifies the acceptable misclassification rate for each rule, expressed either as an absolute number of instances or as a proportion of remaining instances in the data set. The ignore parameter can thus be
In [12], Chipper was evaluated on 9 binary datasets from the UCI machine learning repository [13], which were chosen to enable interpretation of rule sets found, and compared to two standard algorithms for generating transparent models; RIPPER and C4.5. Results were very encouraging, with Chipper obtaining accuracy comparable to other techniques, but having superior comprehensibility. However, it was noted that Chipper sometimes performed very badly regarding accuracy and also was very sensitive to the value of the ignore parameter. This is, of course, not entirely a bad thing, since it means that the parameter works as intended, letting the user choose the granularity of the rule set. In some cases, though, it is of course desirable to have the option of finding “the best possible” model without having to manually search for optimal parameter settings. The main purpose of this study is to conduct a more thorough evaluation of both a slightly improved version of Chipper, but also RIPPER and C4.5 regarding suitability for concept description. It is of course interesting to investigate how existing algorithms perform regarding brevity and relevance, especially for decision trees which are not normally evaluated on these criteria. III. METHOD The main change in Chipper from [12] is that the algorithm is able to handle multiclass problems and is now implemented in Java and incorporated in the WEKA [9] data mining framework. Implementation in WEKA has facilitated solving the problem with excessive sensitivity to the value of the ignore parameter, by using the built-in meta procedure for cross-validation parameter selection (CVParameterSelection). CVParameterSelection uses internal cross-validation to find the best values for one or more parameters within a specified range. Obviously, most standard techniques, such as RIPPER and C4.5, use an internal cross-validation procedure as a part of their algorithm to optimize models on accuracy. Using this procedure to set parameter values in Chipper gives the desired flexibility regarding parameter use, since the user can either specify a given range or set specific values. Finally, using techniques all implemented in the same framework means an improved ability to carry out comparisons by using controlled experiments with fixed folds for all techniques.
A. Datasets 26 publicly available dataset from the UCI repository [13] were used for the experiments. As seen in Table 1 below, where the dataset characteristics are presented, both binary and multiclass problems are used, and the datasets have different properties regarding number of instances, as well as the number of continuous (Cont.) and categorical (Cat.) attributes. TABLE 1. DATA SETS USED Data set Instances Classes Breast cancer 286 2 Cmc 1473 3 Colic – Horse 368 2 Credit – American 690 2 Credit – German 1000 2 Cylinder 540 2 Diabetes – Pima 768 2 E-coli 336 8 Glass 214 6 Haberman 306 2 Heart – Cleveland 303 2 Heart – Statlog 270 2 Hepatitis 155 2 Ionosphere 351 2 Iris 150 3 Labor 57 2 Liver disorders – BUPA 345 2 Lymphography 148 4 Sick 3772 2 Sonar 208 2 Tae 151 3 Vehicle 846 4 Votes 435 2 Wine 178 3 WBC - Wisconsin 699 2 Zoo 101 7
Cont. 0 2 7 6 7 18 8 7 9 2 6 10 6 34 4 8 6 3 22 60 3 18 0 13 9 16
Cat. 9 7 15 9 13 21 0 0 0 1 7 3 13 0 0 8 0 15 7 0 2 0 16 0 0 1
B. Experiments In the experiments, Chipper is compared to two techniques producing transparent models, but using different representations. The chosen decision tree algorithm is again C4.5 implemented as J48 in WEKA and the decision list technique is RIPPER, implemented as JRip in WEKA. The motivation for these choices is that the two algorithms are standard techniques for their respective representations. In the first experiment, the aim is to investigate whether Chipper is able to obtain acceptable predictive performance, measured as accuracy and area under the ROC-curve (AUC), over a large number of datasets. The motivation for including both accuracy and AUC is that they measure quite different things; while accuracy is based only on the final classification, AUC measures the ability of the model to rank instances according to how likely they are to belong to a certain class; see e.g. [14]. AUC can be interpreted as the probability of ranking a true positive instance ahead of a false positive; see [15]. In this experiment, Chipper is used with parameter settings favoring accuracy, using CVParameterSelection for
both for stop and ignore. Ignore had values between 0.5% and 5%, with 10 different values and stop was between 70% and 95%, with 6 different values. In this experiment, 10 x 10-fold cross-validation is used, with identical folding for all three techniques, for measuring accuracy and AUC. J48 and JRip were used with their default settings in WEKA, which means that J48 trees are pruned. In the second experiment, the tradeoff between accuracy and comprehensibility is investigated by using Chipper with settings favoring comprehensible models, meaning lower stop values and higher ignore values. The choice was to use stops between 60% and 80%, and ignores between 4% and 8%. For this kind of rule set, where the aim is to present the most important relationships in the data, the evaluation should primarily be on comprehensibility, with overall model accuracy serving only to guarantee that the relationships found will hold in general. C. Evaluation of comprehensibility For measuring the brevity aspect of comprehensibility, we have previously, in [2], suggested the classification complexity (CC) measure. The classification complexity for a model is the average number of tests (i.e. simple conditions using only one attribute) needed to classify an instance. A low value thus means good brevity. The interpretability and relevance aspects, however, have not as yet been formulated as numeric measures. The most problematic of these is arguably interpretability; in Chipper, interpretability is ensured by the very simple representation language, allowing only one test in each rule. For relevance, a measure would have to differentiate between models containing few and many rules, but also be more refined than just model size. A model should be deemed to have high relevance when every part of the model classifies a large number of instances. To make comparison between different representations possible, we introduce the term classification point, and define it as a place where instances are assigned class labels in a classification model. For a decision list, the classifications points then consist of all rules, and for a decision tree, every leaf is a classification point. We propose that relevance can be measured by calculating the average number of instances that reach each classification point in a model. Thus, a high value will represent good relevance. This in essence, means calculating the load factor for each rule or leaf. This measure will simply be called relevance and is calculated using (1) below: = ݁ܿ݊ܽݒ݈݁݁ݎ
#௦௧௦ ௗ௧௦௧ #௦௦௧ ௧௦
(1)
Comprehensibility for the models produced in both experiments is consequently evaluated using classification complexity (CC), and the relevance measure introduced above.
IV. RESULTS The results regarding accuracy and AUC from Experiment 1 are shown in Table 2 below. Bold numbers indicate the best result for a specific dataset. TABLE 2. RESULTS ACCURACY AND AUC, 10X10CV J48 JRip Chipper Data set Acc AUC Acc AUC Acc AUC Breast-cancer 70.5% 0.59 71.5% 0.60 67.8% 0.54 Cmc 52.3% 0.69 52.4% 0.64 48.2% 0.66 Colic 85.3% 0.85 85.1% 0.83 77.6% 0.78 Credit - A 85.2% 0.87 85.2% 0.87 85.1% 0.91 Credit - G 70.6% 0.64 72.2% 0.63 70.1% 0.65 Cylinder 73.2% 0.76 67.1% 0.66 66.9% 0.71 Diabetes 74.5% 0.75 75.2% 0.72 76.0% 0.79 E-coli 82.8% 0.96 81.4% 0.95 71.1% 0.95 Glass 67.6% 0.79 66.8% 0.80 61.5% 0.73 Haberman 72.2% 0.58 72.4% 0.60 76.3% 0.61 Heart – C 78.2% 0.78 80.0% 0.81 71.2% 0.74 Heart – S 78.2% 0.79 78.8% 0.80 72.0% 0.73 Hepatitis 79.2% 0.67 78.1% 0.62 80.8% 0.74 Ionosphere 89.7% 0.89 89.2% 0.89 87.2% 0.85 Iris 94.7% 0.99 93.9% 0.99 93.1% 0.99 Labor 78.4% 0.72 83.7% 0.82 78.9% 0.73 Liver 65.8% 0.65 66.6% 0.65 63.2% 0.63 Lymph 75.6% 0.77 76.3% 0.40 79.2% 0.46 Sick 98.9% 0.96 98.3% 0.94 96.5% 0.94 Sonar 73.6% 0.75 73.4% 0.75 75.5% 0.78 Tae 57.4% 0.75 43.8% 0.60 42.1% 0.60 Vehicle 72.3% 0.76 68.3% 0.77 59.9% 0.73 Vote 96.6% 0.98 95.8% 0.96 94.7% 0.97 Wine 93.2% 0.97 93.1% 0.96 91.9% 0.97 WBC 95.0% 0.96 95.6% 0.96 93.8% 0.95 Zoo 91.6% 1.00 86.6% 0.93 87.9% 1.00 13 14 9 10 5 9 #Wins
As can be seen from the table, J48 performs best overall, both regarding accuracy and AUC. J48 obtains the highest accuracy on 13 out of 25 datasets, and JRip performs slightly better than Chipper. However, for some datasets, there are significant differences in accuracy, with Chipper quite often losing by a large margin. This is probably due to there being no global rule set optimization targeting accuracy in Chipper, whereas the two other techniques spend quite a lot of effort on post-pruning and optimizing their models on accuracy, using internal cross-validation. When measuring predictive performance using AUC, J48 still stands out, winning 14 datasets, albeit 5 of them drawn with other techniques. In contrast to accuracy, there is very little difference between Chipper and JRip on the AUC measure. The comprehensibility results from Experiment 1 are shown in Table 3 below. Classification complexity (CC) is given for the number of tests, size is measured simply as the total number of tests in the rule set, serving as an indicator of overall model complexity, and relevance (Rel) is calculated according to (1) above. All these measures are given for the model constructed over the entire dataset that WEKA outputs. When calculating CC, everything using more than 30
conditions is lumped together as if it were in the default rule; the only dataset where this happens is Vehicle for both JRip and Chipper. Obviously, J48 has an advantage for the CC measure, because of its different representation. TABLE 3. COMPREHENSIBILITY RESULTS FOR EXPERIMENT 1 J48 JRip Chipper Data set Breast-c Cmc Colic Credit - A Credit - G Cylinder Diabetes E-coli Glass Haberman Heart – C Heart – S Hepatitis Ionosphere Iris Labor Liver Lymph Sick Sonar Tae Vehicle Vote Wine WBC Zoo #Wins Average
Size
CC
Rel
Size
CC
Rel
Size
CC
Rel
21 131 5 24 98 66 19 21 29 10 17 17 10 17 4 6 25 13 22 17 33 97 5 4 13 8 2 28.2
5.1 8.9 1.7 3.8 9.2 23 3.7 4.4 5.9 2.3 3.6 3.0 2.9 5.8 2.1 2.7 4.7 5.0 3.2 4.4 6.4 7.2 1.6 2.3 2.7 2.8 15 4.9
13 11 61 28 10 8 38 15 7 28 17 15 14 20 30 8 13 11 164 12 4 9 73 36 50 11 0
4 14 6 7 11 6 9 19 18 3 6 8 5 2 3 4 4 8 10 9 1 43 6 4 9 6 17 8.7
4.6 12.6 5.4 6.0 5.3 6.0 8.0 16.7 14.9 3.9 5.3 6.7 5.3 2.4 2.3 3.7 4.2 7.5 9.5 7.6 1.8 23.1 5.2 3.9 7.4 5.8 5 7.1
95 295 92 173 333 180 192 34 31 153 76 54 39 117 38 19 115 25 943 42 76 53 145 60 140 20 25
8 13 13 7 8 22 10 13 25 12 11 7 6 13 3 5 13 13 2 5 20 49 3 5 4 6 10 11.4
7.8 10.0 5.0 2.4 5.1 9.9 3.8 4.8 13.1 7.1 6.7 4.2 3.2 7.7 2.2 3.2 7.7 5.3 1.4 3.2 9.0 18.3 1.6 2.5 1.9 2.7 7 5.8
32 105 26 86 111 23 70 24 8 24 25 34 22 25 38 10 25 11 1257 35 7 17 109 30 140 14 4
As can be seen from the table, J48 is able to use its representation to obtain very good classification complexity, despite having by far the largest average model size. However, J48 performs very badly on relevance, not winning a single dataset. This, together with the very good classification complexity, indicates that the decision trees produced by J48 contain a lot of branches that classify very few instances each, which means that the overall comprehensibility of the model is limited. Turning to a direct comparison between Chipper and RIPPER, the results are very clear. Chipper is superior to RIPPER on classification complexity, performing better on 19 out of 26 datasets and also obtaining a lower average number of tests. It is, of course, slightly iffy to calculate an average like this, but what this average says is that over all datasets, Chipper needs just under six tests to classify an instance. JRip performs best on relevance, winning all but one datasets, which of course is a consequence of the smaller model size. Indeed, RIPPER rule sets typically consist of between 3 and 4 rules and seldom contain rules classifying only a few instances. Below are some sample rule sets for Chipper and JRip, where the differences between how the two techniques work become very apparent.
IF Color_intensity = 885.0 THEN 0 [49/0] IF Flavanoids = 760 AND Color_intensity >= 3.52 THEN 1 [57/0] DEFAULT: 2 [74/3]
Figure 2: JRip sample rule set for Wine IF FATIGUE == 0.0 THEN LIVE [54/2] IF PROTIME >= 51.0 THEN LIVE [34/1] IF ALBUMIN >= 4.1 THEN LIVE [12/1] IF BILIRUBIN >= 3.5 THEN DIE [8/1] IF SEX == 0.0 THEN LIVE [7/0] IF ALBUMIN