Convex Hulls in Concept Induction Douglas A. Newlands B.Sc.(Glasgow), M.Sc.(Deakin) School of Computing and Mathematics Deakin University Geelong, Victoria, 3217, Australia. Tel: +61-(03)-52271165 Fax: +61-(03)-52272028
A Thesis submitted in Complete Fullment of the Requirements for the Degree of Doctor of Philosophy March 27, 1998
Abstract Classi cation learning is dominated by systems which induce large numbers of small axis-orthogonal decision surfaces. This strongly biases such systems towards particular hypothesis types but there is reason believe that many domains have underlying concepts which do not involve axis orthogonal surfaces. Further, the multiplicity of small decision regions mitigates against any holistic appreciation of the theories produced by these systems, notwithstanding the fact that many of the small regions are individually comprehensible. This thesis investigates modeling concepts as large geometric structures in n-dimensional space. Convex hulls are a superset of the set of axis orthogonal hyperrectangles into which axis orthogonal systems partition the instance space. In consequence, there is reason to believe that convex hulls might provide a more exible and general learning bias than axis orthogonal regions. The formation of convex hulls around a group of points of the same class is shown to be a usable generalisation and is more general than generalisations produced by axis-orthogonal based classi ers, without constructive induction, like decision trees, decision lists and rules. The use of a small number of large hulls as a concept representation is shown to provide classi cation performance which can be better than that of classi ers which use a large number of small fragmentary regions for each concept. A convex hull based classi er, CH1, has been implemented and tested. CH1 can handle categorical and continuous data. Algorithms for two basic generalisation operations on hulls, ination and facet deletion, are presented. The two operations are shown to improve the accuracy of the classi er and provide moderate classi cation accuracy over a representative selection of typical, largely or wholly continuous valued machine learning tasks. The classi er exhibits superior performance to well-known axis-orthogonal-based classi ers when presented with domains where the underlying decision surfaces are not axis parallel. The strengths and weaknesses of the system are 1
identi ed. One particular advantage is the ability of the system to model domains with approximately the same number of structures as there are underlying concepts. This leads to the possibility of extraction of higher level mathematical descriptions of the induced concepts, using the techniques of computational geometry, which is not possible from a multiplicity of small regions.
2
Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5
Background to Research : : : : : Research Objectives : : : : : : : : Principle Outcomes of this Thesis Methodology : : : : : : : : : : : Structure of this Thesis : : : : : :
2 Review
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
2.1 Chapter Outline : : : : : : : : : : : : : : : : : : : : : : : 2.1.1 Geometric View of Generalisation : : : : : : : : : 2.2 Attribute Types : : : : : : : : : : : : : : : : : : : : : : : 2.2.1 Underlying and Imposed Metrics : : : : : : : : : 2.2.2 Granularity in Discrete and Discontinuous Spaces 2.2.3 Continuous Attributes : : : : : : : : : : : : : : : 2.2.4 Taxonomy of Actual Concepts : : : : : : : : : : : 2.3 Rule-based Systems : : : : : : : : : : : : : : : : : : : : : 2.4 Decision Trees : : : : : : : : : : : : : : : : : : : : : : : : 2.5 Decision Trees using Attribute Combinations : : : : : : : 2.5.1 Boolean Combinations of Attributes : : : : : : : : 2.5.2 Synthetic Attributes : : : : : : : : : : : : : : : : 2.5.3 Oblique Decision Trees : : : : : : : : : : : : : : : 1
: : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : :
11 12 13 15 18 19
22
22 23 24 24 26 27 27 29 32 33 33 34 34
2.6
2.7 2.8 2.9 2.10
2.11 2.12 2.13 2.14 2.15 2.16 2.17
2.5.4 Decision Lists : : : : : : : : : : : : : : : : : : : : : : Exemplar-based techniques : : : : : : : : : : : : : : : : : : : 2.6.1 Nearest Neighbour Methods : : : : : : : : : : : : : : 2.6.2 Nested Rectangles : : : : : : : : : : : : : : : : : : : Connectionist Methods : : : : : : : : : : : : : : : : : : : : : Statistical Methods : : : : : : : : : : : : : : : : : : : : : : : 2.8.1 DIPOL92 : : : : : : : : : : : : : : : : : : : : : : : : Convex Hulls : : : : : : : : : : : : : : : : : : : : : : : : : : 2.9.1 Implementations of Convex Hull Forming Algorithms Survey of Convex Hull Software : : : : : : : : : : : : : : : : 2.10.1 cdd : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.10.2 chD : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.10.3 Hull : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.10.4 Porta : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.10.5 lrs, qrs, rs : : : : : : : : : : : : : : : : : : : : : : : : 2.10.6 qhull : : : : : : : : : : : : : : : : : : : : : : : : : : : Choice of Package : : : : : : : : : : : : : : : : : : : : : : : : Performance Metrics : : : : : : : : : : : : : : : : : : : : : : 2.12.1 Accuracy : : : : : : : : : : : : : : : : : : : : : : : : Relative Operating Characteristic : : : : : : : : : : : : : : : More Informative Performance Metrics : : : : : : : : : : : : Misclassi cation Cost-based Metrics : : : : : : : : : : : : : : Statistical Measures : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : :
3 A Prototype Polygonal Generalisation System
: : : : : : : : : : : : : : : : : : : : : : : :
35 36 36 37 38 39 40 40 42 43 43 44 44 44 45 45 46 46 46 48 49 51 52 53
54
3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 3.2 The Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 55 3.3 The Prototype : : : : : : : : : : : : : : : : : : : : : : : : : : : 57 2
3.3.1 Cover and Generalisation : 3.3.2 Spiking : : : : : : : : : : 3.4 Evaluation : : : : : : : : : : : : : 3.5 Conclusions : : : : : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
57 60 62 65
4 Implementation and Proof of Concept
67
5 Ination of Convex Hulls
94
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 The CH0 Algorithm : : : : : : : : : : : : : : : : : : : : : : : 4.3 Re ning the CH0 Algorithm : : : : : : : : : : : : : : : : : : : 4.3.1 Time Complexity of CH1 : : : : : : : : : : : : : : : : : 4.3.2 Implementation of CH1 : : : : : : : : : : : : : : : : : : 4.4 Comparison of the Classi cation Performance of CH1 and C4.5 4.4.1 Analysis of NPV Results : : : : : : : : : : : : : : : : : 4.4.2 Analysis of PPV Results : : : : : : : : : : : : : : : : : 4.4.3 Analysis of Sensitivity Results : : : : : : : : : : : : : : 4.4.4 Analysis of Speci city Results : : : : : : : : : : : : : : 4.4.5 Analysis of Accuracy Results : : : : : : : : : : : : : : 4.5 Comparison of Performance Metric Results : : : : : : : : : : : 4.6 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : :
67 68 70 74 76 76 78 79 82 84 84 86 87
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : 94 5.2 Inating a Convex Hull : : : : : : : : : : : : : : : : : : : : : : 96 5.3 Algorithm for Per Hull Ination : : : : : : : : : : : : : : : : : 99 5.3.1 Test of Implementation of Per Hull Ination : : : : : : 99 5.4 Evaluation of Per Hull Ination Strategies : : : : : : : : : : : 102 5.4.1 Interaction between Ination, Decision Lists and Performance Metrics : : : : : : : : : : : : : : : : : : : : : 104 5.4.2 Summary of Discussion : : : : : : : : : : : : : : : : : : 109 3
5.5 5.6 5.7 5.8
Algorithm for Per Facet Ination Evaluation of Per Facet Ination Comparison of Ination Types : : Conclusions : : : : : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
6 Facet Deletion
6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : 6.2 Deletion of Non-essential Facets : : : : : : : : : : : : : : 6.2.1 Basic Characteristics of Non-Essential Deletion : 6.2.2 Adding Ination to Non-Essential Deletion : : : : 6.2.3 Time Complexity of Non-Essential Deletion : : : 6.3 Retention of Redundant Facets : : : : : : : : : : : : : : 6.3.1 Unordered Retention : : : : : : : : : : : : : : : : 6.3.2 Evaluation of Unordered Retention with Ination 6.3.3 Ordered Retention : : : : : : : : : : : : : : : : : 6.3.4 Time Complexity of Ordered Retention Strategy : 6.3.5 Evaluation of Ordered Retention with Ination : 6.3.6 Comparison of Retention Strategies : : : : : : : : 6.4 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : :
7 Evaluation of CH1
7.1 Introduction : : : : : : : : : : : : : : : 7.2 Evaluation on Selected Domains : : : : 7.2.1 Body Fat : : : : : : : : : : : : 7.2.2 POL : : : : : : : : : : : : : : : 7.2.3 Summary of Evaluation : : : : 7.3 Evaluation on a Variety of Domains : : 7.4 Complexity of Domain Representations 7.5 Conclusions : : : : : : : : : : : : : : : 4
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : :
: 110 : 110 : 112 : 112
115
: 115 : 116 : 117 : 118 : 121 : 121 : 121 : 123 : 123 : 126 : 126 : 127 : 127
131
: 131 : 132 : 132 : 135 : 136 : 136 : 140 : 141
8 Large Axis Orthogonal Hulls 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12
Introduction : : : : : : : : : : : : : : : : : : : : : : Axis Orthogonal Hulls : : : : : : : : : : : : : : : : Evaluation of Per Hull Ination Strategies : : : : : Evaluation of Per Facet Ination : : : : : : : : : : Evaluation of Non-Essential Deletion : : : : : : : : Comparison of Non-Essential Deletion and Ination Evaluation of Unordered Retention : : : : : : : : : Evaluation of Ordered Retention : : : : : : : : : : : Comparison of Retention Strategies : : : : : : : : : Comparison of CH1 and AOH : : : : : : : : : : : : Comparison of AOH with C4.5 and CN2 : : : : : : Conclusions : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
: : : : : : : : : : : :
144
: 144 : 145 : 145 : 145 : 148 : 148 : 151 : 151 : 151 : 155 : 155 : 155
9 CH1-CN2 Hybrid
159
10 Conclusions and Future Research
166
9.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : 159 9.2 Experimental Design : : : : : : : : : : : : : : : : : : : : : : : 159 9.3 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : 165
10.1 10.2 10.3 10.4
Summary : : : : : : : : : : : : : : : : : : : : : : Summary of Software Designed and Implemented Future research : : : : : : : : : : : : : : : : : : : Conclusions : : : : : : : : : : : : : : : : : : : : :
A Data Set
A.1 Description of Data Sets A.1.1 balance-scale : : A.1.2 bcw : : : : : : : A.1.3 bf : : : : : : : : :
: : : :
: : : :
: : : : 5
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: : : : : : : :
: 166 : 171 : 173 : 174
177
: 177 : 177 : 178 : 178
A.1.4 bupa : : : : : : A.1.5 Cleveland : : : A.1.6 echocardiogram A.1.7 german : : : : : A.1.8 glass : : : : : : A.1.9 glass7 : : : : : A.1.10 heart : : : : : : A.1.11 hepatitis : : : : A.1.12 horse-colic : : : A.1.13 hungarian : : : A.1.14 ionosphere : : : A.1.15 iris : : : : : : : A.1.16 new thyroid : : A.1.17 page-blocks : : A.1.18 pid : : : : : : : A.1.19 POL : : : : : : A.1.20 satimage : : : : A.1.21 segment : : : : A.1.22 shuttle : : : : : A.1.23 sonar : : : : : : A.1.24 soybean-large : A.1.25 vehicle : : : : : A.1.26 waveform : : : A.1.27 wine : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
6
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: 178 : 179 : 180 : 180 : 181 : 181 : 181 : 182 : 182 : 182 : 183 : 183 : 183 : 184 : 184 : 184 : 185 : 185 : 185 : 185 : 185 : 186 : 186 : 186
List of Figures 2.1 A Complex Representation of a Simple Concept. : : : : : : : : 24 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Comparison of Minimality of Generalisation Cover Example : : : : : : : : : : : : : : : : Faulty Generalisation : : : : : : : : : : : : : Examples of Generalisation : : : : : : : : : Spiking : : : : : : : : : : : : : : : : : : : : Early Spiking : : : : : : : : : : : : : : : : : Test Concepts : : : : : : : : : : : : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
56 58 58 60 60 61 63
4.1 4.2 4.3 4.4 4.5 4.6 4.7
Shortest Decision List : : : : : : : : : : : : : : : : : The \RCC" Universe : : : : : : : : : : : : : : : : : : Negative Predictive Value Graphs for CH1 and C4.5 : Positive Predictive Value Graphs for CH1 and C4.5 : Sensitivity Graphs for CH1 and C4.5 : : : : : : : : : Speci city Graphs for CH1 and C4.5 : : : : : : : : : Accuracy Graphs for CH1 and C4.5 : : : : : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
71 78 89 90 91 92 93
5.1 5.2 5.3 5.4 5.5
Performance Characteristics : : : : Diering Ination Strategies : : : : Implementation Test : Square Sets Implementation Test : Quad Sets : Basic Situation : : : : : : : : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: 94 : 98 : 100 : 101 : 105
7
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : : : :
: : : : :
: : : : : : :
: : : : :
: : : : : : :
: : : : :
: : : : : : :
: : : : :
: : : : :
5.6 Overlapping Situation without Misclassi cation : : : : : : : : 106 5.7 Overlapping Situation with Misclassi cation : : : : : : : : : : 107 6.1 Example of Non-Essential Deletion : : : : : : : : : : : : : : : 117 6.2 Minimal Facet Deletion : : : : : : : : : : : : : : : : : : : : : : 122 7.1 Learning Curves for Body Fat : : : : : : : : : : : : : : : : : : 134
8
List of Tables 2.1 2x2 Contingency Table : : : : : : : : : : : : : : : : : : : : : : 49 3.1 Comparison of Predictive Accuracy for PIGS and OC1 : : : : 63 3.2 Comparison of Predictive Accuracy for PIGS and C4.5 : : : : 64 4.1 4.2 4.3 4.4 4.5
Negative Predictive Values Positive Predictive Values Sensitivity : : : : : : : : : Speci city : : : : : : : : : Accuracy : : : : : : : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
5.1 5.2 5.3 5.4 5.5
Confusion Matrix: square : : : : : : : Confusion Matrix: quad : : : : : : : : Various Amounts of Per Hull Ination Various Amounts of Per Facet Ination Per Hull and Per Facet Comparison : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: : : : :
: 100 : 101 : 103 : 111 : 113
80 81 83 85 86
6.1 Comparison of Non-Essential Deletion and Ination : : : : : : 119 6.2 Comparison of Non-Essential Deletion with Ination against Ination : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120 6.3 Accuracy using Unordered Retention : : : : : : : : : : : : : : 124 6.4 Accuracy using Ordered Retention : : : : : : : : : : : : : : : 128 6.5 Comparison of Retention Strategies : : : : : : : : : : : : : : : 129 9
7.1 7.2 7.3 7.4
Evaluation on Body Fat Data Set : : : : : : : Evaluation on POL Data Set : : : : : : : : : : Comparison of CH1, C4.5, CN2 and OC1 : : : Number of Regions Induced for each Domain :
8.1 8.2 8.3 8.4
Various Amounts of Per Hull Ination : : : : : : : : : : : : : 146 Various Amounts of Per Facet Ination : : : : : : : : : : : : : 147 Per Facet Ination after Non-Essential Deletion : : : : : : : : 149 Comparison of Non-Essential Deletion with Ination against Ination : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 150 Accuracy using Unordered Retention : : : : : : : : : : : : : : 152 Accuracy using Ordered Retention : : : : : : : : : : : : : : : 153 Comparison of Retention Strategies : : : : : : : : : : : : : : : 154 Comparison of CH1 and AOH : : : : : : : : : : : : : : : : : : 156 Comparison of AOH, C4.5 and CN2 : : : : : : : : : : : : : : : 157
8.5 8.6 8.7 8.8 8.9
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: 133 : 135 : 138 : 142
9.1 Exploration of Hybrid Classi er Operation : : : : : : : : : : : 163 9.2 Comparison of CN2, hybrid and C4.5 : : : : : : : : : : : : : : 164
10
Chapter 1 Introduction Classi cation learning has been dominated by the induction of axis-orthogonal decision surfaces in the form of rule-based systems, decision trees, inductive logic programming and decision graphs. Axis-orthogonal systems are characterised by boolean combinations of tests of single attributes against a single value such as \if length < 5.0 and width > 1.5 then : : : ". Each component of the rule antecedent, implicitly, produces a division of the instance space at right angles to the axis, hence the appellation of axisorthogonal. While the induction of alternate forms of decision surface has received some attention, in the context of non-axis orthogonal decision trees, statistical clustering algorithms, instance based learning and regression techniques 62, 66, 1, 152, 80, 17, 15, 131, 49], this issue has received little attention in the context of decision rules. This thesis is concerned with the construction of convex polytopes in N-dimensional instance space and the interpretation of these as rule-like structures. The polytopes are to be constructed by examining a training set of attribute-class tuples. It is conjectured that using arbitrarily-shaped decision surfaces will result in a system which performs well on a wide range of target concepts particularly those concepts that are not readily represented by long, at decision surfaces. It 11
is also expected that, although individual rules or groups of decision surfaces may be complex, the collection of rules describing the domain, will be both simple and small. The speed of modern computer hardware and the development of new geometric algorithms seem to have facilitated a frontal geometric assault on classi cation.
1.1 Background to Research One of the major problems in expert systems development is obtaining domain knowledge for use in expert systems. Conventional knowledge acquisition from an expert is the bottleneck, taking much time and eort to represent even small parts of the expert's knowledge formally 138, 124, 84, 144]. Machine learning provides a way past this bottleneck. Supervised learning is well developed and well understood 123, 115, 42, 31, 128]. Concept learning is a form of supervised learning wherein an implementation of an algorithm, supplied with attribute-class pairs, will emit a concept description, or hypothesis, consistent with the data. The emitted concept description is a classi er and can be used to classify new items presented to it. The possible concepts emitted are determined by the biases 72] of the system, most particularly, the language bias. This bias reects the forms of concept which are expressible in the output language of the classi cation algorithm. In learning systems which divide the concept space using axis parallel decision surface, the learned concept can only be expressed in terms of a collection of hyperrectangles. If the underlying concepts in the domain being explored do not have axis parallel surfaces, then a learning system which expresses learned concepts using hyperrectangles will have diculty in accurately and succinctly describing the underlying concepts. The consequence of this diculty is the generation of a multiplicity of small and inappropriately shaped regions the sum of which gives some 12
degree of approximation to the underlying concepts. In some contexts, it will be desirable that the developed rules be comprehensible by humans. Typically, machine learning systems produce many rules per class and, although each rule may be individually comprehensible, not all will be and holistic appreciation of concepts modeled may be impossible due to their fragmentary presentation. It can be contended that comprehensibility of the rules is quite dierent from comprehensibility of the domain and that claims of human comprehensibility of rule sets may be quite unjusti ed. This comprehension is more problematic in neural networks though some progress has been made recently 48, 129]. Each concept, constructed as a large, convex polytope in this thesis is expected to correspond closely to a single underlying concept of the domain. Although the structure of such concepts is not directly comprehensible, the form of the concepts gives access to work on extracting mathematical descriptions via the techniques of computational geometry including diameters of polytopes, intersections and equations for the surfaces of polytopes 96].
1.2 Research Objectives The modeling of induced concepts by a single, large region rather than many small regions, is conjectured to be epistemologically desirable. Such a representation should also give access to results in computational geometry which permit the extraction of high-level mathematical descriptions of the concepts. It has to be demonstrated that concept representation using regions, which are less strongly constrained in their shape than hyperrectangles, will reduce the hypothesis language bias which might be expected to be particularly strong in axis orthogonal systems. The use of convex, but otherwise arbitrarily shaped, regions will be an important characteristic of systems developed in this work. The problems associated with constructing the sim13
plest groupings must be clari ed and the minimally complex but satisfactory polytopes will be identi ed to be convex hulls. This will oer a much larger set of concept geometries than axis orthogonal systems can. As well as being geometrically adequate, the chosen representation must be shown to be a viable basis for a classi cation system. Convex hulls are a superset of axis orthogonal hyperrectangles, which are a common approach to division of the instance space, and there are reasons to believe they might provide a more exible and general learning bias. It has to be demonstrated that convex hulls constitute a useful, least general generalisation of a group of points and that induction systems producing a small number of large, convex regions can be constructed. An important characteristic of systems developed in this thesis will be the small number of the regions representing a concept. It will be necessary to investigate the nature of possible generalisation operators in such a representation. A tight- tting n-dimensional polygon (a polytope) oers generalisations, via geometric operations, which are of much smaller volume than generalisations which are minimal in terms of axis orthogonally biased hypothesis languages. Such generalisation oers a much more conservative induction process and might be expected to make fewer errors in making positive predictions. The existence and usefulness of such geometric operators has to be demonstrated. It is important that the concept representation should be exible enough so its performance can be modi ed via the generalisation operators to minimise resubstitution errors or any other heuristic used to guide the system. Methods of achieving such performance modi cation will be investigated. Several prototype systems will be designed and implemented and their classi cation behaviour will be examined to guide the design of subsequent versions. The classi cation performance of the nal version will be investigated and compared to other well-understood systems. The interaction 14
between the characteristic parts of the representation (the convexity, largeness and small number of the hulls and the use of a decision list) will be investigated to attempt to identify how each contributes to the performance of the system. The types of tasks, for which this approach is best suited, will be identi ed. The viability of the geometric approach to classi cation will be assessed.
1.3 Principle Outcomes of this Thesis The principal outcomes of this thesis are:1. It has been demonstrated that ecient algorithms for the construction of convex hulls and the speed of modern computers have facilitated a direct geometric modeling of generalisation in N-dimensional space. 2. It has been demonstrated that the formation of convex hulls around a group of points of the same class is a usable generalisation and is less general than generalisations produced by axis-orthogonal based classi ers like decision trees, decision list and rules (without constructive induction). 3. By highlighting the problems of constructing arbitrarily shaped polygons, it has been demonstrated that convex hulls are a simple, satisfactory method of constructing useful polytopes. 4. It has been demonstrated that the use of a small number of large hulls as a concept representation can provide classi cation performance superior to that of classi ers which use a large number of small fragmentary regions for each concept when the underlying concepts do not have axis orthogonal decision surfaces. If the underlying concepts have AO 15
decision surfaces, higher data densities are necessary to match the performance of AO systems. The reasons for this behaviour are explained. 5. The diculties of constructing arbitrarily-shaped hulls incrementally is demonstrated through the construction of a prototype system (PIGS) and the investigation of its performance. The adoption of the convexity restraint solves the problems of arbitrary shapes elegantly. 6. Diculties with run-times still occur when constructing convex hulls incrementally. In this approach, the hull is generalised, one point at a time, using successive training items of the correct class. After each new point is added, the hull is tested to ensure it does not now cover any training point of another class. When a contradictory example is found, the hull constructed so far needs to be undone and another hull of the same class is generalised to include it. If there is no hull which can be so generalised, a new hull must be created. This back-tracking, in an algorithm with potentially long run-times, is a practical reason for not using an incremental algorithm to build the classi er. Another reason for not using an incremental algorithm is that the construction of many small hulls is a much slower process than the construction of a few large hulls. This is because there will be few points which are already covered and which could be ignored. Also, every hull has to be tested to see if a point is covered before it can be decided whether to increment a hull and, indeed, which hull to increment. The consequential use of a small number of large convex hulls, rather than many small ones, in a decision list as a concept representation is desirable in its economy of representation of concepts 7. A convex hull based classi er, CH1, has been implemented and tested. CH1 can handle categorical and continuous data. This demonstrates 16
that a useful classi er can be constructed using convex hulls held in a decision list. The basic convex hulls are shown to be too speci c to provide good classi cation performance for general use. 8. Algorithms for two basic generalisation operations on hulls, ination and facet deletion, are presented. The two operations are shown to increase the predictive accuracy of the classi er. 9. The classi er was demonstrated to provide superior performance to well-known axis parallel biased learning systems on data sets which have curved or non-axis-parallel decision surfaces. The general performance on a wide range of data sets from the UCI Repository was less good. It was hypothesised that, since most data sets are submitted by researchers using axis orthogonally biased classi ers, there is a tendency in these data sets to provide good performance on such classi ers. This suggests that the data sets in the UCI Repository are less wide ranging in character than might be thought at rst. 10. The classi er was modi ed in two ways to investigate the contribution of the characteristics of CH1 to the performance on the UCI data sets. The rst modi cation used a few large, AO hulls instead of a few large convex hulls to investigate the contribution of convex hulls. The second was a hybrid system which used many, small convex hulls rather than fewer larger convex hulls. The rst showed that the convex hulls perform less well than AO systems over a range of UCI data sets especially when the underlying classes have, or are likely to have, axis orthogonal decision surfaces. The second showed that even many, small convex hulls cannot reach the same level of performance as an AO-based system when the data set has AO underlying concepts. These conclusions reect more on the inadvisability of using a learning system on a data 17
set for which it is inappropriately biased than on any inherent fault in the design or operation of CH1.
1.4 Methodology Analytical studies of the performance of complex learning algorithms are very dicult since they depend upon a detailed knowledge of the data distribution and ordering in the domain. Thus this work will, as does most machine learning, depend upon empirical results from experiments to evaluate the concept learning system. Experiments will, unless noted otherwise, consist of 100 repetitions of:randomly select and shue a test(80%) and training set(20%) from the domain data set. construct classi ers from one or more learning systems using the training set. evaluate the performance of the above classi ers using the test set. compare the performance of the classi ers. This should ensure that performance comparisons will be as fair and unbiased as possible. The comparisons will be done using matched pair 2 tailed t-tests on the predictive accuracies for two systems on each of the 100 runs of each experiment and using a sign test on the average predictive accuracies for each systems over the 100 runs for each of a number of dierent domains. C4.5 104], CN2 26] and OC1 80] will be used to provide a basis for comparison of the techniques being developed against a well-established, modern axis orthogonal classi er. Since the classi er construction method is applicable only to continuous variables the data sets chosen for use have all or most attributes continuous. The algorithm handles categorical attributes so that 18
domains with some categorical attributes but mainly continuous ones can be explored. The classi er cannot handle missing values and these are treated in two ways. Where there are a small number of instances with missing values in a big set, the instances with missing values are simply removed. Where this approach is not feasible, missing values are replaced by the mean for that attribute. The run times for some data sets are unacceptably long and some data sets are consequently reduced in size to facilitate better run times. The quickhull software is the cause of the problem. In some data sets the heuristic guiding its hull formation works well but not in others. It is also possible that the innate roughness of the concept being modelled requires many surface facets thus causing long run times. Details of the data sets used and modi cations to them are in the appendices.
1.5 Structure of this Thesis This introduction describes the issues and methodology of the thesis. Attention is drawn to the interrelation of the output language bias of classi ers and the geometry of the underlying actual concepts and how the geometric approach might markedly reduce the conicts in this relationship. The objectives and main outcomes are described. Chapter 2 is a survey of literature relevant to the geometric approach including geometric appreciations of learning systems and a survey of software systems which were considered for the experimental work. A software package, which will be used for constructing convex hulls for experimental purposes within the thesis, is chosen and the decision is justi ed. The literature on performance measurement for classi ers is reviewed and recommendations are made as to which ones will be used and why. Chapter 3 describes a simple prototype system which examines a naive approach to enclosing groups of points. A number of simple experiments 19
are analysed to identify the problems arising. It is noted that the use of convex hulls will address these problems without any arbitrary constraints which would otherwise be necessary. Other advantages consequent upon using convex hulls are described elsewhere. Chapter 4 describes the implementation of a convex hull based classi cation algorithm and examines its performance on a variety of arti cial data sets to provide a preliminary indication of the characteristics of its classication performance. Some de ciencies in its performance are shown and the problems are hypothesised to be caused by the maximal specialisation of the convex hulls. Attention is drawn to how these characteristics might be advantageously employed in a classi er and under what circumstances this would be possible. Chapter 5 introduces the idea of ination of hulls as a possible approach to modifying the performance of the classi er. It is demonstrated that convex hulls can be inated, much like a balloon, and various strategies for selecting the amount of ination are discussed and two are implemented. Experimental results are obtained which show that ination can reduce the specialisation of hulls and improve their classi cation performance. Chapter 6 introduces the idea of reducing hull specialisation by deleting facets that do not contribute to the resubstitution accuracy of the classi er in the belief that these deleted facets are unlikely to contribute to the accuracy of the nal classi er. Experiments are performed to validate this belief and to compare deletion with ination. Deletion alone is shown to be less useful than ination. The reason for this is explained. Deletion with subsequent ination is shown to be superior to ination alone. In Chapter 7, the nal, optimised version of CH1 is compared with CN2, C4.5 and OC1 on some data sets which are known to not be axis-orthogonal. CH1 is shown to be superior to C4.5, CN2 and OC1 in particular respects. CH1 is then tested against C4.5 and CN2 on a variety of data sets from 20
the UCI Repository. The results are mixed and some reasons for this are discussed, particularly implicit biases in the data sets, the convexity of the hulls and the largeness of the hulls. The characteristics of domains where CH1 provides good performance are identi ed. Chapter 8 explores the contribution of the convexity of the hulls to the performance of CH1 on the UCI data sets by replacing large convex hulls with large axis orthogonal hulls. The experiments of Chapters 5 and 6 are repeated and the outcomes using the two dierent types of hull, while keeping all other factors equal, are compared and contrasted. It is concluded that the bias implicit in most of the data sets favours AO systems and that convex hulls need much more data to compete on that kind of data set. Chapter 9 explores the contribution of the largeness of the hulls to the performance of CH1 and another version of the software which uses many small convex hulls is implemented. Another classi er, CN2, is used to preclassify groups of points and hulls are built round these groups. The performance of this hybrid classi er is investigated experimentally. It is concluded that even many small hulls will not compensate for the mismatch of the underlying bias in the data and the bias of the classi er. Chapter 10 summarises the work done in this thesis and the conclusions drawn. Some interesting issues for future research are also noted.
21
Chapter 2 Review 2.1 Chapter Outline This section examines the underlying ideas of a geometric representation of concepts. A simple taxonomy of concept geometries is introduced to facilitate discussion of the interaction of the geometries of concepts and of generatable hypotheses. Machine learning algorithms which have relevance to the convex hull/decision list methodology pursued in this work, are reviewed. Since the novelty of the proposed method lies in its style of generalisation, the method by which other algorithms form generalisations and the geometry of the generalisations will be the main focus of the examination. Other aspects of each algorithm which have no direct relevance to the geometric approach being adopted, will be dealt with summarily. Some of the basic theory behind convex hulls is reviewed as are a number of implementations of convex hull constructing algorithms. An implementation is chosen for use in this work and the reasons for the choice are presented. Several dierent, although complementary, methods of evaluation are examined. Accuracy is the most common measure and is the most commonly used and so is examined in detail. The ROC measure is quite dierent from accuracy and gives a 22
view of overall performance at a range of settings. Cost-sensitive techniques oer a set of dierent metrics against which one can optimise classi cation performance and they also provide insight into the characteristics of the operation of the classi er.
2.1.1 Geometric View of Generalisation All types of classi er can be regarded as implicitly partitioning a volume of instance space. The characteristics of these volumes for dierent types of classi er are described in the following sections. The geometric interpretation is essentially an afterthought for these methodologies. It is not a concept of the classi er construction process or of the classi cation procedure, just a convenient visualisation of the underlying concept. This thesis pursues an explicitly geometric approach to classi cation learning in order to seek new insights. A geometric view will be taken of both classi er construction and subsequent classi cation. One of the problems of many classi er algorithms is that a single underlying concept may be described by a large number of small instance space divisions which reduces the probability of human comprehension of the rules. The complexity of the representation of an hypothesis produced by a given computational learning system is a function of both the hypothesis language and the target concept. If the concept being learned and the hypothesis language are geometrically similar, the concept will be concisely represented if appropriate attributes are available and used by the algorithm. Otherwise the representation of the concept will be by a large number of inappropriately shaped decision surfaces. The typical axis-orthogonal decision tree representation of the concept shown in Figure 2.1 is an example of the problem. The other problem is that when the classi er produces concepts of a different shape from the underlying concept, there will always be error in the 23
b b
b
b
b
b
b b
b
b
Target Concept with Typical Approximation Two Decision Surfaces with AO Surfaces Figure 2.1: A Complex Representation of a Simple Concept. representation. This is expected from the Law of Conservation of Generalisation Performance 116] or the No Free Lunch Theorem 151], notwithstanding the caveats of Rao et al. 108], but is still a problem as limited data limits the accuracy of the representation. Since the geometric approach is considerably less limited in the orientation and placement of decision surfaces, one might expect better performance than axis orthogonal systems can provide over a range of domains where the underlying decision surfaces are not axis orthogonal.
2.2 Attribute Types 2.2.1 Underlying and Imposed Metrics Consider any arbitrary data space with a set of n observed and classi ed instances with numerical attributes a1 : : : ad where d is the dimensionality of the attribute space. Nothing else is known or assumed (an oracle is excluded) about the data space and classes within it. The attributes appear uniformly continuous because a continuous scale has been imposed on the underlying metric of the attribute space. Consider a 2-D space and suppose there are 3 points, (x1 y1), (x2 y2) and (x3 y3) such j x2 ; x1 j CA for d even 2 2N B > > d @ d < ;1 2 1 F (d N ) = > 0 d > c ; 1 N ; b CA for d odd B@ 2 > 2 > : b d2 c
However, the expected number of facets for random points is proportional to logd;1 n 10]. 6. the algorithm should output a facet list with components of the unit normal and the distance from the origin to facilitate later tests for inclusion of points in the hull by the concept learning software.
2.10 Survey of Convex Hull Software 2.10.1 cdd cdd 22] is a C implementation by Fukuda of the Double Description Method 73] and generates the vertices of a general convex polyhedron given by a system of linear inequalities. Input and output are in polyhedra format 5]. The implementation can do the opposite transformation: when supplied with a list of vertices and rays, it will construct the hull. The computations are done using oats and not in nite precision arithmetic. This software was not attractive since attribute vector data sets would need preprocessing before use and the output format is, similarly, not immediately useful. Also the data sets provided are not lists of vertices. The majority of points will be internal to the implied convex hull.
43
2.10.2 chD Emiris et al. 35] describe a convex hull construction algorithm combined with perturbations to side-step typical problem of having too many points on a single plane or having points covertical when a sweep-line search algorithm is used. The implementation is chD 24]. All coordinates must be integers with an absolute value less than 231 and the number of dimensions must be in the range 2 to 20. There is a scaling such that the initial simplex's volume is 1 and this will make scaling of the test set a problem as the scaling will change from run to run.
2.10.3 Hull This software constructs convex hulls in general, but small, dimension. The incremental algorithm is described by Clarkson et al. 28] and the numerical code for normals to facets is described in 27]. While points may be input as oats they are rounded to integers. Facets are output as lists of their vertices. This implementation has only been ported to Crays and SGIs,
2.10.4 Porta Christof and Loebel 94] present a collection of routines for analyzing polytopes. Polyhedra are transformed, using Fourier-Motzkin elimination between two representations 1. the convex hull of a set of points and the convex cone of a set of vectors. 2. a system of linear equations and inequalities. Numerical quantities are integers and operations are performed using rational numbers. Integer points can be determined to be inside the hull.
44
2.10.5 lrs, qrs, rs lrs, like earlier versions rs 5] and qrs, nds all the vertices of an intersection of halfspaces, by walking from vertex to vertex. It is written in C, with exact rational arithmetic and optional lexicographic symbolic perturbation to handle degenerate polytopes These programs use integer input only.
2.10.6 qhull The Quickhull Algorithm 6] uses a variant of the Grunbaum's BeneathBeyond Theorem 46]. An initial set of points which forms a simplicial complex 96] is selected. For each facet, a list of points beyond the facet is constructed (there is a visibility constraint on which facets a point can be associated with) then the point farthest beyond the facet is processed. Facets which become internal are deleted from the facet list. This use of the farthest point should minimise the number of points to be processed. For a point to be within a convex hull, it must be beneath every hyperplane and this is easily ascertained by the inner product of the point and the hyperplane normal being compared with the distance from the origin. The Qhull implementation 125] accepts lists of points expressed in oats or doubles and outputs a facet list with each facet being represented by the components of a unit out-ward pointing normal to hyperplane and the minimum distance from the hyperplane to the origin. The software has a large number of options particularly variable convexity constraints which give some control over the merging of facets and, thus, of the nal number of facets for a set of points.
45
2.11 Choice of Package The qhull implementation was chosen for the construction of convex hulls and will be called from the classi er software written for this thesis. This choice was made because it provides 1. straightforward use of data sets from UCI repository without transformation 2. output in an immediately useful form 3. control over the size of the facet list 4. easy access to testing new points for inclusion 5. oating point arithmetic 6. no rotation of points 7. scaling can be done simply to both training and test data sets if necessary
2.12 Performance Metrics 2.12.1 Accuracy The simplest measure of the performance of a classi er is its accuracy or \true misclassi cation rate". There are a number of estimates of this accuracy which can be used. Assuming (following Breiman et al. 15]) that instance x with d attributes x1 : : : xd is drawn from the set of all instances, . Associated with each instance is a class, c, drawn from the set of all classes, C, so that the training and test sets consist of ordered tuples (x,c). The classi er acts like a function d(x) which predicts a class, c0 2 C . An 46
evaluation function, E, is de ned to return 1 if d(x) = c, the actual class of x, and 0 if d(x) 6= c. The data set is of size N.
Resubstitution Accuracy This is computed using the same data as was used to construct the classi er. N X accresub = N1 E (d(xi ) 6= ci) i=1
Many classi ers try to maximise this value and it can clearly give a very optimistic estimate. Indeed many systems will produce a resubstitution accuracy of 100% for noise-free data. However classi ers that have been pruned will give lower resubstitution accuracy though this measure will still tend to overestimate predictive accuracy on previously unsighted cases.
Test Sample Accuracy In this estimation, some proportion, say 13 , of the data is reserved for testing and the classi er is constructed using the remaining 23 . This method has drawbacks in that the classi er is constructed with a smaller sample size and that care has to be taken in selection of the test sample. The most common approach is to select the test set randomly but some experimenters ensure that the frequency of classes is the same in the test and training data. Assuming there are N2 items in the test set, T, the accuracy from this technique is X E (d(xi) 6= ci) acctest = N1 2 (xn cn )2T There is no established justi cation for 13 to 23 split rather than 15 to 54 or any other particular values. If the data set is large, this method is an excellent approach because the large number of training items will ensure construction of as good concepts as possible. Similarly, the large number of unseen test items will ensure thorough and fair testing of the classi er. 47
Cross-validation Accuracy For smaller sample sizes, using the test and train method above will not be desirable since there will not be enough training items to ensure the construction of good concepts after the usual proportion of test items has been removed. One approach to this problem is for the data to be split into V equal sections, a classi er formed using V ; 1 of the sections and a test done using the Vth section. For each section omitted, a classi er dv can be constructed and the acctest can be evaluated as X acctest (dv ) = N1 E (dv (xi ) 6= ci) v (xn cn )2Tv This is repeated leaving each of the V sections out in turn and using the average accuracy as the estimate of the true accuracy. V X acccv = V1 acctest(dv ) v=1 There is no established correct value of V and values of 5 to 10 are seen in the literature. For very small data sets, it may be desirable to use \leave one out" training and testing, so that V = N .
2.13 Relative Operating Characteristic Swets 126] points out that accuracy measures can be obtained in misleading ways particularly when the relative frequencies of events are very dierent. When systems are required to distinguish between just two dierent cases, there are only a small number of possible outcomes as shown in Table 2.1. If we consider proportions rather than actual frequencies, then, because of the nature of the table, it is only necessary to record one pair of the complementary proportions and usually the top two in the table are used P FP ( T PT+F N and F P +T N ). Consequently, FN = 1 ; TP TP + FN TP + FN 48
Table 2.1: 2x2 Contingency Table Actual Pos.(AP) Actual Neg. (AN) Classed Pos.(CP) True Pos.(TP) False Pos. (FP) TP+FP Classed Neg.(CN) False Neg.(FN) True Neg. (TN) FN+TN TP+FN FP+TN N=TP+FP+TN+FN and
TN = 1 ; FP FP + TN FP + TN giving the other table entry proportions. When a positive diagnosis is made according to more lenient criteria the proportions of both true and false positives will tend to rise while the proportions of true and false negatives will tend to decline. If the decision criteria are made more strict to reduce the proportion of false positives, then the proportion of true positives will also decline. A measure of accuracy should be valid for all settings which can be made. The true positive value is plotted against the corresponding false positive value for many settings of the diagnostic system. The curve is then characterised by the proportion of the area which is under the graph: 0.5 implies the system provides no discrimination and 1.0 that it provides perfect discrimination. Studies in the areas of weather forecasting 2], information retrieval, aptitude testing, medical imaging, materials testing and polygraph lie detection are discussed. This metric has not been used in any other machine learning work to this investigator's knowledge.
2.14 More Informative Performance Metrics Although accuracy is a simple, single metric for the evaluation of classi cation systems, its very simplicity reduces its applicability in real-life situations since it gives no indication of the types of errors a system will make and this can 49
be an important consideration in practical systems. Consider a system for diagnosing rabies: the disease is deadly and the treatment itself is dangerous. One must avoid giving the treatment unnecessarily so a diagnostic system wherein the errors are false positives is totally unacceptable even if its accuracy is superior to all other diagnostic systems. One which gives false negatives is much more acceptable since one can then wait for more de nite symptoms before embarking on a dangerous cure. Contrariwise, consider a system for diagnosing incipient appendicitis for use on astronauts going for a long ight. This time the treatment is almost entirely safe although the complications from the appendicitis can be fatal to some one without access to a hospital. Now a system which gives false positives is acceptable but one which gives false negatives is totally unacceptable even if it possesses high accuracy. Clearly, it is necessary to have separate measures for each class so that the performance of the classi er can be understood intimately and and the behaviour of the classi er tuned to suit mis-classi cation costs. Thus more sensitive performance measures, such as those described by Weiss et al. 149] and illustrated in Table 2.1, are required. These measures are Positive Predictive Value which is the fraction of those items identi ed as belonging to a class which actually belong to the class. PPV = TP T P +F P
Negative Predictive Value (NPV) which is the fraction of items identi ed as not belonging to a class which actually do not belong to that N class. NPV = T NT+F N Sensitivity which is the fraction of the items which belong to a class which are correctly identi ed as belonging to that class. Sensitivity = TP T P +F N
50
Speci city which is the fraction of items which do not belong to a class and are correctly identi ed as not belonging to that class. Specificity = TN T N +F P N Accuracy = T P +TTNP +T +F P +F N
With these measures it is possible to identify the characteristics of a system which cause poor performance and, perhaps, be able to adjust the system to minimise the eect on overall accuracy.
2.15 Misclassi cation Cost-based Metrics This section is not concerned with costs incurred in evaluating attribute values 127, 130] but with trading one metric against another. Pazzani 88] discusses trading o cover, the ability to classify every instance, against accuracy. Subsequently, Pazzani et al. 86, 87] discuss strategies for reordering rules in decision lists to minimise misclassi cation costs. It will be demonstrated in this thesis that the convex hull based methodology will allow the trading o of errors in predicting one class for errors predicting another class by modifying individual rules without reordering. If the consequences of mistaken classi cations are not of equal cost for all classes, the placement of decision surfaces can be adjusted so that the total cost of misclassi cations is minimised rather than the number of points misclassi ed. Most research in machine learning assumes all misclassi cations have the same cost but this is not necessarily the case as was demonstrated above. Nonetheless, the assignment of actual values for misclassi cation costs is dicult and somewhat arbitrary. Webb 147] describes four approaches to minimising misclassi cation costs. 1. Divide the data into subsets and perform experiments with dierent learning biases to see which one to choose for the main task 98]. 51
2. Better safe than sorry 97] which permits rules for classes with high misclassi cation cost if they have high empirical support. 3. Vary the empirical bias of the learning system to reduce the occurrence of high cost misclassi cations 15, 87, 97]. 4. Use background knowledge to bias the system toward suitable hypotheses 45]. Suppose, following Michie et al. 69](ch. 2), the cost of misclassifying an object of class i as class j is c(i,j), the probability of an item of class i is pi, the cost of correctly classifying an item is zero and all misclassi cations have the same cost c = c(i j )for i 6= j . Now, if all observations are assigned to class d, the cost will be
Cd =
X i
pic(i d) =
X i6=d
pic = c
X i6=d
pi = c(1 ; pd)
So the cost will be minimised by defaulting all classi cations to the class of highest probability for the object. Misclassi cation costs can be positive 147], where the cost is a function of the class to which it is assigned, or negative where the costs are a function of the actual class of an object.
2.16 Statistical Measures In comparing two or more classi cation systems on a single domain, matched pair training and test sets will be used and the statistical comparison will use a t-test with p 0 AND num-misclassi ed-pts < prev-num-misclassi ed-ptsthis class] prev-num-misclassi ed-ptsthis class] = num-misclassi ed-pts misclassi ed-points = all points in training-set not correctly classi ed by Rule-list construct-next-rule and update num-misclassi ed-pts ENDWHILE
END.
CONSTRUCT-NEXT-RULE
SET best rule to empty COUNT number of each class in misclassi ed points FOR all classes IF number in this class is zero continue to next class ENDIF extract-next-rule FIND best position for next rule IF this rule is best so far SAVE rule to best rule ENDIF ENDFOR INSERT best rule in DL in appropriate position CONSTRUCT new misclassi ed points list
END.
73
EXTRACT-NEXT-RULE
FOR all points in misclassi ed points list FOR all attributes IF categorical SET array entry in rule ELSE PUT in continuous attribute le ENDIF ENDFOR IF not enough points to construct hull OR forcing AO HULL construct AO HULL ELSE construct convex hull IF qhull fails construct AO HULL ENDIF ENDIF ENDFOR
END.
4.3.1 Time Complexity of CH1 Consider the classi cation of n points in d dimensional space where there are c actual concepts. It is a design expectation that there will be approximately c hulls. It is also the case that a point which is outside any given hull will be beneath approximately half the facets and beyond the other half of the facets of the given hull. 74
Testing for Coverage of a Point Firstly, the time for considering the coverage of categorical values is negligible compared to coverage of continuous attributes and will be ignored. Thus, it is only necessary to consider a point as being covered when it is beneath every facet for a given hull. Since there are expected to be few hulls, a sequential inspection is used. Since the facets are unordered, they will also be inspected sequentially but when the current point is found to be beyond the current facet, the inspection of facets of the current hull can be abandoned. Assuming the data points are approximately evenly divided between hulls, a hull will have logd;1 nc facets (see Section 2.9.1) Typically, we would expect to test half the hulls before nding one which does not cover the current point. Thus testing a single point has time complexity O( 2c logd;1 nc ) = O(c logd;1 nc ). Testing all points for classi cation will have time complexity of O(nc logd;1 nc ).
Constructing the List of Hulls It can be expected from design criteria that there will be approximately c groups of unclassi ed points to be processed. Finding the most populous class in each pass will involve counting classes and will take one pass through O(n) points. Forming the convex hull has time complexity (Section 2.9) of
O(nb(d+1)=2c ) Finding the best position for the new rule will take approximately c2 operations. and reclassifying all points each time will take O(c logd;1 n) operations. Thus the time complexity of the hull creating algorithm is
O(c3 logd;1 n nb(d+1)=2c ) . 75
4.3.2 Implementation of CH1 The algorithm was implemented in C and interfaced to the quickhull software 6]. When a convex hull has been created, it is stored in the calling program's rule list, in the appropriate position, for use when classifying test points. If a convex hull cannot be created because there are insucient points or the hull is degenerate, an axis orthogonal hull is substituted for it. Each rule then contains a set of categorical attribute values and either a list of convex hull facets or a set minimum and maximum values for each attribute representing the axis orthogonal hull. Each facet in the list contains the signed oset of the hyperplane from the origin (the sign speci es which side is beneath and which is beyond. a list of the components of a unit outward pointing normal to the hyperplane. the distance above the plane which is considered to be beneath the plane. This is usually a small number reecting the rounding errors in the calculations but it will be manipulated later when inating hulls. Thus, in an N-dimensional domain, each facet is represented by N +2 oating point numbers.
4.4 Comparison of the Classi cation Performance of CH1 and C4.5 In order to investigate the dierences in the performance of CH1 and C4.5 in a systematic way, arti cial data sets were used so that the density of data points, the dimensionality of the data set and the complexity of the data set could be controlled. C4.5 was chosen as a well-known exemplar of an SAP or 76
AO system and will be used as such for comparisons throughout this thesis. Since these arti cial data sets are easy to visualize, they were employed until CH1 was reasonably functional (the early versions could not handle categorical data) and then evaluation could be done using real-world data sets. Some arti cial data sets which will cause an SAP classi er to produce many small, rectangular regions were designed to compare the performance of an SAP system and the convex system. It is expected that the convex hull system is better biased for learning these concepts and dierences in performance can be expected on that basis. The arti cial data sets consisted of a circle embedded in a rectangular universe and the 3 and 4 dimensional analogues (class 0 is the outer area of the universe and class 1 is inside the circle). The data sets, with typical point populations in classes 0 and 1 in brackets, are identi ed as \circle" (47.5%:52.5%), \sphere" (62.5%:37.5%), \hyp-sphr" (80.0%:20.0%). two concentric circles embedded in a rectangular universe and the 3D and 4D analogues (class 0 is the outer area of the universe, class 1 is the annulus and class 2 is inside the inner circle). The data sets, with typical populations as before, are identi ed as \CnCrcl" (or \polo") (34.5%: 14.0%: 51.5%), \CnSphr" (or \solidpolo" or \spolo"') (67.5%: 23.0%: 9.5%), \CnHySp" (or \hyperpolo" or \hpolo") (71.5%: 18.0%: 10.5%). a more complex 2D universe (called RCC) containing a rectangle and two circles shown in gure 4.2 (50%: 30%: 20%). This universe exhibits disjunction for class 1 which makes it more complex. For each experiment a data set of size 1000, 2000, 4000, 8000 or 15000 items was generated with elements randomly placed within the data universe 77
'$ '$ 2 2 1 &% &%
1
0
Figure 4.2: The \RCC" Universe (range 0 to 20 in each dimension) and classi ed according to their location. The data set was then randomly partitioned into 80% training examples and 20% testing examples with random ordering within each and the resultant sets were presented to both C4.5 and CH1. The partitioning, shuing and testing were carried out 100 times for each data set. The performance metrics, described in Section 2.14, are extracted and used to understand the diering performances of the two systems to a depth which the use of accuracy alone would not permit. The results for each experiment are summarised in Tables 4.1 to 4.5 and the accompanying graphs in Figures 4.3 to 4.7. The rst result in each box is the C4.5 result and the second is the CH1 result. Results which are signi cantly dierent at the 0.05 level are marked with an \*" beside the superior result. Lack of an \*", implies there is no signi cant dierence between the values.
4.4.1 Analysis of NPV Results These results are presented in Table 4.1 and Figure 4.3. Examining the performance of the classi ers (CH1 and C4.5) on every data set, it can be seen that CH1 is almost always signi cantly better for the rule for the outermost class and never signi cantly worse. The value is always very high and, for other than the 2D case, is 100% implying that everything which CH1 classi es 78
as not belonging to this class does not, in fact, belong to this class. This is as one would expect as the CH1 method of constructing the hull for inner objects will not overgeneralise and include inappropriate objects. Thus all objects of the outer class will be seen as such. For the rule for the innermost class, C4.5 is usually signi cantly better and never worse on any of the tests although one can see from the graphs that the performance of CH1 approaches that of C4.5 as the data set gets very large. One might surmise that CH1 will always eventually achieve comparable or better performance if the data set is suciently large. In the concentric data sets where there is a middle class, it appears that, if there is sucient data, CH1 will have a better NPV than C4.5. However the necessary amount of data increases with the dimensionality of the data set. In a 2D data set, 1000 points is enough for CH1 to be superior but in 3D it needs somewhat over 2000 points and, in 4D, it needs considerably more than the 15000 points used in the experiment. In the \RCC" data set, CH1 does better in the outer class and C4.5 on the other two but, as the data density rises, the performance of CH1 approaches and passes that of C4.5 eventually.
4.4.2 Analysis of PPV Results These results are presented in Table 4.2 and Figure 4.4. For the outermost class, the PPV of C4.5 is almost always signi cantly better than that of CH1 although the CH1 value approaches that of C4.5 as the data density increases becoming negligibly dierent for \circle" and becoming superior for CnCrcl. The same eect is present but less dramatic at higher dimensionality. In this case the tight bound which CH1 puts on the inner class is counterproductive since it leaves a, possibly, large number of inner class points outside the boundary and these are subsequently wrongly classi ed as the outer class leading to a low PPV for CH1 for the outer class. C4.5's 79
Data size circle 0 circle 1 sphere 0 sphere 1 hyp-sphr 0 hyp-sphr 1 CnCrcl 0 CnCrcl 1 CnCrcl 2 CnSphr 0 CnSphr 1 CnSphr 2 CnHySph 0 CnHySp 1 CnHySp 2 RCC 0 RCC 1 RCC 2
1000 95.09/95.49 *95.88/93.46 89.31/100* *92.85/87.27 73.74/100* *94.74/88.92 96.83/96.74 94.84/96.20* *96.53/93.26 89.28/100* *88.79/87.63 *94.20/90.82 79.42/100* *88.98/83.78 *95.24/90.83 95.70/97.34* *98.05/96.52 *98.11/96.50
2000 96.80/97.74* *97.71/96.18 90.37/100* *94.47/90.79 82.06/100* *95.04/89.51 97.26/97.74* 96.65/97.47* *97.50/96.33 92.05/100* 90.86/90.56 *95.17/91.54 82.51/100* *90.39/86.56 *97.07/93.64 97.30/98.34* *98.47/97.37 *99.19/98.08
4000 97.44/99.87* *98.11/97.49 93.11/100* *95.81/92.45 84.13/100* *95.55/91.43 98.37/98.98* 97.44/98.45* *98.01/97.41 93.87/100* 92.20/93.28* *96.13/93.82 85.88/100* *92.01/90.13 *97.24/94.81 97.82/99.19 *99.03/98.42 *99.15/98.65
Table 4.1: Negative Predictive Values
80
8000 98.31/99.43* *98.76/98.40 93.78/100* *96.39/94.92 no result *96.64/93.40 98.63/99.40* 98.11/98.97* *98.60/98.28 94.12/100* 93.79/95.01* *97.08/95.63 87.59/100* *93.35/91.82 *97.74/95.67 98.36/99.55* *99.30/99.01 *99.31/99.24
15000 98.82/99.99* 99.07/99.02 94.73/100* *97.07/95.95 89.71/100* *97.21/94.69 98.97/99.66* 98.65/99.36* *99.22/98.97 95.28/100* 94.89/96.38* *97.62/96.83 89.59/100* *94.62/93.59 *98.23/96.66 98.86/99.74* 99.34/99.40* 99.50/99.50
Data size circle 0 circle 1 sphere 0 sphere 1 hyp-sphr 0 hyp-sphr 1 CnCrcl 0 CnCrcl 1 CnCrcl 2 CnSphr 0 CnSphr 1 CnSphr 2 CnHySp 0 CnHySp 1 CnHySp 2 RCC 0 RCC 1 RCC 2
1000 *95.88/93.45 95.09/95.47 *92.80/87.27 89.29/100* *94.74/88.95 73.85/100* *94.47/91.38 71.35/74.46* 94.93/95.81* *88.76/82.85 55.91/58.59* 86.89/100* *90.96/82.72 *46.97/28.64 72.62/100* *96.59/90.34 *94.73/92.78 92.19/100*
2000 *97.70/96.17 96.79/97.75* *94.46/90.81 90.38/100* *95.04/89.51 82.06/100* *96.20/94.09 78.89/86.47* 96.95/97.19* *90.84/85.96 63.44/65.55* 89.74/100* *91.93/85.81 *57.51/49.57 78.06/100* *97.44/93.75 *96.74/95.61 95.87/100*
4000 *98.11/97.50 97.45/99.87* *95.81/92.46 93.10/100* *95.55/91.43 84.12/100* *97.48/96.40 84.72/90.25* 97.42/98.70* *92.51/89.86 71.92/77.49* 91.39/100* *93.56/89.24 60.89/61.95* 81.67/100* *97.97/96.05 97.13/97.37 97.00/100*
Table 4.2: Positive Predictive Values
81
8000 *98.76/98.40 98.32/99.43* *96.39/94.92 93.78/100* *96.64/93.39 no result *97.82/97.53 88.34/93.67* 98.32/99.24* *93.95/92.17 75.50/84.19* 93.49/100* *94.78/91.09 67.34/70.26* 83.36/100* *98.49/97.60 97.96/98.64* 97.53/100*
15000 99.07/99.02 98.82/99.99* *97.07/95.95 94.72/100* *97.21/94.70 89.73/100* 98.35/98.49* 91.95/96.05* 98.88/99.57* *95.28/94.25 79.59/88.49* 94.32/100* *95.69/92.90 77.80/79.10* 87.60/100* *98.65/98.43 98.71/99.22* 98.11/100*
somewhat larger generalisation catches more of these peripheral, inner-class points and so its PPV for the outer class is good. However, for the innermost class, CH1's tight, minimal generalisation leads to 100% performance on all but the 2D cases where it is still high. C4.5 does more poorly because its looser generalisation includes some outer class points in its characterisation of the innermost class. For the \RCC" data set, C4.5 does better at all densities tested although CH1 approaches the same performance at higher densities. CH1 always does better for class 2 and also for class 1 when there are more than 4000 points. Looking at Figure 4.2, it can be seen that class 2 is visually in front of class 1 and that the bound around class 2 will be tight with some items escaping into class 1 but not vice-versa.
4.4.3 Analysis of Sensitivity Results These results are presented in Table 4.3 and Figure 4.5. The sensitivity of CH1 for the outer class is signi cantly better than that of C4.5 except for two cases where they are not signi cantly dierent. Since CH1 puts a tight bound around the innermost class, one would expect that almost all items of the outermost class would be in the outermost class and hence the 100% performance on all but the 2D data sets is not unexpected. Similarly for the innermost class, C4.5's looser bound enables it to capture most of the peripheral class members while they may escape CH1. As the data density rises, the performance of CH1 approaches that of C4.5. When a middle class is present, CH1 is superior at low dimensionality but its performance approaches that of C4.5 as the data set size increases. For the \RCC" data set, CH1 performs better on the outer class and C4.5 on the inner one although, as the data set size increases, this dierence becomes negligible. For class 1, C4.5 performs better at low data densities and CH1 when there are very large data sets. 82
Data size Circle 0 Circle 1 sphere 0 sphere 1 hyp-sph 0 hyp-sph 1 CnCircl 0 CnCrcl 1 CnCrcl 2 CnSphr 0 CnSphr 1 CnSphr 2 CnHySp 0 CnHySp 1 CnHySp 2 RCC 0 RCC 1 RCC 2
1000 94.70/95.29 *96.18/93.75 93.38/100* *88.35/76.53 94.26/100* *75.50/41.91 94.58/94.53 66.60/75.64* *96.53/92.96 89.05/100* *55.61/49.05 *86.55/76.70 90.98/100* *49.21/19.32 *67.10/30.74 93.98/96.55* *95.44/91.86 *95.18/90.59
2000 96.61/97.68* *97.82/96.28 94.03/100* *91.05/83.54 95.33/100* *80.99/55.35 95.18/96.09* 79.45/84.39* *97.49/96.25 91.22/100* *62.14/60.25 *90.23/81.28 93.27/100* *55.16/33.50 *75.50/43.31 96.51/97.95* *96.50/93.96 *97.64/94.27
4000 97.44/99.89* *98.09/97.39 95.57/100* *93.49/87.29 96.00/100* *82.57/63.44 97.16/98.23* 83.41/90.03* *98.04/97.40 93.30/100* 70.24/74.27* *91.57/85.69 94.49/100* *59.15/47.55 *79.85/60.17 97.18/98.98* *97.75/96.32 *97.60/96.15
Table 4.3: Sensitivity
83
8000 98.26/99.42* *98.79/98.41 96.14/100* *94.19/91.36 96.82/100* *86.38/71.61 97.52/98.92* 87.89/93.38* *98.66/98.33 93.42/100* 75.97/80.44* *93.87/90.39 95.32/100* *66.49/57.34 *81.87/63.73 98.00/99.46* *98.38/97.72 *97.89/97.62
15000 98.84/99.99* *99.06/98.99 96.64/100* *95.39/93.33 97.47/100* *88.71/77.43 98.21/99.42* 91.11/95.75* *99.22/98.97 94.82/100* 79.58/85.41* *94.99/93.10 95.90/100* *73.86/67.91 *86.24/72.96 98.54/99.68* 99.43/99.65* 98.52/98.50
4.4.4 Analysis of Speci city Results These results are presented in Table 4.4 and Figure 4.6. C4.5 usually has signi cantly better speci city for the outermost rule since its larger generalisation strategy correctly identi es most of the items belonging to the inner class, or classes, so it correctly excludes from the outermost class almost all of the innermost class. As the data set size increases, CH1 exhibits similar or better behaviour on the outermost class. For the innermost class, CH1 performs signi cantly better and often at or close to 100%. It is simple to see that the tight bound, ensures that almost all members not of the inner class are excluded from it. When there is a middle class, CH1 always exhibits signi cantly better performance. The performance for the \RCC" data set is similar except that, for the partially occluded class 1, C4.5 exhibits better performance at lower dimensionality and CH1 at higher dimensionality.
4.4.5 Analysis of Accuracy Results These results are presented in Table 4.5 and Figure 4.7. For small data sets, C4.5 tends to have better accuracy although some dierences are not significant. As the size of the data set rises, CH1 always becomes signi cantly better. The point at which the change takes place appears to be dependent on the dimensionality of the data, being higher as the dimensionality increases. It is possible that CH1's need for large numbers of data points is a consequence of its less stringent geometric bias. It chooses between a greater number of classi ers because it has more options in terms of position and orientation of each decision surface and, thus, it needs more data to make good choices. If it has sucient data then its greater exibility should enable it to make better classi ers. There is an oddity in the accuracy for CnSphr in that the accuracy drops from the 1000 item data set to the 2000 item set. When compared to the 84
Data size Circle 0 Circle 1 Sphere 0 Sphere 1 Hyp-sph 0 Hyp-sph 1 CnCrcl 0 CnCrcl 1 CnCrcl 2 CnSphr 0 CnSphr 1 CnSphr 2 CnHySp 0 CnHySp 1 CnHySp 2 RCC 0 RCC 1 RCC 2
1000 *96.18/93.73 94.71/95.27 *88.27/76.53 93.35/100* *75.44/42.01 94.27/100* *96.81/94.82 95.78/95.95* 94.88/95.99* *88.85/79.56 88.86/91.16* 94.35/100* *79.27/52.30 88.00/89.60* 96.23/100* *97.55/92.45 *97.69/96.91 96.94/100*
2000 *97.82/96.27 96.61/97.69* *91.03/83.55 94.03/100* *80.97/55.33 95.33/100* *97.86/96.57 96.52/97.83* 96.96/97.17* *91.69/85.26 91.31/92.30* 94.88/100* *79.41/58.45 91.19/92.63* 97.42/100* *98.00/94.87 *98.55/98.09 98.56/100*
4000 *98.09/97.40 97.46/99.89* *93.49/87.30 95.56/100* *82.57/63.43 96.00/100* *98.55/97.90 97.67/98.49* 97.38/98.70* *93.12/89.73 92.75/94.30* 96.03/100* *83.72/69.77 92.52/94.24* 97.51/100* *98.43/96.84 98.76/98.88 98.93/100*
Table 4.4: Speci city
85
8000 *98.80/98.42 98.27/99.43* *94.18/91.36 96.13/100* *86.38/71.59 96.82/100* *98.79/98.61 98.19/99.02* 98.25/99.22* *94.61/92.40 93.64/96.10* 96.89/100* *86.31/74.44 93.59/95.18* 97.95/100* *98.76/97.98 99.11/99.41* 99.20/100*
15000 99.05/98.99 98.85/99.99* *95.39/93.33 96.64/100* *88.73/77.46 97.47/100* 99.05/99.12* 98.80/99.40* 98.86/99.57* *95.69/94.41 94.89/97.22* 97.28/100* *89.09/80.71 94.61/96.31* 98.44/100* *98.94/98.75 99.43/99.65* 99.36/100*
Data size Circle Sphere Hyp-sphr CnCrcl CnSphr CnHySp RCC
1000 96.28/96.55 90.66/90.69 *90.42/89.63 92.16/92.89* *88.95/88.32 *81.36/78.12 *95.06/93.80
2000 97.46/97.25 92.84/93.36* *92.41/90.70 93.97/94.52* 85.01/86.29* *84.21/81.92 *96.75/95.97
4000 97.73/98.20* 94.57/95.19* *94.04/93.50 95.57/96.83* 87.89/90.50* *87.21/86.24 97.45/97.62*
8000 98.54/98.86* 95.31/96.57* *94.87/94.10 96.84/98.80* 90.54/92.92* *89.88/89.49 98.10/98.42*
15000 98.89/99.29* 95.97/97.47* 95.81/95.83* 97.65/98.7* 92.13/94.71* 90.78/91.62* 98.74/99.01*
Table 4.5: Accuracy other graphs in Figure 4.7, it can be seen that the accuracy for the 1000 item set is abnormally high. This is assumed to be an artefact of the 1000 item data set for which no explanation is oered. The \RCC" data set shows the same dependency of relative performance on the size of the data set.
4.5 Comparison of Performance Metric Results The performances of these two systems are quite dierent when examined closely, being most marked in the areas where CH1 can achieve 100% results even at low data densities. Clearly if a classi cation problem can be cast in a form where CH1 delivers 100% performance it will be a very powerful tool. More generally, C4.5 tends to oer better accuracy at low data densities and CH1 at higher densities but performance on other metrics tends to be balanced overall at low densities but favours CH1 at higher densities. The initial description of concepts as \innermost", \outermost" and \middle" seems unhelpful and if one can look at the data sets as a visual depth problem with all data sets formed into convex hulls then the visual ordering 86
of \front",\back" etc. seems more useful. Another aspect of performance is the output. C4.5 produces trees with 40 to 100 nodes whereas CH1 produces a much smaller number of decision regions, often the same number as there are classes in the domain. However, there are many domains where the heuristic rule insertion technique becomes trapped in a local minimum and cannot escape. This is caused by subsequent rules being placed deeply in the rule list to avoid a rise in the number of resubstitution errors. This placement of the rule leaves the same group of points still unclassi ed and the algorithm can make no further progress. As a consequence of this behaviour, the hill-climber which inserted new rules in the apparent best position has been abandoned in favour of simple prepending of new rules. The default rule has thus become the rst rule constructed which is the last rule in the evaluation order of the list. This rule covers the most populous class in the training set.
4.6 Conclusions The basic idea of concept formation through the construction of geometric entities has been established and demonstrated. The performance of the system, in comparison with C4.5, has been as would be theoretically expected from the dierent sizes of generalisation produced by each methodology. On the metrics examined, there is often a data set size below which C4.5 performs better and above which CH1 is better. This turn-over point appears to increase with the dimensionality, or number of attributes, of the data set. It is possible that CH1's need for large numbers of data points is a consequence of its less stringent geometric bias. It chooses between a greater number of classi ers because it has more options in terms of position and orientation of each decision surface and, thus, it needs more data to make good choices. CH1 also \hugs" the data points more closely than a hyper-rectangle leading 87
to more conservative genralisations so more data is needed to force the hull to cover an appropriate region of instance space. If it has sucient data then its greater exibility should enable it to make better classi ers. The importance of using metrics which are more informative, about how a classi er performs than just accuracy, has been stressed. Of particular interest is the fact that CH1 can obtain 100% performance on some metrics. The metrics on which this result is obtained depends on the geometry of the actual concepts. In the simple cases examined in this chapter, the concept which is visually the topmost exhibits this behaviour for NPV and sensitivity for sphere, hypersphere, spolo and hpolo. For the background concept, CH1 obtains 100% performance on PPV and speci city for sphere, hypersphere, spolo, hpolo and multi. C4.5 never obtains this singular result. If the classication task can be framed appropriately, and the actual concept geometry permits, CH1 may provide a very powerful classi cation tool. However, the arti cial data sets used so far are very simplistic, only one has any disjunctive concepts, and real world data sets will be examined later once the algorithm is fully understood and optimised. The fact that the data sets used have precise boundaries may also mislead as to the performance obtainable on real world data sets since there may be no precise boundaries expressed in the attribute sets which have been measured. An important advantage of the succinct geometric representation is that it gives access to work on diameters of hulls, proximity, n-dimensional viewing techniques 125] and the possibility of higher level mathematical descriptions of concepts. The discovery of a problem with the dynamic rule-ordering heuristic has resulted in the abandoning of dynamic rule ordering and reversion to the simple prepending of rules to the decision list. This implies that the default class is the most populous one in the domain. 88
NPV circle 0
NPV circle 1
100 99
96
96
97
NPV
97
96
94
94
93 0
2
4
6 8 10 12 Data Size (,000)
14
16
90 88 0
2
4
NPV sphere 1
16
0
85
90
80
88
75
86
NPV
90
92
70 0
2
4
6 8 10 12 Data Size (,000)
14
16
0
2
4
NPV polo 0
99
NPV
98.5 98 97.5 97 96.5 2
4
6 8 10 12 Data Size (,000)
14
99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5
16
94 92 90 88 4
16
6 8 10 12 Data Size (,000)
14
4
6 8 10 12 Data Size (,000)
14
"CH1" "C4.5"
14
6 8 10 12 Data Size (,000)
14
16
0
96
88
95 94 93 92 91 90
2
4
6 8 10 12 Data Size (,000)
14
16
0
2
4
NPV multi 2
98.5
98
98 97.5
97
97 96.5
14
16
"CH1" "C4.5"
99
96.5 6 8 10 12 Data Size (,000)
6 8 10 12 Data Size (,000)
99.5 "CH1" "C4.5"
NPV
NPV
NPV
"CH1" "C4.5"
97
97 4
6 8 10 12 Data Size (,000)
98
96.5 2
4
NPV hpolo 2
97.5
0
2
99 "CH1" "C4.5"
98.5
97.5
16
90 4
99
98
14
91 2
NPV multi 1
98.5
16
92
99.5
99
14
94 93
NPV multi 1 "CH1" "C4.5"
16
95
0
99.5
14
"CH1" "C4.5"
96
90
16
6 8 10 12 Data Size (,000)
97
NPV
NPV
NPV 6 8 10 12 Data Size (,000)
4
NPV spolo 2
82 4
2
98
84 2
16
96
0
86
75
14
97
16
92
80
16
93 2
94
85
14
"CH1" "C4.5"
99
NPV hpolo 1
90
6 8 10 12 Data Size (,000) NPV polo 2
96 "CH1" "C4.5"
0
4
94
NPV hpolo 0
95
2
95
0
100
16
98
97 96 95 94 93 92 91 90 89 88 87
16
14
"CH1" "C4.5"
0
NPV
NPV
96
2
98 97 96 95 94 93 92 91 90 89 88
NPV spolo 1
"CH1" "C4.5"
6 8 10 12 Data Size (,000)
100
NPV spolo 0
0
14
"CH1" "C4.5"
0
100 98
6 8 10 12 Data Size (,000)
NPV
"CH1" "C4.5"
0
4
NPV polo 1
100 99.5
2
NPV hyps 1 "CH1" "C4.5"
95
NPV
NPV
14
100 "CH1" "C4.5"
94
NPV
6 8 10 12 Data Size (,000) NPV hyps 0
98 96
94 92
95
95
"CH1" "C4.5"
98
98
NPV
NPV
100 "CH1" "C4.5"
99
98
NPV
NPV sphere 0
100 "CH1" "C4.5"
0
2
4
6 8 10 12 Data Size (,000)
14
16
0
Figure 4.3: Negative Predictive Value Graphs for CH1 and C4.5 89
2
4
6 8 10 12 Data Size (,000)
PPV circle 0
PPV circle 1
100 99
94
PPV
96
97
90
94
95
88
93
94 0
2
4
6 8 10 12 Data Size (,000)
14
16
86 0
2
4
PPV sphere 1
PPV
96 94 92 90 88 0
2
4
6 8 10 12 Data Size (,000)
14
16
0
14
98 97 96 95 94 93 92 91 90 89 88
16
4
6 8 10 12 Data Size (,000)
14
16
0
"CH1" "C4.5"
PPV
PPV
PPV
85 80
93 75
91
70 2
4
6 8 10 12 Data Size (,000)
14
16
0
2
4
PPV spolo 0 "CH1" "C4.5"
16
0
"CH1" "C4.5"
75
94
PPV
90
70 65
90
84
60
88
82
55 6 8 10 12 Data Size (,000)
14
16
2
4
PPV hpolo 0
6 8 10 12 Data Size (,000)
14
16
80
30
75
82
20 6 8 10 12 Data Size (,000)
14
16
2
4
PPV multi 0
6 8 10 12 Data Size (,000)
14
16
0
99
96
PPV
95 94
98
97
97
96
95 94
91
93 6 8 10 12 Data Size (,000)
14
16
16
93
92 4
14
96
94
2
"CH1" "C4.5"
99
98
92 90
6 8 10 12 Data Size (,000) PPV multi 2
95
93
0
4
100 "CH1" "C4.5"
PPV
97
2
PPV multi 1 100 "CH1" "C4.5"
98
16
70 0
99
14
85
84 4
16
90
50 40
86
14
"CH1" "C4.5"
95
PPV
PPV
88
6 8 10 12 Data Size (,000) PPV hpolo 2
60
2
4
100 "CH1" "C4.5"
70
90
0
2
PPV hpolo 1
92
16
"CH1" "C4.5"
0
80
94
14
86 0
"CH1" "C4.5"
6 8 10 12 Data Size (,000)
92
86
4
4
98 96
2
2
PPV spolo 2
80
PPV
PPV
14
"CH1" "C4.5"
100
85
88
6 8 10 12 Data Size (,000)
100 99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5
92
96
PPV
6 8 10 12 Data Size (,000)
90
0
4
PPV spolo 1
96 94
2
PPV polo 2
90
92
16
70 2
97
94
14
85
75
95
95
16
"CH1" "C4.5"
95
100
96
14
PPV hyps 1
PPV polo 1 "CH1" "C4.5"
6 8 10 12 Data Size (,000)
80
0
99
0
4
90
PPV polo 0 98
2
100 "CH1" "C4.5"
PPV
"CH1" "C4.5"
98
6 8 10 12 Data Size (,000) PPV hyps 0
100
PPV
92
96
95
"CH1" "C4.5"
96
98
97
PPV
PPV
98 "CH1" "C4.5"
99
98
PPV
PPV sphere 0
100 "CH1" "C4.5"
92 0
2
4
6 8 10 12 Data Size (,000)
14
16
0
Figure 4.4: Positive Predictive Value Graphs for CH1 and C4.5 90
2
4
6 8 10 12 Data Size (,000)
Sens circle 0
Sens circle 1
"CH1" "C4.5"
99
Sens
Sens
98 97 96 95 94 0
2
4
6 8 10 12 Data Size (,000)
14
99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5 94
16
93 2
4
97
95 94 6 8 10 12 Data Size (,000)
99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5
14
16
0
2
4
6 8 10 12 Data Size (,000)
14
Sens
Sens 6 8 10 12 Data Size (,000)
14
16
0
80 70
Sens
Sens
75 65 60 55 50 45 2
4
6 8 10 12 Data Size (,000)
14
16
0
2
4
Sens hpolo 0 100 99 98 97 96 95 94 93 92 91 90
6 8 10 12 Data Size (,000)
14
70
40
4
6 8 10 12 Data Size (,000)
14
16
0
98
96
95 94
95
93
94
92
93 6 8 10 12 Data Size (,000)
14
16
95 94 93 92 91
91 4
"CH1" "C4.5"
97
Sens
Sens
96
6 8 10 12 Data Size (,000) Sens multi 2
98
96
2
4
99 "CH1" "C4.5"
97
0
2
Sens multi 1
97
16
30 2
99
98
14
40
Sens multi 0 "CH1" "C4.5"
16
60 50
0
100
14
70
50
16
16
"CH1" "C4.5"
80
Sens
Sens 14
6 8 10 12 Data Size (,000) Sens hpolo 2
10
99
4
90 "CH1" "C4.5"
20 6 8 10 12 Data Size (,000)
2
Sens hpolo 1
30
4
6 8 10 12 Data Size (,000)
"CH1" "C4.5"
0
60
2
4
96 94 92 90 88 86 84 82 80 78 76
16
80 "CH1" "C4.5"
0
2
Sens spolo 2
"CH1" "C4.5"
85
88
14
92 4
Sens spolo 1
90
16
96
93 2
90
92
14
"CH1" "C4.5"
94
Sens spolo 0
94
16
95
0
96
14
97
80
65
"CH1" "C4.5"
6 8 10 12 Data Size (,000)
98
70
0
4
99
75
98
2
100
85
16
16
Sens polo 2 "CH1" "C4.5"
95
14
14
"CH1" "C4.5"
0
90
6 8 10 12 Data Size (,000)
6 8 10 12 Data Size (,000)
90 85 80 75 70 65 60 55 50 45 40
16
100
4
4
Sens polo 1 "CH1" "C4.5"
2
2
Sens hyps 1
Sens
Sens
Sens Sens
0
"CH1" "C4.5"
99
100
Sens
16
96
0
Sens
14
Sens hyps 0
Sens polo 0
Sens
6 8 10 12 Data Size (,000)
98
4
96
94
100
2
97
95
0
"CH1" "C4.5"
0
"CH1" "C4.5"
99 98
Sens sphere 1 96 94 92 90 88 86 84 82 80 78 76 74
Sens sphere 0 100
"CH1" "C4.5"
Sens
100
90 0
2
4
6 8 10 12 Data Size (,000)
14
16
Figure 4.5: Sensitivity Graphs for CH1 and C4.5 91
0
2
4
6 8 10 12 Data Size (,000)
Spec circle 1 "CH1" "C4.5"
99
Spec
98 97 96 95 94 0
2
4
6 8 10 12 Data Size (,000)
14
16
0
2
4
Spec sphere 1
97
Spec
Spec
98
96 95 94 93 0
2
4
6 8 10 12 Data Size (,000)
14
90 85 80 75 70 65 60 55 50 45 40
16
4
14
16
0
"CH1" "C4.5"
96 0
2
4
Spec spolo 0
86 84 82 80 78 4
6 8 10 12 Data Size (,000)
6 8 10 12 Data Size (,000)
14
14
98 97 96 95 94 93 92 91 90 89 88
16
0
4
6 8 10 12 Data Size (,000)
14
16
0
Spec
Spec
92
97
88 2
4
Spec multi 0
6 8 10 12 Data Size (,000)
14
16
0
99.5
Spec
99 98.5
Spec
99
94
98 97.5
98 97.5
93
97
97
92
96.5
96.5
4
6 8 10 12 Data Size (,000)
14
16
"CH1" "C4.5"
99.5
98.5
95
6 8 10 12 Data Size (,000) Spec multi 2
96
2
4
100 "CH1" "C4.5"
97
0
2
Spec multi 1 100
98
16
96 0
"CH1" "C4.5"
14
98 97.5 96.5
99
16
98.5
93
16
"CH1" "C4.5"
99
89 14
6 8 10 12 Data Size (,000)
99.5
90
6 8 10 12 Data Size (,000)
4
Spec hpolo 2
55 4
2
100 "CH1" "C4.5"
60
2
14
94 2
91
50
16
95
94
65
14
97 96
95
70
16
98
96
75
6 8 10 12 Data Size (,000)
"CH1" "C4.5"
Spec hpolo 1
80
0
4
99
97
85
2
Spec spolo 2
"CH1" "C4.5"
Spec hpolo 0 "CH1" "C4.5"
14
"CH1" "C4.5"
100
0
90
6 8 10 12 Data Size (,000)
100 99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5
16
Spec
Spec
88
2
4
Spec spolo 1
90
0
2
Spec polo 2
Spec
Spec
Spec
6 8 10 12 Data Size (,000)
95.5
"CH1" "C4.5"
16
97
94 2
97
92
14
"CH1" "C4.5"
95
97.5
16
16
96
99
14
14
98
96.5
6 8 10 12 Data Size (,000)
6 8 10 12 Data Size (,000)
99
98
94
4
Spec hyps 1
98.5
4
2
Spec polo 1
96
Spec
0
99.5
2
"CH1" "C4.5"
100
0
"CH1" "C4.5"
0
Spec
16
"CH1" "C4.5"
Spec polo 0 99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5
14
Spec
"CH1" "C4.5"
99
6 8 10 12 Data Size (,000)
96 94 92 90 88 86 84 82 80 78 76 74
Spec hyps 0
100
Spec
Spec sphere 0
100 "CH1" "C4.5"
Spec
Spec
Spec circle 0 99.5 99 98.5 98 97.5 97 96.5 96 95.5 95 94.5 94
0
2
4
6 8 10 12 Data Size (,000)
14
16
Figure 4.6: Speci city Graphs for CH1 and C4.5 92
0
2
4
6 8 10 12 Data Size (,000)
Accuracy circle
Accuracy sphere
99.5 "CH1" "C4.5"
99
"CH1" "C4.5"
94
95
Accuracy
98 97.5
94 93
92
92
91
96.5
91
90
96
90 2
4
6 8 10 12 Data Size (,000)
14
16
89 0
2
4
Accuracy polo
Accuracy
97 96 95 94 93 92 0
2
4
6 8 10 12 Data Size (,000)
14
16
14
16
14
16
95 94 93 92 91 90 89 88 87 86 85
16
14
16
86 84 82 80 78
0
2
4
6 8 10 12 Data Size (,000)
14
16
97 96 95 94 93 6 8 10 12 Data Size (,000)
14
88
98
4
6 8 10 12 Data Size (,000)
"CH1" "C4.5"
90
"CH1" "C4.5"
2
4
Accuracy hpolo
Accuracy multi
0
2
92 "CH1" "C4.5"
100 99
0
Accuracy
"CH1" "C4.5"
98
6 8 10 12 Data Size (,000) Accuracy spolo
99
Accuracy
93
97
0
"CH1" "C4.5"
95
96
Accuracy
Accuracy
96
97
98.5
Accuracy
Accuracy Hyp-sph
98
Figure 4.7: Accuracy Graphs for CH1 and C4.5 93
0
2
4
6 8 10 12 Data Size (,000)
Chapter 5 Ination of Convex Hulls 5.1 Introduction The performance characteristics of a classi er using convex hulls can be considered by examining Figure 5.1.
' $$ ' A B & % C & %
Figure 5.1: Performance Characteristics The actual concept is shown as the rectangle C. Typically a convex hull based system will construct an enclosure which is a subset of C, such as oblong B in the gure, simply because the random sampling technique used to construct the training set will omit some extremal data points. Clearly every item which is classi ed as B will also be a member of C so the PPV of this classi cation will be 100%. However, as membership of B is a subset of 94
the actual concept, C, the sensitivity of this classi cation will reect this by being less than 100%. If the sample points are uniformly distributed through the data space the sensitivity will be the ratio of the areas of B and C. If the points outside B are examined, it is seen that they are not all outside of C and so the NPV of the classi er for this class will be the ratio :C to :B (assuming uniform distribution as before). Since :C is a subset of :B , the speci city of the classi er will be 100%. If the classi er is, for the moment, assumed to produce an overly-large classifying hull, A, instead of B, the performance characteristics are quite dierent. C4.5 will, in principle, produce this kind of performance for underlying convex concepts if the available counter-examples do not fall close to the decision boundary. It will, of course, be over-specialising the neighbouring concept but if the rst region is the important one for the classifying task then the eect will be marked. This can be seen immediately by considering that C4.5 will induce a rectangular concept for a 2D concept which is triangular. Now, since A subsumes C, all members of A will not be members of C and so the PPV will be the ratio C to A. The sensitivity will also be the ratio C to A. The NPV will be 100% since all items of :A are also :C . The speci city will be the ratio :A to :C since the classi er will miss items in the volume A-C. Clearly the classi cation performance of A could be improved by shrinking it and that of B by inating it. This chapter investigates the consequences of inating convex hulls in various ways on the classi cation performance on the representation. A simple ination algorithm is described. Limiting the amount of ination applied is explored using limits at facet, hull, class and domain level.
95
5.2 In ating a Convex Hull The edge of a concept has been assumed to be the hyperplane surface of the convex hull but, as we have seen, this is very conservative in its location since these hyperplanes pass through points occupied by positive instances. There may be some not inconsiderable distance from a facet to the rst point which is of another class. Test points in this volume will default to being classi ed as belonging to a dierent class from the nearby hull. This may result in the outer concept being over-generalised and similarly the nearby, inner convex hull being under-generalised. This can be adjusted by moving all facets of the hull outwards which is analogous to inating it like a balloon. This will generalise the concept represented by the convex hull and specialise the external concept. The hull ination can be implemented by allowing points some small distance beyond each hyperplane to be considered to be beneath it. The CH1 implementation will be altered so that a variable, facet ination, which stores the amount of the ination, can be set for each facet or for the hull owning the facet. This can be set from the variable (min dist beyond) which stores the distance to the nearest point beyond a facet, modi ed by the amount of ination to be applied. A simplistic view of ination would lead one to expect that, as a hull is inated, the PPV will rise as it becomes less likely that any given point of a class is omitted due to the hull's greater size. The sensitivity should start to fall at some stage since points of other classes start to be captured by the enlarged hull and, similarly, the speci city will initially rise and then start falling if the hull is over-inated. The NPV should rise towards 100% as the hull inates and the only points not included are all not members of the enclosed class. The maximum amount of the ination can clearly be limited at several dierent levels as shown in Figure 5.2. The numbers \1" and \2" show the 96
position of an instance of that class and the original uninated, per hull limited and per facet limited inated hulls are shown. Each facet of every hull can be inated individually until resubstitution errors arise, that is until the facet eectively includes the nearest external point. All facets may be inated by dierent amounts. Some facets may be inatable by an in nite amount because no instance of another class exists beyond them. They may become detached from the rest of the convex hull but this will not aect the correctness of the inclusion test so no attempt will be made to detect detached facets. This ination will be called \per facet" ination. All facets of a single hull can be inated simultaneously until resubstitution errors arise. As soon as any resubstitution error occurs, all ination stops and all facets are inated by the same amount. This will be called \per hull" ination. All facets of all hulls of a single concept can be inated simultaneously and by the same amount until resubstitution errors arise. This has the problem that, if one hull cannot be inated then none can. This will be called \per concept" ination. It is not expected to be useful since ination is so restricted. all facets of all hulls of all concepts in the domain are inated simultaneously until resubstitution errors rise in any hull. This will be called \per domain" ination. Since it is more limited than per concept ination, it is not expected to be useful. The rst two modes of ination, described above, will be investigated experimentally. It is expected that it will be possible to have accurate expectations of the performance of the latter two from consideration of the former two. The nal full ination value (distance beyond the hyperplane) for each 97
2 per facet inflation 2 2
per hull inflation 1 per facet inflation
1 1
Original hull for class 1 l
2
1
2 1
1
per facet inflation
1
per hull inflation 2
2 2 2
Figure 5.2: Diering Ination Strategies facet is the maximum ination value and produces the greatest generalisation in the internal hull and greatest specialisation of the outer concept without any resubstitution errors. When there are no dierential misclassi cation costs and when raw accuracy is to be maximised, there is clear reason for preferring this to the original hulls since the original is too specialised and will miss nearby points of the same class in the region for which there was no evidence. An alternative to full ination may be to inate the inner hull so that both concepts are approximately equally generalised, relative to the empirical evidence, in terms of the volume of the space which contained no evidence. This will be called semi ination. For simplicity, all dimensions will be assumed to be similar and a simple geometric calculation, based on the simplifying assumption that the hulls are spherical, will yield the fraction 98
of the maximum distance beyond the hyperplane which is equivalent to semi ination (the two radii,r1 and r2 , are determined such that the corresponding volumes, V1 and V2 are in the ratio 2:1). The experimental section will examine the use of full and half ination for guidance in their use.
5.3 Algorithm for Per Hull In ation After the construction of a set of facets (hyperplanes) owned by a rule, each facet is annotated with the distance of the nearest point beyond (min dist beyond) it which is of a dierent class. If there is no such point, the min dist beyond is set to in nity. This distance represents the maximum possible ination for that facet. The rule itself is annotated with the smallest of these limits and that will be the maximum permissable per hull ination amount. This algorithm expects a pointer to a rule as input and modi es the contents of the facet ination eld of each facet for per facet ination and of the rule for per hull ination. The value of this facet ination is the distance beyond a facet which is subsequently considered to be beneath the facet. The appropriate value is chosen by inspecting the ination mode currently in use.
5.3.1 Test of Implementation of Per Hull In ation The implementation was tested on the training and test sets called \square" and \quad" which are shown in Figures 5.3 and 5.4. In each case, the empty area in the training set is populated by class 1 in the test set so that the eect of ination can be demonstrated. The uninated hull will misclassify all items in the test set which are in the initially empty area. However, after ination it should correctly classify all these points. The data points are randomly distributed over the whole area in each case. The \square" data set should result in a tight convex hull around the 99
0 Empty
20 18 16
0
1
1 4 2
0
2 4
16 18 20
0
Training Set
2
18 20 Test Set
Figure 5.3: Implementation Test : Square Sets Uninated Inated Actual 0 Actual 1 Actual 0 Actual 1 Classi ed 0 1462 1104 1462 0 Classi ed 1 0 1434 53 2485 Table 5.1: Confusion Matrix: square concept labeled \1" and, when used to classify the test set, it should give erroneous classi cations for all test points in the area which was \empty" in the training set. After ination the hull corresponding to concept \1" should have been generalised to cover essentially all of the \empty" area and should have very few erroneous classi cations. The uninated hull gives 27.60% errors on the test set while the inated one gives only 1.32% errors. The confusion matrix is shown in Table 5.1. As would be expected, the number of points in each area is approximately proportional to the area. The small number of errors can be ascribed to the fact that the convex hull will not necessarily be rectangular and will not necessarily cover small areas in the corners. The \quad" data set should not be inatable in per hull mode as both of the straight edges should be closely represented in the convex hull and, 100
20 18 16
20 18 Empty
0
0
1 1 2
2
0
0
2
Training Set
18 20 Test Set
Figure 5.4: Implementation Test : Quad Sets Uninated Inated Actual 0 Actual 1 Actual 0 Actual 1 Classi ed 0 2435 456 2435 456 Classi ed 1 0 1089 0 1089 Table 5.2: Confusion Matrix: quad therefore, be uninatable. As expected, the test performance is the same in both cases con rming no ination took place. The results, as a confusion matrix, are shown in Table 5.2. These experiments give some con dence regarding the implementation of per hull ination. This data set also shows the attraction of having ination done on a per facet basis rather than on a per hull basis.
101
5.4 Evaluation of Per Hull In ation Strategies In this section, it is intended to compare the performance of convex hulls for classi cation with no ination, full ination and semi ination. The experiments will use the per hull ination mode. Experiments were carried out on a range of data sets from the UCI Repository. The sets were chosen because they have mainly or entirely continuous attributes which CH1 is designed to handle and because only numerical attributes can be inated. Missing values cannot be handled by CH1 and these are replaced by the mean for that attribute. The experiments were carried out by shuing the data randomly and dividing it randomly into 80% training and 20% test items. With each set of data, the classi cation performance was measured with no, semi and full per hull ination. Average values for predictive accuracy over 100 runs are shown in the result tables. Signi cant dierences are shown by superscript numbers of the columns to which the superscripted value is signi cantly better, at p=0.05, using a matched-pair, 2 tailed t-test. The results are shown in Table 5.3. Comparing nil and semi ination, it can be seen that semi ination is superior by a ratio of 6:0 which, using a two tailed sign test, is signi cant at p = 0.05. Similarly, semi ination has a win loss ratio of 8:7 over full ination but this is not signi cant. Full ination has a win loss ration of 10:8 over nil ination but again this is not signi cant. It would appear that semi ination is the preferred method for per hull ination but this conclusion is not very strong since there is little dierence between semi and full ination. Unfortunately the underlying actuality may be that full ination, in eect, is just like semi ination because high points, resulting in a very small value 102
Accuracy balance-scale bcwo bupa cleveland echocardiogram german glass glass7 heart hepatitis horse-colic hungarian ionosphere iris new-thyroid page-blocks pid satimage segment shuttle1 sonar soybean-large vehicle waveform wine
Nil In. Semi In. 81:36 81:36 96:933 96:933 58:26 58:26 50:833 50:833 70:35 70:35 50:893 50:893 60:08 60:171 47:783 48:4413 64:243 64:243 30:02 38:691 55:583 55:583 60:283 60:283 87:97 87:97 60:80 60:861 73:79 73:811 48:96 49:591 78:043 78:043 49:00 49:00 83:26 83:26 60:04 60:04 54:63 54:661 64:05 64:05 35:08 35:08 27:52 27:52 55:67 55:67
Full In. 81:36 54:48 58:26 44:52 70:35 50:46 60:171 24:12 44:10 40:2312 54:77 53:00 88:1212 60:861 73:811 49:9512 67:32 49:00 83:2712 60:04 54:7512 64:05 35:08 27:6312 57:6612
Table 5.3: Various Amounts of Per Hull Ination 103
of min dist beyond, on a hull may impinge upon a point of a dierent class, terminating ination, while in other regions of space large amounts of possible ination will never occur as a result. Comparison of these results with those of an experiment using per facet ination may yield some insight into this. It would appear that in most (18) of the data sets per hull ination has little eect as might be expected from such a constrained method. In the hepatitis set, the ination produces some improvement in performance but in 6 domains it results in some decrease in performance which is unexpected and needs to be understood.
5.4.1 Interaction between In ation, Decision Lists and Performance Metrics A simplistic view of ination of polytopes as a method of altering the performance of the classi er leads to expectations of behaviour of the performance metrics which are not observed in practice. As a polytope is inated, one would expect the sensitivity and NPV would rise as one gets fewer false negatives and the PPV and speci city to decrease as one gets more false positives. However this ignores the interaction between neighbouring polytopes as ination takes place and it also ignores the eects of the xed ordering of tests implied by the decision list structure.
The Basic Situation Consider the concepts A and B in the rectangular universe shown in Figure 5.5. Al and Bl , respectively show to where the concepts learned from the training set extend (the geometry is exaggerated for clarity). Ai and Bi show to where the inated concepts extend. In the statements below A is taken to mean everything to the left of the full centre line and B everything to the right. These represent the actual, underlying concepts. Similarly, Al 104
A
Al Ai
Bi Bl
B
Figure 5.5: Basic Situation and Ai are taken to mean everything to the left of the named line and Bl and Bi similarly but to the right. If the classi er is used with no ination, the expected values for the classi er for A, using the formulae in Section 2.14, are PPV = Al =(Al + 0) = 100%
Sensitivity = Al =(Al + (A ; Al )) = Al =A 100% NPV = B=(B + (A ; Al )) 100% Specificity = B=(B + 0) = 100% The classi er for B performs exactly symmetrically.. Now consider that the constructed concepts have been inated. The expected values for the classi er for A are
PPV = Ai=(Ai + 0) = 100% Sensitivity = Ai=(Ai + (A ; Ai)) = Ai =A 100% but greater than previous case since Ai > Al .
NPV = B=(B + (A ; Ai )) 100% but greater than previous case since (A ; Ai ) < (A ; Al ).
Specificity = B=(B + 0) = 100% 105
It is still the case that the expected values of the metrics for concepts A and B are completely symmetric.
Basic Situation with Overlap A
Al
Ai BiBl
B
Figure 5.6: Overlapping Situation without Misclassi cation All annotations in Figure 5.6 have the same meaning as in Figure 5.5. The only dierence is that, during ination, Ai has strayed over into actual concept B. The consequences can be seen in the values for the metrics for A shown below
PPV = A=(A + (Ai ; A)) = A=Ai 100% Sensitivity = A=(A + 0) = 100% NPV = (A + B ; Ai)=((A + B ; Ai) + 0) = 100% Specificity = (A+B ;Ai )=((A+B ;Ai )+(Ai ;A)) = 1;(Ai ;A)=B 100% For B, the metrics will be unaltered from the previous case if there is no ordering of rules. However, in a decision list, there is a distinct and xed ordering of the rules and anything which has fallen within the scope of an earlier rule can never be classi ed positive by a later one. Consider now rule B with the consequence of the ordering that it can never classify as positive any thing previously classi ed as belonging to A. The metrics for B will be 106
PPV = Bi=(Bi + 0) = 100% Sensitivity = Bi=(Bi + (Ai ; A)) 100% NPV = A=(A + (B ; Bi)) 100% Specificity = A=(A + 0) = 100% It can be seen that the symmetry in the behaviour of the metrics is now gone as a consequence of the intrusion of one inated rule into the domain of a dierent actual concept.
More Complex Situation In Figure 5.7, Al and Bl are both at an angle but, of course, completely within their respective actual concepts. A Al . A. i B . . A k ..l ..m A ... ... A ... ... A ... ... A A ... ... A A ... ... A ... ... A ... ... A A ... ... A A n ...o ...p A A Bi Bl Figure 5.7: Overlapping Situation with Misclassi cation However, when ination is applied, it can be seen that Ai and Bi intrude into the other concept and do not completely cover their own actual concept. Examining the metrics for concept A after ination
PPV = (Ai;o;n)=((Ai ;o;n)+(o+n)) = (Ai;o;n)=Ai = 1;(o+n)=Ai 100% 107
The behaviour of PPV is going to depend on how (o+n) and Ai behave on ination and conceivably PPV will be able to both increase and decrease in general.
Sensitivity = (Ai ; o ; n)=((Ai ; o ; n)+ m) = 1=(1+ m=(Ai ; o ; n)) 100% This increases or decreases as the concept is inated depending on the relative sizes of (o+n) and m.
NPV = (B ; o ; n)=((B ; o ; n) + m) = 1=(1 + m=(B ; o ; n)) This similarly increases or decreases as the concept is inated depending on the relative sizes of (o+n) and m.
Specificity = (B ; o ; n)=((B ; o ; n) + (o + n)) = 1 ; (o + n)=B 100% As ination proceeds, (o+n) increases so the speci city decreases. Examining the metrics for concept B assuming prior application of the rule for concept A
PPV = (B ; o ; n)=((B ; o ; n) + m) = 1=(1 + m=(B ; o ; n)) < 100% This will increase or decrease during ination depending on the relative sizes of M and (o+n).
Sensitivity = (B ; o ; n)=((B ; o ; n) + (o + n)) = 1 ; (o + n)=B As ination increases, (o+n) will increase and sensitivity will decrease.
NPV = (A ; m)=((A ; m) + n) = 1=(1 + n=(A ; m)) As m increases, the NPV increases as n increases, the NPV decreases. Therefore, NPV may increase or decrease during ination.
Specificity = (A ; m)=((A ; m) + m) = 1 ; m=A As ination increases, m decreases and so the speci city will increase. 108
5.4.2 Summary of Discussion PPV : When the inated polytope is still contained in the actual concept, the PPV will remain 100% but as overlap occurs, it may drop or rise depending on how the induced and actual concepts overlap and the order of application of the rules. If the concept is overinated, the PPV will drop due to the presence of false positives this eect may be inhibited by earlier evaluated rules which prevent false positives. Sensitivity : This is less than 100% when the actual concept contains the induced concept and 100% when the induced concept contains the actual concept. When the inferred and actual concept partially overlap each other, the sensitivity may rise or fall at any given point during the ination. It has also been shown that sensitivity need not reach 100% because of the eect of earlier rules. NPV : In general the NPV will increase as more actual positives are included but during the ination process it may rise or fall at any given point. Further the presence of earlier rules may prevent it reaching the 100% no matter how much it is inated. Specicity : This is 100% for simple induced concepts which are contained within the actual concept but falls as concepts are inated too much. However, it has been shown that speci city can increase or decrease during ination, depending on the order of evaluation of rules.
Ination of all rules in a decision list, does not necessarily generalise all rules. Some may be specialised as a result of generalisation of other rules. This means that no rm expectations about the classi cation accuracy can be justi ed in general and empirical study is necessary to evaluate ination strategies. 109
5.5 Algorithm for Per Facet In ation This algorithm expects a pointer to a rule as input and modi es the contents of the facet ination eld of each facet attached to the rule. Since the min dist beyond is dierent for each facet, dierent facets may be inated by radically dierent amounts. It may be that a facet is inated to the extent that it is no longer in contact with the rest of the hull but this cannot be detected. Since it will have no bad consequences, other than doing an unnecessary inclusion test, such detached facets will not be tested for nor will they be handled specially in any way.
5.6 Evaluation of Per Facet In ation The experiments, described in Section 5.4, were replicated to test various levels of ination within per facet ination limits for the same range of data sets as before. The results are shown in Table 5.4. A superscript on a value shows the columns to which result is signi cantly dierent, in a 2 tailed t-test at the 0.05 level. It can be seen that semi ination has a win-loss advantage of 14:0 over nil ination and this is signi cant at p = 0.01 for a 2 tailed sign test. Full ination has a win-loss ratio of 17:1 over semi ination which is signi cant at p = 0.01. Full ination has a win-loss ratio of 23:1 over nil ination which is signi cant at p = 0.01. It is thus clear that semi ination is superior to nil ination and that full ination is superior to both for per facet ination over this range of data sets. Unlike per hull ination, there are no cases where ination has markedly reduced the classi cation accuracy (slight reduction in full ination for german only) and 13 cases where there is a marked increase. The empirical evidence is that per facet ination is less prone to anomalous decreases in performance than per hull ination. 110
Accuracy Nil In. Semi In. balance-scale 81:36 83:421 bcwo 96:93 96:93 bupa 58:26 58:281 cleveland 50:83 50:83 echocardiogram 70:35 70:871 german 50:893 50:893 glass 60:08 61:471 glass7 47:78 50:241 heart 64:24 64:24 hepatitis 30:02 39:261 horse-colic 55:58 55:58 hungarian 60:28 60:381 ionosphere 87:97 87:97 iris 60:80 69:851 new-thyroid 73:79 75:721 page-blocks 48:96 51:781 pid 78:04 78:061 satimage 49:00 49:031 segment 83:26 83:26 shuttle1 60:04 60:04 sonar 54:63 54:911 soybean-large 64:05 64:05 vehicle 35:08 35:401 waveform 27:52 27:52 wine 55:67 55:67
Full In. 83:421 96:93 58:281 54:0212 70:871 50:73 61:471 51:0712 67:7012 41:3912 62:0612 61:6912 88:9712 69:851 75:721 54:0312 79:0512 57:7212 84:2912 60:3112 55:6412 64:2912 39:1512 30:4712 68:3012
Table 5.4: Various Amounts of Per Facet Ination 111
5.7 Comparison of In ation Types The experiment described in Section 5.4 was carried out to compare per facet and per hull ination on matched data sets. Full ination was used in case of per facet ination and semi for per hull ination, based on the conclusions of the previous sections. The results are shown in Table 5.5 Examining the results, per facet ination has a win-loss ratio of 23:1 over per hull ination and this is signi cant at p = 0.01. Clearly, full per facet ination is the strongest ination technique over the data sets used in the experiments and, consequently, all pure ination experiments will use this method in the rest of this thesis. There is no clear reason why the performance on the german data set is dierent from the rest.
5.8 Conclusions Ination was introduced as a method of modifying the over-specialisation of the convex hulls and is shown to be a viable operator for modifying classi er performance. Experiments were carried out to explore the consequences of various amounts of ination in various modes. Per hull, per concept and per domain modes are all limited by any nearby point in any direction, since the rst point encountered in any direction stops the ination in all directions. The per facet ination mode has been shown to not suer from over limitation of any possibility of performance modi cation since nearby points in one direction will not inhibit ination in other directions as happens with per hull ination. Therefore, per facet ination will generally produce greater concept generalisation than the other ination modes. For per facet and per hull ination, nil, full and semi ination were applied 112
Accuracy Per Hull Per Facet balance-scale 81:36 83:421 bcwo 96:93 96:93 bupa 58:26 58:281 cleveland 50:83 54:021 echocardiogram 70:35 70:871 german 50:892 50:73 glass 60:17 61:471 glass7 48:44 51:071 heart 64:24 67:701 hepatitis 38:69 41:391 horse-colic 55:58 62:061 hungarian 60:28 61:691 ionosphere 87:97 88:971 iris 60:86 69:851 new-thyroid 73:81 75:721 page-blocks 49:59 54:031 pid 78:04 79:051 satimage 49:00 57:721 segment 83:26 84:291 shuttle1 60:04 60:311 sonar 54:66 55:641 soybean-large 64:05 64:291 vehicle 35:08 39:151 waveform 27:52 30:471 wine 55:67 68:301 Table 5.5: Per Hull and Per Facet Comparison 113
to range of learning tasks and the consequences examined. For per hull ination, classi er performance was insensitive to the amount of ination applied with any amount leading to better classi cation performance than none over the range of tasks. For per facet ination, any amount of ination was statistically signi cantly superior to no ination and full ination was similarly superior to semi ination over the range of tasks. A comparison of per hull and per facet ination showed that per facet ination was clearly superior and it will be used in future experiments. The use of a decision list to hold the convex hulls was shown, both theoretically and experimentally, to interact strongly with ination and that the behaviour of the chosen metrics was generally unpredictable. Empirical results substantiated theoretical expectations of the unpredictability of the consequences of ination, particularly in per hull mode. The only reliable aspect was that inating a single concept would cause its predictive accuracy to rise.
114
Chapter 6 Facet Deletion 6.1 Introduction Generalisation, by ination, of individual hulls in the list constructed by CH1 has been shown to improve the performance of the classi er. Another possible method to reduce the specialisation of the convex hulls is to remove facets which might not contribute to the classi cation performance of the hulls. Such facets will not have any training set points beyond them. These facets are deemed to be non-essential facets. These deletions will result in a, possibly very large, increase in the volume of the hull and corresponding decrease in the number of facets retained. A simple method for facet deletion is described and tested. This will be used to examine how deleting facets aects the predictive accuracy of a classi er which uses convex hulls. It will be shown to dramatically reduce the number of facets which need to be stored in the decision list. The edges represented by the remaining facets after the deletion operation will still, individually, be too specialised and so ination will be applied to these facets. Experimental tests will evaluate the performance of hulls which have had facet-deletion and subsequent ination. After non-essential facet deletion, some facets will only exclude points 115
already excluded by other facets. It seems reasonable to remove these facets also, since there is less empirical evidence for their presence as they exclude fewer negative points than other facets. It is also the case that these surfaces were constructed to delimit the convex hull and the intent now is to generalise the hull, so they are no longer needed. Two algorithms for this purpose are formed and experimentally evaluated.
6.2 Deletion of Non-essential Facets Any facet belonging to a hull, which does not have any instance in the training set beyond it, is clearly not contributing to the resubstitution accuracy and thus, hopefully, to the classi cation performance, since the test for points being beyond it always fails. Thus the deletion of such a facet can never contribute to a fall in resubstitution accuracy and, by increasing the volume of the polytope, generalises the represented concept. Such generalisations may extend to in nity in directions where there are no negative cases, see Figure 6.1. Clearly, deletion of facets is equivalent to ination to in nity and produces the same eect as ination to in nity except that the test for inclusion no longer exists after deletion. Two regions of dierent classes which are side by side may have all facets removed except for a single plane separating the two classes, considerably simplifying the domain model. The facets which are removed are artefacts of the hull formation at extreme points and their removal allows the class to extend to in nity. This removal of nonessential facets will be called non-essential deletion. It is proposed that all such facets be deleted and the resultant classi cation performance will be examined. The simplest method for such deletion is to physically delete facets which have a min dist beyond of 1 after the pass which annotates facets with min dist beyond distances. This requires a single pass through the list of facets for each hull. 116
2
2 2
1
1
2
2 1
2
1
1
2
1 2
1
1 1
2
2 1
After Deletion
Before Deletion
Figure 6.1: Example of Non-Essential Deletion Since the last rule is the default rule, all of its facets may be deleted without any tests, since it is used to classify everything which no other hull will classify. If categorical attributes are to be deleted, then the set of values covered by the rule is completed so that all possible values are covered. This is analogous to moving a facet to in nity where it covers all values and excludes none and is thus, eectively, deleted.
6.2.1 Basic Characteristics of Non-Essential Deletion To investigate the characteristics of deleting non-essential facets, the standard experiment was run 100 times using CH1 with facet deletion but no ination, CH1 with full ination but no facet deletion and CH1 with neither on a range of data sets. This will allow a comparison of the eectiveness of 117
both techniques. The results are shown in Table 6.1. When there is a statistically signi cant dierence, at p=0.05, in a 2 tailed t-test with matched pairs, the superior result is superscripted by the number of the column to which it is superior. As before, Facet Ination is superior to \Neither" with a win-loss ratio of 23:1 which is signi cant at p = 0.01. However, Facet Deletion is superior to \Neither" with a win-loss ratio of 22:0 which is signi cant at p = 0.01. This con rms that Facet Deletion is a useful generalisation tool. Comparing Facet Deletion and Ination, it is found that ination is superior with a win-loss ration of 20:3 which is signi cant at p = 0.01. Since facet deletion is a greater generalisation than ination this, at rst sight, is a surprising result. However, all deletion takes place away from the area between concepts where it is less likely to aect unseen points which will tend to occur between concepts. Also the ination may be beyond the part of the instance space occupied by instances and eectively ination has the eect of deletion and ination. One would expect that inating the remaining facets would improve the performance of facet deletion because of the supposed location of the instances causing the errors.
6.2.2 Adding In ation to Non-Essential Deletion The edges represented by the remaining facets after the deletion operation may still, individually, be too specialised and so full, per facet ination will be applied to these facets. The previous experiment was repeated comparing facet deletion with subsequent ination against ination only. The results are shown in Table 6.2 with the usual annotations. Inspecting Table 6.2, facet deletion with ination can be seen to be superior to ination with a win-loss ratio of 19:6 which is signi cant at p = 0.05. It can be concluded that ination has markedly improved the few facets left after non-essential deletion so that the performance of facet deletion and ination is much superior to either 118
Accuracy Neither Ination Facet Deletion balance-scale 81:36 83:4213 83:421 bcwo 96:93 96:93 96:93 bupa 58:26 58:281 58:3212 cleveland 50:83 54:0213 53:291 echocardiogram 70:35 70:8713 70:35 german 50:892 50:73 50:8912 glass 60:08 61:4713 60:731 glass7 47:78 51:0713 48:091 heart 64:24 67:7013 66:801 hepatitis 30:02 41:3913 30:031 horse-colic 55:58 62:0613 57:991 hungarian 60:28 61:691 61:7312 ionosphere 87:97 88:9713 87:97 iris 60:80 69:8513 65:301 new-thyroid 73:79 75:7213 75:061 page-blocks 48:96 54:0313 49:091 pid 78:04 79:0513 78:891 satimage 49:00 57:7213 54:401 segment 83:26 84:2913 83:791 shuttle1 60:04 60:3113 60:191 sonar 54:63 55:6413 55:031 soybean-large 64:05 64:291 64:291 vehicle 35:08 39:1513 36:811 waveform 27:52 30:4713 29:041 wine 55:67 68:3013 64:991 Table 6.1: Comparison of Non-Essential Deletion and Ination 119
Accuracy Ination Deletion with Ination balance-scale 83:422 83:25 bcwo 96:93 98:951 bupa 58:28 58:831 cleveland 54:02 55:091 echocardiogram 70:872 70:29 german 50:73 51:121 glass 61:47 63:311 glass7 51:07 52:131 heart 67:702 67:69 hepatitis 41:392 30:07 horse-colic 62:06 62:891 hungarian 61:69 62:881 ionosphere 88:97 89:841 iris 69:852 65:32 new-thyroid 75:722 75:12 page-blocks 54:03 54:121 pid 79:05 80:141 satimage 57:72 72:491 segment 84:29 91:681 shuttle1 60:31 64:951 sonar 55:64 56:451 soybean-large 64:29 81:011 vehicle 39:15 44:341 waveform 30:47 34:741 wine 68:30 83:731 Table 6.2: Comparison of Non-Essential Deletion with Ination against Ination 120
ination or deletion alone. Inspecting the numbers of facets before and after deletion, it can be seen that all concepts have between 65 and 105 facets before non-essential deletion and ve or less after deletion. However, most concepts have 70 to 90 facets before deletion and 1 or 2 after it but the default hull always has none by de nition since it excludes nothing. Thus facet deletion, with subsequent ination, can be seen to oer the possibility of being an eective tool for generalising hulls and simplifying classi ers.
6.2.3 Time Complexity of Non-Essential Deletion This operation requires a single pass through all the facets, on average f say, of each of h hulls and to delete facets where the min dist beyond is in nity. This will involve processing O(h f ) facets.
6.3 Retention of Redundant Facets After the initial removal of non-essential facets, it is still the case that many facets will exclude only points already excluded by other facets and, consequently, these facets can be deleted without altering resubstitution accuracy. This will result in further generalisation of the hulls. See Figure 6.2 for a simple example in which facets a and c are unnecessary in separating classes 1 and 2. There is more than one approach to deciding which facets to retain and these are discussed next.
6.3.1 Unordered Retention The simplest technique is to process the facets in order of occurrence, which is essentially random, being dependent on the order of processing training points. The list of remaining facets will be traversed and any facet, which 121
1
a
1
1
1
c
1
1
1
1
b 2
2
2
2 2
2
2
Figure 6.2: Minimal Facet Deletion does not exclude at least one point not excluded by a previous facet, is deleted. If the resulting resubstitution accuracy has fallen, the facet is restored but otherwise the deletion is made permanent. The algorithm, called unordered retention, is
UNORDERED RETENTION
END.
SET global excluded point list to empty SET f to rst facet in list WHILE (f not at end of list) IF f excludes points not in global excluded point list COPY points excluded by f to global excluded point list ELSE DELETE f from facet list ENDIF MOVE f to next facet ENDWHILE
122
6.3.2 Evaluation of Unordered Retention with In ation The standard experiment of 100 repetitions on matched data sets was carried out for unordered retention with varying amounts of ination and nonessential deletion. The results are shown in Table 6.3. Comparing nil and semi ination, it can be seen that semi has a win-loss ratio of 15:4 which is signi cant at p = 0.05. Full ination has a win-loss ratio of 18:5 over nil ination which is also signi cant at p = 0.05. Full has a win-loss ratio of 13:5 against semi which is only signi cant at p = 0.1. It can be concluded that any ination is superior to none after unordered retention and that full ination is likely to be best.
6.3.3 Ordered Retention The simple approach of the previous section might not produce minimal results in cases where either a single facet or a pair at a small angle to the single facet are equivalent in their eect. Consider Figure 6.2, where a hull has been formed around class one and the lower part is shown. There are no instances of class 2 beneath edge b. If the single facet b is reached rst in the list, it will be permanently deleted since the pair (a,c) classi es everything correctly. Thereafter, neither a nor c can be deleted since the resubstitution accuracy will fall. This outcome is not minimal for this simple situation. If one of the two other facets, a and c is found rst, it will be deleted permanently and consequently the other other will be deleted since the pair are needed to produce the same eect as b alone. In general, the outcome is likely to be minimal but cannot be guaranteed to be so. 123
Unordered Retention Accuracy Nil In. Semi In. Full In. balance-scale 87:61 87:61 87:61 bcwo 98:95 98:95 99:0612 bupa 57:8823 57:84 57:84 cleveland 57:74 58:671 59:5112 echocardiogram 69:8223 69:65 69:65 german 51:123 51:123 50:71 glass 52:31 52:831 52:831 glass7 31:12 32:071 32:7512 heart 68:28 68:831 69:3212 hepatitis 29:80 38:841 39:8012 horse-colic 58:5123 58:243 58:18 hungarian 71:57 72:7713 72:471 ionosphere 76:24 76:24 76:7612 iris 79:60 81:941 81:941 new-thyroid 86:11 86:731 86:731 page-blocks 39:90 41:241 41:6112 pid 79:87 79:871 80:1212 satimage 37:87 38:261 38:4412 segment 18:13 18:13 18:1312 shuttle1 64:79 72:0113 71:421 sonar 56:3923 56:183 56:09 soybean-large 29:92 29:92 29:92 vehicle 44:29 47:701 48:2312 waveform 32:63 32:881 33:6812 wine 77:43 77:461 77:5312 Table 6.3: Accuracy using Unordered Retention 124
Another justi cation is that, at each stage, the facet which excludes the most negative points is the one for which there is most empirical evidence of relevance, and, hence, it is the best one to retain. The desired type of outcome can be attained by scanning the facet list for the facet which excludes most new points and transferring it to the retained facet list. Thereafter the original facet list is repeatedly scanned to ascertain which remaining facet excludes the most points not already excluded and that facet is transferred to the retained facet list. At any stage, when a facet is discovered which excludes no new points, it is deleted. This process is repeated until the original list is empty. This simple, hill-climbing algorithm, called ordered retention of facets, is
ORDERED RETENTION
SET retained facet list to NULL SET global excluded point list to EMPTY WHILE facet list is not EMPTY SET f to rst facet in list WHILE f not = NULL SET f!number excluded points to ZERO SET f!excluded point list to NULL FOR all points IF point beyond f!facet AND point not in global excluded point list ADD point to f!excluded point list ADD 1 to f!number excluded points ENDIF ENDFOR IF f!number excluded points = 0 SET tmp = f 125
MOVE f to next facet DELETE tmp!facet ENDIF ENDWHILE FIND facet excluding most points COPY facet to retained facet list APPEND f!excluded point list to global excluded point list ENDWHILE
END.
The basic problem of the overspecialisation of a facet de ning a highly specialised edge, by being a plane occupied by positive points, will, however, still be true for both algorithms. Therefore, full, per facet ination will be applied to the remaining facets after deletion and this should produce some further improvement in overall system accuracy.
6.3.4 Time Complexity of Ordered Retention Strategy For ordered retention, the algorithm processes the facet list of each hull, containing f fac ets, at most f times which occurs only if none are deleted since they are not excluding any new points. Each facet is annotated with the outcome of processing at most all p points. Subsequently, there is a linear pass through the facet list to nd and transfer the best facet to the retained list. Thus, since there are few hulls, the time complexity is O(f f p) + O(f ) = O(f f p).
6.3.5 Evaluation of Ordered Retention with In ation The standard experiment of 100 repetitions on matched data sets was carried out for ordered retention with varying amounts of ination and non-essential 126
deletion. The results are shown in Table 6.4. Comparing nil and semi ination, it is found that semi has a win-loss ratio of 13:1 which is signi cant at p = 0.01. Full ination has a win-loss ratio of 19:5 over nil ination which is signi cant at p = 0.01. Full has a win-loss ratio of 12:6 over semi which is signi cant at p = 0.25. Thus some ination should always be used after ordered retention and probably it is best to use full ination.
6.3.6 Comparison of Retention Strategies The standard experiment of 100 repetitions on matched data sets was carried out for CH1 with ordered and unordered retention with full ination and nonessential deletion. The results are shown in Table 6.5. Comparing unordered and ordered retention, ordered retention is superior with a win-loss ratio of 15:8, which is signi cant at p = 0.25, over the range of data sets. Thus the ordered retention strategy is to be preferred but not strongly so. Comparing these results with those of Table 6.2 (possible because all tables of results are from the one group of experiments unless noted otherwise), it can be seen that non-essential deletion, ordered retention and ination has a win-loss ratio of 19:5 relative to non-essential deletion and ination only. This is signi cant at p = 0.01 using a sign test. Comparing non-essential deletion, unordered retention and ination with non-essential deletion and ination only, the win-loss ratio is 15:10 in favour of the latter. This is not signi cant at p = 0.25 using a sign test. These results reinforce the decision to use ordered retention in future.
6.4 Conclusions It has been demonstrated that facet deletion generalises concepts de ned by convex hulls. The simple strategy of deleting facets which exclude no 127
Ordered Retention Accuracy Nil In. Semi In. Full In. balance-scale 83:25 83:261 83:261 bcwo 98:95 98:95 99:0612 bupa 58:8323 58:77 58:77 cleveland 55:09 55:09 55:5412 echocardiogram 70:29 70:811 70:811 german 51:123 51:123 50:71 glass 63:31 64:141 64:141 glass7 52:13 54:211 55:2112 heart 67:69 67:69 67:7112 hepatitis 30:07 40:5113 40:511 horse-colic 62:893 62:893 62:01 hungarian 62:883 62:9713 62:48 ionosphere 89:84 89:84 90:8312 iris 65:32 69:851 69:851 new-thyroid 75:12 75:781 75:781 page-blocks 54:12 56:901 57:4712 pid 80:14 80:161 80:3012 satimage 72:493 72:5113 72:36 segment 91:68 91:68 91:9412 shuttle1 64:95 64:95 66:3212 sonar 56:45 56:7413 56:641 soybean-large 81:01 81:01 81:01 vehicle 44:34 44:671 45:0312 waveform 34:74 34:74 35:8012 wine 83:73 83:73 84:0412 Table 6.4: Accuracy using Ordered Retention 128
Accuracy Ordered Retention Unordered Retention balance-scale 83:26 87:611 bcwo 99:06 99:06 bupa 58:772 57:84 cleveland 55:54 59:511 echocardiogram 70:812 69:65 german 50:71 50:71 glass 64:142 52:83 glass7 55:212 32:75 heart 67:71 69:321 hepatitis 40:512 39:80 horse-colic 62:012 58:18 hungarian 62:48 72:471 ionosphere 90:832 76:76 iris 69:85 81:941 new-thyroid 75:78 86:731 page-blocks 57:472 41:61 pid 80:302 80:12 satimage 72:362 38:44 segment 91:942 18:13 shuttle1 66:32 71:421 sonar 56:642 56:09 soybean-large 81:012 29:92 vehicle 45:03 48:231 waveform 35:802 33:68 wine 84:042 77:53 Table 6.5: Comparison of Retention Strategies 129
points of other classes has been shown to be somewhat ineective because the deleted facets are not located between classes but on the periphery of the data points with nothing beyond them. Nonetheless, their deletion simpli es hulls by dramatically reducing the number of facets necessary to de ne a concept. The remaining facets have also been shown to be over-specialised and subsequent ination markedly improves their performance. It was noted that the remaining facets contained some redundancy in terms of the data points which they exclude. Two algorithms for removing this redundancy were examined. One followed an unordered strategy for facet retention and the other an ordered strategy. Experimental results showed that the performance of CH1 was not signi cantly dierent for the two deletion strategies but that ordered retention should be preferred. As before, after facet deletion, the remaining facets are still overly specialised in their location and full, per facet ination should be applied to them. Comparison of simple ination and deletion with subsequent ination shows that the latter is statistically signi cantly superior. As a result of this, and other experiments, it was concluded that CH1 should always be used with nonessential deletion, ordered retention and full, per facet ination in all future experiments.
130
Chapter 7 Evaluation of CH1 7.1 Introduction The previous chapters describe the development of a form of CH1, incorporating ination and two forms of facet deletion, which is optimised over a set of well-known domains. This learning system has been designed to not have any strong bias in terms of the position or orientation of decision surfaces and, it is assumed that it has not been accidentally so biased during the optimisation. The early version of the system demonstrated strong performance on some arti cial data sets which were constructed to be not inappropriately biased for the system. However, it is necessary to broaden the evaluation to include \real-world" data sets and tasks that others have constructed. It will initially be evaluated on domains for which there is reasonable expectation that the decision surfaces are either curved or are straight but not axis orthogonal. It is dicult to identify, for arbitrary data sets, whether they have these properties since geometric analysis of the shapes of surfaces in spaces of high dimension is dicult. Such analysis is rare and tends to identify simple feature like at surfaces which favours SAP systems and not curved features where CH1 is expected to be superior. For instance, there is 131
a known at surface between classes in the iris data set 39]. The data sets to be used are known, for dierent reasons, to provide the decision surface characteristics which are required for the comparisons. The performance of the complete system will be compared with that of C4.5, CN2 and OC1 on a data set concerned with body fat 117] and the POL 80] (parallel oblique lines) data set. Subsequently, CH1 will be compared to the same systems on a range of data sets from the UCI Repository.
7.2 Evaluation on Selected Domains 7.2.1 Body Fat The Ph.D. thesis of Yip 152] which, inter alia, explores constructive induction provides a data set where the decision surfaces, when expressed in the natural attributes, are known to be strongly curved. This data set has attributes of height and weight of persons and they are classi ed according to their Body Mass Index (BMI) 117] which is the weight(kg) divided by the square of the height(metres) (see also A.1.3 for further description). A BMI less than 20 is categorised as underweight, 20 to 25 as normal, 25 to 30 as fat and over 30 as obese. In this experiment, the height and the weight will be used as the attributes and the decision surfaces are known, from a consideration of the fact that BMI / h12 , to be distinctly curved. Thus, in this domain, it is expected that CH1 will not be poorly biased whereas C4.5, CN2 and OC1, with their preference for long straight lines, will be poorly biased. The data set was derived from frequency tables for height and weight in 117] and, hence, can be viewed as a realistic simulation of a real-world classi cation problem. This experiment was carried out with a range of data set sizes with the usual shuing and partitioning. Each experiment was carried out 20 times 132
Size 100 200 300 400 500 600 700 800 900 1000 1100 1200 1500 2000 2500
CH1 95:6 97:723 99:5234 98:823 99:224 99:023 99:323 99:223 99:023 99:323 99:723 99:83 99:623 99:83 99:823
CN2 93.7 95.4 97.0 96.9 97.1 97.4 97.6 98.6 98.4 98.4 99.1 99.3 98.7 99.5 99.4
C4.5 93.3 92.0 96.1 96.1 94.7 97.1 92.3 98.0 97.4 97.8 98.7 98.9 98.7 99.3 99.3
OC1 95.1 97.1 97.9 97.2 98.2 98.9 99.2 98.6 98.9 99.2 99.5 99.5 99.7 99:91 99.9
Table 7.1: Evaluation on Body Fat Data Set as this was sucient to obtain statistically signi cant results. The results are shown in Table 7.1 and the corresponding graphs in Figure 7.1. The superscript shows to which columns the average for CH1 is statistically signi cantly superior at p = 0.05 using a 2 tailed matched pairs t-test. The only result which is superior to CH1 is also shown. No comparisons between C4.5, CN2 and OC1 are shown. The win-loss ratio for CH1 against CN2 is 20:0 and this signi cant at p = 0.01 using a sign test. Looking only at where the averages are signi cantly dierent, it is found that CH1 is superior 12 times and CN2 never is. Similarly, the win-loss ratio for CH1 versus C4.5 is also 20:0 and is signi cant at 133
Accuracy
100 99
CH1
98 97
CN2
96
C4.5
95
OC1
94 93 92 0
500
1000
1500
2000 Data Set Size
2500
Figure 7.1: Learning Curves for Body Fat the same level. Looking only at where the averages are signi cantly dierent, it is found that CH1 is superior 13 times and C4.5 never is. Comparing CH1 with OC1, the win-loss ratio is 12:3 which is signi cant at p = 0.05 using a sign test. Although CH1 only has a superior average on the t-test twice, to once for OC1, the pattern of superiority is clear. On all data sets of size less than 1500, CH1 provides a better model of the data. With larger amounts of data, the performance of OC1 is slightly superior. A possible explanation is that with low data densities, OC1 generates unsuitably large decision surfaces which do not match the underlying concepts closely but which are not invalidated in the training set. At high data densities, OC1 is constrained from constructing overly large decision surfaces and its performance becomes very close to that of CH1 since its biases happen to suit the class distributions in this domain marginally better than those of CH1. 134
CH1 CN2 C4.5 OC1 Mean 97:323 95.1 95.2 99:21 Std.Dev. 1.4 1.1 1.1 0.4 Table 7.2: Evaluation on POL Data Set
7.2.2 POL The POL data set (description in A.1.19), created by Murthy et al. 80], consists of a 2-D rectangular universe with 4 parallel oblique lines, approximately equally spaced, dividing it into 5 regions with 2 classes. Since the decision surfaces are known to be at 45 degrees, the strong SAP bias of C4.5 and CN2 should reduce their performance but the performance of CH1 should be superior since it can provide decision surfaces of the correct orientation. It will not necessarily produce single large decision surfaces but may induce several almost coplanar decision surface which will provide performance slightly worse than a single at surface. Of course, OC1 is perfectly biased for this data set since it provides surfaces which are both long and of the appropriate orientation. OC1 uses all of the neighbouring points to orient a single large surface where CH1 has to place and orient, possibly, several surfaces from the same amount of information. Therefore, OC1 should provide the best performance of the methods being compared. The usual experiment, shuing and selecting 80% of the data for a training set with the remainder being reserved for testing, was done for various data set sizes. The set for 500 points is typical and is shown in Table 7.2 Twenty repetitions were sucient to obtain statistically signi cant results. The mean accuracy for CH1 is superior to that of both CN2 and C4.5 at p = 0.01 using a matched pairs t-test. However, the mean accuracy of OC1 is similarly superior to that of CH1 at p = 0.01. Comparing CH1 and CN2 using a sign test, the win-loss ratio is 18:2 in favour of CH1 which is signi cant 135
at p = 0.01. Comparing CH1 and C4.5 using a sign test, the win-loss ratio is 17:3 in favour of CH1 which is signi cant at p = 0.01. Lastly, comparing CH1 with OC1, the win-loss ratio is 1:19 in favour of OC1 and is signi cant at p = 0.01. These results clearly show that CH1 outperforms the SAP systems on a distinctly SNAP domain. However, the bias of OC1 for straight SNAP decision surfaces exactly suits this domain and OC1 is far superior to CH1, CN2 and C4.5. Nonetheless, these results are very encouraging and accord exactly with our expectations.
7.2.3 Summary of Evaluation From the evidence of these two experiments,and the evaluation of the prototype in Chapter 4, it can be concluded that CH1 is likely to provide signi cantly superior performance to axis orthogonally biased classi er systems on strongly SNAP or curved concepts. Particularly, on curved concepts, it provides superior performance to SAP systems and better performance than OC1 at low and medium data densities. At high data densities, the performance of OC1 may overtake that of CH1.
7.3 Evaluation on a Variety of Domains Having concluded that CH1 is likely to provide good performance on learning domains where the decision surfaces are not straight and axis parallel, it must now be evaluated on a range of domains from the UCI Repository to investigate how it performs on well-known data sets. Since CH1 is principally trying to use convex hulls, domains with few or no continuous values will not be used. Domains which are wholly continuous are of the most interest but, since categorical attributes can be handled, domains with a small number 136
of categorical attributes and many continuous ones can be used. The standard experiment was run 100 times on each domain using CH1, C4.5, CN2 and OC1 with the results are shown in Table 7.3. Unfortunately, the need for OC1 was not forseen at the time of the original experiments and so the OC1 results are not from identical, although they are similar, training and evaluation splits. They were generated using the same random process but dierent random values will have resulted in dierent splits. Thus comparisons between results for CH1 and OC1 are tested for signi cance using a z test for populations of known size and variance. Comparisons between OC1 and the other systems are not made. Because of prohibitive run-times, only subsets of the available data were used for some domains. Comparing CH1 with C4.5, a win-loss ratio of 3:22 is signi cant at p = 0.01. Similarly comparing CH1 and CN2, a win-loss ratio of 4:21 is also signi cant at p = 0.01. There is no signi cant dierence between C4.5 and CN2 with a win-loss ratio of 14:10. Comparing CH1 with OC1, a win-loss ratio of 7:18 is signi cant at p = 0.05. The z test shows that all dierences in mean accuracy between CH1 and OC1 are signi cant at p = 0.05. Worse is the fact that when CH1 is superior, it is never markedly superior but when it is worse, it can be considerably worse. This is a disappointing result and CH1 is only superior to C4.5 and CN2 on balance-scale, echocardiogram and ionosphere. Balance-scale is a simple domain where what is being measured is the turning moment of two weights on a beam and since (weight, length) pairs of (12,1), (6,2), (4,3), (3,4), (2,6) and (1,12) are equivalent, it can be seen that the decision surfaces are curved. Thus it should be expected, from previous experimental results, that CH1 will perform well on this domain. Surprisingly, OC1 does better than CH1 but, reecting on the data sets used, the values are all integers so that, although the underlying reality is a domain with curved surfaces, the actuality is a set of large at facets which is ideal for the bias of OC1. 137
Accuracy balance-scale Br.Canc.Wisc. bupa cleveland echocardiogram german glass glass7 heart hepatitis horse-colic hungarian ionosphere iris new-thyroid page-blocks pid satimage segment shuttle1 sonar soybean-large vehicle waveform wine
CH1 83:2623 99:0634 58:77 55:54 70:81234 50:71 64:14 55:21 67:71 40:51 62:01 62:48 90:83234 69:85 75:78 57:47 80:304 72:36 91:944 66:32 56:64 81:014 45:03 35:80 84:044
C4.5 76:67 99:4213 63:491 87:7313 69:113 60:6613 73:311 66:181 80:291 41:5713 77:031 88:5513 87:71 94:7413 91:831 87:201 82:341 91:3313 94:4113 93:8413 74:1213 90:6813 80:6113 72:7613 87:0113
CN2 81:302 97:75 64:4212 87:641 67:90 57:131 76:2812 66:3412 81:5312 41:271 87:7812 86:071 89:692 94:131 95:3012 90:3412 82:9512 90:161 94:321 92:051 70:501 84:601 79:361 65:701 85:451
OC1 91.50 95.46 68.37 72.22 62.24 72.91 70.51 60.27 76.37 93.51 81.13 79.66 77.55 95.01 90.72 85.37 71.98 87.05 58.60 74.83 70.19 70.52 74.57 52.43 81.06
Table 7.3: Comparison of CH1, C4.5, CN2 and OC1 138
CH1 is also superior to C4.5 and CN2 on the echocardiogram and ionosphere domains. These may be domains where the underlying concepts have curved surfaces and certainly the truly continuous attributes will obviate the problem of at surfaces arising as an artefact of the domain sampling. Unfortunately, these domains are not susceptible to any easy analysis. However, OC1 is also inferior to CH1 so we may infer that large at surfaces are inapproriate for these domains and this strengthens the conclusion about the curved decision surfaces. It is notable that OC1 performs worse than all other systems on the domains breast cancer Wisconsin, pid, segment, soybean-large and wine. Perhaps these domains have markedly axis orthogonal decision surfaces and the actual data sets do not mask this. Thus these domains are well biased for C4.5 and CN2. There is a possiblity that, since most UCI Repository data sets have been collected and de ned in the context of SAP-based systems, the Repository may have a preponderance of data sets suited to such systems. Such a preponderance might be strong and would adversely aect the results of systems with dierent biases. If our understanding of how the performance of various sytems depends on their language bias and the shapes of the underlying concepts is correct, it is also possible that low density sampling of the domain has created artfacts in the decision surface shapes. Nonetheless, it would be informative to investigate what aspects of CH1 lead to this poor performance since a weakly biased system should not necessarily be a bad system. If the data sets are SAP biased, then SAP classi ers have the correct orientation of the decision surface automatically and only have to position the surface from the evidence. A SNAP system has to decide both the orientation and the position of the decision surface from the same amount of evidence so it has a bigger space of theories to explore. These results reect those of Chapter 4, where it was also found that the perfor139
mance of C4.5 and CH1 varied with the data set size and CH1 required more data to provide similar predictive accuracy. Since most of the UCI data sets may be such that CH1 is poorly biased for learning from them, that may explain the relatively poor performance of CH1. Also, the time complexity of the quickhull algorithm required the use of subsets of many of the UCI data sets and, in view of the Chapter 4 results suggesting better performance with large amounts of data, this paucity of data might well imply that the results obtained are on the lower end of the spectrum of possible system performance. One possible investigation is to substitute large SAP hulls for the large convex hulls to see if the convex hulls cause the lower performance. The other possibility is the largeness of the hulls causes the poor performance. Another variation of CH1 which uses many small hulls could be used to investigate this possibility. Both of these possibilities will be investigated in later chapters.
7.4 Complexity of Domain Representations In a domain containing a small number of concepts, it is debatable whether a representation which involves tens or hundreds of small regions contributes to human comprehensibility. Certainly each small area may be individually explicable but holistic comprehension may be impossible. Also the hyperrectangular structure imposed on the domain may be a subset or superset of the volume for which the interpretation is true thus the underlying hypothesis language may exclude volumes which are explicable and include volumes which are not. One advantage of convex hulls is that, from a human comprehension viewpoint, the number of structures induced is similar to the number of underlying concepts. For CH1, the number of structures is the number of 140
convex hulls which are constructed and for C4.5 the number of structures is the number of hyperrectangular regions which are identi ed. The number of actual concepts in each domain and the number of concepts induced by CH1 and C4.5, averaged over 100 runs, are shown in Table 7.4. The average number of concepts for satimage and shuttle1 are lower than one might expect because the sample used contains eectively only 5 classes for satimage rather than 6 and, for shuttle1, contains about 3 classes rather than 7. The smallness of the number of concepts induced by CH1 can be seen in comparison with C4.5. The number of hulls induced by CH1 is always very close to the number of actual concepts in each domain. Using a sign test, the number of hulls produced by CH1 is superior to C4.5 at p=0.01.
7.5 Conclusions Testing CH1 on domains which are known to have SNAP or curved decision surfaces produced the expected superiority of performance to SAP systems like C4.5 and CN2. Wider testing on domains where the underlying characteristics of the decision surfaces is unknown, produced rather disappointing results. Of the domains where CH1 performed well, it was shown that echocardiogram and ionosphere are likely to have curved decision surfaces. It was also shown that the balance-scale domain tended to have curved surfaces but that the sampling of data points disguised this. It is further noted that the UCI Repository data sets, which largely came from work on SAP classi ers, might be biased towards domains on which these classi ers will work well. That is, the data sets are suggested to have mainly SAP decision surfaces. Another factor at work is the need, demonstrated in Chapter 4, for CH1 to have large bodies of data to enable it to select concepts from a very large concept space. Unfortunately, large data sets are dicult to process because of the computational demands of the 141
Domain No. Concepts No. Hulls:CH1 No. Hulls:C4.5 balance-scale 3 5.6 108.2 bcwo 2 2.8 33.2 bupa 2 3.7 95.8 cleveland 2 3.7 66.8 echocardiogram 2 2.6 13.5 german 2 7.3 76.4 glass 3 5.9 43.5 glass7 7 13.0 49.4 heart 2 3.7 11.4 hepatitis 2 2.0 4.4 horse-colic 2 2.3 9.6 hungarian 2 3.6 55.2 ionosphere 2 5.0 29.4 iris 3 3 8.6 new-thyroid 3 3.0 14.5 page-blocks 5 7.6 14.8 pid 2 5.1 39.0 satimage 6 5.8 19.4 segment 7 22.0 63.0 shuttle1 7 4.8 9.0 sonar 2 2.5 31.6 soybean-large 19 19.7 70.2 vehicle 4 4.5 20.2 waveform 3 3.3 17.2 wine 3 3.0 39.5 Table 7.4: Number of Regions Induced for each Domain 142
convex hull software. Possible factors which lead to the poor performance of CH1 on these data sets from the Repository are the convexity of the hulls and the largeness of the hulls. The contributions of each of these will be examined in the next two chapters. The closeness of the number of regions induced by CH1 to the actual number of underlying concepts in each domain is suggestive of some underlying suitability of the approach and may lead to the possibility of high level mathematical descriptions of concepts.
143
Chapter 8 Large Axis Orthogonal Hulls 8.1 Introduction The performance of CH1, on a range of domains from the UCI Repository is rather worse than that of C4.5 and CN2 and this may be due to the the use of convex hulls rather than axis orthogonal structures. To investigate how this aects the performance of the classi er, it is necessary to isolate other aspects of the induction algorithm that may be confounding the results. Unfortunately, from this respect, the computational demands of convex hull formation have necessitated the development of induction techniques that depart from previous techniques in a number of respects other than the use of convex hulls, most notably, in forming successive rules, each of which covers all remaining objects of a class. The dierences between a few large convex hulls and a few large axis orthogonal hulls, can be investigated by replacing the large convex hulls currently generated by CH1 with large axis orthogonal hulls. This alters no other aspect of the construction of the classi er particularly the interaction of the decision list with the hulls. This version of the system will be thoroughly explored, as was CH1, to nd its optimal operational settings. 144
8.2 Axis Orthogonal Hulls To understand how each part of the data structure contributes to the performance, a replacement function for qhull(), which constructs the convex hulls, was designed. This new function, ao hull(), outputs a hyperrectangle with axis orthogonal faces. There are two faces per attribute one for the minimum value of that attribute which is enclosed in the prism and the other for the maximum value. Since this implementation, called AOH, does not use convex hulls, it is much faster than CH1. The experiments establishing the performance of CH1 will be repeated to establish that AOH performs similarly and then the chosen form of AOH will be compared with CH1, C4.5 and CN2. The descriptions of each section will be very abbreviated because they are identical to those for CH1 in the preceding chapters.
8.3 Evaluation of Per Hull In ation Strategies The performance of AOH with varying amounts of per hull ination are shown in Table 8.1 Semi ination has a win-loss ratio of 1:0 over nil ination which is not signi cant and a win-loss ratio of 7:8 relative to full ination which is also not signi cant. Full ination has a win-loss ratio of 9:7 relative to nil ination which is not signi cant. Clearly per hull ination has very little eect on AOH classi ers.
8.4 Evaluation of Per Facet In ation The performance of AOH with varying amounts of per facet limited ination are shown in Table 8.2 Semi ination has a win-loss ratio of 4:0 over nil ination and this is signi cant only at p = 0.25. Full ination has win-loss 145
Accuracy balance-scale bcwo bupa cleveland echocardiogram german glass glass7 heart hepatitis horse-colic hungarian ionosphere iris new-thyroid page-blocks pid satimage segment shuttle1 sonar soybean-large vehicle waveform wine
Nil In. Semi In. 45:17 45:17 96:933 96:933 49:19 49:19 71:593 71:593 66:54 66:54 50:893 50:893 62:10 62:111 50:853 50:853 71:013 71:013 30:02 30:02 61:10 61:10 76:573 76:573 87:97 87:97 84:98 84:98 90:63 90:63 67:51 67:51 76:983 76:983 60:16 60:16 83:26 83:26 80:91 80:91 55:55 55:55 64:05 64:05 58:14 58:14 38:55 38:55 56:26 56:26
Full In. 45:17 54:48 49:19 52:67 66:54 50:46 62:111 32:40 44:10 30:6012 61:3912 49:78 88:1212 84:98 90:63 68:7312 65:86 60:16 83:2712 80:91 55:7012 64:05 58:14 39:7112 58:2512
Table 8.1: Various Amounts of Per Hull Ination 146
Accuracy Nil In. Semi In. balance-scale 45:17 45:17 bcwo 96:93 96:93 bupa 49:19 49:231 cleveland 71:59 71:59 echocardiogram 66:5423 66:48 german 50:893 50:893 glass 62:10 62:341 glass7 50:85 50:85 heart 71:01 71:01 hepatitis 30:02 30:02 horse-colic 61:10 61:10 hungarian 76:57 76:57 ionosphere 87:97 87:97 iris 84:98 86:431 new-thyroid 90:63 90:761 page-blocks 67:51 67:51 pid 76:98 76:98 satimage 60:16 60:16 segment 83:26 83:26 shuttle1 80:91 80:91 sonar 55:55 55:55 soybean-large 64:05 64:05 vehicle 58:14 58:14 waveform 38:55 38:55 wine 56:26 56:26
Full In. 45:17 96:93 49:231 77:4512 66:48 50:73 62:341 53:3712 75:1712 33:2012 72:7212 78:3412 88:9712 86:431 90:761 73:4012 78:0012 71:2912 84:2912 85:5712 58:8812 64:2912 67:2212 47:4312 69:1912
Table 8.2: Various Amounts of Per Facet Ination 147
ratio of 21:2 over nil ination which is signi cant at p = 0.01 and a win-loss ratio of 17:1 over semi ination which is signi cant at p = 0.01 also. Clearly, full ination is superior to both other amounts. Since there was no dierence between the various amounts of ination for the per hull mode, it is also immediately clear that full, per hull ination is preferrable to any other mode and amount for AOH.
8.5 Evaluation of Non-Essential Deletion Non-Essential deletion was applied to the classi ers constructed by AOH on a variety of domains and various amounts of per facet ination were applied. The results are shown in Table 8.3. Semi ination has a 3:0 win-loss ratio over nil ination but this is not signi cant. Full ination has a win-loss ratio of 18:3 over nil ination which is signi cant at p = 0.01 and a 16:2 win-loss ratio over semi ination which is signi cant at p = 0.01. Clearly, non-essential deletion is best when accompanied by full, per facet ination.
8.6 Comparison of Non-Essential Deletion and In ation The next comparison is of full, per fact ination alone and with non-essential deletion. The results are shown in Table 8.4. Non-Essential deletion has a win-loss ratio of 4:1 over ination alone which is not signi cant. There is little to choose between these techniques but non-essential deletion plus ination will be chosen so that it is like CH1.
148
Accuracy Nil In. Semi In. balance-scale 45:17 45:17 bcwo 96:93 96:93 bupa 51:53 51:53 cleveland 76:22 76:22 echocardiogram 66:09 66:09 german 50:893 50:893 glass 62:93 63:151 glass7 50:90 50:90 heart 74:34 74:34 hepatitis 33:06 33:06 horse-colic 65:80 65:80 hungarian 78:563 78:563 ionosphere 87:97 87:97 iris 85:45 86:811 new-thyroid 91:22 91:351 page-blocks 69:25 69:25 pid 77:84 77:84 satimage 67:11 67:11 segment 83:79 83:79 shuttle1 84:77 84:77 sonar 57:41 57:41 soybean-large 64:29 64:29 vehicle 62:44 62:44 waveform 44:00 44:00 wine 65:78 65:78
Full In. 45:17 96:93 51:53 77:4512 66:09 50:73 63:151 53:3712 75:1712 33:2012 72:7212 78:34 88:9712 86:811 91:351 73:4012 78:0012 71:2912 84:2912 85:5712 58:8812 64:29 67:2212 47:4312 69:1912
Table 8.3: Per Facet Ination after Non-Essential Deletion 149
Accuracy Ination Deletion with Ination balance-scale 45:17 45:17 bcwo 96:93 96:93 bupa 49:23 51:531 cleveland 77:45 77:45 echocardiogram 66:482 66:09 german 50:73 50:73 glass 62:34 63:151 glass7 53:37 53:37 heart 75:17 75:17 hepatitis 33:20 33:20 horse-colic 72:72 72:72 hungarian 78:34 78:34 ionosphere 88:97 88:97 iris 86:43 86:811 new-thyroid 90:76 91:351 page-blocks 73:40 73:40 pid 78:00 78:00 satimage 71:29 71:29 segment 84:29 84:29 shuttle1 85:57 85:57 sonar 58:88 58:88 soybean-large 64:29 64:29 vehicle 67:22 67:22 waveform 47:43 47:43 wine 69:19 69:19 Table 8.4: Comparison of Non-Essential Deletion with Ination against Ination 150
8.7 Evaluation of Unordered Retention This section will evaluate various amounts of per facet ination being applied after non-essential deletion and unordered retention. The results are shown in Table 8.5. Semi ination has a win-loss ratio of 3:0 over nil ination which is not signi cant. Full ination has a win-loss ratio of 14:7 over nil ination which is signi cant only at p = 0.25 and a win-loss ratio of 11:7 over semi ination which is also signi cant only at p = 0.25. Clearly, full, per facet ination is the preferred choice although not strongly so.
8.8 Evaluation of Ordered Retention This section will evaluate various amounts of per facet ination being applied after non-essential deletion and ordered retention. The results are shown in Table 8.6. Semi ination has a win-loss ratio of 3:0 over nil ination which is not signi cant. Full ination has a win-loss ratio of 14:7 over nil ination which is signi cant at only p = 0.25 and a win-loss ratio of 11:7 over semi ination ination which is also signi cant only at p = 0.25. Clearly, full, per facet ination is the preferred choice although not strongly so.
8.9 Comparison of Retention Strategies Lastly, ordered and unordered retention with non-essential deletion and full, per facet ination are compared to decide which will be compared to CH1. The results are shown in Table 8.7. The ordered retention strategy has a win-loss ratio of 19:1 over the unordered strategy which is signi cant at p = 0.01. This is a much clearer cut preference for the ordered strategy than was obtained for CH1. Clearly, there is some characteristic of AOH which is favourable to it. 151
Unordered Retention Accuracy Nil In. Semi In. Full In. balance-scale 45:17 45:17 45:17 bcwo 98:95 98:95 99:0612 bupa 52:17 52:17 52:17 cleveland 73:98 73:98 75:0712 echocardiogram 63:05 63:05 63:05 german 51:123 51:123 50:71 glass 52:65 52:781 52:781 glass7 31:78 31:78 32:8912 heart 74:99 74:99 75:4512 hepatitis 33:35 33:35 33:4812 horse-colic 68:733 68:733 68:48 hungarian 80:633 80:633 79:82 ionosphere 76:24 76:24 76:7612 iris 85:32 86:551 86:551 new-thyroid 84:82 84:861 84:861 page-blocks 49:133 49:133 48:87 pid 77:853 77:853 77:13 satimage 34:85 34:85 35:0112 segment 18:13 18:13 18:1312 shuttle1 75:563 75:563 74:28 sonar 58:253 58:253 58:10 soybean-large 29:92 29:92 29:92 vehicle 54:34 54:34 54:6612 waveform 39:10 39:10 39:4212 wine 78:50 78:50 78:6412 Table 8.5: Accuracy using Unordered Retention 152
Unordered Retention Accuracy Nil In. Semi In. Full In. balance-scale 45:17 45:17 45:17 bcwo 98:95 98:95 99:0612 bupa 52:17 52:17 52:17 cleveland 77:85 77:85 79:2612 echocardiogram 63:38 63:38 63:38 german 51:123 51:123 50:71 glass 67:90 68:121 68:121 glass7 58:71 58:71 59:5912 heart 75:253 75:253 75:16 hepatitis 33:35 33:35 33:4812 horse-colic 74:553 74:553 73:49 hungarian 81:623 81:623 80:73 ionosphere 89:84 89:84 90:8312 iris 86:69 87:921 87:921 new-thyroid 90:88 90:981 90:981 page-blocks 79:793 79:793 79:56 pid 81:023 81:023 80:22 satimage 88:763 88:763 88:59 segment 91:68 91:68 91:9412 shuttle1 94:79 94:79 95:2412 sonar 62:90 62:90 63:3512 soybean-large 81:01 81:01 81:01 vehicle 77:72 77:72 78:3312 waveform 59:87 59:87 62:4912 wine 84:97 84:97 85:3512 Table 8.6: Accuracy using Ordered Retention 153
Accuracy Ordered Retention Unordered Retention balance-scale 45:17 45:17 bcwo 99:06 99:06 bupa 52:17 52:17 cleveland 79:262 75:07 echocardiogram 63:382 63:05 german 50:71 50:71 glass 68:122 52:78 glass7 59:592 32:89 heart 75:16 75:451 hepatitis 33:48 33:48 horse-colic 73:492 68:48 hungarian 80:732 79:82 ionosphere 90:832 76:76 iris 87:922 86:55 new-thyroid 90:982 84:86 page-blocks 79:562 48:87 pid 80:222 77:13 satimage 88:592 35:01 segment 91:942 18:13 shuttle1 95:242 74:28 sonar 63:352 58:10 soybean-large 81:012 29:92 vehicle 78:332 54:66 waveform 62:492 39:42 wine 85:352 78:64 Table 8.7: Comparison of Retention Strategies 154
8.10 Comparison of CH1 and AOH Having completed the evaluation of AOH, it is found that the versions of CH1 and AOH which are to be compared are both with non-essential deletion, ordered retention and full, per facet ination. This means that everything about the two classi er construction methods are as alike as possible. Despite this, the choices which were made were aorded quite dierent experimental support so it is di cult to anticipate the outcome of the nal comparison. The results are shown in Table 8.8. Inspecting Table 8.8, AOH is seen to have a win-loss ratio of 15:5 over CH1 which is signi cant at p = 0.05. This result seems to con rm that large convex hulls have no advantage over large axis-orthogonal hulls on the set of domains explored.
8.11 Comparison of AOH with C4.5 and CN2 Having ascertained that the axis-orthogonal version, AOH, is superior to CH1, it is necessary to compare its performance to C4.5 and CN2 to see how much better it is. The comparison is done as before and the results are shown in Table 8.9. Unfortunately, AOH has loss-win ratios of 2:23 and 3:22 against C4.5 and CN2 respectively and these are both signi cant at p = 0.01. Clearly, the overall performance of AOH is still not close to that of other established systems. There must be de ciencies in the system design other than the use of large convex hulls.
8.12 Conclusions The axis-orthogonal version, AOH, performed considerably better than CH1 especially when its simplicity is considered. However, it does not perform as well as established systems. The comparison of the performances of CH1 and 155
Accuracy balance-scale bcwo bupa cleveland echocardiogram german glass glass7 heart hepatitis horse-colic hungarian ionosphere iris new-thyroid page-blocks pid satimage segment shuttle1 sonar soybean-large vehicle waveform wine
CH1 83:262 99:06 58:772 55:54 70:812 50:71 64:14 55:21 67:71 40:512 62:01 62:48 90:83 69:85 75:78 57:47 80:302 72:36 91:94 66:32 56:64 81:01 45:03 35:80 84:04
AOH 45:17 99:06 52:17 79:261 63:38 50:71 68:121 59:591 75:161 33:48 73:491 80:731 90:83 87:921 90:981 79:561 80:22 88:591 91:94 95:241 63:351 81:01 78:331 62:491 85:351
Table 8.8: Comparison of CH1 and AOH 156
Accuracy AOH balance-scale 45:17 bcwo 99:063 bupa 52:17 cleveland 79:26 echocardiogram 63:38 german 50:71 glass 68:12 glass7 59:59 heart 75:16 hepatitis 33:48 horse-colic 73:49 hungarian 80:73 ionosphere 90:8323 iris 87:92 new-thyroid 90:98 page-blocks 79:56 pid 80:22 satimage 88:59 segment 91:94 shuttle1 95:2423 sonar 63:35 soybean-large 81:01 vehicle 78:33 waveform 62:49 wine 85:35
C4.5 76:671 99:4213 63:491 87:7313 69:1113 60:6613 73:311 66:181 80:291 41:5713 77:031 88:5513 87:71 94:7413 91:831 87:201 82:341 91:3313 94:4113 93:843 74:1213 90:6813 80:6113 72:7613 87:0113
CN2. 81:3012 97:75 64:4212 87:641 67:901 57:131 76:2812 66:3412 81:5312 41:271 87:7812 86:071 89:692 94:131 95:3012 90:3412 82:9512 90:161 94:321 92:05 70:501 84:601 79:361 65:701 85:451
Table 8.9: Comparison of AOH, C4.5 and CN2 157
AOH have established that the large convex hulls perform less well than large axis-orthogonal hulls on this set of domains. However replacing the convex hulls with axis-orthogonal hulls does not close the performance gap to C4.5 and CN2 so another aspect of the system design must also contribute to the poor performance. The matter of how the size of the induced structures aects the performance of the classi er, will be pursued in the next chapter.
158
Chapter 9 CH1-CN2 Hybrid 9.1 Introduction The comparison of the axis orthogonal version, AOH, and CH1 in the last chapter showed that the relative performances of AOH and CH1 were dependent on the type of hulls being constructed with large convex hulls being distinctly inferior to large AO hulls. However, the performance of AOH was still distinctly worse than C4.5 and CN2 so we must look at causes, other than large convex hulls, for the poor performance of CH1. One other major dierence between CH1 and other machine learning systems is the formation of few, large hulls rather than many, small hulls and this can be investigated. It will be necessary to model domains with many small convex hulls and compare that with few large convex hulls (CH1) and many small AO structures (CN2 or C4.5).
9.2 Experimental Design This will need a dierent hull construction algorithm to allow the building of concepts consisting of many small, disjunctive regions. Direct induction 159
of multiple hulls for a class, in a manner such as that employed by the PIGS prototype, was not considered viable for CH1 for two major reasons. The rst is that there would be some diculty creating the initial simplex eciently since the group of points chosen might produce a degenerate and, thus, useless hull. The second problem is that using a large or widespread group of points may lead to a hull which is immediately invalidated by a negative example. If a group of close points is used to avoid the second problem then the rst may arise as well as leading to long run times. If a group of distant points is chosen to avoid the rst problem and to improve performance, in the manner of the quickhull algorithm, then the second problem arises. There are also the problems, found in the prototype, of spiking to be considered since these will reduce classi cation accuracy. Indeed convex hulls were chosen partly to overcome these problems and partly because they seem an attractive form of generalisation. Since CH1 cannot form disjunctive concepts, it is proposed that data points be preclassi ed into disjunctive concepts by a classi er which supports this type of concept. Thereafter, convex hulls will be formed around these groupings of points producing a rather dierent classi er especially after facet deletion and ination. For these experiments, the data points will be preclassi ed by CN2 and the groupings formed will be converted to convex hulls. A hybrid system, which reads the rules output by CN2 and uses these to determine the groups of points around which small hulls will be created, has been designed. The CN2 rules are directly representable as AO HULLS in the decision list. These hulls can then be used to pregroup the points and then convex hulls can be formed around the groups. The convex hull will then replace the AO HULL unless there were insucient points to create a convex hull or the convex hull was degenerate, if which cases the AO HULL is retained. There may be a problem if the small groupings have too few points to form a hull in the domain attribute space or if the hull is often 160
degenerate but since the hulls which remain as AO hulls will have identical performance to CN2, any dierence in performance between the hybrid and CN2 will be due to the formation of convex hulls on other groupings. This experiment will enable a comparison of the performance of a convex hull based classi er which uses many small disjunctive hulls with one which uses a small number of large convex hulls and with the many small AO structures of the pre-classifying algorithm and C4.5. The algorithm for the hybrid is
HYBRID ALGORITHM
END.
RUN CN2 READ CN2 rules and construct Decision List of AO HULLS FOR each rule CONSTRUCT corresponding group of instances FORM a convex hull round group of instances IF convex hull is not degenerate REPLACE AO HULL by convex hull ENDIF ENDFOR do non-essential deletion do ordered retention INFLATE facets
Apart from the formation of the groups, the CH1-CN2 hybrid is identical to CH1 and will use a decision list which exactly matches that output by CN2 in terms of rule ordering. The performance of the CH1-CN2 hybrid will be compared with CN2 itself, C4.5 and CH1. Most AO hulls will convert to convex hulls but some will fail due to there 161
being insucient points to de ne a hull in a space of that dimensionality and some will fail because only a degenerate hull can be formed. In the case of these failures, the AO hull representation is retained. Each rule is marked as AO or convex so that the classi cation is done using the appropriate data structure. The classi er is tested immediately after reading in the CN2 rules as AO hulls to verify the correctness of the reading. After the conversion to convex hulls, the classi er is again tested. Then facet deletion and ination are performed and the classi er is tested after each of these operations. Some exploratory single experiments were performed to understand how the hybrid would operate. Some typical results are shown in Table 9.1 with each row being the confusion matrix for a 2 class concept as formed by the classi er at that stage. The rst row shows the matrix for CN2, the second the matrix for the hybrid AO classi er, the third is the matrix for the raw convex hull, the fourth is the matrix after ordered deletion of facets, the fth is the matrix after ination and the sixth is the performance on C4.5 on the same data and test set. A confusion matrix entry of aa ,ab,ba ,bb implies that aa items of class a were identi ed as such, ab items of class a were identi ed as being of class b, ba items of class b were identi ed as being of class a and bb items of class b were correctly identi ed. Clearly having high values for aa and bb and low values for ab and ba is characteristic of good performance. In every case, the CN2 matrix and the AO matrix are identical, which is as expected from a correct reading of the CN2 rules into the hybrid. However, it is almost always the case that the performance decreases on the change from AO hulls to convex hulls. Usually the performance on one class improves while that on the other class decreases. Occasionally both decrease by a small amount. The problem here is the creation of the highly specialised hulls which, as has been seen earlier, have only moderate overall classi cation accuracy although the performance for an individual class might be very good. The deletion operation always causes the accuracy to rise as one would expect from earlier 162
After CN2 AO cvx del inf C4.5
aa 20 20 7 16 16 15
Expt1 ab bb 4 17 4 17 17 7 8 9 8 9 9 11
ba 25 25 35 33 33 31
aa 15 15 5 17 17 13
Expt2 ab bb 13 7 13 7 23 2 11 2 11 2 15 8
ba 28 28 33 33 33 27
aa 23 23 4 17 17 23
Expt3 ab bb 10 18 10 18 29 2 16 5 16 5 10 12
ba 29 29 45 42 42 35
aa 14 14 22 15 15 13
Expt4 ab bb 12 17 12 17 4 25 11 13 11 13 13 11
Table 9.1: Exploration of Hybrid Classi er Operation work due to the generalisation of the hulls and the performance seems good compared to CN2 and C4.5. The ination operation causes a small or zero further rise in the accuracy. This nal performance is always comparable to that of CN2 and can be slightly worse or better. It is similarly comparable to C4.5. The next experiment is the usual 100 runs on a variety of domains comparing the accuracy of CN2, the hybrid, C4.5 and CH1 but the results in this table are from a dierent set of experiments to every other table in this thesis and so are not directly comparable to any other table. The results are shown in Table 9.2 with the usual annotations. Inspecting Table 9.2 and comparing CH1 and the hybrid, it can be seen that the hybrid has a win-loss ratio of 11:13 which is not signi cant. Thus using many, small hulls has not changed the performance from that obtained with few, large hulls. CN2 has a win-loss ratio of 23:1 over the hybrid and C4.5 a ratio of 22:3 over the hybrid and both of these results are signi cant at p = 0.01. This change to small hulls has not improved the performance relative to any other system. Therefore, it appears that the pernicious eect of the convex hulls applies to both large and small convex hulls. 163
ba 16 16 8 20 20 22
Domain balance-scale bcwo bupa cleveland echocardiogram german glass glass7 heart hepatitis horse-colic hungarian ionosphere iris new-thyroid page-blocks pid satimage segment shuttle1 sonar soybean-large vehicle waveform wine
CH1 80:37 99:003 59:563 57:55 72:3323 50:20 64:093 56:013 69:36 36:13 60:76 65:31 91:3123 70:56 76:943 55:70 76:97 77:16 91:61 61:663 57:763 80:22 56:243 37:98 70:553
CN2 Hybrid 82:2613 81:3014 99:3113 96:33 63:6913 57:47 88:7813 76:241 68:313 57:20 60:761 60:97124 76:0213 63:65 67:6613 50:27 81:0313 70:931 40:7813 39:131 76:0913 76:071 89:6113 69:091 87:133 84:57 93:7613 72:681 93:9913 58:88 87:4713 63:121 85:171 53:55 91:0513 85:881 94:1413 93:431 90:6413 13:80 74:5313 55:66 90:1113 83:911 88:4813 36:27 72:1113 67:5814 88:2413 30:29
C4.5 77:69 98:403 64:573 89:183 69:413 57:43 73:213 66:333 82:223 40:933 86:633 85:443 90:943 94:203 91:663 91:103 84:883 87:873 94:153 90:883 70:053 84:333 74:783 65:48 87:613
Table 9.2: Comparison of CN2, hybrid and C4.5 164
9.3 Conclusions The hybrid classi er, which uses CN2 to pre-group data points, has been designed, implemented and tested and shown to operate as expected in Section 9.2. The simple representation of a CN2 rule as a convex hull is inferior and, although deletion and ination improve the performance of the convex hulls in the hybrid, the performance never recovers to the level of CN2. When the hybrid classi er is used on a range of domains, its accuracy is no dierent from that of CH1 and it can be concluded that many, small convex hulls is no better than few large hulls. Since large hulls, each modeling one concept region have other potential advantages, large hulls might be preferred over small hulls if they provide sucient accuracy in a particular domain.
165
Chapter 10 Conclusions and Future Research This thesis proposes a novel, geometric approach to inductive generalisation using convex hulls and empirically evaluates the method. The objective is to discover whether using a less strongly biased hypotheses representation can yield superior classi ers. Section 10.1 summarises the contribution of the thesis. Section 10.4 recapitulates the conclusions from the results of the experiments. Section 9.2 summarises the software that has been designed and implemented. Finally, section 10.3 discusses some interesting issues for future research.
10.1 Summary This thesis proposed that delimiting groups of points of a single class by a large, arbitrarily shaped polytope would yield a classi er with a less strong built-in bias in its hypothesis language than those classi ers which rely on SAP divisions of space. Such a classi er might have been expected to have a better classi cation performance over a variety of domains if it is assumed 166
that the SAP bias of many systems is not particularly well-suited to the domains. This thesis investigates how to build such a classi er and examines its utility and performance. Chapter 3 describes a simple prototype algorithm, PIGS, to construct 2D concepts by tting, around the points, the tightest polygon which does not include any negative examples. The main feature of the algorithm is that it builds the polygon incrementally and spiking was identi ed as the major problem of doing this. A temporary solution to spiking was to constrain the new polygons formed to having limits on sizes or shapes. This simple classi er was tested on some arti cial data sets against C4.5 and OC1. The experimental results showed that a polytope based classi er was viable. The classi er gave signi cantly better results than both C4.5 and OC1 except on data sets for which they were particularly well biased (\squares" and \POL" respectively). It was noted that the polytope-based classi er produced better performance on dense data sets and that this was because it needed to position and orient each line segment whereas SAP systems only have to choose a position. This may be a problem which will disadvantage CH1 on natural data sets which are all rather sparse, particularly those with large numbers of attributes. To solve spiking, it was concluded that constraining polygons to being convex would be satisfactory as well as being epistemologically appealing as a representation. It was noted that the use of large convex hulls could give access to higher level mathematical descriptions of induced concepts. Chapter 4 describes a classi er, CH1, which forms large convex hulls around points of the same class and, subsequently, forms smaller convex hulls of exceptions within the initial hull. Recently published algorithms for convex hulls and the power of modern computers make this approach more attractive and perhaps feasible. All such hulls are maintained in a decision list. Experimental work establishes the satisfactory performance of this implementation of an N-dimensional convex hull-based classi er. Various 167
methods of ordering the decision list are discussed and an empirical approach is adopted, after experimentation, wherein each new rule is prepended to the decision list and the rst, most populous rule, is the default rule. The detailed performance metrics of CH1 and C4.5, on a series of arti cial data sets, are examined. The performance clearly depends on the data density. When the data density is low, because of high dimensionality or the low number of data points, the performance of C4.5 tends to overtake that of CH1. On some metrics, CH1 produces 100% performance and if classi cation queries can be suitably framed, CH1 will provide very good performance In Chapter 5, the insights provided by the detailed performance metrics in the previous chapter were applied to compare and contrast the performances of classi ers using dierently biased hypothesis languages. It is demonstrated that convex hull based systems tend to produce conservative, highly specialised regions whereas SAP based systems tend to produce greater generalisations. The consequences of this are that classi cations of instances as being within a hull will tend to be very reliable but that some positive instances will be misclassi ed due to the inherent conservatism of the convex hull. Contrariwise, the SAP regions will be somewhat larger and be slightly less reliable at classifying points which are contained in them but will miss fewer positive instances than the convex hull because the hyperectangle volume will be larger than that of the convex hull. It is proposed that the representation using tight- tting surfaces, and consequent performance, of classi ers can be modi ed by ination. Experimental results show that ination is a viable operator in reducing the overspecialisation of concepts represented by convex hulls. This reduction in overspecialisation was accompanied by a rise in classi cation accuracy. Various possible modes of ination are discussed and the more promising ones are examined experimentally. It is demonstrated that the per facet mode of ination allowed the greatest improvement in classi cation accuracy and the theoretical reasons for this are examined. It is 168
also demonstrated that any ination improves the classi cation performance relative to no ination and that generally larger amounts of ination provide better performance. The strong interaction between ination and the decision list structure is noted, particularly that there is no simple way to predict how ination will aect performance metrics for individual classes. The only safe expectations are that the sensitivity, NPV and predictive accuracy of a class will rise when the corresponding hull is inated. Another method of altering the overspecialisation of the convex hulls is proposed in Chapter 6. This involves the deletion of facets which do not contribute to the classi cation performance. Facets which are on the outer edges of the data area, with nothing beyond them, will never exclude any point and so these are all deleted. Deletion of these facets has less eect than expected because the deleted facets tend to be at the outer edge of the domain and the remaining edges are still too specialised. When these remaining facets are inated, classi cation performance improves. The remaining facets are in the area between concepts and there is generally considerable redundancy in their exclusion of points of other classes. Two algorithms for facet retention/deletion were evaluated and ordered retention was found to be superior. The use of these algorithms to reduce the number of retained facets to the minimum, consistent with no rise in resubstitution errors, simpli es the classi er enormously. The number of facets retained is typically reduced by two orders of magnitude. The study of the number of concepts in a domain and the number of concepts induced, showed that CH1 produced approximately the same number of regions as there were underlying concepts. The similarity of the number of hulls induced to the number of concepts in the domain is noted and is asserted to be a good characteristic suggesting that the induced representation is good especially compared to the large numbers of regions identi ed by systems like C4.5. In Chapter 7, CH1 is evaluated on data sets where it is expected to pro169
vide better performance than SAP-based classi ers. The experiments verify the expected good performance. In a wider range of domains from the UCI Repository, however, CH1 is generally inferior. On domains where it is superior, the underlying concepts plausibly have SNAP decision surfaces. There is a real possibility that the Repository contains a preponderance of data sets well-suited to SAP-based classi ers. Of course, this may be the reality of the world but it is not clear that the general performance of CH1 is as disappointing as it might appear on these domains. The performance of CH1 is also limited by the comparative sparseness of the population of the Repository data sets, since CH1 needs more data to provide good performance relative to systems which are well-biased for the underlying concepts. In Chapter 8, an SAP version of CH1, called AOH, is implemented. AOH is identical to CH1 except that it constructs large axis orthogonal hulls. The performances of CH1 and AOH are compared in an eort to understand the contribution of convex hulls to the classi er performance. Overall, AOH with large AO hulls outperformed CH1 with large convex hulls. A study was made of the number of regions induced by both classi ers compared to the number of underlying concepts to evaluate the quality of concept representations and both were found to produce similar numbers of hulls. It was concluded that large AO hulls are superior to large convex hulls as a concept representation format on the set of domains used. Comparisons with CN2 and C4.5, on UCI data sets, show that few large AO hulls provides inferior performance to many, small AO hulls. Thus large hulls are not attractive on the UCI data sets used in this thesis and may not be generally attractive. The results of this chapter may be misleading if the domains used are preponderantly containing SAP concepts as suggested previously. In Chapter 9, an investigation of the comparative performance of many, small AO hulls and many, small convex hulls is made. Since incremental building of convex hulls is not viable, to enable CH1 to have many, small 170
convex regions, the data points are pregrouped by CN2 and groupings of points can be extracted, around which convex hulls can be constructed. Thus it was possible to model concepts as many, small and disjunctive convex hulls. This experiment allowed the comparison of the ecacy of modeling concepts using many, small, convex hulls and many, small, axis orthogonal hulls. The hybrid algorithm was found to perform essentially identically to the original CH1 algorithm and worse than CN2 and C4.5 each with many, small AO hulls. This suggests that the poor performance of CH1, on the UCI data sets, is not due to the size of the hulls but simply to their convexity. Thus, on the evidence from the data sets used in this thesis, convex hulls are not an eective method when either large or small. The same caveat about underlying SAP concepts in the data sets applies to this conclusion.
10.2 Summary of Software Designed and Implemented The following is a brief summary of the software designed and implemented during this project. A prototype 2-D polygonal induction system (PIGS) was written and tested. Six arti cial data sets were created to evaluate PIGS. An n-dimensional system for creating polytopes, CH1, was created and tested. Software to generate seven arti cial datasets were constructed to evaluate its performance.
171
Software, to convert the confusion matrix output of C4.5, CH1, CN2 and any other classi ers to various metrics (sensitivity, speci city, PPV, NPV, accuracy), was created and tested. Software was added to CH1 to inate hulls, using per hull and per facet strategies, to evaluate ination. Software to delete facets which are not excluding any points was written and tested. Software to perform ordered retention of facets was written and tested. Software to perform unordered retention of facets was written and tested. The software was modi ed to produce large axis orthogonal hulls instead of convex hulls. Software was designed and tested to read CN2 rules, partition the data set according to these rules and to construct a convex hull for each CN2 decision region. The software was modi ed to produce axis orthogonal hulls or convex hulls depending on whether or not the set of points de ned a convex hull. All classi cation functions were modi ed to use the appropriate data structure. Software was designed and implemented to handle categorical data. Facet deletion was modi ed to delete either an AO hull facet or a convex hull facet or a categorical value as was appropriate in facet deletion or retention.
172
The command les to run all experiments, handle shuing and partitioning of training and test sets and to process all output were written and tested.
10.3 Future research Most of the practical diculties with CH1 centre on the sometimes long compute times for the convex hulls and yet these hulls are promptly dismantled and inated in the search for classi cation accuracy. Some aspects of the process need to be retained to get large hulls but if the important facets can be obtained more quickly it would be advantageous. Perhaps some regression technique or genetic programming technique can obtain a set of discriminating hyperplanes and these can be subjected to the usual deletion and ination operators. This would side-step the long compute times associated with quickhull and allow exploration of really large datasets where a variant of CH1 might be superior to C4.5 and the like. Other systems, for instance DIPOL92, which use regression, do not position the hyperplanes in the same way that CH1 does and it would be interesting to explore the dierences. It would be particularly valuable to obtain a copy of DIPOL92 for comparison purposes but this has not been possible as yet. A convex hull which is approximate in the sense that it is not as tight as possible to the points would also be much quicker to construct. This would be a more attractive approach than the current one since it could be much less tight- tting and therefore have fewer facets and better run-times. Another approach is the quick construction of angular hulls and smoothing this hull to an approximation of a convex hull. The possibility of sets of small hulls being combined to form larger structures would oset the consequences of using many small hulls to model dif173
cult domains and should be investigated. A demonstration of actual progress to the extraction of an higher level mathematical description of the concepts modeled by the hulls would also be useful.
10.4 Conclusions At the outset of this research, the use of convex hulls for induction seemed attractive since it oered a classi er with a much less strong hypothesis language bias than axis orthogonal systems and the possibility of better classi cation performance over a set of domains than axis orthogonal systems. They also oered the possibility of representing concepts as a few convex hulls, rather than a multiplicity of small inappropriately shaped regions, which would allow holistic appreciation of the concept. The possibility of using the tools of computational geometry to extract higher level mathematical descriptions of concepts is also attractive. It has been demonstrated that the speed of modern computers and new algorithms for constructing convex hulls make geometric modelling of concepts viable in spaces of low dimensionality. Convex hulls have been shown to be a simple, satisfactory method for constructing useful polytopes. Convex hulls oer smaller generalisations than other techniques which use hypothesis language based generalisation, especially in continuous domains since convex hulls do not use discretisation. The use of facet deletion and ination have been shown to improve the performance of the classi er. Understanding the relative performance of CH1 and SAP based classi ers is a matter of noting that CH1 needs to position and orient decision surfaces whereas SAP classi ers only have to place the decision surfaces since there is no choice of orientation. If the SAP classi er has the same orientation for its decision surfaces as the underlying concepts of the domain then it will 174
always outperform CH1 because it is well biased. If the SAP classi er is poorly biased for the domain, then CH1 will perform better since the SAP classi er now has to use the data to place many, small inappropriately shaped regions. If the data density for the domain is high then there may be much more information than an SAP classi er needs but a suciency for CH1 and so its performance, on underlying SAP concepts, rises to match, and perhaps surpass, that of the SAP classi er. It has been demonstrated that the advantages of convex hulls, relative to SAP-based classi ers are:a classi er with less bias in terms of the geometry of induced concepts. a classi er which induces one large structure per concept rather than many small structures is philosophically appealing in its economy of representation. CH1 with fewer, larger hulls oers classi cation accuracy which is superior to well known systems on data sets where the underlying concepts are known to be SNAP. On data sets with SAP underlying concepts, CH1 needs higher data densities to be competititve. large hulls oer access to mathematical descriptions of concepts which can be extracted from convex hulls and should be used when the training set can be fully resolved with them. The disadvantages of CH1 are:the use of a successive generalisation algorithm is contraindicated by diculties with spiking and infeasible run times. after the eort to create the convex hull, it is immediately deconstructed using facet deletion and ination. 175
for domains with large numbers of attributes and/or large numbers of instances, run times can still be infeasible. many commonly used data sets may have underlying SAP concepts and CH1 needs more data than an SAP system since it has to position and orient line segments rather than just place them as an SAP system does. run times for the quickhull algorithm are very variable and can be prohibitive even for small data sets of tens of items. The initial hopes for the power of modern computers and algorithms have not been as well supported as might have been expected. The implementation of CH1 has produced results, at all stages of its development, which match theoretical expectations and so there can be reasonable con dence in the implementation itself. Experiments on selected data sets suggest that CH1 can outperform SAP-based systems in both accuracy and economy of representation. However, it appears that conventional SAP approaches such as C4.5 and CN2 provide better predictive accuracy at lower computational cost on the types of learning task found in the UCI Repository. It was demonstrated that the convexity of the hulls, rather than their largeness, was the major contribution to the performance of CH1 on \real world" data sets from the UCI Repository. It may be that the lack of a strong bias in convex hulls, the very feature that made them attractive initially, is their Achilles' heel for real world data sets, since they require far greater data density to choose the position and orientation of each decision surface.
176
Appendix A Data Set A.1 Description of Data Sets Some of the data sets proved to have infeasible run times on the quickhull software and so a subset was used. This was determined empirically by trying dierent sizes until an acceptable run time with as large a subset as possible was found. The subsets were chosen to have class populations approximately the same as the whole set. Occasionally, some very small classes were deleted completely to avoid having examples of a class appear in only one of the training and test sets. The chosen subset was randomly partitioned into training and test sets and both sets were shued before each experiment.
A.1.1 balance-scale This data set has 4 continuous attributes and 3 classes with 625 instances with no missing values. All data items are used.
177
A.1.2 bcw This is the Breast Cancer Wisconsin data set with patient id removed. The data attributes are Clump-Thickness: continuous Uniformity-of-Cell-Size: continuous Uniformity-of-Cell-Shape: continuous Marginal-Adhesion: continuous Single-Epithelial-Cell-Size: continuous Bare-Nuclei: continuous Bland-Chromatin: continuous Normal-Nucleoli: continuous There are two data classes and 699 data items. Those with missing values are removed leaving 683 items in this study. All data items are used.
A.1.3 bf This is the body fat data set. The attributes,both continuous, are weight(kg) and height(m). The data sets are derived from height and weight frequency tables in 117]. There are 4 classes: underweight, normal, fat and obese with relative frequencies of approximately 2:6:5:2. Data points can be created as necessary.
A.1.4 bupa This is a renal function data set with data attributes 178
mcv: continuous alkphos: continuous sgpt: continuous sgot: continuous gammagt: continuous drinks: continuous There are two classes and 345 data instances. All data items are used.
A.1.5 Cleveland This is heart disease data collected at the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. The principle investigator responsible for the data collection was Robert Detrano, M.D., Ph.D. The data attributes are age: continuous. sex: 0, 1. cp: 1, 2, 3, 4. trestbps: continuous. chol: continuous. fbs: 0, 1. restecg: 0, 1, 2. thalach: continuous. exang: 0, 1. 179
oldpeak: continuous. slope: 1, 2, 3. ca: continuous. thal: 3, 4, 5, 6, 7. The are two classes for the data. There are 303 data instances with less than 1% missing values which have been replaced by mean values in this study. This data set has long run times and so is cut to 30 items in each class. Adding another 10 items produces a 10-fold increase in the run times.
A.1.6 echocardiogram This data set has 6 attributes, 2 classes and 74 data instances. The attributes are age-at-heart-attack: continuous pericardial-eusion: 0,1. fractional-shortening: continuous epss: continuous lvdd: continuous wall-motion-index: continuous The ratios of the two classes are 2:1 and 12 instances with missing values were removed for this study. All data items were used.
A.1.7 german This version of this data set has 24 continuous attributes in 2 classes with 1000 instances. Only 135 instances per class were used. 180
A.1.8 glass This glass data set has 6 attributes, 3 classes and 214 data instances. The attributes are RI: continuous Sodium: continuous Magnesium: continuous Aluminum: continuous Silicon: continuous Potasium: continuous Calcium: continuous Barium: continuous Iron: continuous The classes are oat (41%), not oat (35%) and other (24%). There are no missing values. All data items were used.
A.1.9 glass7 This is identical to the glass dataset except that there are 7 classes. All data items were used.
A.1.10 heart This data set has 12 continuous variables, 2 classes and 270 data instances. The algorithm was also extremely slow on this data set so only 34 items per class were used. 181
A.1.11 hepatitis This data set has 7 continuous and 12 binary attributes with 2 classes. There are 155 instances of which only 35 per class are used.
A.1.12 horse-colic This data set has 8 continuous attributes, 12 which can be considered continuous and 1 which is binary and 2 classes. Only 40 items per class are used.
A.1.13 hungarian This is a heart disease data set with 10 attributes, 2 classes and 294 data instances. The attributes are age: continuous. sex: 0, 1. cp: 1, 2, 3, 4. trestbps: continuous. chol: continuous. fbs: 0, 1. restecg: 0, 1, 2. thalach: continuous. exang: 0, 1. oldpeak: continuous. There are missing values and 24 data instances with these were removed for this study. Only 70 items per class were used. 182
A.1.14 ionosphere This data set has 34 continuous attributes and 2 classes with 351 instances all of which were used.
A.1.15 iris This is the iris data set of Fisher 39]. There are 4 attributes, 3 classes and 150 data instances. The attributes are sepal-length-in-cm: continuous sepal-width-in-cm: continuous petal-length-in-cm: continuous petal-width-in-cm: continuous The three classes are present in equal proportions and there are no missing values. All instances were used.
A.1.16 new thyroid This data set has 5 attributes, 3 classes and 215 data instances. The data attributes are T3-resin-uptake: continuous. Total-Serum-thyroxin: continuous. Total-serum-triiodothyronine: continuous. basal-TSH: continuous. mod-TSH: continuous. The classes are present in the ratios 150:35:30 and there are no missing values. All instances were used. 183
A.1.17 page-blocks This data set has 10 continuous attributes and 5 classes with 5473 instances. A small set totalling 140 items was used. Many classes are present only in numbers too small for a convex hull and so more items were used for classes where there were sucient.
A.1.18 pid This is the Pima Indian Diabetes data set. There are 8 attributes, 2 classes and 768 data instances. The attributes are Number-of-times-pregnant: continuous Oral-glucose-tolerance: continuous Diastolic-blood-pressure-(mm-Hg): continuous Triceps-skin-fold-thickness-(mm): continuous Two-Hour-serum-insulin-(mu-U/ml): continuous Body-mass-index-(weight-in-kg/(height-in-m)2): continuous Diabetes-pedigree-function: continuous Age-(years): continuous The classes are present in the ratio 500:268 and there are no missing values. This data set usually exhibits long run times and only 150 items per class were used.
A.1.19 POL This is an arti cial data set described in 80]. The attributes are continuous x and y values in the same range and 2 classes each with disjoint regions 184
caused by 4 parallel oblique lines at 45 degrees. Data points can be created as necessary.
A.1.20 satimage This is part of a frame of landsat MSS imagery. There are 6 decision classes, 36 integer attributes and 4435 training and 2000 test instances. This data set has prohibitively long run times so only 225 instances were used.
A.1.21 segment This is an image segmentation data set with 19 attributes, 7 classes and 2310 instances and no missing values. This data set has prohibitively long run times and so only 45 items per class were used.
A.1.22 shuttle This is space shuttle data set with 9 numerical attributes, 7 classes and 43500 training instances and 14500 test instances. Only 100 instances were used with some classes not present.
A.1.23 sonar This data set has 60 continuous attributes in 2 classes and 208 instances. All data items were used.
A.1.24 soybean-large This data set has 35 numerical attributes, 19 classes and 307 data instances with missing values. Instances with missing values were removed for this study. A set of 200 items was used with some classes not present. 185
A.1.25 vehicle This is a vehicle recognition database with 18 continuous attributes, 4 classes and 846 instances with no missing values. The four classes are opel, saab, bus, van. Only 30 items per class were used.
A.1.26 waveform This data set has 20 continuous attributes, 3 classes and 300 instances with no missing values. Only 30 items per class were used.
A.1.27 wine This data set has 3 classes, 13 continuous attributes and 178 instances with no missing values. Only 35 items per class were used.
186
Bibliography 1] D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6:37{66, 1991. 2] G. Allen, V. Ciesielski, and W. Bolam. Evaluation of an expert system for predicting rain in Melbourne. In First Australian AI Congress, pages I6{I17, 1986. 3] W. Altherr. An algorithm for enumerating the vertices of a convex polyhedron. Computing, 15:181{193, 1975. 4] D. Angluin, M. Frazier, and L. Pitt. Learning conjunctions of horn clauses. Machine Learning, 9:147{164, 1992. 5] D. Avis and K. Fukuda. A pivoting algorithm for convex hulls and vertex enumeration of arrangements and polyhedra. Discrete Computational Geometry, 8:295{313, 1992. 6] C.B. Barber, D.P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. Submitted to ACM Trans. Mathematical Software, May 1995. 7] N. Beckmann and H-P. Kriegel. The r*-tree:an ecient and robust access method for points and rectangles. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 322{331, 1990. 187
8] C.B. Begg and R. Gray. Calculation of polychotomous logistic regression parameters using individualised regressions. Biometrika, 71:11{18, 1984. 9] J.L. Bentley, M.G. Faust, and F.P. Preparata. Approximation algorithms for convex hulls. Comms. of the ACM, 25(1):64{68, 1982. 10] J.L. Bentley, H.T. Kung, M. Schkolnick, and C.D. Thompson. On the average number of maxima in a set of vectors. J. ACM, 25:536{543, 1987. 11] E. Bloedorn and R.S. Michalski. Data driven constructive induction in aq17-pre: a method and experiments. In Proc. IEEE 3rd Conference on Tools for Articial Intelligence, pages 30{37. IEEE Computer Society Press, 1991. 12] E. Bloedorn, R.S. Michalski, and J. Wnek. Multistrategy constructive induction aq17-mci. In Proc. 2nd Workshop on Multistrategy Learning, pages 188{203, 1992. 13] D.M. Boulton and C.S. Wallace. A program for numerical calculation. Comp. Journal, 11(1):63{69, 1970. 14] D.M. Boulton and C.S. Wallace. An information measure for hierarchic classi cation. Comp. Journal, 16(3):254{261, 1973. 15] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classication and Regression Trees. Wadsworth Int. Group, Belmont, California, 1984. 16] C.E. Brodley and P.E. Utgo. Multivariate versus univariate decision trees. Technical Report 92{8, Dept. of Computer Science, Uni. of Massachussets, Amherst, M.A., 1992. 188
17] C.E. Brodley and P.E. Utgo. Multivariate decision trees. Machine Learning, 19:45{77, 1995. 18] W. Buntine. Generalised subsumption and its application to induction and redundancy. Articial Intelligence, 36:149{176, 1988. 19] W. Buntine and T. Niblett. Technical note: A further comparison of splitting rules for decision tree induction. Machine Learning, 8:75{85, 1992. 20] R.M. Cameron-Jones. Minimum description length instance-based learning. In Proceedings of 5th Australian Joint Conf. on A.I., 1992. 21] J. Catlett. On changing continuous attributes into ordered discrete attributes. In Proceedings of 5th European Working Session on Learning, pages 164{178. Springer Verlag, 1991. 22] cdd can be obtained from
[email protected].
[email protected].ch
or
23] B. Cestnik, I. Kononenko, and I. Bratko. Assistant 86: A knowledgeelicitation tool for sophisticated users. In Progress in Machine LearningProc.of EWSL 87. Sigma Press, Wilmslow, 1987. 24] chD can be obtained ftp://robotics.eecs.Berkeley.edu/pub/ConvexHull.
from
25] P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Proc. 5th European Working Session on Learning, pages 151{163. Springer Verlag, 1991. 26] P. Clark and Tim Niblett. The cn2 induction algorithm. Machine learning, 3:261{284, 1989. 189
27] K.L. Clarkson. Safe and eective determinant evaluation. In Proc. 31st IEEE Symposium on Foundations of Computer Science, pages 387{395, 1992. 28] K.L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental constructions. In Proc. Symp. Theor. Aspects of Comp. Sci., 1992. 29] W.W. Cohen. Fast eective rule induction. In Machine LEarning: Proceedings of 12th International Conference, pages 115{123. Morgan Kaufmann, 1995. 30] P.A. Devijer and J.V. Kittler. Pattern Recognition: A Statistical Approach. Prentice Hall, 1982. 31] T. Dietterich, B. London, K. Clarkson, and G Dromey. Learning and inductive inference. In The Handbook of Articial Intelligence, volume 3. Kaufmann, 1982. 32] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In The 12th International Conference on Machine learning, pages 194{202, 1995. 33] B.A. Draper, C.E. Brodley, and P.E. Utgo. Goal-directed classi cation using linear machine decision trees. IEEE PAMI, 16(9):888{893, 1994. 34] Edelsbrunner. Algorithms in Combinatorial Geometry. Springer Verlag, 1987. 35] I.Z. Emiris, J.F. Canny, and R. Seidel. An ecient approach to removing geometric degeneracies. In Proceedings of the 8th Annual ACM Symposium on Computational Geometry, pages 74{82, 1992. 190
36] F. Esposito, D. Malerba, and G. Semeraro. Decision tree pruning as a search in the state space. In Proc. European Conference on Machine Learning, pages 165{184, 1993. 37] U.M. Fayyad and K.B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1):87{102, 1992. 38] D.H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139{172, 1987. 39] R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179{188, 1936. 40] J. Furnkranz and G. Widmer. Incremental reduced error pruning. In Machine Learning: Proceedings of the Eleventh Annual Conference. Morgan Kaufmann, 1994. 41] S.I. Gallant. Optimal linear discriminants. In Proc. of Int. Conf. on Pattern Recognition, pages 849{852. IEEE Computer Society Press, 1986. 42] M. Gams and N. Lavrac. Review of ve empirical learning systems within a proposed schemata. In Progress in Machine LearningProc.of EWSL 87. Sigma Press, Wilmslow, 1987. 43] J-G. Ganascia. Learning with hilbert cubes. In Progress in Machine LearningProc.of EWSL 87. Sigma Press, Wilmslow, 1987. 44] R. Gemello, F. Mana, and L. Saita. Rigel:an inductive learning system. Machine Learning, 6:7{35, 1991. 45] D. Gordon and D. Perlis. Explicitly biased generalisation. Computational Intelligence, 5(2):67{81, 1989. 191
46] B. Grunbaum. Measures of symmetry for convex sets. In Proc. 7th Symposium in Pure Mathematics of the AMS, pages 233{270, 1961. 47] A. Guttman. R-trees: A dynamic index structure for spatial searching. ACM, pages 47{56, 1984. 48] Y. Hayashi. A neural expert system with automated extraction of fuzzy if-then rules and its application to medical diagnosis. In Advances in Neural Information Processing Systems. Morgan Kaufmann, 1990. 49] D. Heath, S. Kasif, and S. Salzberg. Learning oblique decision trees. In Proc. 13th IJCAI, pages 1002{1007. Morgan Kaufmann, 1993. 50] N. Helft. Inductive generalisation: A logical approach. In Progress in Machine LearningProc.of EWSL 87. Sigma Press, Wilmslow, 1987. 51] J. Herz, A. Krogh, and R. Palmer. Introduction to the Theory of Neural Computation. Addison Wesley, 1991. 52] E.B. Hunt, J. Marin, and P.I. Stone. Experiments in Induction. Academic Press, 1966. 53] D. Hunter, R.R. Bomford, and D.G. Penington. Hutchinson's Clinical Methods. Bailliere, Tindall and Cassell, 15th edition, 1970. 54] G.H. John, R. Kohavi, and K. Peger. Irrelevant features and the subset selection problem. In Proceedings of 11th International Conference on Machine Learning, pages 121{129, 1994. 55] M. Kallay. Convex hulls in higher dimensions. Technical report, Dept. Math., University of Oklahoma, Norman, Oklahoma, 1981. 56] D. Kibler and D.W. Aha. Learning representative examples of concepts: an initial case study. In Proceedings of 4th International Workshop on Machine Learning, pages 24{30. Morgan Kaufmann, 1987. 192
57] K. Kira and L.A. Rendell. The feature selection problem and a new algorithm. In Proceedings of the 10th National Conference on Artical Intelligence, pages 129{134, 1992. 58] V. Klee. Convex polytopes and linear programming. In Proc. IBM Sci. Comput. Symp: Combinatorial Problems, pages 123{158, 1966. 59] U. Knoll, G. Nakhaeizadeh, and B. Tausend. Cost-sensitive pruning of decision trees. In Proc. 8th European Conf. on Machine Learning, pages 383{386, 1994. 60] M. Lebowitz. Experiments with incremental concept formation: Unimem. Machine Learning, 2:103{138, 1987. 61] W. Lu and M. Sakauchi. A new algorithm for handling continuousvalued attributes in decision tree generation and its application to drawing recognition. In Industrial and Engineering Applications of Articial Intelligence and Expert Systems, pages 435{442, 1995. 62] C. Matheus and L.A. Rendell. Constructive induction on decision trees. In Proceedings of IJCAI, pages 645{650, 1989. 63] W. McCulloch, W.S.and Pitts. A logical calculus of the ideas immanent in nervous activity forms. In Bulletin of MAthematical Biophysics, volume 9, pages 127{147, 1943. 64] C. McMillan, M.C. Mozer, and P. Smolensky. Rule induction through integrated symbolic and subsymbolic processing. In Advances in Neural Information Processing Systems, pages 969{976. Morgan Kaufmann, 1992. 65] P. McMullen and G.C. Shephard. Convex Polytopes and the Upper Bound Conjecture. Cambridge University Press, Cambridge, England, 1971. 193
66] Michalski, R.S., Mozetic, I., Hong, and N. J. Lavrac. The multi-purpose incremental learning system aq15 and its testing and application to three medical domains. In Proceedings of the Fifth National Conference on Articial Intelligence, pages 1041{1045. Morgan Kaufman, 1986. 67] R.S. Michalski. Knowledge acquisition through conceptual clustering: a theoretical framework and an algorithm for partitioning data into conjunctive concepts. International Journal of Policy Analysis and Information Systems, pages 63{87, 1980. 68] R.S. Michalski. A theory and methodology of inductive learning. In Michalski, R.S., Carbonell, J.G., Mitchell, and T.M., editors, Machine Learning:An Articial Intelligence Approach, pages 83{134. SpringerVerlag, 1984. 69] D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine Learning, Neural and Statistical Classication. Ellis Horwood, 1994. 70] J. Mingers. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227{243, 1989. 71] T.M. Mitchell. Version spaces:a candidate elimination approach to rule learning. Proceedings of the Fifth International Joint Conference on Articial Intelligence, pages 305{310, 1978. 72] Tom M. Mitchell. The need for biases in learning generalizations. Technical Report CBM-TR-117, Rutgers University, Department of Computer Science, New Brunswick, NJ, 1980. 73] T.S. Motzkin, H. Raia, G.L. Thompson, and R.M. Thrall. The double description method. In H.W. Kuhn and A.W. Tucker, editors, Contribution to the Theory of Games, Vol. 2, volume 2, pages 81{103. Princeton University Press, 1953. 194
74] S. Muggleton. Duce, an oracle based approach to constructive induction. Proceedings of Iternational Joint Conference on Articial Intelligence, pages 287{292, 1987. 75] S. Muggleton. Inductive logic programming: derivations, successes and shortcomings. Proceedings of European Conference on Machine Learning, 1993. 76] S. Muggleton and W. Buntine. Machine invention of rst-order predicates by inverting resolution. In Proceedings of the Fifth International Conference on Machine Learning, pages 339{352. Kaufmann, 1988. 77] S. Muggleton and C. Feng. Ecient induction of logic programs. In Proceeding of the First Conference on Algorithmic Learning Theory. OHMSHA, 1990. 78] P.M. Murphy and D.W. Aha. The uci repository of machine learning databases, http://www.ics.uci.edu/ mlearn/mlrepository.html. 79] K.S. Murray. Multiple convergence:an approach to disjunctive concept acquisition. In Proceedings of the Tenth International Joint Conference on Articial Intelligence, pages 297{300. Morgan Kaufman, 1987. 80] S.K. Murthy, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Articial Intelligence Research, 2:1{ 32, 1994. 81] J. Oliver. Decision graphs - an extension of decision trees. International Joint Conference on AI, 1993. 82] J. Oliver, D.L. Dowe, and C.S. Wallace. Inferring decision graphs using the minimum message length principle. In Proceedings of 5th Australian Joint Conference on Articial Intelligence, pages 361{367. World Scienti c, 1992. 195
83] J. Oliver and C.S. Wallace. Inferring decision graphs. International Joint Conference on AI, 1991. 84] J.L. O'Neill and R.A. Pearson. A development environment for inductive learning systems. In Proc. 1987 Australian Joint Articial Intelligence Conference, pages 134{145, 1987. 85] G. Pagallo. Adaptive Decision Tree Algorithms for Learning from Examples. PhD thesis, U. of California at Santa Cruz, 1990. 86] M. Pazzani and C. Brunk. Finding accurate frontiers: A knowledge intensive approach to relational learning. In National Conference on Articial Intelligence, pages 328{334, 1993. 87] M. Pazzani, C. Merz, P. Murphy, K. Ali, T. Hume, and C. Brunk. Reducing misclassi cation costs. In Proc. 11th International Conference on Machine Learning, ML-94, pages 217{225, 1994. 88] M. Pazzani, P. Murphy, K. Ali, and D. Schulenburg. Trading o coverage for accuracy in forecasts: Applications to clinical data analysis. In AAAI Symposium on AI in Medicine, pages 106{110, 1993. 89] M.J. Pazzani. Constructive induction of Cartesian product attributes. In D.L.Dowe, K.B.Korb, and J.J.Oliver, editors, Information, Statistics and Induction in Science, pages 66{77. World Scienti c, 1996. 90] B. Pfahringer. Compression-based discretization of continuous attributes. In The 12th International Conference on Machine learning, pages 456{463, 1995. 91] G.D. Plotkin. A note on inductive generalisation. In B. Melzer and D. Michie, editors, Machine Intelligence 5, pages 153{163. Edinburgh University Press, 1970. 196
92] G.D. Plotkin. A further note on inductive generalisation. In B. Melzer and D. Michie, editors, Machine Intelligence 6, pages 101{124. Edinburgh University Press, 1971. 93] R.J. Popplestone. An experiment in automatic induction. In B. Melzer and D. Michie, editors, Machine Intelligence 5, pages 203{215. Edinburgh University Press, 1970. 94] porta can be obtained from
[email protected]. 95] F.P. Preparata. An optimal real-time algorithm for planar convex hulls. Comms. of ACM, 22(7):402{405, 1979. 96] F.P. Preparata and M.I. Shamos. Computational Geometry. Texts and Monographs in Computer Science. Springer-Verlag, New York, 1985. 97] F.J. Provost. Goal directed inductive learning: Trading o accuracy for reduced error cost. In Proc. AAAI Spring Symposium on Goal Directed Learning, pages 94{101, 1994. 98] F.J. Provost and B.G. Buchanan. Inductive policy. In Proc. 10th National Conf. on AI, pages 255{261, 1992. 99] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106, 1986. 100] J.R. Quinlan. Simplifying decision trees. International Journal of ManMachine Studies, 27:221{234, 1987. 101] J.R. Quinlan. Learning logical de nitions from relations. In Machine Learning, volume 5, pages 239{266. Kluwer Academic Publishers, 1990. 102] J.R. Quinlan. Determinate literals in inductive logic programming. In Proceedings of the Twelfth International Joint Conference on Articial Intelligence, pages 746{750. Morgan Kaufman, 1991. 197
103] J.R. Quinlan. C4.5 programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1995. 104] J.R. Quinlan and R.M. Cameron-Jones. Avoiding pitfalls when learning recursive theories. In Proceedings of IJCAI 93, pages 1050{1055, 1993. 105] J.R. Quinlan and R.M. Cameron-Jones. First order learning, zeroth order data. Sixth Australian Joint Conference on Articial Intelligence (forthcoming), pages 316{321, 1993. 106] J.R. Quinlan and R.M. Cameron-Jones. Foil: A midterm report. Proceedings of European Conference on Machine Learning, pages 3{20, 1993. 107] J.R. Quinlan and R.M. Cameron-Jones. Oversearching and layered search in empirical learning. IJCAI, pages 1019{1025, 1995. 108] R.B. Rao, D. Gordon, and W. Spears. For every generalisation action, is there really an equal and opposite reaction? analysis of the conservatiob law for generalisation performance. In The 12th International Conference on Machine Learning, pages 471{479, 1995. 109] J. Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):416{431, 1983. 110] R.L. Rivest. Learning decision lists. Machine Learning, 2:229{246, 1987. 111] M. Sahami. Learning non-linearly separable boolean functions with linear threshhold unit trees and madaline-style networks. In Proc. 11th National Conf. on AI, pages 335{341. AAAI Press, 1993. 112] K. Saito and R. Nakano. Medical diagnostic expert system based on pdp model. In Proc. of ICNN, pages 255{262, 1988. 198
113] S. Salzberg. A nearest hyperrectangle learning method. Machine Learning, 6:251{276, 1991. 114] C. Sammut. The origins of inductive logic programming: A prehistoric tale. Inductive Logic Programming, pages 127{147, 1993. 115] C. Sammut and R.B. Banerji. Learning concepts by asking questions. Machine Learning: An Articial Intelligence Approach, 2:167{ 191, 1986. 116] C. Schaer. A conservation law for generalisation performance. In Machine Learning: Proc. of the 11th International Conference. Morgan Kaufmann, San Francisco, 1993. 117] J. Schell and B. Leelarthaepin. Physical Fitness Assessment. Leelar Biomediscience Services, Box 283, Matraville 2036, NSW, Australia, 1994. 118] R. Schneider and H-P. Kriegel. The tr*-tree: A new representaion of polygonal objects supporting spatial queries and operations. 119] S. Schuierer, G.J.E. Rawlins, and D. Wood. A generalisation of staircase visibility. 120] B. Schulmeister and Wysotzki. The piecewise linear classi er dipol92. In Proc. ECML94, pages 411{414, 1994. 121] R. Seidel. A convex hull algorithm optimal for points in even dimensions. Master's thesis, U. of B.C., Canada, 1981. 122] E.Y. Shapiro. An algorithm that infers theories from facts. Proceedings of the Seventh International Joint Conference on Articial Intelligence, pages 446{451, 1981. 199
123] J.W. Shavlik and T.G. Dietterich. Readings in Machine Learning, page 1. Morgan Kaufmann, 1990. 124] Attar Software. Structured decision tasks methodology for developing and integrating knowledge base systems, 1989. 125] Software Development Group, Geometry Center, 1300 South Second Street, Suite 500, Minneapolis, MN 55454, USA. Geomview Manual. 126] J.A. Swets. Measuring the accuracy of diagnostic systems. Science, 40:1285{1293, 1988. 127] M. Tan and J.C. Schlimmer. Two case studies in cost-sensitive concept acquisition. In Proc. 8th National Conf. on AI, pages 854{860, 1990. 128] C.J. Thornton. Techniques in Computational Learning. Chapman and Hall Computing, 1992. 129] G.G. Towell. Symbolic Knowledge and Neural Networks: Insertion, Renement and Extraction. PhD thesis, U. Wisconsin, Madison, 1991. 130] P.D. Turney. Cost-sensitive classi cation: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Articial Intelligence Research, 2:369{409, 1995. 131] P.E. Utgo and C.E. Brodley. Linear machine decision trees. Technical report, U. Mass. at Amherst, 1991. 132] K.S. Van Horn and T.R. Martinez. The bbg rule induction algorithm. In Proc. 6th Australian Joint Conf. on AI, pages 348{355, 1993. 133] C.S. Wallace. Classi cation by minimum-message-length inference. In Lecture Notes in Computer Science No. 468. Springer Verlag, 1990. 200
134] C.S. Wallace and D.M. Boulton. An information measure for classi cation. Comp. Journal, 11:185{195, 1968. 135] C.S. Wallace and J.D. Patrick. Coding decision trees. Machine Learning, 11:7{22, 1993. 136] P.D. wasserman. Neural Computing Theory and Practice. Van Norstrand Reinhold, 1989. 137] L. Watanabe and R. Eloi. Guiding constructive induction for incremental learning from examples. Knowledge Acquisition, pages 293{296, 1987. 138] D.A. Waterman. A Guide to Expert Systems. Addison Wesley, 1986. 139] C.J.C.H. Watkins. Combining cross-validation and search. In Progress in Machine LearningProc.of EWSL 87. Sigma Press, Wilmslow, 1987. 140] G.I. Webb. Einstein: an interactive inductive knowledge-acquisition tool. In Proceedings of the 6th Ban Knowledge-Acquisition for Knowledge-based Systems Workshop, pages 22{1{22{16, 1991. 141] G.I. Webb. Accommodating noise during induction by generalisation. Technical Report C92/13, Deakin University, 1992. 142] G.I. Webb. Data-driven inductive knowledge-base re nement. Technical Report C92/10, Deakin University, 1992. 143] G.I. Webb. Learning disjunctive class descriptions by least generalisation. Technical Report C92/9, Deakin University, 1992. 144] G.I. Webb. Control, capabilities and communication: Three key issues for machine-expert collaborative knowledge acquisition. Technical Report C93/04, Deakin University, 1993. 201
145] G.I. Webb. Systematic search for categorical attribute-value datadriven machine learning. In Proc. 6th Australian Joint Conf. on AI, pages 342{247, 1993. 146] G.I. Webb. Recent progress in learning decision lists by prepending inferred rules. In Second Singapore International Conference on Intelligent Systems, pages B280{B285, 1994. 147] G.I. Webb. Cost-sensitive specialization. In Pacic Rim International Conference on AI, pages X{X, 1996. 148] G.I. Webb and P.A. Smith. The least generalisation algorithm. Technical report, Deakin University, 1993. 149] S.M. Weiss, R.S. Galen, and P.V. Tadepalli. Maximizing the predictive value of production rules. Articial Intelligence, 45:47{71, 1990. 150] D. Wettschereck and T.G. Dietterich. An experimental comparison of nearest-neighbour and nearest hyperrectangle algorithms. Machine Learning, 19:5{27, 1995. 151] D.H. Wolpert. O-training set error and a priori distinctions between learning algorithms. Technical Report SFI TR 95-01-00, Santa Fe Institute, 1995. 152] S.P. Yip. Empirical Attribute Space Renement in Classication Learning. PhD thesis, Deakin University, Geelong, Victoria 3217, Australia, 1995. 153] P. Young. Recursive Estimation and Time Series Analysis. SpringerVerlag, 1984.
202
154] Z. Zheng. Constructing nominal x-of-n attributes. In Proceedings of the 14th International Joint Conference on Articial Intelligence, pages 1064{1070. Morgan Kaufmann, 1995.
203