Nov 21, 2005 - This implies that we can not simply write a sequence of instructions for the computer ...... forced to be smaller and finite (Klein and Manning, 2003). We depict ...... Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C. L. (2000). Collaborative .... Data Mining: Practical machine learning tools and techniques.
Local Classification and Global Estimation Explorations of the k-nearest neighbor algorithm
Iris Hendrickx
Local Classification and Global Estimation Explorations of the k-nearest neighbor algorithm
Proefschrift
ter verkrijging van de graad van doctor aan de Universiteit van Tilburg, op gezag van de rector magnificus, prof. dr. F.A. van der Duyn Schouten, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op maandag 21 november 2005 om 14.15 uur
door
Iris Helena Emilie Hendrickx geboren op 4 mei 1979 te Breda
Promotores: Copromotor:
Prof. dr. W.P.M. Daelemans Prof. dr. H.C. Bunt Dr. A.P.J. van den Bosch
c 2005 Iris Hendrickx ! ISBN-10: 90-9019978-0 ISBN-13: 978-90-9019978-8 NUR 980 Printed by Koninklijke drukkerij Broese & Peereboom, Breda Typeset in LATEX Cover: photo by Roser Morante, design by Bas de Jong
Acknowledgments I would like thank everyone who contributed to the last four years in which I worked on this dissertation. This research was done as part of the NWO Vernieuwingsimpuls project ‘Memory Models of Language’ carried out at Tilburg University in the ILK research group. I would like to thank Prof. Walter Daelemans and Prof. Harry Bunt for willing to be my promotores, and for their useful comments and advise. I am also indebted to the members of the reading committee: Prof. Hendrik Blockeel, Prof. William Cohen, Dr. Emiel Krahmer and Dr. Maarten van Someren. My special thanks go to my daily supervisor Antal van den Bosch for always being there for support, encouragement, inspiration and guidance. I have learned a lot from his advise, the pleasant co-operation and the many fruitful discussions, which in the end have led to finishing this thesis. A large part of the research described here is conducted on the basis of software provided by others. I would like to thank the programmers and authors of these software programs: William Cohen for his program Ripper and Dan Roth for the SNoW package. I am grateful to Zhang Le for his Maxent toolkit and his willingness to share information about the internals of his implementation. I want to thank the people of the ILK research group, in particular Ko van der Sloot, for the Timbl software. I thank Bertjan Busser for always keeping all computers and software up and running, Antal van den Bosch for his WPS program and Erik Tjong Kim Sang for his random sampling program. I thank Martin Reynaert for doing an excellent job of the spelling checking of the text of this thesis (as no current automatic program can beat his performance level). I would like to acknowledge the support of Fonz and Tippie who were always there during the long hours at home writing this dissertation. They helped me to get the necessary distraction and occasionly attempted to help me typing. I thank my family and friends for their interest and support, and most of all my parents who stimulated me to choose my own path. From a practical point, I would like to thank Lambert Dekkers for arranging the process of printing this thesis. I am also very happy that my brother Marco Hendrickx and my roommate Roser Morante have agreed to stand
v
Acknowledgements
vi
by me as paranymfs. My deepest gratitude goes to my best friend and dearest partner L´eon Dekkers for his care and patience especially during this last year when I had to spend much time typing and writing. I would like to thank my colleagues of the Computational Linguistics and Artificial Intelligence department at Tilburg University for their enthusiasm and help, and for creating a nice and stimulating working environment during the past four years: Anne Adriaensen, Toine Bogers, Antal van den Bosch, Sabine Buchholz, Harry Bunt, Bertjan Busser, Sander Canisius, Walter Daelemans, Hans van Dam, Federico Divina, Jeroen Geertzen, Yann Girard, Simon Keizer, Piroska Lendvai, Anders Nøklestad, Erwin Marsi, Roser Morante, Reinhard Muskens, Hans Paijmans, Martin Reynaert, Mandy Schiffrin, Ko van der Sloot, Ielka van der Sluis, Elias Thijsse, Erik Tjong Kim Sang, Menno van Zaanen, and Paul Vogt.
Contents 1 Introduction
2
1
1.1
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Hybrid Machine Learning algorithms . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Algorithms, data and methods 2.1
Learning Classifiers
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
Memory-based learning . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2
Hyperplane discrimination . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3
Rule induction
2.1.4 2.2
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Maximum entropy modeling . . . . . . . . . . . . . . . . . . . . . . . 21
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1
UCI benchmark tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1.1
2.2.2
Discretization of continuous features . . . . . . . . . . . . . 24
NLP tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2.1
Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . 26
2.2.2.2
Phrase chunking . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.2.3
Named entity recognition vii
. . . . . . . . . . . . . . . . . . 28
viii
CONTENTS 2.2.2.4 2.3
Prepositional phrase attachment
Experimental methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1
Optimization of algorithmic parameters . . . . . . . . . . . . . . . . 31
2.3.2
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Hybrids 3.1
3.3
3.4
39
k-nn and maximum entropy modeling . . . . . . . . . . . . . . . . . . . . . 40 3.1.1
3.2
. . . . . . . . . . . . . . 29
Comparison between maxent-h, maxent and k-nn . . . . . . . . . 41
k-nn and rule induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1
Comparison between the rule hybrids, rules and k-nn . . . . . . . . 46
3.2.2
Comparing the two rule learners . . . . . . . . . . . . . . . . . . . . 49
k-nn and winnow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1
Error correcting output codes . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2
Construction of the hybrid with ECOC . . . . . . . . . . . . . . . . 55
3.3.3
Comparison between winnow-h, winnow and k-nn . . . . . . . . . 57
Additional Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1
Comparison between the basic algorithms . . . . . . . . . . . . . . . 59
3.4.2
AUC compared to Accuracy . . . . . . . . . . . . . . . . . . . . . . 60
3.4.3
Results on NLP data sets . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Comparison with stacking 4.1
67
Classifier ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.1
Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
ix
CONTENTS 4.3.1
Comparison with stack-maxent . . . . . . . . . . . . . . . . . . . 72
4.3.2
Comparison with stack-rules . . . . . . . . . . . . . . . . . . . . 74
4.3.3
Comparisons on NLP data sets . . . . . . . . . . . . . . . . . . . . . 76
4.4
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Analysis 5.1
5.2
79
Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.1
5.2.2
Complementary rates . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2.1.1
Complementary rates of the hybrid algorithms . . . . . . . 88
5.2.1.2
Complementary rates of the stacked classifiers . . . . . . . 89
Error overlap with parent algorithms . . . . . . . . . . . . . . . . . . 90 5.2.2.1
Results of the hybrid algorithms . . . . . . . . . . . . . . . 90
5.2.2.2
Results of the stacked algorithms . . . . . . . . . . . . . . 92
5.3
Illustrative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Conclusions 6.1
103
Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.1
Construction and performance of hybrid algorithms . . . . . . . . . 103
6.1.2
Differences and commonalities between hybrids and their parents . . 104
6.1.3
Comparison between hybrids and classifier ensembles
6.1.4
Performance of hybrids on NLP tasks . . . . . . . . . . . . . . . . . 106
. . . . . . . . 105
CONTENTS 6.2
x
k-nn classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Bibliography
107
Appendices
121
A UCI data sets
121
B Summary
125
C Samenvatting
127
Chapter 1
Introduction This chapter introduces the research questions investigated in this thesis and outlines the structure of the book.
1.1
Machine Learning
In this thesis we describe several hybrid machine learning algorithms. We detail the construction and motivation of the hybrid algorithms, we measure their classification performance, and perform error analyses to investigate their functional classification behavior. Before we outline the research questions investigated and the structure of the thesis, we start with a short introduction on machine learning. When a computer needs to perform a certain task, a programmer’s solution is to write a computer program that performs the task. A computer program is a piece of code that instructs the computer which actions to take in order to perform the task. A problem arises with tasks for which the sequence of actions is not well defined or unknown. An example of such a task is face recognition. Humans have no problem recognizing faces of people. However, the task of writing down a unique description of a face is a difficult task. A description of a particular face will often also apply to hundreds of other faces. This implies that we can not simply write a sequence of instructions for the computer to perform this task. One solution to this problem is to make a special computer program that has the capacity to learn a task from examples or past experience. Such a program is given a large set of pictures of faces labeled with names and its task is to learn the mapping between the pictures and the names. We call such a program that learns some function with an input and output a machine learning algorithm.
1
2
Chapter 1: Introduction training instances
learning module model classification module
predicted label y
test instance x
Figure 1.1: A machine learning algorithm consists of three parts: a learning module, a model and a classification module.
Machine learning is part of the research field of Artificial Intelligence and has strong relations with the field of statistics.1 Machine learning algorithms can be divided in two main categories: supervised and unsupervised machine learning algorithms. In supervised learning, the input of the learning algorithm consists of examples (in the form of feature vectors) with a label assigned to them. The objective of supervised learning is to learn to assign correct labels to new unseen examples of the same task. Unsupervised algorithms learn from unlabeled examples. The objective of unsupervised learning may be to cluster examples together on the basis of their similarity. In this thesis we concentrate on supervised learning algorithms. In Figure 1.1 a schematic visualization of a supervised machine learning algorithm is shown. The algorithm consists of three parts: a learning module, a model and a classification module. The learning module constructs a function on the basis of labeled examples. We refer to labeled examples as instances. The model is the stored estimation of the function induced by the learning module. The classification module takes as input unseen instances and applies the model to predict class labels for the new instances. More formally, supervised machine learning methods learn classification tasks on the basis of instances. The set of labeled training instances represents some function f : X → Y that maps an instance x ∈ X to a class y ∈ Y . The true target function is not directly available to the learner, only implicitly through the class labels assigned to the set of instances. A machine learning experiment consists of two phases: a learning phase and a classification phase. First the machine learning algorithm induces a hypothesis f ! of 1 For more information about machine learning, we refer to (Alpaydin, 2004; Langley, 1996; Mitchell, 1997; Weiss and Kulikowski, 1991).
3
1.2 Hybrid Machine Learning algorithms
the target function f on the basis of set d of labeled training instances. This induced hypothesis is stored in the form of some model of the target function. In the classification phase, new examples are classified by applying the model (Mitchell, 1997). Inducing a hypothesis f ! of the true target function f can be viewed as a search in a hypothesis space that represents all possible hypotheses. This hypothesis space is defined by the hypothesis representation chosen by the designer of the learning algorithm. There are two criteria that guide the search for the hypothesis with the closest resemblance to the target function. In the first place the hypothesis should fit the set of training instances. The second criterion concerns the machine learning bias of the algorithm. The term machine learning bias is defined as the “set of assertions that form the basis for a machine learning algorithm to choose a certain hypothesis, besides being consistent with the training data” (Mitchell, 1997). Each algorithm uses its own set of heuristics, constraints or rules to choose a certain hypothesis. For example, a well known guideline that is used in many algorithms is the Minimum Description Length (MDL) principle (Rissanen, 1983). This principle states that given a hypothesis space and a set of labeled training instances d, choose the hypothesis that is the smallest description of d.2 It is desirable to know beforehand which machine learning algorithm has the right bias for a particular task. Even better would be to have a machine learning algorithm that performs well on any task. Common sense dictates, and studies have confirmed that there does not exist one universal machine learning algorithm that performs best on all types of classification tasks (Culberson, 1998; Wolpert and Macready, 1995). Besides the fact that different types of tasks ask for different types of solutions, it is also difficult to determine beforehand which type of solution best suits a particular task. These observations point to the need of an empirical approach. For each task, one needs to find out experimentally which types of solution work well; in other words, which machine learning bias is suited for the task. In this thesis we take an empirical approach and perform a large range of experiments to investigate this question.
1.2
Hybrid Machine Learning algorithms
Most machine learning algorithms construct a model by abstracting from the set of labeled instances. An exception is memory-based learning algorithms as they simply store all instances in memory without generalization. Memory-based learners are also called lazy learners as they do not put any effort in the learning phase as opposed to eager learners (Aha, 1997). Eager learning algorithms invest most of their effort in the learning phase. They construct a compact representation of the target function by generalizing from the training 2 Detailed
information about the MDL principle can be found in Gr¨ unwald et al. (2005).
Chapter 1: Introduction
4
instances. Classification of new instances is usually a straightforward application of simple classification rules that employ the eager learner’s model. In contrast, lazy learning algorithms put no effort or computation in the learning phase. The learning phase consists of storing all training instances in memory. The search for the optimal hypothesis does not take place in the learning phase, but in the classification phase. The memory-based learners make local approximations of the target function depending on the test instance. To classify a new instance, the memory-based learner searches in memory for the most similar instances. The result of this local search, usually a set of nearest neighbor instances, is used to assign a class label to the new instance. Thus, eager and lazy learning algorithms differ on three points. First, eager learners put effort in the learning phase, while lazy learning divert their effort to classification. Second, eager learners form a global hypothesis of the target function that describes the complete instance base. In contrast, lazy learners produce for each new instance a local hypothesis based on a part of the instance base. Third, the hypothesis constructed by an eager learner is independent of the new instance to be classified, while the hypothesis of the lazy learner depends on the new instance. The contrast between memory-based learning and eager learning forms the motivation for constructing hybrids. We describe hybrid algorithms in which we combine eager learning with memory-based classification. We put effort in the learning phase as well as in the classification phase, as we expect this double effort will be repayed with an improved performance. We take the system as constructed by the eager learner and replace its standard classification component by memory-based classification. The hybrid uses both the global hypothesis as induced by the eager learner and the local hypothesis of the memory-based learner based on the test instance to be classified. (We call the eager learning algorithm and memory-based learner parent algorithms of the hybrid that combines them.) A visualization of the hybrid algorithm is shown in Figure 1.2. The training instances form the input for the learning module of the eager parent and are also stored by the memorybased algorithm. In the classification phase, both the model produced by the eager learner and the stored instances are used by the memory-based classification method to classify new instances. We perform our study using instantiations of three quite different types of eager machine learning algorithms: A rule induction algorithm which produces a symbolic model in the form of lists of classification rules; maximum entropy modeling, a probabilistic classifier which produces a matrix of weights associating feature values to classes; and multiplicative update learning, a hyperplane discriminant algorithm which produces a (sparse) associative network with weighted connections between feature value and class units as a model. The research presented in this thesis is motivated by research in machine learning
5
1.2 Hybrid Machine Learning algorithms training instances
instance
simple
eager
storage
learning
base model eager learner
k−nn classification
predicted label y
test instance x
Figure 1.2: Representation of a hybrid learning algorithm that combines eager learning with memory-based classification.
for Natural Language Processing (NLP).3 NLP is the cross-disciplinary field between Artificial Intelligence and Linguistics and studies problems in the processing, analysis and understanding of natural language by computers. This background of NLP research influences the choice of machine learning algorithms. Memory-based learning has been shown to be well suited for NLP tasks (Daelemans and van den Bosch, 2005; Daelemans et al., 1999). Many language processing tasks concern complex tasks with many subregularities and exceptions. Because a memory-based learning algorithm stores all instances in memory, all exceptions and infrequent regularities are kept. As every algorithm has its strengths and weaknesses, this strong feature of memory-based learning is at the same time its weak point, as it makes the algorithm sensitive to noise and irrelevant information. The embedding in NLP research also influences the type of classification tasks we use to study the performance of machine learning algorithms. We choose to perform our experiments on two types of data. As the main goal in this research is the study of machine learning algorithms, we use a set of benchmark data sets that are used frequently in the field of machine learning research. The other type of data concerns four NLP data sets which are publicly available and have been used in previous research. 3 The research is conducted within the Induction of Linguistic Knowledge research group in the department of Language and Information Science at Tilburg University, which applies inductive learning algorithms to natural language problems.
Chapter 1: Introduction
1.3
6
Research Questions
From the perspective of the memory-based learner we hypothesize that combining the eager learner’s representation with memory-based classification can repair a known weakness of memory-based learning, namely its sensitivity to noise (Aha et al., 1991) and irrelevant features (Tjong Kim Sang, 2002; Wettschereck et al., 1997). From the perspective of the eager learners, we hypothesize that replacing their simple classification method with the more sensitive local k-nn classification method could improve generalization performance. By combining the learning module of an eager learner with the classification module of a lazy learner we hope to combine the best of both worlds. The main research question investigated in this thesis is: Can we, by replacing the simple classification method of an eager learner with the memory-based classification method improve generalization performance upon one or both of the parent algorithms? Answering this research question will also tell us whether the model of an algorithm only has value within its intended context, i.e. in combination with the usual classification component of the learner, or whether it is still valuable when pulled out of that context. Besides generalization performance we are also interested in the differences and commonalities between the hybrid algorithms and their parents. We formulate the following research question: To what extent does the functional classification behavior of the hybrid algorithms differ from the functional behavior of their parent algorithms? We try to answer this question by analyzing the outputs of the algorithms. We investigate the types of errors the algorithms make and the overlap in error between the hybrid algorithms and their parent algorithms. The hybrid algorithms combine the models of two algorithms to improve performance. In this respect, classifier ensembles are similar as they also put several classifiers together to reach a better performance. This correspondence between hybrid algorithms and classifier ensembles is investigated by asking the following question: How do hybrid algorithms compare to classifier ensembles? We design classifier ensembles that have a close analogy to the hybrid algorithms. We compare the algorithms in terms of generalization performance, and we analyze the differences and commonalities in their functional classification behavior.
7
1.4 Thesis Outline
We perform our experiments to estimate the generalization performance of the algorithms on two types of tasks, machine learning benchmark tasks and natural language processing tasks. The latter are characterized by the fact that they are quite complex and contain many exceptions. We investigate the following question: Is there a difference in performance of the hybrid algorithms on natural language processing tasks?
1.4
Thesis Outline
This thesis is structured as follows. Chapter 2 provides an overview of the experimental methods. First we describe the four machine learning algorithms that we use: memorybased learning, rule induction, maximum entropy modeling and multiplicative update learning. We give a description of the data sets, the experimental setup and the evaluation methods. In Chapter 3 we discuss and motivate the construction of four hybrid algorithms that combine lazy and eager learning. We compare the generalization performance of the hybrid algorithms to the performance of their parent algorithms. Comparing the four hybrid algorithms to their parents, two of the hybrids will be seen to outperform their eager learning parent algorithm and to perform equal to or slightly better than memorybased learning. The other two hybrids are less successful as both have a lower performance than memory-based learning and equal or lower performance than the eager parent. Chapter 4 offers a comparison between the hybrid algorithms and classifier ensembles. We discuss the construction and performance of two classifier ensembles that have a close resemblance to the two successful hybrids discussed in Chapter 3. Experimental results show that the classifier ensembles perform rather similarly to the hybrid algorithms. In Chapter 5 we investigate the differences and commonalities in the functional classification behavior of hybrid algorithms and their parent algorithms. We conduct a bias– variance decomposition of error rates and we perform error analyses of the results presented in Chapter 3. As a comparison, we perform the same analyses for the classifier ensembles and their parent algorithms. Chapter 6 summarizes the answers to the research questions central to this thesis.
Chapter 1: Introduction
8
Chapter 2
Algorithms, data and methods This chapter describes the experimental setup of the experiments presented in this thesis. We provide a description of the machine learning algorithms, data sets and experimental method used in this research. In this chapter we detail the setup of our experiments and briefly describe the machine learning algorithms used in this research. First we discuss four machine learning classifiers: memory-based learning, maximum entropy modeling, rule induction, and multiplicative update learning, in terms of their learning and classification method, their model and issues related to the particular software implementations we use. Next we describe the classification tasks and data sets. The last sections of the chapter detail the experimental setup and methods to evaluate the generalization performance of the algorithms.
2.1
Learning Classifiers
The key distinction between lazy machine learning algorithms and eager algorithms is the difference in effort in the learning phase of the algorithm. The eager learning algorithms put significant effort in abstracting a model from the training instances. Classification on the other hand is reduced to a relatively effortless application of a classification rule that employs the abstracted model to new instances. In contrast, lazy learning algorithms simply store all training instances in memory as model. All effort is diverted to the classification phase. The lazy algorithm creates for each new test instance a local estimate of the target function by searching for the most similar stored instances. As explained in Chapter 1 we use the contrast between lazy learning algorithms and eager learners to construct hybrid algorithms that combine the learning phase of an eager learning algorithm with the classification phase of a lazy learning algorithm. 9
10
Chapter 2: Algorithms, data and methods training instances
simple storage instance base k−nn classification
predicted label y
test instance x
Figure 2.1: The learning module of k-nn algorithm consists of a simple storage step of the instance base as the model of the target function. The most important module of the k-nn algorithm is the classification module.
We choose three eager machine learning algorithms: rule induction, maximum entropy modeling and multiplicative update learning. We choose these three eager learners because we can design a conceptual bridge between their model and k-nn classification to form the hybrid algorithms. Detailed motivations for the construction of the hybrids can be found in Chapter 3. In the next sections we discuss the four parent algorithms in light of the chosen search strategy to find a hypothesis, the model and classification method.
2.1.1
Memory-based learning
Memory-based learning is a similarity-based method (Aha et al., 1991; Cover and Hart, 1967; Fix and Hodges, 1951) and goes by many names such as instance-based learning, case-based learning, non-parametric learning, local learning and nearest neighbor learning. Memory-based learning relies on the hypothesis that every instance has predictive power to classify other similar instances. A prominent attribute of memory-based learning algorithms is that they delay generalization from the set of labeled instances to the classification phase. Memory-based learning includes methods such as k-nearest neighbor classification, locally-weighted regression and case-based reasoning. Locally-weighted regression (Bottou and Vapnik, 1992) is a memory-based method that performs regression around the point to be predicted using only training data in the local context. Case-based reasoning (Kolodner, 1993) is a technique that uses complex instance representations such as relational descriptions (Mitchell, 1997).
11
2.1 Learning Classifiers
In this research we focus on the k-nearest neighbor algorithm (k-nn)(Aha et al., 1991; Cover and Hart, 1967; Fix and Hodges, 1951). Figure 2.1 shows a schematic visualization of the k-nn algorithm. The learning module is presented as a simple storage step; the most important module in the algorithm is the classification module. The k-nn algorithm uses all labeled training instances as a model of the target function. In the classification phase, k-nn uses a similarity-based search strategy to determine a locally optimal hypothesis function. Test instances are compared to the stored instances and are assigned the same class label as the k most similar stored instances. We use the k-nn algorithm as implemented in the Timbl software package (Daelemans et al., 2004). In this variant of knn the k does not point to the k nearest cases but to the k nearest distances as proposed by Aha et al. (1991). The similarity or distance between two instances A and B is measured according to the distance function in Equation 2.1:
∆(A, B) =
n !
wi δ(ai , bi )
(2.1)
i=1
where n is the number of features, wi is a global weight for feature i, and distance metric δ estimates the difference between the feature values of the two instances. The distance metric δ and feature weighting w are both algorithmic parameters besides k. The class of feature weighting methods and distance metrics is open; many methods have been proposed in the literature. We discuss three different distance metrics and four feature weighting methods which we use in our k-nn implementation. A simple distance metric is the overlap metric. For symbolic features this metric estimates the distance between two mismatching feature values as one and the distance between two matching values as zero. For numeric values the difference is computed as illustrated in Equation 2.2.
δ(ai , bi ) =
ai −bi maxi −mini
0 1
if numeric, else if ai = bi if ai $= bi
(2.2)
The overlap metric is limited to exact matches between feature values. To allow relative differences between symbolic values to be acknowledged by the classifier, the (Modified) Value Difference metric (mvdm) was introduced in (Stanfill and Waltz, 1986) and further refined by Cost and Salzberg (1993). mvdm estimates the distance between two values v1 and v2 as the difference of their mutual conditional distributions of the classes calculated from the set of labeled instances. Equation 2.3 shows the calculation of mvdm, where j represents the total number of classes and P (Ci |v1 ) is the probability of Class Ci given the presence of feature value v1 .
δ(v1 , v2 ) =
j ! i=1
|P (Ci |v1 ) − P (Ci |v2 )|
(2.3)
12
Chapter 2: Algorithms, data and methods
When two feature values occur with the same distribution over the classes, mvdm regards them as identical and their estimated distance is zero. In contrast, when two feature values never occur with the same classes, their distance will be maximal (2.0 according to Equation 2.3). The Jeffrey divergence distance metric is a symmetric variant of the Kullback-Leibner distance metric. Similar to mvdm, Jeffrey divergence uses class distributions to estimate the distance between two feature values, but instead of simply summing over the subtractions of two feature values, it uses a logarithmic term. Equation 2.4 shows the distance calculation between two feature values v1 and v2 . The denominator 0.5 ∗ z is the normalization factor.
δ(v1 , v2 ) =
n ! i=1
P (Ci |v1 )log
P (Ci |v1 ) P (Ci )|v2 ) + P (Ci |v2 )log 0.5 ∗ z 0.5 ∗ z
z = P (Ci |v1 ) + P (Ci |v2 )
(2.4)
(2.5)
Feature weighting is an important algorithmic parameter of the k-nn algorithm. Not all features have the same global informativeness, and it is a good heuristic to give more important features a higher weight. A mismatch on a feature with a high weight creates a larger distance between two instances than a mismatch on a feature with a low weight. Information Gain weighting (Quinlan, 1993) looks at each feature in isolation and measures how important it is for the prediction of the correct class. The Information Gain of a feature is the difference in entropy in the situation with and without the knowledge of that feature. Equation 2.6 shows the calculation of Information Gain: wi = H(C) −
!
v∈Vi
P (v) ∗ H(C|v)
(2.6)
where C is the set of class labels, Vi is the set of values for feature i and H(C) is the entropy of the class labels. One weak point of Information Gain is that it tends to overestimate the relevance of features that have many different values. To remedy this, Quinlan (1993) proposed Gain Ratio which is is a normalized version of Information Gain in which the Information Gain weight is divided by the entropy of the feature values (si(i)) as shown in Equation 2.7 and 2.8.
wi =
H(C) −
%
P (v) ∗ H(C|v) si(i)
v∈Vi
(2.7)
13
2.1 Learning Classifiers
si(i) = −
!
P (v)log2 P (v)
(2.8)
v∈Vi
(White and Liu, 1994) showed that Gain Ratio weighting still has the property of assigning higher weights to features with many values. They propose to use a statistical weighting method without this property: Chi-Square. This weighting method estimates the significance, or degree of surprise, of the number of feature value occurrences with respect to the expected number of occurrences (a priori probability). Equation 2.9 shows the calculation of Chi-Square:
X2 =
|Vi | |C| ! ! (Enm − Onm )2 Enm n=1 m=1
(2.9)
where Onm and Enm are the observed and expected frequencies in a contingency table t which records how often each feature value co-occurs with each class. Onm is simply the number of co-occurrences of value m with class n. Enm is the number of cases expected in cell tnm if the null hypothesis (of no predictive association between feature value and class) is true. In Equation 2.10 it is defined that t.m is the sum of column m and t.n is the sum of row n in the contingency table. ta is the total number of instances and equals the sum of all cells in the table. Enm =
t.m t.n ta
(2.10)
Shared variance (Equation 2.11) is a normalization of the Chi-Square weighting in which the dimensionality of the contingency table is taken into account.
SVi =
X2 N ∗ min(|C|, |V |) − 1
(2.11)
where N is the number of instances, |C| is the number of classes, |V | is the number of values and min(|C|, |V |) expresses that only the lowest value ( |V | or |C|) is used. Another possibility to further refine the behavior of the k-nn algorithm is to use distance weighting of the nearest neighbors. When k is larger than one and multiple nearest neighbor instances are involved in classification, the simplest method to classify a new instance is to give this instance the same classification as the majority class of the k neighbors. This method is called majority voting. When larger k values are used, the nearest neighbor set can include, besides the similar and nearby instances, also a large group of less similar instances. In such case majority voting may be less appropriate because the group of less similar instances can override the classification of the nearby instances.
14
Chapter 2: Algorithms, data and methods
1
-d
e e-2d2 e-d4 e-d
weight of vote
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5 distance
2
2.5
3
Figure 2.2: Visualization of the exponential decay distance weighting function with varied parameters α and β.
Dudani (1976) proposes Inverse Linear Distance, a method in which votes of instances are weighted in relation to their distances to the new instance. Instances that are close to the new instance are given a higher weighted vote than instances that are less similar to the new instance. The nearest neighbor gets a weight of 1, the most dissimilar neighbor a weight of 0 and the others are given a weight in between. Inverse Linear Distance is expressed in Equation 2.12.
wj =
&
dk −dj dk −d1
1
if dk $= d1 if dk = d1
'
(2.12)
where dj is the distance of the new instance to nearest neighbor j, d1 is the distance of the nearest neighbor, and dk the distance of the furthest neighbor (k). Inverse Distance Weighting (Dudani, 1976) is another distance weighting method shown in Equation 2.13. wj is the inverse distance of the new instance to the nearest neighbor j. Usually a small constant is added to dj to prevent division by zero. wj =
1 dj
(2.13)
15
2.1 Learning Classifiers
Exponential decay distance weighting (Shepard, 1987) scales the distances of the nearest and furthest according to an exponential decay function shown in Equation 2.14. β
wj = eαdj
(2.14)
This function has two parameters α and β that determine the slope and power of the function. Some examples of the function can be found in Figure 2.2. We see that α influences the steepness of the curve. A steeper curve implies that relatively less weight is given to neighbors at a further distance. When β has a value higher than one, the curve becomes bell shaped and assigns a high vote to all nearest neighbors up to a certain distance.
2.1.2
Hyperplane discrimination
Hyperplane discriminant algorithms include algorithms such as Perceptron (Rosenblatt, 1958), Winnow (Littlestone, 1988) and Support Vector Machines (Vapnik, 1998). Hyperplane discriminant algorithms search for an optimal hypothesis of the target function in a hypothesis space of which the dimensions are determined by the features present in the set of labeled instances. The labeled training instances can then be considered as points in this space. The hyperplane discriminant algorithm tries to find a hypothesis function that separates the space in such a way that it separates the points representing the instances belonging to different classes. We choose the sparse winnow algorithm (winnow) as implemented in the SNoW software package (Carlson et al., 1999; Roth, 1998) as hyperplane discriminant algorithm. winnow is an efficient learning algorithm designed for classification tasks with many features, and is robust against noisy data and irrelevant features. It has been applied successfully to many classification tasks in natural language processing: shallow parsing (Munoz et al., 1999), part-of-speech tagging (Roth and Zelenko, 1998), text categorization (Dagan et al., 1997), spelling correction (Golding and Roth, 1999), word sense disambiguation (Escudero et al., 2000) and semantic role labeling (Ngai et al., 2004; Punyakanok et al., 2004). Winnow has also been applied to other tasks such as face detection (Yang et al., 2000) and patent classification (Koster et al., 2003). In Figure 2.3 a visualization of winnow is given. The winnow algorithm uses a sparse network architecture as model of the target function. The sparse network consists of weighted connections between input nodes and output nodes. The input nodes symbolize the feature values present in the labeled training instances and the target nodes denote the class labels. The classification module in the picture is labeled with “summed winner-take-all”. When a test instance is matched against the network, the feature values present in the instance activate the weighted connections of the matching input nodes. Each output node in the network has a threshold T and predicts positively for the target class when the sum of
16
Chapter 2: Algorithms, data and methods training instances
Winnow sparse network summed winner−take−all
predicted label y
test instance x
Figure 2.3: The Winnow algorithm uses a sparse associative network as a model. When classifying a new instance, the weights connected to the feature values present in the instance are summed for each class and the class label with the highest sum is chosen.
the active weighted connections exceeds the threshold, and negatively otherwise. In the next part of the section we first describe the network architecture and explain the learning phase of winnow, followed by a description of the classification phase. The learning module is the heart of the winnow algorithm. The construction of the network is data-driven; connections between input nodes and target nodes are only created when the data provides evidence in the form of at least one instance that contains that particular feature value and is labeled with the target class, hence the term sparse in the name of the algorithm. The weights of the connections are learned using the Littlestone Winnow update rule (Littlestone, 1988). The Littlestone update rule is multiplicative and has two parameters: a promotion parameter α > 1 and a demotion parameter β between 0 and 1. The learning of the weights starts with a random initialization. The learning process is errordriven. Each training instance is tested against the network; only when the instance is misclassified, the weights are updated. The output node with same class label as the training instance needs to give a positive prediction. When the target node misclassifies the instance by predicting negative, the weights of the active connections leading to that target node are promoted by multiplication by α. The target nodes that do not have the same class label as the instance to be learned, need to give a negative prediction of this instance. When one of the nodes misclassifies the instance as positive, the weights of the active connections to this node are demoted by multiplication by β. Other weights are unchanged.
17
2.1 Learning Classifiers
When the algorithm classifies a new instance, the features present in the new instance activate connections to certain target nodes. The activation score for each target node is calculated by summing its active connections. The class label of the target node with the highest score is assigned to the new instance. Equation 2.15 shows the summarization of weights for target node t where n is the total of active features connected to target node t and wit is the weight of active feature i connected to t. n !
wit > T
(2.15)
i=1
We already mentioned three algorithmic parameters that need to be set: the promotion parameter α, demotion parameter β and threshold T . The SNoW software package also implements many other algorithmic parameters that can be specified by the user. We briefly mention three parameters that we use in our experiments. The first parameter is the number of cycles through the training data in which the weights of the connections are adjusted with the winnow update rule. The second parameter offers the possibility to change the thickness of the separator between the positive and negative instances. This floating point value indicates the distance between the threshold and negative or positive instances. The third parameter performs feature selection or noise reduction by simply ignoring the values that occur less than n times in the training set (we use in our experiments the default setting, n=2).
2.1.3
Rule induction
In this section we discuss rule induction algorithms. First we mention some general issues in rule learning algorithms. In the next part of the section we describe in detail ripper, the rule induction algorithm we use in our experiments. We also discuss a second rule induction algorithm, cn2 which we use for some additional experiments. Rule induction algorithms induce a set of classification rules from labeled training data as a model of the target function. The condition part of the rule consists of tests on the presence of feature values combined with boolean operators. Many variants of rule induction algorithms exist; the method for inducing rules, the type of rules, and the classification method differ among the various implementations of rule induction algorithms. We mention some of the general choices to be made in the design of a rule induction algorithm. One large group of rule induction algorithms are separate-and-conquer or sequential covering algorithms. They learn in an iterative process one rule at the time, and remove all instances from the data that are covered by this rule. F¨ urnkranz (1999) gives an overview of sequential covering algorithms. In contrast, divide-and-conquer or simultaneous covering (Mitchell, 1997) rule induction algorithms split the data in disjoint sets and construct rules for each of these sets. C4.5rules (Quinlan, 1993) is an example of such algorithm.
18
Chapter 2: Algorithms, data and methods
Another key aspect in the design of the learning module is the choice between a general-tospecific search strategy or specific-to-general. Working from a general-to-specific principle, the rule learner starts with inducing one general rule, and incrementally induces more specific rules to fit the data better. Alternatively, the learner can work specific-to-general; it starts with constructing several specific rules and iteratively makes this rule set more general, by merging rules or by modifying individual rules to become more general. This approach is also called overfit-and-simplify. A choice that influences all parts of the rule induction algorithm is the choice between an ordered or unordered rule set. In case of an unordered rule set, the learning module induces a rule set on the basis of all labeled training instances. In the classification phase, each rule is matched against the new instance. When more than one class label is assigned by the matching rules, an extra mechanism such as voting or rule weighting must be used to determine the final class label. In case the model of the algorithm is an ordered rule set, the learning module has an extra task of finding an optimal ordering of the rules. Classification on the other hand is straightforward: the first matching rule fires and assigns the class label, discarding all other rules in the remainder of the ordered rule set. We choose ripper (Cohen, 1995) as rule induction algorithm in our experiments. We also ran some additional experiments with another rule induction algorithm, cn2 (Clark and Boswell, 1991), which differs on various points from ripper. ripper is an abbreviation of Repeated Incremental Pruning to Produce Error Reduction (Cohen, 1995) and is based on the irep algorithm developed by F¨ urnkranz and Widmer (1994). ripper is a sequential covering algorithm that uses, by default, the specific-togeneral technique to induce rules. Figure 2.4 presents a global schematic visualization of ripper. When used without any specification of algorithmic parameters, ripper produces an ordered rule set as model and classifies a new instance by assigning the class label of the first matching rule in the rule set. In the learning phase, ripper starts by making a list of the classes. Standard ripper orders the classes on the basis of their frequency, and starts with inducing rules for the least frequent class. For each class, the instances labeled with this class are considered as positive examples and all other instances are regarded as negative examples. The set of training instances is split into a growing set and pruning set. First a specific rule is constructed by repeatedly adding feature-value tests until the rule covers no negative examples in the growing set. Next the rule is pruned by deleting conditions from the rule maximizing function f (rule) in Equation 2.16: f (rule) =
p−n p+n
(2.16)
where n is the number of negative examples present in the pruning set that is covered the rule (rule), p is the number of positive examples in the pruning set covered by the rule.
19
2.1 Learning Classifiers training instances
RIPPER ordered ruleset first matching rule fires
predicted label y
test instance x
Figure 2.4: The rule induction algorithm ripper uses an ordered rule set as model. In the classification phase, the first rule in the rule set that matches the new instance, assigns the class label.
When a rule is created, it is added to the rule set and the positive and negative examples covered by the rule are removed from the training set. In the next step the process is repeated. The stopping criteria of adding rules to the rule set is based on the minimum description length principle. The description length of the rule set and its coverage of instances is computed after adding a rule to the rule set. When this description length is d bits longer than previous descriptions, ripper stops adding new rules. The default setting of d is 64. After the construction of a rule set, ripper has a post-processing step in which it performs an optional number of extra optimization rounds. Each rule is pruned and evaluated in the context of optimizing the error rate of the rule set as a whole. ripper constructs two alternative rules for each rule, a replacement rule and a revision rule. The alternative replacement rule is constructed by growing and pruning a new rule while minimizing the error of the entire rule set on the pruning set. The construction of the revision rule starts with the original rule and is modified with respect to minimal error of the complete rule set. The rule versions are evaluated by measuring the description length of the whole rule set and the rule version that causes the smallest description length is chosen as final version of the rule. ripper has several algorithmic parameters that can affect the behavior of the algorithm. Here we describe the parameters that we use in our experiments. We already mentioned class ordering. ripper offers several possibilities to order the classes. The ordering of the classes determines the order in which rules are induced. ripper also has the
20
Chapter 2: Algorithms, data and methods
option to produce an unordered rule set. The number of optimization rounds in the post processing step can be specified by the user. By default it is set to two. The amount of pruning applied to the rule set, also called rule simplification, can be varied with an algorithmic parameter. It is optional to perform negative tests for nominal valued features. This allows conditions of the form ‘if not ..’. The minimal number of instances covered by a rule can be set; the default choice of ripper is one, allowing the algorithm in principle to make a rule for every instance. The expectancy of noisy training material can be signaled to the algorithm with a binary noise expectancy parameter. Misclassification cost of an instance can be changed. ripper offers the possibility to give a higher cost to false positives or false negatives. The default value is equal cost for all errors. cn2 (Clark and Boswell, 1991) is a sequential covering algorithm that produces unordered rule sets.1 Rules are learned using a general-to-specific beam search for each class in turn. The creation of a rule starts with the most general rule: a rule with empty conditions. Iteratively, feature-value tests are added creating more specific rules. In each iteration the search algorithm keeps a list of the n best rule candidates (in the implementation n has the value of 5). The performance of a rule is evaluated on the training set by computing Laplace Accuracy AccL : AccL (rule) =
p+1 p+n+C
(2.17)
where p is the number of positive examples covered by the rule, n is the number of negative examples covered by the rule and C is total number of classes. Significance tests are used as a stopping criterion for searching for further rules. The tests determine whether the distribution of examples over the classes covered by the rule is significantly different from a uniform distribution. After learning a rule, this rule is added to the general rule set together with the class distribution, e.g. the number of training instances per class that are covered by the rule. Only positive examples covered by the rule are removed, which allows the rules to overlap in the examples they cover. Instances that are positive examples for one rule, can also be negative examples for one or more other rules. In classification, rules are matched against the new instance. The stored class distributions of the matching rules are summed for each class and the class label with the highest summed value is assigned to the new instance. In sum, ripper and cn2 differ on several key points. ripper uses the specific-to-general search strategy for producing rules, while cn2 uses the general-to-specific approach. The algorithms use different performance evaluation methods and stopping criteria in the learning module. ripper removes all examples covered by a rule, whereas cn2 only removes the positive examples. The algorithms have in common that they are both sequential covering algorithms and both can produce unordered rule sets. cn2 has the 1 The
first version of the algorithm (Clark and Niblett, 1989) produces ordered rule sets.
21
2.1 Learning Classifiers
advantage that it has no algorithmic parameters to be tuned. In terms of efficiency and computation time, ripper is more efficient and faster than cn2.
2.1.4
Maximum entropy modeling
Maximum entropy modeling (Berger et al., 1996; Guiasu and Shenitzer, 1985) is a statistical machine learning approach that derives as a model a conditional probability distribution from labeled training data. This technique is based on the principle of insufficient reason: “When there is no reason to distinguish between two events, treat them as equally likely” (Keynes, 1921). Maximum entropy models (maxent) only represent what is known from the labeled training instances and assume as little as possible about what is unknown. In other words, maxent derives a probability distribution with maximal entropy. The hypothesis of the target function constructed by a probabilistic learning algorithm can be formulated as finding a conditional probability p(y|x) that assigns class y given a context x. The features present in the training instances of set S constitute context x. These features are used to derive constraints or feature functions of the form:
f (x, y) =
&
1 0
if y=y’ and cp(x)=true otherwise
(2.18)
where cp is a predicate to map the pair of class y and context x to {true, f alse} (Le, 2004; Ratnaparkhi, 1998). maxent searches for a hypothesis in the form of a distribution that satisfies the imposed constraints and maximizes the entropy. Della Pietra et al. (1997) show that the distribution is of the exponential form as displayed in Equation2.19:
p(y|x) =
j ! 1 exp( λi fi (x, y)) Z(x) i=1
(2.19)
where fi (x, y) is the feature function of feature i as detailed in 2.18, j represents the total number of features, λi is a parameter to be estimated, namely the weighting parameter of fi (x, y), and Z(x) is a normalization factor to ensure that the sum over all classes equals one as shown in Equation 2.20 (Nigam et al., 1999):
Z(x) =
k !
y=1
j ! exp( λi fi (x, y))
(2.20)
i=1
The search strategy for the optimal distribution of the λ weighting parameters is done in an iterative way with algorithms such as Generalized Iterative Scaling (GIS) (Darroch
22
Chapter 2: Algorithms, data and methods training instances
maxent weights matrix summed winner−take−all
predicted label y
test instance x
Figure 2.5: The maximum entropy modeling algorithm produces a probability matrix as internal model.
and Ratcliff, 1972) and limited-memory quasi-Newton (L-BFGS) (Nocedal, 1980). A general method against overfitting on the training instances is to perform smoothing. When information in the training instances is sparse and can not be used to make reliable probability estimations, smoothing is a common technique to produce more accurate estimations (Chen and Goodman, 1996). Chen and Rosenfeld (1999) tested several smoothing methods for maximum entropy modeling and found the maximum entropy smoothing method with a Gaussian prior to perform best. This smoothing method forces each λ-parameter to be distributed according to a Gaussian with a mean µ and a variance σ 2 as shown in Equation 2.21.
P (λi ) =
(λi − µi )2 1 √ ] exp[− σi 2π 2σi2
(2.21)
This smoothing method penalizes the weighting parameters λ when deviating too much from their mean prior µ which usually has the value of zero. On the basis of sparse evidence present in the training instances, the weights of λ can be estimated to be large and possibly infinite. By assuming a prior expectation that the λ parameters are not large and balancing that expectation against the evidence in the data, the λ parameters are forced to be smaller and finite (Klein and Manning, 2003). We depict the maximum entropy modeling algorithm in Figure 2.5. The learning module produces a distribution of λ weight parameters of the feature functions. This distribution can also be viewed as a matrix filled with weights between classes and feature values. In
23
2.2 Data
the classification phase, the distribution in Equation 2.19 is applied to the test instance. For each class a probability is calculated; the λ weight parameters of the features present in the instances are summed, the exponent of the sum is calculated and normalized.
2.2
Data
We perform our experiments on two types of data sets: data taken from the UCI collection of machine learning benchmark data sets, and data representing natural language processing (NLP) tasks. In the next two sections we detail the data sets we use in our experiments.
2.2.1
UCI benchmark tasks
We choose 29 data sets from the UCI (University of California, Irvine) benchmark repository for machine learning (Blake and Merz, 1998). These tasks are well known and used frequently in comparisons in machine learning research. Table 2.1 shows some basic statistics; it expresses for each data set the number of instances, number of classes, number of features, average number of values per feature2 , and the percentages of numeric and symbolic features. The name cl-h-disease stands for ’Cleveland-heart-disease’ and segment refers to ’Image Segmentation data’. Sixteen data sets consist of symbolic features only, ten only have numeric features and the other three data sets have mixed symbolic and numeric features. The data sets are diverse. The number of classes varies from 2 to 28 and the number of instances range from 32 instances for the lung-cancer data set to 67,557 instances for connect4. Some data sets are artificial, i.e. designed by humans such as the monks data sets while other data sets consist of data sampled from a real-world domain such as medical records in Cleveland-heart-disease. The UCI benchmark tasks have a given feature representation. We made some minor modifications to some data sets, such as removing non-informative features (for example, instance identifiers) or choosing one of the features as class label when this was not specified beforehand. In appendix A we detail these modifications. Some features in the UCI data sets have missing feature values. We ignore them by treating the ’unknown’ value as any other feature value. 2 For numeric features we calculate the average number of values per feature on the basis of the discretized versions of the data sets.
Chapter 2: Algorithms, data and methods 2.2.1.1
24
Discretization of continuous features
Different machine learning algorithms have different methods to handle continuous feature values. In order to rule out these differences, we discretize numeric features in a preprocessing step. We use the entropy-based discretization method of Fayyad and Irani (1993). Ting (1994) proposed a globalized version of this method and applied it successfully to instance-based learning. Dougherty et al. (1995) and Kohavi and Sahami (1996) apply the entropy-based method to a range of supervised machine learning algorithms and show that it works well. This method was advised as the most appropriate method for discretization by Liu et al. (2002).
2.2.2
NLP tasks
The data representing the natural language processing tasks is based on human annotation of linguistic phenomena in text. In the data we use the text consists of newspaper articles. Before annotation, the text is tokenized which means that punctuation marks attached to words are separated from the words by white space. The elements in the text that are separated by white space are called tokens. The natural language processing (NLP) tasks are characterized by the fact that they are quite complex and contain many exceptions. We choose four NLP tasks: phrase chunking, named entity recognition, prepositional phrase attachment, and part-of-speech tagging. Two of these tasks, prepositional phrase attachment and part-of-speech tagging, can be considered as lexical disambiguation tasks. Lexical disambiguation concerns the labeling of words in text. Words can have multiple labels and the task is to choose the label that fits best given the context of the word. The other two tasks concern sequential tasks: phrase identification and classification. In a sequential task a label is assigned to a sequence of words. The task is to first correctly identify the boundaries of the sequence and secondly, to assign the most appropriate label given the context. Table 2.2 lists the number of instances, the number of classes, the number of features and average number of values per feature of the four NLP tasks. The NLP data sets only contain symbolic features. Compared to the statistics of the UCI data sets we observe that the NLP data sets have a markedly higher average number of feature values per feature. Three of the NLP data sets are also much larger than any of the UCI data sets. For the natural language tasks we choose a simple feature representation and we keep the feature representation constant for each task. The prepositional phrase attachment data set has a given feature representation which we use in our experiments. For the other three natural language tasks, an instance is created for each token in the text, representing some information about the token and its local context. Each instance consists of the focus surrounded by three tokens to the left and right (a so-called 3-1-3 window representation).
25
2.2 Data Task abalone audiology bridges car cl-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-large splice tictactoe vehicle votes wine yeast
# inst 4177 226 104 1728 303 67557 336 194 214 3196 20000 32 432 432 432 8124 12960 5620 10992 106 2310 1389 683 3190 958 846 435 178 1484
# classes 28 24 6 4 5 3 8 8 6 2 26 3 2 2 2 2 5 10 10 2 7 6 19 3 2 4 2 3 10
# feats 8 69 7 6 13 42 7 28 9 36 16 56 6 6 6 21 8 64 16 57 19 12 35 60 9 18 16 13 8
av. v/f 149.1 2.3 11.3 3.5 27.4 3.0 21.7 10.8 58.2 2.0 15.5 2.8 2.8 2.8 2.8 5.5 3.4 11.1 29.9 4.0 53.5 3.6 3.8 4.8 3.0 60.6 3.0 19.2 51.4
% num. 87.5 0 0 0 100 0 100 35.7 100 0 100 0 0 0 0 0 0 100 100 0 100 83.3 0 0 0 100 0 100 100
% symb. 12.5 100 100 100 0 100 0 64.3 0 100 0 100 100 100 100 100 100 0 0 100 0 16.7 100 100 100 0 100 0 0
Table 2.1: Basic statistics of the 29 UCI repository data sets. For each task is presented: the number of instances, the number of classes, the number of features, average number of values per feature (av. v/f), the percentage of numeric features and the percentage of symbolic features.
For the named entity recognition task and chunking task the part-of-speech tags of the seven tokens are also included as features. In the next sections we describe each of the four data sets and the feature representations used present the instances.
26
Chapter 2: Algorithms, data and methods data chunk ner pos pp
# instances 211,727 203,621 211,727 20,801
# classes 22 8 45 2
# features 14 14 7 4
av. v/f 9134.8 10799.1 18225.0 3268.8
Table 2.2: Number of instances, classes, features and average number of values per feature of the four NLP data sets.
2.2.2.1
Part-of-speech tagging
Part-of-speech tagging (pos) is the task of assigning syntactic labels to tokens in text. pos tagging aims to solve the problem of lexical syntactic ambiguity and can be used to reduce sparseness in data as there are fewer different part-of-speech labels than tokens. Information about syntactic labels is considered useful for a large number of natural language processing tasks such as parsing, information extraction, word sense disambiguation and speech synthesis. In our experiments we use the sections 15-18 from the Wall Street Journal (WSJ) corpus as our data set. The WSJ corpus is annotated with part-of-speech information as part of the Penn Treebank project (Marcus et al., 1993). Example 2.22 shows a sentence from the WSJ corpus where each token is annotated with its part-of-speech tag. For instance, the first two words Mr. Krenz have the label N N P which denotes a singular proper noun; the third word is labeled with V BZ indicating that the word is a verb, 3rd person singular, present tense.3 (2.22) Mr.N N P KrenzN N P isV BZ suchJJ aDT contradictoryJJ figureN N thatIN nobodyN N hasV BZ evenRB comeV BN upRP withIN anyDT goodJJ jokesN N S aboutIN himP RP .P OIN T We use the following simple feature representation. For each token in the text an instance is created, representing some information about the token and its local context. Each instance consists of the focus token and a 3-1-3 window of the tokens to the left and right. Example 2.23 shows two instances of the sentence presented in Example 2.22. The first instance presents the focus token such that is labeled with the class JJ denoting that the word is an adjective in this sentence. The second instance represents a labeled with the class DT, indicating that the word is a determiner. 3 Detailed
information about the tag set can be found in (Santorini, 1990).
27
2.2 Data
(2.23)
2.2.2.2
Mr. Krenz is such a contradictory figure JJ Krenz is such a contradictory figure that DT Phrase chunking
Phrase chunking consists of splitting sentences into non-overlapping syntactic phrases or constituents. Chunking is an intermediate step towards full parsing (Abney, 1991; Tjong Kim Sang and Buchholz, 2000). Example 2.24 presents a sentence with the chunks marked with brackets and labeled with their syntactic function. For instance, the first chunk contains the word Late and its label 4 ADV P represents the syntactic category adverbial phrase . (2.24) [Late]ADV P [in]P P [the New York trading day]N P KOMMA [the dollar]N P [was quoted]V P [at]P P [1.8578 marks]N P KOMMA [up]ADV P [from]P P [1.8470 marks]N P [late Thursday]N P [in]P P [New York]N P .
Phrase chunking involves the recognition of sequences of words. However, many machine learning algorithms can only predict one label per item and cannot handle sequences as a whole. This implies a conversion step to map the sequences to single items and for this purpose IOB tags (Ramshaw and Marcus, 1995) are used. Each token is annotated with a class label that indicates whether it is inside or outside of a chunk. There are several possible variants to annotate this information (Munoz et al., 1999; Tjong Kim Sang and Veenstra, 1999). We use the training set of CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000) in our experiments. This data set is extracted from the Penn Treebank by Buchholz (2002) and the training set covers sections 15-18 from the WSJ corpus. The CoNLL-2000 shared task uses the following IOB tags: the start of a chunk (B), inside a chunk (I) or outside a chunk (O). When the token is part of a chunk, the class label also marks the syntactic label of the chunk. The data set contains three columns that represent tokens, their partof-speech tags and class labels. The part-of-speech tags are predicted with the Brill tagger (Brill, 1994) trained on a different part of the WSJ corpus. Table 2.3 presents a fragment of the English training set of CoNLL-2000 shared task. We create an instance for each token and its local context. Each instance presents the focus and its part-of-speech and three tokens and part-of-speech tags left and right. Example 2.2.2.2 displays two instances, representing the local context of the chunk late Thursday of the sentence displayed in Table 2.3. The first instance represents the word late with part-of-speech tag JJ and class label B-NP, the three words to the right Thursday,in,New, their part-of-speech NNP, IN, NNP and the three words to the left from, 1.8470, marks and their part-of-speech tags. 4 Detailed
information about labeling can be found http://www.cis.upenn.edu/∼treebank/home.html
Chapter 2: Algorithms, data and methods Late in the New York trading day , the dollar was quoted at 1.8578 marks
RB IN DT NNP NNP NN NN , DT NN VBD VBN IN CD NNS
28
B-ADVP B-PP B-NP I-NP I-NP I-NP I-NP O B-NP I-NP B-VP I-VP B-PP B-NP I-NP
Table 2.3: A fragment of the English training set of the CoNLL-2000 shared task. Each token is mapped to a part-of-speech tag and a class label containing IOB-tags.
(2.25)
2.2.2.3
from IN 1.8470 CD marks NNS late JJ Thursday NNP in IN New NNP B-NP 1.8470 CD marks NNS late JJ Thursday NNP in IN New NNP York NNP I-NP
Named entity recognition
The task of named entity recognition (ner) is to recognize and classify named entities in text. ner is a subtask of information extraction. Information extraction is concerned with the extraction of facts in unstructured data (Appelt and Israel, 1999; Jackson and Moulinier, 2002). The facts to be extracted depend on the type of application, but usually information about named entities such as persons, locations, companies, organizations, dates or amounts is considered valuable information. Information about named entities is not only important for information extraction, but is also considered helpful for other natural language processing tasks such as machine translation and information retrieval. The recognition of names in English text seems easy as names start with a capital letter. Marking words in the text that start with a capital letter but are not the first word in a sentence will already cover a large part of the names in the text. However, classifying the type of the names is the harder part of the job, as names can be ambiguous. Names such as Heineken can point to a person as well as an organization, and when used in a sentence such as (2.26) the name refers to an object. (2.26) Could you bring me a Heineken and a coffee?
29
2.2 Data
The CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003) concerns languageindependent named entity recognition, using four name categories: persons, organizations, locations and miscellany names. This last category contains all other names and can refer to special events i.e. World Cup, or adjectival names i.e. British. We use only the English training set of the data sets provided in the CoNLL-2003 shared task, in our experiments. This data set is a collection of newswire articles from the Reuters Corpus 5 . Example 2.27 shows two sentences from the data set with the named entity information between brackets. (2.27) [Rabinovich]person is winding up his term as ambassador . He will be replaced by [Eliahu Ben-Elissar] person , a former [Israeli]misc envoy to [Egypt]location and right-wing [Likud]organization party politician .
Named entity recognition is, similarly to the chunking task, a task that involves both the identification of multi-word units of one or more words, and assigning these units a label. The IOB-tagging convention is used to represent the task as a classification task. This data set is tagged with a slightly different variant of IOB tags as the chunk data set. (I) indicates the token is inside a named entity phrase, (B) indicates a border between two named entities and tokens labeled with (O) are not part of a name. The data set consists of four columns, tokens, part-of-speech tags, phrase chunks and class labels. The part-of-speech tags and phrase chunks are generated by the memory-based MBT tagger (Daelemans et al., 2003) and the named entity labels are annotated manually by the organizers of the shared task. We use the same simple feature representation as for the chunk task. Each instance consists of a 3-1-3 window of the focus surrounded by three tokens and part-of-speech to the left and right. The first instance displayed in Example 2.28 presents the token by of the sentence presented in Example 2.22. The instance is labeled with the class O which means that the token is not part of a named entity phrase. The second instance presents the name Eliahu and is labeled with I-PER which stands for ’inside phrase, person name’. (2.28) will MD be VB replaced VBN by IN Eliahu NNP Ben-Elissar NNP KOMMA KOMMA O be VB replaced VBN by IN Eliahu NNP Ben-Elissar NNP KOMMA KOMMA a DT I-PER
2.2.2.4
Prepositional phrase attachment
Parsing is the task of analyzing the syntactic structures of sentences. When parsing a sentence, some elements are difficult to place in the structure as there are several possibilities. Prepositional phrases are typical examples of such difficult elements. The 5 http://about.reuters.com/researchandstandards/corpus/
Chapter 2: Algorithms, data and methods
30
position of a prepositional phrase in a syntactic structure can be ambiguous. Resolving this ambiguity may require lexical or structural information, but sometimes more complex information such as discourse or world knowledge is needed. Assigning prepositional phrases their correct positions (attachments) in a syntactic structure is also called ppattachment (pp). A well-known data set that represents an instantiation of pp-attachment ambiguity is the data set collected by Ratnaparkhi et al. (1994). This data set is extracted from the IBM-Lancaster Treebank of Computer Manuals and the WSJ part of the Penn treebank. The data contains extracted verb phrases with an object phrase and prepositional phrase. The task is to determine whether the prepositional phrase has to be attached to either the object phrase or to the verb phrase. Each instance consists of a class label and four features representing the head of the verb phrase, the head noun of the object, the preposition and the head noun of the prepositional phrase. We use the same feature representation as Ratnaparkhi. In example 2.29 we see a sentence from the WSJ corpus; Example 2.30 presents the instance extracted from this sentence. Each instance consists of a verb, an object, a preposition and the head noun of the prepositional phrase. The class label V indicates that for this sentence the prepositional phrase has to be attached to the verb phrase. (2.29) Pierre Vinken, 61 years old, will join the board as nonexecutive director Nov 29. (2.30) join board as director V The data as collected by Ratnaparkhi is split into 3 sets: a training, development and test set. We only use the training set of 20801 instances in our experiments.
2.3
Experimental methods
For the experiments presented in Chapter 3 and 4 we perform ten-fold cross-validation (10-fold cv) on the data sets, an established method in machine learning to estimate the generalization performance of a classifier on a particular classification task (Weiss and Kulikowski, 1991). The data set representing the classification task is split into ten folds of approximately the same size. Splitting into ten folds was done at the instance level for the 29 UCI benchmark data sets, and at the sentence level for the natural language data sets. Each fold is considered as test material once while the other nine parts form the training set. For the data sets with numerical features we apply discretization as described in Section 2.2.1.1. We base our discretization intervals solely on the training data, and use these intervals on both training and test material. This means that we perform a separate discretization for each training fold in the ten-fold cross-validation experiments. We use
31
2.3 Experimental methods
this discretized data for each of the machine learning algorithms to keep all circumstances in the experiments as equal as possible. A classifier is trained on the training part and optimized with respect to its algorithmic parameters (described in Section 2.3.1). We measure the accuracy of the classifier on each of the test folds and calculate the mean and the standard deviation over the ten folds. We perform paired t-tests to determine the significance of differences in the performance of two classifiers (detailed in Section 2.3.2).
2.3.1
Optimization of algorithmic parameters
Machine learning algorithms tend to be equipped with several algorithmic parameters that affect their behavior. Performing experiments with machine learning algorithms with their default parameter settings might not be the best choice for the task and data set at hand. With default parameters settings we mean the settings that the algorithm uses when nothing is specified by the user, thus the settings as coded by the designer of the particular algorithm implementation. Determining an optimal algorithmic parameter setting for a particular task and data set is not simple, especially as the parameters are usually not independent from each other. This implies that combinations of parameter settings can influence the behavior of the algorithm in a way that cannot be predicted from the effects of the parameter settings individually. When the data set and number of algorithmic parameters is sufficiently small, and numerical algorithmic parameters are tested at a limited amount of discrete values, one can perform an exhaustive search and test all possible parameter combinations to find an optimal parameter setting automatically (Kohavi and John, 1997). In our experiments we perform a pseudo-exhaustive search for data sets with less than 1000 instances in the training set. For each training fold we perform internal 10-fold cv experiments to test all combinations of a broad range of possible parameter settings as listed in the last part of this section. For larger data sets we use a heuristic wrapping-based approach to automatically select a parameter setting called wrapped progressive sampling (WPS) (van den Bosch, 2004b). In this method, the training set is split into an 80% training part and a 20% held-out part from which a sequence of growing training and test subsets is sampled respectively. The training part sequence consists of√d increasingly-sized subsets. Each subset is created by multiplication with a factor f = d n, where n is the number of instances in the 80% training part and d = 20. The size si of a subset i is si = si−1 ∗ f and i varies from 1 to d. Only subsets with more than 500 instances are used. We also include the training subset of 500 instances itself as the first subset. Each training subset is paired with a test set of size 0.2 ∗ s created from the 20% test part. Each subsequent test subset contains the instances of the previous subset plus extra added instances.
32
Chapter 2: Algorithms, data and methods
number of elements in bin
20
cutpoint bins
15
10
5
60
58
56
54
52
50
0 accuracy
Figure 2.6: Visualization of the selection process in WPS. The settings are divided over 10 equally sized bins according to their score. Selection goes from right to left; subsequently all bins containing equal or more settings than the bin with the highest score are selected.
The first step in WPS is to test all possible parameter combinations on the first train and test subset. Accuracies are measured on the test set and the parameter combinations are ranked accordingly. The next step is to remove the setting combinations with low performance. The WPS procedure does not simply cut off some percentage of the ranked settings but tries to estimate which part of the ranked list performed best. The difference between the highest and lowest accuracy score in the ranking is taken. This difference is divided into 10 equally-sized bins; the setting combinations are sorted into the bins. The number of settings contained in one bin can differ for each run. First the bin that contains the settings with the highest 10% accuracies is chosen. Next, the bin that contains the same number of settings or more is also selected. This selection process is visualized in Figure 2.6. Selection proceeds from right to left, starting with the selection of the bin that contains the highest scores. In the example this bin contains settings that score between 58.2 and 59.1% accuracy. The next three bins are also selected because each contains more elements than its preceding bin (looking right to left). The subset of settings in the selected bins will be tested again in the next iteration with the training- and test subset that are the next in the sequence of subsets. This iteration stops when there is one best performing setting found in the first bin. When there are several best settings left in the final iteration, one of them is selected randomly. WPS is a generic method and can be used for any algorithm that has several algorithmic parameters to be tuned. The parameters of the maximum entropy modeling algorithm are not optimized as it was shown in (van den Bosch, 2004b) that optimizing the two
33
2.3 Experimental methods
available parameters did not substantially increase the performance of the algorithm. Here we list the algorithmic parameters of k-nn, ripper and winnow that are tested in the WPS procedure or in a pseudo-exhaustive search. We indicate the default parameter settings in bold. We test five algorithmic parameters of k-nn with a total of 925 parameter combinations. Details about the parameters can be found in Section 2.1.1 and the TiMBL software manual (Daelemans et al., 2004). • number of nearest neighbors: 1, 3, 5, 7, 9, 11, 13, 15, 19, 25, 35; • feature weighting: none, gain ratio, information gain, shared variance, chi-square; • distance metric: overlap, mvdm, Jeffrey divergence; • neighbor weighting: normal majority voting, inverse linear weighting, inverse distance weighting (only when k > 1) • frequency threshold for switching from mvdm distance metric to overlap metric: 1, 2; For the rule induction algorithm ripper we test seven algorithmic parameters which leads to a total of 972 parameter combinations to be tested. We refer to Section 2.1.3 for further details about the algorithmic parameters. • number of extra optimization rounds: 0, 1, 2; • order of the classes: start by making rules for the most frequent classes, start with least frequent classes, unordered. • rule simplification: 0.5, 1.0, 2.0; • misclassification cost: 0.5, 1.0, 2.0; • minimum number of instances covered by rule: 1, 2, 5, 10, 20, 50; • noise expectancy: yes, no; • negative tests for nominal valued features: yes, no; For the winnow algorithm as implemented in the SNoW software package, we test on five algorithmic parameters and a total of 756 parameter combinations. Further information about the parameters can be found in Section 2.1.2 and the user manual of SNoW (Carlson et al., 1999). • target threshold: 1.0, 2.0, 4.0; • demotion parameter: 0.75, 0.9, 0.95; (default 0.74)
34
Chapter 2: Algorithms, data and methods
prediction
yes no
actual yes TP FN
class no FP TN
Table 2.4: Contingency table
• promotion parameter: 1.05, 1.1, 1.25, 1.5; (default 1.35) • number of cycles: 2, 5, 10; • thickness separator between positive and negative examples: 0.0, 0.1, 0.2, 0.5, 1.0, 2.0, 4.0;
2.3.2
Evaluation methods
Several methods are available to evaluate the generalization performance of a classifier on a particular task. In such an evaluation the classifier labels a set of unseen instances. A well known measurement of generalization performance is accuracy, i.e. the proportion of correctly classified instances in relation to the total number of instances of the test set as shown in Equation 2.31.
accuracy =
# correctly classif ied instances # total instances
(2.31)
Sometimes not every classification of the classifier is considered equally important. The performance on some classes may be more interesting than on other classes. For instance, the named entity recognition task has four named entity class labels and one no-name class. The task is to recognize named entities and therefore the performance on the four named entity classes is more important than the score on the no-name class. Furthermore, the performance of the classifier on the no-name class can be expected to be high because the prediction of this class is easier because the a priori probability of the class is high and predicting only this class leads to a relatively high accuracy. There are several evaluation methods that measure a score per class. For each class instances labeled with that class are considered positive examples of the class and all other instances are seen as negative examples. When a classifier predicts a class label for a certain instance, this can have four possible outcomes as shown in table 2.4, a so-called contingency table or confusion matrix. A positive prediction for a positive example is called a true positive (TP), while a negative prediction for a positive example is labeled as
35
2.3 Experimental methods
false negative (FN). True negatives (TN) are negative predictions for negative examples and false positives (FP) are positive predictions for negative examples. On the basis of this contingency table several metrics can be calculated. Precision measures the ratio of the number of correctly classified instances to the total number of positive predictions made by the classifier.
precision =
TP (T P + F P )
(2.32)
Recall measures the number of correctly classified instances relative to the total number of positive instances. This measure is also called True Positive Rate (TPR).
recall = T P R =
TP (T P + F N )
(2.33)
False Positive Rate (FPR) calculates the number of misclassified positive instances in relation to the total number of misclassified instances.
FPR =
FP FP + TN
(2.34)
F-score (van Rijsbergen, 1979) is a method to balance between recall and precision and has a parameter β to give a higher importance to recall or precision. To calculate the harmonic mean between precision and recall, parameter β is given value 1.
Fβ =
(β 2 + 1) ∗ precision ∗ recall β 2 ∗ precision + recall
(2.35)
An alternative method is Area Under the ROC (Receiver Operating Characteristics) Curve, abbreviated as AUC. This measure originates from the field of signal detection and is now also common in the field of machine learning. ROC graphs are two dimensional graphs that plot TPR against FPR and show the ratio between benefits and costs (Fawcett, 2003). AUC scores are mappings of curves in ROC space to one number. The predicted class labels of a discrete classifier can be used to calculate the TPR and FPR of the classifier. The calculated TPR and FPR produce one point in the ROC space. Other types of classifiers such as probabilistic classifiers, e.g. Naive Bayes, or hyperplane discriminant algorithms such as Winnow and SVM, do not produce class labels for each instance, but a numeric value for each class: a probability or a weight indicating the appropriateness of the class. A threshold on these numeric values can force the algorithm to assign the class label when the value is above the threshold or not when below the threshold. Different thresholds will result in different points in the ROC space and a range of these points produce a curve. AUC scores are mappings of curves in ROC space to one number, denoting the surface below the curve. A discrete classifier produces only
36
Chapter 2: Algorithms, data and methods 1 0.9 0.8 0.7 FPR
0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TPR A
B
Figure 2.7: Two curves in ROC space, A denotes the angular curve of a discrete classifier and B denotes the curve of a probabilistic classifier.
one point in ROC space, but one can also draw a curve by connecting this point to the two corners (0,0) and (100,100). Figure 2.7 presents a ROC space exemplifying an angular curve of a discrete classifier (A) and the curve of a probabilistic classifier (B). In this thesis we consider all classifiers to be discrete and produce only one point in ROC space. The AUC score of a discrete classifier can be calculated as follows: AU C = T P R ∗ (1 − F P R) +
T P R ∗ F P R (1 − T P R) ∗ (1 − F P R) + 2 2
(2.36)
Because both F-score and AUC are calculated for each class separately, we need a method to calculate an average score over all classes of a particular task. Such an average can be calculated in two ways: micro- and macro-average. When calculating a micro-average, each classification of an instance is counted and an average is computed over the total number of instances. With macro-averaging a score per class is calculated and averaged over the total number of classes. When class distributions are skewed, the scores on low frequency classes have a low vote in a micro-averaged score, while they have an equal weight compared to the other classes in macro-averaging. Therefore the choice for micro- or macro-average depends on the data set and the goals of the research. For the AUC score, Provost and Domingos (2003) propose micro-averaging to calculate an average AUC for data sets with multiple classes as shown in Equation 2.37. First a separate AUC score for each class ci is calculated where all instances labeled with that
37
2.3 Experimental methods
class are considered as positive examples and all other instances are considered negative examples (one-versus-all). These individual AUC scores AU C(ci ) are multiplied by the relative weight p(ci ) of that class. This weight is estimated by the frequency of the class in the data set.
AU Cmulti−class =
n !
AU C(ci ) p(ci )
(2.37)
i=1
In this research we use accuracy as our evaluation method, because the machine learning algorithms are optimized with respect to their accuracy. Huang and Ling (2005) show, formally and empirically, that AUC is a better measurement than accuracy for the evaluation of learning algorithms. Also other work (Bradley, 1997; Provost et al., 1998) discusses the ROC analysis, AUC and accuracy. In the last years AUC has become an important measurement in the field of machine learning. Therefore we also want to pay attention to AUC although we find it unappropriate to use AUC as our main evaluation method while we optimize the algorithms with respect to their accuracy. In Section 3.4.2 we compare the performance of the algorithms in terms of micro-averaged AUC against accuracy. We use one-tailed paired t-tests, also known as Student’s t-tests, to measure whether two algorithms have a significantly different score on a particular task. A paired t-test measures whether two sets of values are significantly different from each other and assumes that the data is dependent, i.e. there is a one-to-one correspondence between the values in the two samples. In our case the sample contains the ten measured accuracies on the test folds of the 10-fold cv experiments. As each algorithm is tested on the same ten test folds these pairs of values are indeed dependent. We use a one-tailed t-test because we only look for positive correlations.
Chapter 2: Algorithms, data and methods
38
Chapter 3
Hybrids We describe and motivate the construction of the hybrids that combine eager learning with k-nn classification. We present experimental results of the hybrids and compare their performance to their parent algorithms. In this chapter we explain how the hybrid algorithms combine the classification method of k-nn with the model of an eager learning algorithm. We combine k-nn with maximum entropy modeling, rule induction, and multiplicative update learning. As argued in Chapter 1, our motivation for constructing these hybrid algorithms lays in the interesting contrast between eager and lazy learning. Eager learners put most of their time and effort in the learning phase, while lazy learners divert their effort to the classification phase. The hybrid algorithms combine the two approaches and put effort in both learning and classification, with the intent of combining the best of both worlds. In the next three sections we describe and motivate the construction of the hybrid algorithms. We present the results of comparative experiments with the hybrids. For these experiments we use the methods and data sets described in Chapter 2. Section 3.1 describes the construction and experimental results of the hybrid which combines maximum entropy modeling with k-nn. Section 3.2 details the design and results of two hybrids which combine the rule induction algorithm ripper with k-nn. We also describe additional experiments with the rule induction algorithm cn2 and compare these to the results of ripper. Section 3.3 describes the construction and the results of the hybrid algorithm which combines multiplicative update learning with k-nn classification. In Section 3.4 we describe three additional generalization performance of the four parent winnow to each other. Next we pay attention in our experiments and compare this to another 39
comparisons. First we compare the algorithms k-nn, rules, maxent and to the evaluation method accuracy used popular evaluation method, AUC. In the
Chapter 3: Hybrids
40
last part of this section we analyze the results of the algorithms on the NLP data sets and investigate whether these differ from the results obtained on all tasks. Section 3.5 describes related work and Section 3.6 presents a summary of this chapter.
3.1
k-nn and maximum entropy modeling
In this section we describe the hybrid algorithm that represents a bridge between a maximum entropy modeling algorithm and a k-nearest neighbor classifier using the Modified Value Difference metric. As described in Section 2.1.1, the mvdm distance metric estimates distances between symbolic feature values as the difference between the conditional class distributions of the two values. A known weakness of the mvdm metric is its sensitivity to data sparseness. mvdm probabilities are based on counts of class–value co-occurrences. When these probabilities are based on low counts, this can lead to maximal distance between feature values that only accidentally occur with different classes, and minimal distance between feature values that only accidentally occur with the same class. Re-estimations of the probabilities could enhance the mvdm metric. One such reestimation of the value–class matrix is the one produced by the maximum entropy modeling algorithm. maxent re-estimates probabilities based on co-occurrence counts of classes and values in an iterative way to maximize the entropy of the value–class distribution. These re-estimated probabilities are referred to as λ weight parameters. Further details of maxent can be found in Section 2.1.4. We construct a hybrid algorithm which we name maxent-h (the ’h’ denotes hybrid) as reported earlier in (Hendrickx and van den Bosch, 2004). We replace the probabilities of the class distribution matrix of k-nn’s mvdm metric with the λ weight parameters distribution produced by the maximum entropy modeling algorithm. We take the λ weight parameters as estimated by maxent and scale these numbers linearly between zero and one, as the original class distribution probabilities used in mvdm are also numbers between zero and one. We refer to this new metric which uses maxent λ parameters as the maxent–mvdm distance metric. maxent-h combines the λ parameter distribution matrix of the maximum entropy modeling algorithm with k-nearest neighbor classification. The hybrid is depicted in Figure 3.1. In the learning phase, the hybrid reads the training instances in memory like a normal k-nn algorithm, but the training instances are also used to train the learning module of maxent which produces a value–class matrix of λ weight parameters that has the same dimensions as the mvdm matrix. The classification module uses these weights in the maxent–mvdm distance metric to determine the similarity between instances. From the perspective of k-nn, we expect that the hybrid repairs its sensitivity to data sparseness by smoothing the mvdm metric. The model of the eager learner is used
41
3.1 k-nn and maximum entropy modeling training instances
simple instance
storage
maxent
base weights matrix
k−nn classification
predicted label y
test instance x
Figure 3.1: The hybrid maxent-h combines the learning module of maximum entropy modeling with k-nn classification.
to modify the distance calculation between instances in k-nn. From the perspective of maximum entropy modeling, we may witness improved performance due to the fact that its default classification module, a normalized additive function, is replaced by the local maxent–mvdm distance metric to find the k nearest neighbors in the data. The two parent algorithms do not use the same types of values in their internal models: maxent uses a matrix of λ weights whereas k-nn uses class distributions. Although the values are different, we can apply the mvdm principle to both: feature values with similar class distributions are regarded as similar.
3.1.1
Comparison between maxent-h, maxent and k-nn
We run experiments with the hybrid maxent-h and both parent algorithms maxent and k-nn, following the experimental methods described in Chapter 2. The experimental setup of the experiments with the hybrid has a minor modification. As described in Section 2.3.1, the WPS optimization process of the algorithmic parameters tests three different distance metrics for the k-nn algorithm. The hybrid maxent-h is based on a k-nn algorithm with an mvdm distance metric, so the WPS optimization of maxent-h excludes optimizing the choice between different δ functions in the distance metric. Table 3.1 lists the mean accuracies and standard deviations of all algorithms calculated on the basis of 10-fold cv experiments on the 33 data sets. The mean accuracies of k-nn, maxent and maxent-h are shown in the third, fourth and seventh column of the table respectively. The bottom row of the table shows the macro-averaged accuracy over all
42 Chapter 3: Hybrids
data set abalone audiology bridges car cl-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-l splice tictactoe vehicle votes wine yeast pos chunk ner pp macro av.
# inst 4177 226 104 1728 303 67557 336 194 214 3196 20000 32 432 432 432 8124 12960 5620 10992 106 2310 1389 683 3190 958 846 435 178 1484 187209 211727 203621 20801
k-nn 24.6 ± 2.8 80.5 ± 6.3 54.7 ± 10.6 96.5 ± 1.3 55.7 ± 5.4 77.7 ± 1.8 79.5 ± 4.9 66.9 ± 11.4 67.7 ± 8.4 96.8 ± 1.2 95.6 ± 0.5 33.3 ± 12.9 100.0 ± 0.0 94.0 ± 11.8 97.2 ± 2.5 100.0 ± 0.0 99.4 ± 0.5 98.0 ± 0.7 93.4 ± 0.9 87.0 ± 7.1 95.7 ± 0.9 94.2 ± 2.2 92.8 ± 4.2 95.3 ± 1.0 95.8 ± 3.8 67.6 ± 4.5 95.2 ± 2.4 96.1 ± 2.6 53.3 ± 2.9 94.8 ± 0.1 95.5 ± 0.2 97.4 ± 1.2 80.1 ± 1.3 83.4 ± 3.6
maxent 23.6 ± 1.7 80.9 ± 5.0 61.6 ± 9.1 90.9 ± 2.2 55.1 ± 5.0 75.7 ± 0.5 76.5 ± 7.8 69.8 ± 13.4 70.1 ± 11.6 96.8 ± 0.6 85.0 ± 0.7 39.2 ± 24.7 75.0 ± 4.0 65.1 ± 5.5 97.2 ± 2.5 100.0 ± 0.0 92.4 ± 0.4 95.8 ± 0.8 86.0 ± 1.2 92.5 ± 9.0 92.1 ± 2.8 94.7 ± 1.8 92.2 ± 2.8 94.6 ± 0.8 98.3 ± 0.7 63.5 ± 5.2 96.5 ± 2.1 94.9 ± 5.3 49.3 ± 3.9 92.5 ± 0.1 95.0 ± 0.2 96.9 ± 0.1 81.6± 1.0 81.0 ± 4.0
rules 18.1 ± 1.7 76.5 ± 7.7 53.8 ± 14.3 97.6 ± 1.1 58.4 ± 5.9 76.3 ± 1.7 69.7 ± 10.9 61.8 ± 8.8 60.7 ± 6.5 99.2 ± 0.5 73.8 ± 1.5 31.7 ± 24.1 99.3 ± 2.0 72.0 ± 8.1 97.2 ± 2.5 100.0 ± 0.0 97.7 ± 0.8 89.3 ± 1.0 82.6 ± 1.7 79.4 ± 7.9 90.5 ± 3.6 94.6 ± 1.5 91.1 ± 3.0 94.1 ± 1.6 99.7 ± 0.7 55.7 ± 5.0 94.2 ± 1.6 93.2 ± 6.5 42.2 ± 3.2 68.8 ± 0.5 92.9 ± 0.5 90.3 ± 0.9 77.0 ± 1.9 78.2 ± 4.2
winnow 24.3 ± 1.8 72.9 ± 8.5 57.7 ± 10.6 89.2 ± 1.9 55.7 ± 7.7 70.1 ± 1.3 74.4 ± 6.6 66.9 ± 12.2 60.7 ± 9.6 93.7 ± 1.4 65.3 ± 2.0 43.3 ± 24.9 74.3 ± 4.2 65.0 ± 6.1 96.1 ± 5.3 100.0 ± 0.0 90.8 ± 0.5 90.1 ± 1.2 79.5 ± 1.9 87.0 ± 11.7 90.5 ± 2.2 94.4 ± 1.9 90.6 ± 4.2 94.5 ± 1.8 98.0 ± 1.3 58.0 ± 6.2 93.8 ± 3.1 93.2 ± 6.0 49.5 ± 3.9 76.5 ± 5.2
maxent-h 22.8 ± 2.2 81.3 ± 5.8 55.7 ± 13.1 96.5 ± 1.5 54.8 ± 5.8 78.1 ± 1.9 78.6 ± 2.8 68.9 ± 14.9 61.5 ± 9.9 99.1 ± 0.4 95.9 ± 0.7 43.3 ± 13.3 93.7 ± 10.5 96.3 ± 8.3 97.2 ± 2.5 100.0 ± 0.0 97.9 ± 1.1 97.3 ± 0.6 92.0 ± 1.4 93.5 ± 8.2 95.9 ± 1.1 94.2 ± 1.8 93.1 ± 3.2 94.8 ± 1.5 100.0 ± 0.0 66.3 ± 5.4 95.4 ± 2.3 96.6 ± 2.7 55.4 ± 3.0 93.5 ± 0.2 95.5 ± 0.1 98.0 ± 0.4 78.3 ± 0.9 83.7 ± 3.9
rules-r-h 18.0 ± 1.7 61.0 ± 9.4 52.9 ± 17.2 94.0 ± 4.0 58.4 ± 5.9 75.2 ± 1.3 72.6 ± 12.1 61.8 ± 7.8 60.8 ± 8.0 99.2 ± 0.5 74.6 ± 1.4 25.0 ± 12.9 100.0 ± 0.0 75.5 ± 12.2 96.5 ± 2.6 98.7 ± 2.6 97.9 ± 0.7 89.9 ± 1.0 81.7 ± 3.5 80.3 ± 8.7 90.6 ± 3.6 94.6 ± 1.5 91.8 ± 3.2 94.2 ± 1.1 100.0 ± 0.0 56.0 ± 5.2 94.2 ± 1.6 92.7 ± 7.0 40.2 ± 2.1 68.8 ± 0.5 92.5 ± 0.2 90.5 ± 0.9 77.4 ± 1.5 77.5 ± 4.3
rules-a-h 25.0 ± 2.5 81.8 ± 5.2 55.7 ± 13.1 98.4 ± 0.9 58.4 ± 5.2 78.6 ± 2.5 78.0 ± 6.3 65.8 ± 10.9 66.9 ± 10.1 99.2 ± 0.5 95.6 ± 0.4 34.2 ± 16.0 100.0 ± 0.0 97.0 ± 4.2 97.2 ± 2.5 100.0 ± 0.0 99.2 ± 0.4 97.9 ± 0.6 92.5 ± 1.5 83.5 ± 11.2 95.8 ± 1.3 94.2 ± 1.9 92.8 ± 3.5 95.8 ± 0.7 100.0 ± 0.0 64.2 ± 5.7 94.9 ± 1.3 95.5 ± 4.9 52.0 ± 4.7 95.0 ± 0.19 95.5 ± 0.29 97.5 ± 1.09 80.2 ± 1.9 83.6 ± 3.7
winnow-h 20.1 ± 2.0 69.4 ± 6.9 58.6 ± 12.8 78.1 ± 4.4 57.1 ± 4.1 71.6 ± 2.0 64.6 ± 10.4 66.4 ± 13.0 45.2 ± 11.5 98.4 ± 0.8 82.0 ± 26.3 43.3 ± 20.0 98.4 ± 4.9 76.9 ± 18.3 97.2 ± 2.5 100.0 ± 0.0 74.0 ± 6.9 89.8 ± 1.8 80.6 ± 1.4 86.8 ± 9.7 89.4 ± 2.9 94.7 ± 1.7 87.8 ± 3.8 93.8 ± 1.7 92.4 ± 3.6 39.6 ± 6.1 94.0 ± 2.6 88.2 ± 5.8 49.6 ± 5.3 75.4 ± 6.7
Table 3.1: Mean accuracy and standard deviation of the 10-fold cv experiments on the 33 tasks for all algorithms. Highest scores are printed in bold-face.
3.1 k-nn and maximum entropy modeling 100
100
90
90
80
80
70
70
accuracy maxent-h
accuracy maxent-h
43
60 50 40 30
60 50 40 30
20
20
10
10
0
0 0
10
20
30
40 50 60 accuracy k-nn
70
80
90
100
0
10
20
30
40 50 60 accuracy maxent
70
80
90
100
Figure 3.2: Accuracies on the 33 data sets of the hybrid maxent-h plotted against the accuracies of k-nn (left) and against maxent (right).
33 data sets. We see that k-nn has a macro-average of 83.4 %, which is higher than the average 81.0 % of maxent. The hybrid maxent-h has a macro-average of 83.7% which is higher than both parents. We plot the accuracies of the hybrid maxent-h against both parents. The graph on the left in Figure 3.2 displays the accuracies of maxent-h plotted against the accuracies of k-nn. Each point represents one of the 33 data sets. We observe that the points center around the diagonal of the graph indicating that both algorithms have a similar performance for most data sets. The graph on the right presents the accuracies of maxent-h plotted against maxent. We observe a majority of points to the left of the diagonal, which means that for these data sets the hybrid maxent-h has a higher accuracy than maxent. In this graph the points are located further away from the diagonal than in the graph to the left. This indicates that there are more data sets for which the accuracy of the hybrid differs from the accuracy of maxent than is the case for the comparison against k-nn. We perform paired t-tests on the results of the hybrid and both parents to estimate the significance of the observed differences. Table 3.2 shows the results of paired t-tests summarized over all 33 data sets between the hybrid maxent-h, k-nn and maxent. The counts in the column labeled “won/tied/lost” are based on paired t-tests; a significant difference with p < 0.05 is counted as “lost” or “won” depending on which algorithm has the higher average accuracy. The counts for “tied” represent the cases in which no significant difference with p < 0.05 is found. The table shows that no significant difference is found between maxent-h and k-nn in 23 cases. The hybrid wins four times against k-nn while losing six times. maxent-h outperforms its parent maxent 15 times and only loses once. When comparing the two parents to each other, k-nn wins 12 times against maxent and loses three times.
44
Chapter 3: Hybrids algorithm pair maxent-h – k-nn maxent-h – maxent k-nn – maxent
won/tied/lost 4/23/6 15/17/1 12/18/3
Table 3.2: Summarized results of significance tests on the accuracies of the 10-fold cv experiments on the 33 data sets. maxent-h is compared to its two parent algorithms.
Thus, the hybrid maxent-h has a quite similar generalization performance to its parent k-nn. The hybrid performs better than its parent maxent, similar to k-nn which also performs better than maxent.
3.2
k-nn and rule induction
In this section we give a description of the hybrid algorithm that combines rule induction with k-nn and present experimental results. (Rule induction algorithms are introduced in Section 2.1.3.) A guiding principle of most rule induction algorithms is the minimal description length principle (Cohen, 1995; Quinlan, 1993). The extracted model should be as simple and compact as possible while attaining the best possible generalization performance. Typically, induced rules have a high level of abstraction. In contrast, knn does not follow the minimal description length principle and does not perform any abstraction but stores all instances in memory instead. In terms of the model of the algorithms, the difference between the rule induction algorithm and k-nn is not as big as one might think. There is a continuum between rules and instances. Rules can be considered as generalized instances, and instances can be considered as specific rules. This relationship between rules and instances implies that the classification method of knn can be applied to rules straightforwardly. Classification can be based on the k-nearest rules. Furthermore, a rule generalizes over a set of training instances, and in this sense the information contained in a rule can be described as more global than the local neighborhood information used in k-nn. For the construction of the hybrid we first train a rule induction algorithm on the training part of the data set. We optimize the algorithmic parameters of the rule learner to obtain the best possible rule set for this particular data set. In the next step we convert the model of the rule learning algorithm to the model of k-nn. More concrete, we map the rule set to the instance base. Previous work on this mapping can be found in van den Bosch (2004a). This mapping works as follows. The rule set is represented as a vector of bits (binary
45
3.2 k-nn and rule induction rule set: R1:
instances if
F3=T
and F4=A
rule−features
then YES
C A G C A NO
0 1 0 0
R2:
if
F3=G and F5=A
then NO
C C G G A NO
0 1 1 0
R3
if
F1=C
then NO
A G T C C YES
0 0 0 0
R4:
if
F1=T and F5=T
then YES
T
A T A C YES
1 0 0 1
C G T A T YES
1 0 1 0
Figure 3.3: Conversion of rules to binary features for each instance. Each binary rule feature presents whether a particular rule fires for this instance.
rule-features) where each bit represents a certain rule. For each instance we create such a vector and activate the bits of the rules that apply to this instance. An example of this process is depicted in Figure 3.3. It shows four rules from a rule set and five instances that consist of 5 features and a class label (yes/no). For each instance a bit vector is created of length 5 and each binary feature denotes one of the rules. Next, each instance is matched against each of the rules. When a rule fires for an instance, the bit presenting this rule receives a value of one, and zero otherwise. Some instances will have multiple active bits while others might have zero active bits (i.e. they match with no rule or the default rule). Note that we only use the condition part of the rule; the information of the class labels attached to the rule is not conveyed in the binary feature vector. This implies that bits representing rules that predict different class labels can co-occur together in an instance. These different class labels can even be different from the actual class of the instance. In the example the last instance has two active binary rule-features. Bit one presents rule R1 which points to class label ’yes’, while bit three represents rule R3 which points to class ’no’. We create two versions of the hybrid algorithm. In the first version the binary rule-features replace the original features in the instances. In other words, this operation transforms the original feature space into a new one (Sebag and Schoenauer, 1994). We convert all training and test instances to this binary format and feed them to the memory-based learner. We refer to this hybrid as rules-r-h, where the middle r denotes replace. From the k-nn perspective, rules-r-h attempts to repair k-nn’s sensitivity to noise and irrelevant features, since induced rules will typically not cover low-frequency noise and will not test on irrelevant features. Replacing the original features of the instances by rule-features can thus be considered as a compression and noise filtering step in which the rule induction algorithm has grouped interacting feature values together, which the normal k-nn algorithm is incapable of. From the perspective of the rule induction algorithm, we do not have the simple classification strategy of taking the class of the rule that fires first, but the local classification method of k-nn.
Chapter 3: Hybrids
46
In the hybrid, rules are represented as active-inactive binary features, and more than one rule can be active for a particular instance. As k-nn can be used with k > 1, the nearest neighbors can also contain different active rules that are applied to the new instance. Several rules may be involved in the classification instead of the one rule that determines the class in a standard ripper classifier. In the other version of the hybrid, the binary rule-features are added to the original instances instead of replacing them. We refer to this hybrid as rules-a-h where a stands for add. Thus, this hybrid is a k-nn classifier with extra added features that represent the per-instance firing patterns of the induced rule set. In this case the rule-features can not be considered as a compression and noise filtering step, but we still expect that the rule-features can repair k-nn’s sensitivity to noise in a more implicit way. As shown in Section 2.1.1, feature weighting in k-nn gives a higher weight to more important features. As many of the created rule-features will have a strong predictive power, they are likely to receive high feature weights, enabling them to influence the distance calculation. We hypothesize that the extra rule-features can overrule the influence of noise and irrelevant features. Next we describe in Section 3.2.1 the results of comparative experiments in which we test the generalization performance of the two rule hybrid algorithms rules-r-h and rulesa-h against the performance of their two parent algorithms, the rule induction algorithm ripper and k-nn. In Section 3.2.2 we discuss additional comparative experiments with variants of the rule hybrids that combine another rule induction algorithm, cn2, with k-nn. In Section 2.1.3 we described ripper and cn2. Both rule induction algorithms are sequential covering algorithms and can produce unordered rule sets, but they use slightly different strategies for learning these rules. ripper learns one rule at the time and removes all instances covered by the rule from its learning set. cn2 also produces one rule at the time, but only removes the positive examples covered by the rule. This implies that cn2 potentially produces larger rule sets and more importantly, rules can overlap to a higher extent in the instances that they cover. We investigate whether rule hybrids that combine cn2 with k-nn can benefit from having a larger and more overlapping set of binary rule-features compared to the hybrids that combine ripper with k-nn.
3.2.1
Comparison between the rule hybrids, rules and k-nn
In this section we describe the results of comparative experiments in which we measure the generalization performance of the hybrids that combine rule induction with k-nn classification. We compare the performance of the hybrids to the performance of their parents and present the outcomes of paired t-tests. Table 3.1 shows the mean accuracies and standard deviations of the hybrids and their parent algorithms on the 29 UCI benchmark task and the four NLP tasks. The macro-
3.2 k-nn and rule induction 100
100
90
90
80
80
70
70
accuracy rules-r-h
accuracy rules-r-h
47
60 50 40 30
60 50 40 30
20
20
10
10
0
0 0
10
20
30
40 50 60 accuracy k-nn
70
80
90
100
0
10
20
30
40 50 60 accuracy rules
70
80
90
100
Figure 3.4: Accuracies on the 33 data sets of the hybrid rules-r-h plotted against the accuracies of k-nn (left) and against rules (right).
averages in the last row of the table show that the hybrid rules-r-h has an average of 77.5% which is lower than both parents. rules has an average of 78.2% and k-nn has a macro-average of 83.4%. The hybrid rules-a-h has a higher average of 83.6% than both parent algorithms. Figure 3.4 displays the accuracies of the hybrid rules-r-h plotted against the accuracies of both parent algorithms on the 33 data sets. The points in the graph on the left are located to the right of the diagonal which indicates that k-nn has a higher accuracy for most data sets than the hybrid. The points in the graph on the right are very close to the diagonal which means that the accuracies of rules and rules-r-h are very similar for many data sets. The accuracies of the hybrid rules-a-h and both parents are shown in Figure 3.5. The graph on the left displays the accuracies of the hybrid plotted against k-nn. We observe that the points are somewhat scattered around the diagonal, indicating that both algorithms have a quite equal performance. In the graph on the right the points are located to the left of the diagonal which means that the hybrid rules-a-h has a higher accuracy for almost all data sets. Table 3.3 summarizes the outcomes of paired t-tests on mean accuracies on all 33 data sets. The hybrid rules-r-h has a significantly lower accuracy than k-nn in 17 cases and only wins twice. The hybrid performs rather similar to rules as no significant difference is found on 26 of the data sets. The hybrid rules-a-h performs similarly to k-nn as there is no significant difference between them in 27 cases. However, rules-a-h wins five times against k-nn and loses only once. Also, most interestingly, rules-a-h performs much better than rules; the hybrid wins on 20 data sets against rules and is never outperformed by the rule learner.
48
100
100
90
90
80
80
70
70
accuracy rules-a-h
accuracy rules-a-h
Chapter 3: Hybrids
60 50 40 30
60 50 40 30
20
20
10
10
0
0 0
10
20
30
40 50 60 accuracy k-nn
70
80
90
100
0
10
20
30
40 50 60 accuracy rules
70
80
90
100
Figure 3.5: Accuracies on the 33 data sets of the hybrid rules-a-h plotted against the accuracies of k-nn (left) and against rules (right). algorithm pair rules-r-h – k-nn rules-r-h – rules rules-a-h – k-nn rules-a-h – rules k-nn – rules
won/tied/lost 2/14/17 3/26/4 5/27/1 20/13/0 18/12/3
Table 3.3: Summarized results of paired t-tests on mean accuracy obtained in 10-fold cv experiments on the 33 data sets. The rule hybrids are compared to their two parent algorithms.
rules-r-h has a similar generalization performance to the rule induction algorithm, while the other rule hybrid rules-a-h has a much better performance than rules. Our expectation was that the hybrids would differ from rules in classification behavior. We based our expectation on the fact that more than one binary rule-feature can be active in the feature representation of the hybrid and larger k values allow several nearest instances with different active bits to be involved. We looked at the automatically selected algorithm parameters of the two rule hybrids. In 70% of the experiments with rules-r-h the automatic algorithmic parameter selection has chosen the k value to be 1, meaning that the potential benefit of k-nn classification is not fully explored. The hybrid rules-a-h uses k = 1 in only 23% of the experiments, apparently profiting from the possibility to set k > 1 in many cases.
49
3.2 k-nn and rule induction average unique instances bits per instance active bits per instance
rules-r-hripper 547.6 67.5 1.3
rules-r-hcn2 1413.7 290.4 1.5
Table 3.4: Statistics that show some of the effects of choosing ripper or cn2 as rule induction algorithm. The numbers present averages over the ten training folds, macroaveraged over the 29 UCI datasets. algorithm pair ripper – cn2 rules-r-hripper – rules-r-hcn2 rules-a-hripper – rules-a-hcn2 rules-r-hcn2 – k-nn rules-r-hcn2 – cn2 rules-a-hcn2 – k-nn rules-a-hcn2 – cn2 rules-r-hripper – k-nn rules-r-hripper – ripper rules-a-hripper – k-nn rules-a-hripper – ripper
won/tied/lost 9/18/2 10/15/4 8/21/0 2/10/17 7/17/5 2/19/8 14/12/3 2/14/13 3/23/3 4/24/1 16/13/0
Table 3.5: Summarized results of paired t-tests on mean accuracy obtained in 10-fold cv experiments on the 29 UCI data sets. The rule hybrid variants with cn2 are compared to their two parent algorithms and to the hybrid variants with ripper.
3.2.2
Comparing the two rule learners
During the experiments with the hybrids rules-r-h and rules-a-h we also investigated the bit vectors that present the rule sets produced by ripper. We noted that the bit vectors are often quite short and that the number of active bits is low. For the hybrid rules-r-h short vectors and a low number of active bits has the effect that the number of unique instances in the training set drops drastically. In this section we discuss additional experiments in which we try a second rule induction algorithm, cn2, which produces larger rule sets containing rules that can overlap in the instances that they cover. Further information about cn2 can be found in Section 2.1.3. We expect that the larger rule sets of cn2 will lead to a more diverse set of instances as input for the hybrids, and in turn will improve the performance of the rule hybrids. We run 10-fold cv experiments with cn2, and the hybrids based on cn2 on the 29 UCI
50
Chapter 3: Hybrids data set abalone audiology bridges car cl-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-l splice tictactoe vehicle votes wine yeast macro av.
k-nn 24.6 ± 2.8 80.5 ± 6.3 54.7 ± 10.6 96.5 ± 1.3 55.7 ± 5.4 77.7 ± 1.8 79.5 ± 4.9 66.9 ± 11.4 67.7 ± 8.4 96.8 ± 1.2 95.6 ± 0.5 33.3 ± 12.9 100.0 ± 0.0 94.0 ± 11.8 97.2 ± 2.5 100.0 ± 0.0 99.4 ± 0.5 98.0 ± 0.7 93.4 ± 0.9 87.0 ± 7.1 95.7 ± 0.9 94.2 ± 2.2 92.8 ± 4.2 95.3 ± 1.0 95.8 ± 3.8 67.6 ± 4.5 95.2 ± 2.4 96.1 ± 2.6 53.3 ± 2.9 82.2± 4.0
cn2 24.1 ± 2.6 77.8 ± 5.2 56.7 ± 7.5 96.4 ± 1.8 52.8 ± 8.8 74.1 ± 0.6 72.1 ± 10.4 58.5 ± 12.9 60.7 ± 9.4 98.6 ± 0.9 67.7 ± 0.9 36.7 ± 22.1 100.0 ± 0.0 57.6 ± 5.4 97.2 ± 2.5 100.0 ± 0.0 98.1 ± 0.3 71.2 ± 2.5 79.0 ± 1.0 75.4 ± 11.8 83.5 ± 4.5 94.9 ± 1.7 92.5 ± 3.5 78.0 ± 5.8 99.6 ± 0.7 44.2 ± 5.7 94.0 ± 2.9 91.0 ± 6.3 43.7 ± 3.7 75.0 ± 4.9
rules-r-hcn2 21.8 ± 1.8 80.5 ± 5.4 56.8 ± 11.2 93.3 ± 1.4 44.5 ± 7.9 72.3 ± 2.6 73.2 ± 8.8 56.0 ± 13.0 62.1 ± 9.5 98.4 ± 1.0 69.6 ± 0.7 34.2 ± 16.0 100.0 ± 0.0 51.6 ± 9.0 91.4 ± 13.6 100.0 ± 0.0 98.4 ± 0.3 75.7 ± 2.7 79.7 ± 1.0 77.5 ± 7.2 86.9 ± 4.3 93.6 ± 1.8 92.5 ± 3.0 87.9 ± 2.8 99.8 ± 0.6 45.1 ± 4.5 93.6 ± 2.7 92.1 ± 5.7 46.5 ± 4.0 75.0 ± 4.9
rules-a-hcn2 21.7 ± 1.5 83.1 ± 6.0 55.8 ± 15.0 93.5 ± 1.8 51.5 ± 7.2 77.9 ± 1.3 77.1 ± 5.7 63.7 ± 13.7 62.1 ± 10.3 99.2 ± 0.4 95.5 ± 0.5 40.8 ± 20.6 100.0 ± 0.0 57.4 ± 9.8 96.5 ± 2.6 100.0 ± 0.0 98.7 ± 0.3 98.0 ± 0.3 91.3 ± 1.8 88.6 ± 9.2 94.9 ± 1.8 93.7 ± 1.9 94.1 ± 2.9 95.2 ± 0.9 100.0 ± 0.0 62.2 ± 3.2 95.2 ± 3.2 94.3 ± 5.0 48.3 ± 4.6 80.4 ± 4.5
Table 3.6: Mean accuracies and standard deviations on the 29 UCI tasks of the two rulehybrids that combine k-nn with the rule sets produced by cn2. Highest scores are printed in bold-face.
data sets and compare the results against the results obtained with ripper. (Due to time constraints we did not run experiments on the larger NLP data sets.) In this section, we refer to the rule hybrids that combine ripper with k-nn as rules-r-hripper and rulesa-hripper . The hybrids that have cn2 as parent algorithm are called rules-r-hcn2 and rules-a-hcn2 . Table 3.4 displays statistics that demonstrate some of the effects of using ripper or cn2 on the hybrid rules-r-h. We calculate for each data set the number of unique instances
51
3.2 k-nn and rule induction
(rule vectors) in the training set of rules-r-h, averaged over the ten folds. The table lists macro-averages over the 29 UCI data sets. We observe that cn2 indeed produces a more diverse set of training instances for the hybrid rules-r-h. As a comparison we also computed the macro-average number of unique instances in the original training data. This average is 4507.3, so the cn2 variant still reduces the number of unique instances drastically. The fact that cn2 produces larger rule sets can be seen in the second row of the column. The average number of bits (rules) per instance is larger for the cn2 variant than for the ripper variant. The last row of the table displays the average number of active bits per instance. Here we see that although the cn2 variant has much larger vectors, this does not result in a much higher number of active bits. cn2 makes specific rules that are only active for a small number of instances, ripper on the other hand makes relatively smaller rule sets which have a higher coverage. Next we look at the performance of the cn2 and the hybrid algorithms. Table 3.6 shows the mean accuracies and standard deviations of the two parent algorithms k-nn and cn2, and the two hybrids on the 29 UCI data sets. The last row of the table lists macroaverages over the 29 data sets. The hybrid rules-r-hcn2 has the same average as cn2 and a lower average than k-nn. rules-a-hcn2 has a lower macro-average than k-nn and a higher average than cn2. The results of ripper and its hybrids are already discussed above and are listed in Table 3.1. We perform paired t-tests on the results of ripper and cn2 on the 29 UCI tasks to estimate the significance of the differences between the results. The outcomes of paired t-tests between ripper and cn2 are shown in the first row of Table 3.5. We see that ripper performs better than cn2. ripper has a significantly higher score than cn2 on nine of the 29 data sets, while cn2 outperforms ripper only twice. We also compare the hybrid algorithms that have cn2 as their parent rule learner to the hybrids with ripper as parent algorithm and perform significance tests. The second and third row of Table 3.5 show the outcomes. The ripper hybrids outperform the cn2 hybrids. The hybrid rules-r-hripper wins on ten data sets, while rules-r-hcn2 only wins on four data sets. rules-a-hripper wins on eight data sets against rules-a-hcn2 and no significant difference is found on the other 21 data sets. Next we compare the hybrids rules-r-hcn2 and rules-a-hcn2 to their parents cn2 and k-nn. The outcomes of the paired t-tests on the results of these algorithms are shown in row four to eight of Table 3.5. We also list the outcomes on the 29 UCI data sets of paired t-tests on the results of the ripper hybrids and its parents in the last rows of the table. The relations between these ‘ripper’ pairs and the ‘cn2’ pairs shown in row four to eight are quite alike. Both variants of rules-a-h are more similar to k-nn in their performance than to the rule learner. The two variants of rules-r-h perform both worse than k-nn and quite equally to the rule induction algorithm. The difference between the hybrid variants–parents pairs is that we find more significant differences between the cn2hybrids and their parents than between ripper-hybrids and their parents. This indicates
Chapter 3: Hybrids
52
that the ripper-hybrids have a closer resemblance to their parent algorithms in terms of their generalization performance. In sum, the results presented in this section show that the hybrids that are based on the rule sets produced by ripper perform better than the hybrids that have cn2 as parent. The fact that cn2 produces larger rule sets leading to larger bit vectors and more unique instances, does not produce better hybrids. The method of ripper to produce a smaller rule set containing rules that have a higher coverage leads to better performing hybrids.
3.3
k-nn and winnow
In this section we describe the hybrid algorithm which combines multiplicative update learning k-nn classification. Our motivation for merging winnow and k-nn is based on the difference in weight calculations. k-nn calculates feature weights on the basis of counts in the training data. As mentioned earlier, calculations based on low frequency counts can be unreliable. The architecture of winnow consists of a network of weighted connections between input nodes and target nodes. winnow learns these weights in an iterative fashion, instance by instance, and also applies feature selection by ignoring feature values that have a frequency below two in the training set. The winnow algorithm is described in more detail in Section 2.1.2. A weighted connection between an input node and a target node as produced by winnow can directly be translated into a feature weight that a certain feature value has for a particular class. This link between the weighted connections and feature value weights provides us with the possibility to map the information conveyed in the model of winnow to the model of k-nn. The mapping between feature weights of k-nn and weighted connections of winnow is not straightforward as the type of weights is different. winnow creates relations between classes and feature values, while k-nn normally assigns single weights to multi-valued features. We visualize this difference in Figure 3.6. On the right the weighting method of winnow is shown. Each feature value and class have a weight assigned to their connection. To the left, the normal feature weighting method of k-nn is shown. k-nn assigns a weight to a (multi-valued) feature and not to each feature value separately. This weight expresses, with respect to the four weighting methods described in Section 2.1.1, the overall discriminative power of the feature in relation to the classes. Although k-nn is normally used with multi-valued features, it is not restricted to this format and can also easily be trained with binary features as was shown by van den Bosch and Zavrel (2000). Our expectation is that the weights as learned by winnow are more robust than weights for binary features produced by k-nn, because winnow puts effort in learning these weights and ignores low frequency observations. We regard weights such as the ones produced by winnow more expressive than the weights for multi-valued features of k-nn, because all relations between classes and values are assigned a separate weight.
53
3.3 k-nn and winnow k−nn
winnow
multivalued
single feature
feature
value nodes
F[v1,v2,v3]
v1
[c1,c2,c3]
c1
multi−class
v2
c2
v3
c3
single class nodes
Figure 3.6: k-nn uses global feature weights indicating the importance of the feature for all classes. In contrast winnow uses weights presenting the importance of each feature value for each class.
Thus, we need a transformation to map the weights of winnow to the weights in k-nn. To construct the hybrid algorithm, we first train winnow, which produces a sparse network. For each class, the target node presenting that class, the weighted connections attached to that target node and the input nodes attached to the connections, are mapped to a separate binary k-nn classifier. We map input nodes to instances on which we can train the k-nn classifier. The multi-valued features in the instances are translated to binary features where each bit represents one feature value. Not all feature values of the original instance base are used as input nodes; The SNoW implementation of winnow used in our experiments performs an implicit feature selection. winnow does not make weighted connections between target nodes and feature values when there is not at least one instance in which the class of the target node and the feature value occur together, and winnow does not make connections for feature values with a frequency below two. We only make binary features for the feature values selected by winnow. We also map the accompanying weights of winnow’s connections to feature weights for k-nn. For a two-class task we construct the hybrid algorithms as follows. winnow is trained on the training set to learn one target node presenting the task. We map the input nodes and weights produced by winnow to binary features representations as described above. The hybrid uses these binary features and weights to apply k-nn classification. Carlson et al. (1999) advise to represent a binary task with two target nodes. However, for the construction of winnow-h, it is preferable to use only one bit. Creating a winnow
Chapter 3: Hybrids
54
classifier with two target nodes results in two separate k-nn classifiers in the hybrid. The first classifier predicts class one or not, and the second classifier predicts class two or not. Basically these two classifiers predict the same thing and for the case in which they give opposite predictions, the only available simple solution is random tie resolution.1 For multi-class tasks, the construction of the hybrid involves an extra mapping from a classification task with multiple class labels to a set of binary classifiers. We choose to encode the problem using error correcting output codes (ECOC). ECOC is a generic and elegant method to reduce a multi-class problem to a set of binary problems; Section 3.3.1 details this approach.
3.3.1
Error correcting output codes
Error correcting output codes are a method to encode a multi-class task as a set of binary tasks, allowing for post-hoc error correction (Dietterich and Bakiri, 1995). Each class is coded as a codebook bit vector of length l with maximal Hamming distance among all codebook vectors. For each bit a separate classifier is trained. To classify a new example each of the l binary classifiers predicts one bit. The resulting bit vector is then compared to the codebook vectors that code the original classes. The codebook vector that has the closest resemblance to the predicted vector assigns its class to the new example. Kong and Dietterich (1995) explain that ECOC can increase performance on a multiclass task because it can correct errors if the errors in the bits are uncorrelated. It is also stated that ECOC is less effective for local learning algorithms, as the assumption of independence between the errors does not hold. In contrast, Aha and Bankert (1997) show that ECOC can be used successfully for local learners. The machine learning bias of the binary local learners can be varied so that the binary classifiers will make partly independent mistakes. Aha and Bankert (1997) and Ricci and Aha (1998) use feature selection to create different biases for the binary local learners. Raaijmakers (2000) also uses some algorithmic parameters to change the bias. We follow Raaijmakers and change the bias of the binary classifiers by changing the algorithmic parameters for each classifier individually. We also use a coarse type of feature selection because winnow removes low frequency feature values. The method used to choose the algorithmic parameters is described in Section 2.3.1. Several methods have been proposed to generate ECOC codes. Finding an optimal coding is still an open problem (Masulli and Valentini, 2004). We use exhaustive encoding for classification tasks with less than 8 classes as described by Dietterich and Bakiri (1995). We construct exhaustive codes with a bit vector length of 2k−1 − 1 where k stands for the number of classes. The ECOC-bit vector for the first class only contains active bits (all bits have value 1). The second vector contains first 2k−2 inactive bits, followed by 1 Due to the algorithmic parameter selection process, the classifiers can have different parameter settings. These settings influence the distance calculations between instances and prevent the use of the nearest neighbor distances in tie resolution.
55
3.3 k-nn and winnow class a b c d e
1 0 0 0 0
1 0 0 0 1
1 0 0 1 0
1 0 0 1 1
1 0 1 0 0
1 0 1 0 1
binary coding 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1
1 1 0 1 0
1 1 0 1 1
1 1 1 0 0
1 1 1 0 1
1 1 1 1 0
Table 3.7: Exhaustive ECOC encoding of a five-class task in 15 bits.
2k−2 − 1 active bits. The third has first 2k−3 inactive bits, then 2k−3 active bits, followed by 2k−3 − 1 inactive bits. The nth vector has n parts of size 2k−n of inactive and active bits. An example of this encoding is shown in Table 3.7. A task with five classes is coded with an ECOC bit vector of 25−1 − 1 = 15 bits. We use a method called random dense codes proposed by Allwein et al. (2000) for tasks with 8 or more classes. This random method has the following steps: • calculate the size s of the bit vector, s = 10 ∗ log2 ( number of classes ) • generate 10.000 k-by-s matrices where k is the number of classes and s is the size of the bit vector. Each matrix is randomly filled with 1 or 0. • for each matrix, measure the Hamming distance between all rows, and store the minimal distance found for each matrix. • choose from the 10,000 matrices, the matrix with the highest minimal distance that does not contain identical columns.
3.3.2
Construction of the hybrid with ECOC
We use the following method to construct the hybrid algorithm for multi-class tasks. The class labels of a data set are converted to ECOC labels. For each ECOC-bit a winnow classifier is built with one target node representing that bit. This classifier is trained and optimized with respect to its algorithmic parameters on the training set. The resulting model of winnow is transposed to k-nn. This goes as follows. For each winnow classifier that represents an ECOC bit, the input nodes and the weighted connections of winnow are translated to features and feature weights. We take the training instances and code them according to the features and weights produced by winnow. These instances with ‘winnow weights’ constitute the input of a binary k-nn trained to classify the same ECOCbit as the winnow classifier. This implies that each binary classifier has a different feature
56
Chapter 3: Hybrids ecoc encoding
binary winnow
binary winnow−h
ecoc
classifiers
classifiers
decoding
predicted label y
Figure 3.7: Each task is split into a set of binary tasks according to ECOC encoding, used both by winnow and winnow-h.
selection and different weights. We call this hybrid that combines winnow with k-nn winnow-h. The construction process is depicted in Figure 3.7. The first step is duplicating the training data in as many sets as there are ECOC bits. This example presents a four-class task which receives a seven-bit ECOC coding. The data is duplicated seven times, and the class labels of these seven data sets are adapted to ECOC coding by replacing the class labeling with the ECOC bits. (This is similar to training winnow with seven target nodes, and specifying for each instance seven class labels.) Next, we train a separate winnow classifier on each of the seven data sets. For each data set, a winnow-h classifier is produced by mapping the input features and weights as learned by winnow to k-nn. In the classification phase, the test instance is given to each of the seven winnow-h classifiers. Each classifier predicts a bit for the test instance and the bits are put together to form a binary bit-vector. The bit vector is compared to each of the codebook vectors that encode the classes, and the class connected to the codebook vector with the smallest Hamming distance is assigned to the test instance. We use the ECOC coding scheme both for winnow and winnow-h to keep the similarity between these two algorithms as close as possible. This ECOC encoding has an effect on the parameter optimization step in the experimental setup. Some of these randomly generated ECOC codes cause an extremely skewed class distribution in some of the output
3.3 k-nn and winnow 100
100
90
90
80
80
70
70
accuracy winnow-h
accuracy winnow-h
57
60 50 40 30
60 50 40 30
20
20
10
10
0
0 0
10
20
30
40 50 60 accuracy k-nn
70
80
90
100
0
10
20
30
40 50 60 accuracy winnow
70
80
90
100
Figure 3.8: Accuracies on the 29 UCI data sets of the hybrid winnow-h plotted against the accuracies of k-nn (left) and against winnow (right).
bits. For these cases we put an extra constraint on the algorithmic parameter optimization process: we do not use optimization when the ratio between the classes is higher than 1 : 100. In those cases we simply use the default parameter settings.
3.3.3
Comparison between winnow-h, winnow and k-nn
In this section we present the results of a comparison between the hybrid winnow-h and its parents winnow and k-nn. We perform comparative experiments on the 29 UCI data sets. The hybrid and winnow both use the ECOC-decoding scheme in these experiments. (We did not run experiments on the large NLP data sets due to an excessive overhead in processing time caused by ECOC encoding in combination with the large number of feature values.) The mean accuracies of the hybrid winnow-h and the parents are shown in Table 3.1. Figure 3.8 displays the accuracies of the hybrid plotted against both parents on the 29 UCI data sets. The graph on the left shows that for the majority of the data sets k-nn has a higher accuracy than winnow-h. In the graph on the right we observe slightly more points to the right of the diagonal which indicates that for these data sets winnow has a higher accuracy. In both graphs we observe that many points are distant from the diagonal which means that for many data sets the hybrid and its parent have a different accuracy. We perform paired t-tests to determine significant differences in performance between the hybrid and its parents. Table 3.8 summarizes the outcomes of these tests. The hybrid algorithm performs significantly worse than both parent algorithms. winnow-h loses 17
58
Chapter 3: Hybrids algorithm pair winnow-h – k-nn winnow-h – winnow k-nn – winnow
won/tied/lost 1/11/17 2/19/8 15/14/0
Table 3.8: Summarized results of paired t-tests on mean accuracy obtained in 10-fold cv experiments on the 29 UCI data sets. winnow-h is compared to its two parent algorithms.
times from k-nn and 8 times from winnow. winnow-h only wins two times against winnow and wins once against k-nn, namely on the kr-vs-kp data set. As explained in Section 3.3.1 we use ECOC for encoding multi-class tasks as binary tasks for winnow-h and winnow. We need this conversion to integrate the weights produced by the learning module of winnow in the model of k-nn as detailed in Section 3.3. We discuss some additional experiments to ensure that the low performance of winnow-h is unrelated to this extra ECOC encoding step. We check the effect of ECOC encoding on the parent algorithm winnow. So far we only reported on experiments with winnow with ECOC decoding (for now referred to as winnow-with) to keep a close resemblance between the hybrid and the eager parent. We also ran similar 10-fold cv experiments with standard winnow without ECOC decoding (winnow-without) on the 29 UCI tasks. We conducted paired t-tests on these results compared to the results of winnow-with. The variant winnow-without is slightly better than winnow-with. The winnow-with wins on 3 tasks: car, nursery, yeast, while winnow-without wins on the following six tasks: connect4, kr-vs-kp, letter, optdigits, pendigits, votes. The observation that winnow-without wins on two data sets that have two classes may seem strange because ECOC decoding only applies to multi-class tasks. However, winnow-with and winnow-without also treat binary tasks differently. winnow-with uses one target node to present the two classes, while the standard winnow without ECOC uses two target nodes to represent a binary problem as explained in Section 3.3, and as recommended by Carlson et al. (1999). Using one target node to present a two-class task might have a slightly negative effect on the performance of winnow-with, but this is only speculation: on the other six two-class tasks no significant negative effect is found. It is also possible that the choice of representing a two-class task with one target node in winnow has a negative effect on the hybrid winnow-h. winnow uses a threshold to differentiate between the two classes. As an example we consider a two-class task for which winnow creates a network. The sum of the weights connected to input features that are strongly predictive for class A has to be below the threshold to predict class A. Another group of feature values that are correlated with class B, receive a high weight as
59
3.4 Additional Comparisons
their summed weight needs to be above the threshold to predict class B. This implies that the weights as assigned by winnow do not express importance in the same manner as the feature weighting methods used in our k-nn implementation do. It does not need to be the case that features with a high predictive capacity receive a high weight. In winnow an important predictive feature for the class below the threshold will receive a low weight. Low weighted features are regarded as less important in the search for a nearest neighbor set. The use of winnow ’s weights can thus cause the k-nn algorithm to find a less optimal set of neighbors which may have the effect of predicting the incorrect class. We consider this effect of using weights that are tuned to a threshold as the possible reason for the low performance of the hybrid winnow-h. However, the two tasks for which the hybrid outperforms its parent winnow have two classes, and for these data sets the winnow weights clearly work well as feature weights. Furthermore, on one of these tasks, kr-vs-kp, the hybrid also performs significantly better than k-nn. Again the data does not provide enough evidence to draw conclusions.
3.4
Additional Comparisons
In this section we discuss three additional comparisons which give extra information relevant for the main comparative experiments presented in the previous part of this chapter. We present in subsection 3.4.1 the experimental results of the four parent algorithms: k-nn classification, rule induction, maximum entropy modeling and multiplicative update learning. We compare the performance of these four algorithms to each other. Part of these comparisons, the comparisons between k-nn and the eager parents, have been discussed incidentally in the previous sections. To be complete, in this section we put all algorithms next to each other and present the results. In subsection 3.4.2 we pay attention to the evaluation method accuracy used in our experiments and compare this to another popular evaluation method, AUC. In subsection 3.4.3 we analyze the results of the algorithms on the NLP data sets and investigate whether this differs from the results obtained on all tasks.
3.4.1
Comparison between the basic algorithms
We compare the results of the four parent algorithms: k-nn, rules, maxent and winnow. To determine which algorithm has a significantly higher generalization performance, we perform paired t-tests between all pairs of accuracy scores that are listed in the first four columns of Table 3.1. We count for each pair of algorithms whether there is a significant difference found between the algorithms.
60
Chapter 3: Hybrids k-nn k-nn maxent rules winnow
3/18/12 3/12/18 0/14/15
maxent 12/18/3 6/13/14 0/17/12
rules 18/12/3 14/13/6
winnow 15/14/0 12/17/0 8/19/2
2/19/8
Table 3.9: Comparison of the four parent algorithms. Summary of paired t-tests on the mean accuracy and standard deviation of the 10-fold cv experiments on the 33 tasks and on 29 data sets for winnow. Each cell represents the number of times that the classifiers in the column won/tied/lost compared to the classifiers in the row.
Table 3.9 summarizes the results of the significance tests on the 33 tasks and on 29 tasks for the pairs with winnow. k-nn wins against maxent on 12 of the 33 data sets, and in three cases k-nn loses from maxent. k-nn wins on 18 data sets against rules and loses three times. Compared to winnow, k-nn wins on 15 data sets and does not lose from winnow. maxent wins 14 times against rules, while rules wins on six data sets. maxent wins against winnow on 12 data sets and no significant difference is found on the remaining 17 data sets. rules wins eight times against winnow and loses twice. In sum, k-nn performs best of all algorithms, maxent performs better than rules and winnow, and rules outperforms winnow.
3.4.2
AUC compared to Accuracy
Accuracy is our main evaluation method of generalization performance of the algorithms. The machine learning algorithms are also optimized with respect to accuracy. Recently, AUC has become a popular measure in the area of machine learning (Fawcett, 2003; Flach, 2004) and for this reason we also want to compare our results with AUC scores. We calculate the micro-averaged AUC scores (Provost and Domingos, 2003) on the basis of the same results of the 10-fold cv experiments used for the accuracy calculations. We perform paired t-tests on these AUC scores and compare the results to the results on accuracy scores (Huang and Ling, 2005). Table 3.11 shows the micro-averaged AUC scores on the 33 data sets for all algorithms. We also repeat the paired t-tests to compare the parent algorithms and hybrids. This time however, we base the significance tests on mean micro-averaged AUC-scores. Table 3.10 summarizes the results of the significance tests on mean accuracies next to the results on mean AUC scores. The won/tied/lost counts based on AUC scores do not deviate much from the counts based on accuracy scores. There are only minor differences between the results evaluating with AUC or accuracy: the largest difference found is for the winnow-h – winnow pair, where three data sets get a different outcome from the paired t-test. The
maxent 56.4 ± 0.9 88.9 ± 2.3 74.5 ± 7.7 92.5 ± 1.7 66.4 ± 4.3 70.7 ± 0.6 84.2 ± 4.9 81.5 ± 7.7 77.8 ± 8.2 96.8 ± 0.6 92.2 ± 0.4 55.8 ± 19.4 74.9 ± 4.3 48.6 ± 2.3 97.4 ± 2.4 100.0 ± 0.0 94.5 ± 0.3 97.7 ± 0.4 92.2 ± 0.7 92.2 ± 9.5 95.4 ± 1.7 50.9 ± 2.2 95.6 ± 1.6 95.9 ± 0.7 97.6 ± 1.0 75.8 ± 3.4 96.6 ± 2.4 95.8 ± 4.2 65.8 ± 3.0 95.9 ± 0.1 96.9 ± 0.1 95.2 ± 0.2 81.7 ± 1.0 84.1 ± 3.0
rules 51.2 ± 0.5 86.6 ± 4.6 68.5 ± 8.8 97.8 ± 0.9 69.0 ± 3.9 71.0 ± 5.9 79.6 ± 6.6 75.2 ± 4.9 72.2 ± 5.5 99.2 ± 0.5 86.4 ± 0.8 53.3 ± 19.8 99.2 ± 2.2 59.6 ± 11.0 97.4 ± 2.4 100.0 ± 0.0 98.3 ± 0.6 94.1 ± 0.5 90.3 ± 0.9 79.5 ± 8.0 94.5 ± 2.1 49.7 ± 0.4 95.0 ± 1.7 95.3 ± 1.2 99.5 ± 1.1 70.6 ± 3.1 94.7 ± 1.6 94.2 ± 5.8 59.1 ± 2.8 81.9 ± 0.3 95.5 ± 0.4 80.6 ± 8.8 77.6 ± 2.0 82.3 ± 3.6
winnow 56.5 ± 0.7 84.1 ± 4.4 66.7 ± 5.8 91.5 ± 1.9 59.6 ± 5.8 73.5 ± 0.9 83.0 ± 3.6 79.8 ± 7.4 71.7 ± 7.9 93.7 ± 1.4 82.0 ± 1.0 48.3 ± 17.4 74.2 ± 4.6 55.4 ± 11.0 96.3 ± 5.2 100.0 ± 0.1 93.2 ± 0.4 94.5 ± 0.7 88.6 ± 1.1 86.7 ± 12.1 94.5 ± 1.3 53.3 ± 3.3 94.7 ± 2.4 96.0 ± 1.3 97.1 ± 2.0 72.5 ± 3.9 94.1 ± 3.5 95.5 ± 3.9 65.8 ± 2.5 80.8 ± 4.1
maxent-h 55.9 ± 1.2 89.6 ± 3.1 70.5 ± 6.7 97.2 ± 0.7 62.9 ± 5.3 74.5 ± 2.1 85.8 ± 2.2 80.8 ± 8.6 72.7 ± 6.8 99.1 ± 0.3 97.9 ± 0.4 60.0 ± 14.8 93.7 ± 10.4 94.7 ± 10.2 97.4 ± 2.4 100.0 ± 0.0 98.5 ± 0.8 98.5 ± 0.3 95.6 ± 0.8 93.7 ± 8.8 97.6 ± 0.6 52.9 ± 4.6 96.1 ± 1.8 96.2 ± 1.2 100.0 ± 0.0 77.5 ± 3.7 95.7 ± 2.2 97.4 ± 2.3 69.8 ± 2.0 96.5 ± 0.1 97.2 ± 0.1 96.4 ± 0.9 78.2 ± 0.9 87.0± 3.2
rules-r-h 51.2 ± 0.5 75.8 ± 4.9 67.2 ± 11.8 95.7 ± 2.8 69.0 ± 3.9 69.3 ± 5.7 81.3 ± 7.6 75.4 ± 4.2 71.5 ± 5.9 99.2 ± 0.5 86.8 ± 0.7 45.8 ± 10.0 100.0 ± 0.0 66.3 ± 18.6 96.6 ± 2.6 98.8 ± 2.4 98.5 ± 0.5 94.4 ± 0.6 89.8 ± 1.9 80.4 ± 8.7 94.5 ± 2.1 49.7 ± 0.4 95.4 ± 1.8 95.5 ± 1.0 100.0 ± 0.0 70.8 ± 3.3 93.6 ± 3.3 93.8 ± 6.5 57.6 ± 2.2 82.0 ± 0.3 95.2 ± 0.2 79.1 ± 7.2 78.1 ± 1.5 81.8 ± 3.7
rules-a-h 56.9 ± 1.2 89.6 ± 3.0 68.8 ± 7.7 98.9 ± 0.5 67.8 ± 4.4 76.1 ± 6.1 85.1 ± 3.9 79.3 ± 6.7 76.2 ± 8.1 99.1 ± 0.5 97.7 ± 0.2 56.7 ± 13.3 100.0 ± 0.0 96.5 ± 4.8 97.4 ± 2.4 100.0 ± 0.0 99.5 ± 0.3 98.8 ± 0.3 95.8 ± 0.8 82.9 ± 12.2 97.6 ± 0.7 53.6 ± 7.6 95.9 ± 2.0 96.9 ± 0.5 100.0 ± 0.0 76.1 ± 3.7 95.2 ± 1.8 96.3 ± 4.2 67.8 ± 3.1 97.3 ± 0.1 97.2 ± 0.1 96.5 ± 1.7 80.1 ± 2.0 87.1 ± 3.1
winnow-h 55.1 ± 1.0 81.9 ± 4.2 70.7 ± 7.4 79.8 ± 3.7 66.6 ± 4.7 75.5 ± 1.1 78.2 ± 5.6 80.6 ± 7.7 68.6 ± 6.7 98.4 ± 0.8 90.6 ± 13.7 55.8 ± 13.5 98.5 ± 4.6 72.9 ± 22.4 97.4 ± 2.4 100.0 ± 0.0 85.5 ± 2.9 94.3 ± 1.0 89.2 ± 0.8 86.6 ± 10.1 93.7 ± 1.7 50.4 ± 1.9 93.1 ± 2.2 95.7 ± 1.1 90.4 ± 3.5 60.5 ± 3.2 94.6 ± 2.3 91.2 ± 4.2 66.5 ± 3.7 81.5 ± 4.8
Table 3.10: Mean micro-averaged AUC scores and standard deviations on the 33 tasks (and 29 tasks for winnow and winnow-h). Highest scores are printed in bold-face.
3.4 Additional Comparisons
k-nn 56.7 ± 1.5 89.0 ± 3.8 68.6 ± 6.6 97.6 ± 1.0 64.4 ± 5.0 74.7 ± 1.6 86.2 ± 3.1 79.7 ± 7.2 76.3 ± 6.5 96.7 ± 1.2 97.7 ± 0.2 50.0 ± 12.4 100.0 ± 0.0 91.9 ± 15.6 97.4 ± 2.4 100.0 ± 0.0 99.6 ± 0.4 98.9 ± 0.4 96.4 ± 0.5 86.8 ± 7.7 97.5 ± 0.5 55.6 ± 7.7 95.9 ± 2.4 96.6 ± 0.8 95.3 ± 4.2 78.4 ± 3.0 95.3 ± 2.8 96.8 ± 2.2 68.6 ± 2.0 97.2 ± 0.1 97.2 ± 0.1 96.1 ± 2.0 80.1 ± 1.3 86.6 ± 3.2
61
data set abalone audiology bridges car c-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-l splice tictactoe vehicle votes wine yeast pos chunk ner pp macro av.
62
Chapter 3: Hybrids
algorithm pair k-nn – maxent k-nn – rules maxent – rules k-nn – winnow maxent – winnow rules – winnow maxent-h – k-nn maxent-h – maxent rules-r-h – k-nn rules-r-h – rules rules-a-h – k-nn rules-a-h – rules winnow-h – k-nn winnow-h – winnow
accuracy won/tied/lost 12/18/3 18/12/3 14/13/6 15/14/0 12/17/0 8/19/2 4/23/6 15/17/1 2/14/17 3/26/4 5/27/1 20/13/0 1/11/17 2/19/8
micro-AUC won/tied/lost 13/17/3 17/13/3 15/11/7 15/14/0 11/17/1 8/17/4 4/23/6 15/16/2 3/13/17 4/27/2 4/28/1 20/13/0 1/13/15 5/16/8
Table 3.11: The results of paired t-tests on mean accuracies compared to paired t-tests on micro-averaged AUC scores.
AUC scores support the observations made on the basis of the accuracy scores.
3.4.3
Results on NLP data sets
In this section we focus on the results obtained on the four NLP data sets. As explained in Chapter 1 a marked characteristic of NLP tasks is that they contain many exceptions and sub-regularities which form a challenge for machine learning algorithms, and which appear to favor k-nn classification (Daelemans et al., 1999). Here we scrutinize the results of the three hybrids maxent, rules-r-h and rules-a-h on the four NLP tasks. The mean accuracies of the algorithms on the NLP data sets are shown in Table 3.1. First we investigate results of the hybrid maxent-h and compare these to the results of its parent algorithms. Table 3.12 shows the outcomes of significance tests between maxent-h and its parent algorithms on the four NLP tasks. We see that the hybrid loses two times from the parent algorithm k-nn, and two times no significant difference is found. Thus, the performance of maxent-h on the four NLP tasks deviates from its performance on all 33 tasks where we observed that maxent-h has a similar performance to k-nn. Compared to its eager parent maxent, the hybrid wins three times and loses once. The hybrid also has a better performance than maxent considering all data sets. When we compare the
63
3.5 Related work algorithm pair maxent-h – k-nn maxent-h – maxent k-nn – maxent
won/tied/lost 0/2/2 3/0/1 2/1/1
Table 3.12: Summarized results of significance tests on the accuracies of the 10-fold cv experiments on the four natural language tasks. maxent-h is compared to its two parent algorithms.
two parents to each other, we observe that k-nn performs better than maxent, similar to the observation on all data sets. The observation that k-nn performs better on the NLP data sets than the hybrid maxenth may implicate that the mvdm–maxent distance metric is not the optimal choice for this particular type of task. The NLP data sets are characterized by exceptions, subregularities and a large average number of feature values per feature. The weights as produced by maxent seem not to capture all the low frequency but important cases. This may be explained by the fact that maxent-h maximizes the entropy of the weights and smoothes over low frequency events. Next we investigate the results of the two hybrids that combine ripper with k-nn on the four tasks, and we compare their performance to the performance of their parents. Table 3.13 summarizes the outcomes of paired t-tests between the hybrids and their parents. On the four NLP data sets the performance of the algorithms is not very different from the performance on all 33 data sets. The hybrid rules-r-h performs worse than k-nn; it loses four times. rules-r-h has a similar performance as rules, in three cases no significant difference is found and the hybrid loses once from rules. The results of the hybrid rules-a-h on the four NLP tasks match the observations made on all 33 tasks. The hybrid rules-a-h performs similar to k-nn, no significance is found on three data sets, and the hybrid wins once against k-nn. The hybrid performs clearly better than its eager parent rules and wins four times. When we compare the two parents, we observe that k-nn outperforms rules on each of the tasks.
3.5
Related work
Lee and Shin (1999) present a hybrid algorithm that combines rule induction and knn. Their approach differs from our method because it uses a two-step approach. This hybrid has a rule induction module and a k-nn module that are used separately to classify new instances. The rule induction module induces strong rules as measured by Hellinger
64
Chapter 3: Hybrids algorithm pair rules-r-h – k-nn rules-r-h – rules rules-a-h – k-nn rules-a-h – rules k-nn – rules
won/tied/lost 0/0/4 0/3/1 1/3/0 4/0/0 4/0/0
Table 3.13: Summarized results of paired t-tests on mean accuracy obtained in 10-fold cv experiments on the four natural language task. The hybrids rules-r-h and rules-a-h are compared to both parent algorithms.
divergence (Beran, 1977). In classification, new instances are first tested against the rule set and when a match is found, the class is assigned. New instances that do not match any of the rules, are classified by the k-nn module. Their method obtains a better performance compared to other machine learning algorithms such as CART (Breiman et al., 1984). Another hybrid algorithm that combines rule induction with k-nn is the RISE system of Domingos (1996). As mentioned in Section 3.2 there is continuum between rules and instances; instances can be considered as specific rules and rules can be considered as generalized instances. RISE uses this continuum as it makes no distinction between rules and instances. RISE applies a specific-to-general approach and creates rules by carefully generalizing instances. Contrary to sequential covering algorithms, RISE induces all rules at once. It searches for an optimal rule set by repeatedly finding the nearest instances, and generalizing over them. When this step improves the global accuracy of the whole rule set, the generalization is maintained. The most important difference between RISE and our approach is that RISE considers rules as generalized instances, while our approach differentiates between rules and instances as we transform rules to create features that are added to the instances or replace the original features in the instances. The work of Ting and Cameron-Jones (1994) is closely related to our work concerning the hybrid maxent-h. Ting and Cameron-Jones (1994) compare memory-based learning with Naive Bayes. They use the relationship between the class distributions as used by the mvdm distance metric and the conditional class distributions as used by Naive Bayes. This relationship is used as a starting point in their research that aims to identify the relation between the differences of the two approaches (memory-based learning and Naive Bayes) and their performance in different domains. They interpret Naive Bayes in a memorybased framework. This Naive Bayes stores ‘ideal prototype instances’ in memory, which are used in classification to find the most similar prototype. For each class one prototype instance is created consisting of feature values which maximize the probability of the class given those feature values. Next a range of intermediate variants is constructed, each more or less similar to Naive
65
3.5 Related work
Bayes or the memory-based learner with mvdm. One of these variants has a close resemblance to our hybrid maxent-h, which is the variant that stores all training instances in memory and uses the Naive Bayes metric to calculate distances between instances. When testing the different variants on 16 UCI data sets, their findings are not unequivocal. The step-wise transition of variants between the memory-based algorithm and Naive Bayes does not change in a similar manner for different data sets. Moreover, the relationships between a change in accuracy for two variants, and the overall change in accuracy measures along the entire range of variants for a particular data set is not obvious. The strongest factor found between the relative performance of memory-based learning and Naive Bayes in different domains is whether or not each class can be represented satisfactorily by an ideal prototype. Zheng and Webb (2000) also propose a hybrid algorithm that combines Naive Bayes, rule learning and memory-based learning. This approach differs from our approach and Ting and Cameron-Jones (1994)’s approach as it proposes a lazy Bayesian rule learning algorithm. This hybrid performs local classification as it keeps all instances in memory. In the classification phase, for each new instance the most appropriate rule is constructed. This rule has as antecedent a conjunction of feature-value tests, and as consequent a local Naive Bayesian classifier created on the basis of the training examples that match the antecedent of the rule (‘the nearest neighbors’). The lazy Bayesian rule learning hybrid is compared on a large range of UCI data sets to several other algorithms such as Naive Bayes, C4.5 (Quinlan, 1986) and a lazy decision tree algorithm (Friedman et al., 1996). On average, the hybrid performs better than the other algorithms. Pennock et al. (2000) describe a hybrid system that combines a probabilistic Bayesian approach with memory-based learning. This system is called Personality Diagnosis and is a collaborative filtering algorithm which makes recommendations to a new user on the basis of the preferences of a group of users about items such as books or movies. The task of collaborative filtering is to predict, given a certain user of which a part of its preferences or ratings is known, the ratings for other items on the basis the ratings of other users. Pennock et al. construct for each user a vector of ratings. To make a prediction for an unknown rating for a certain user, they compute the probability that this user is of the same personality type as any other user (i.e. similar to the other vectors) and then compute the probability distribution for the unknown rating. As the output is a probability, this memory-based approach can be regarded as a local regression method. Pennock et al. compare the performance of their hybrid algorithm against two memory-based approaches and two probabilistic approaches. The results show that the hybrid system performed better than the four other algorithms.
Chapter 3: Hybrids
3.6
66
Summary
In this chapter we discussed and motivated the construction of hybrid algorithms that combine lazy and eager learning. We described the hybrid maxent-h which combines knn with maximum entropy modeling, the hybrids rules-r-h and rules-a-h which both combine k-nn with rule induction, and the hybrid winnow-h which combines k-nn with winnow. The results of comparative experiments with the hybrid algorithms and their parent algorithms on a range of tasks were presented. These results showed that the hybrid maxent-h performs similarly to k-nn and better than maxent. rules-a-h performs slightly better than k-nn and better than rules. The other two hybrids rules-r-h and winnow-h are less successful. rules-r-h performs similar to rules but has a lower performance than k-nn. winnow-h performs worse than both parent algorithms. Additionally, we performed experiments with two different rule induction algorithms, cn2 and ripper. We expected that hybrids that combine cn2 with k-nn might have a better performance than the ones with ripper, because cn2 produces larger rule sets and rules that can overlap in the instances they cover. This may lead to a more diverse set of instances which may be beneficial for k-nn classification. Against our expectations, we found that ripper as well as the hybrids that have ripper as parent algorithm, perform better than cn2 and the hybrids that combine cn2 with k-nn. The method of ripper to produce a smaller rule set containing rules that have a higher coverage leads to better performing hybrids. We performed some extra comparisons which provided additional information about the main comparative experiments between the hybrids and their parent algorithm. We compared the results of four parent algorithms to each other. We observed that k-nn outperforms the eager learners on a large range of data sets. maxent performs better than rules and winnow, and rules outperforms winnow. We use accuracy to measure the performance of the algorithms. As an extra evaluation we computed micro-averaged AUC scores and compared these scores to the accuracy scores in Section 3.4.2. We found only small differences between accuracy and AUC measurements and the AUC scores support the observations made on the basis of accuracy scores. We analyzed the results of the algorithms on the NLP tasks only. The results on the NLP tasks do not differ much from the observations made on all 33 tasks. An exception is the hybrid maxent-h which has a lower performance than k-nn on the NLP tasks, while in the comparison on all tasks their performance was similar. The NLP data sets are characterized by exceptions and large averages of feature values per feature. The weights as produced by maxent seem not to capture all low frequency but important cases. This may be explained by the fact that maxent-h maximizes the entropy of the weights, thereby smoothing over low frequency events.
Chapter 4
Comparison with stacking In this chapter we compare two of the hybrid algorithms described in Chapter 3, maxent-h and rules-a-h, with two classifier ensembles. We employ stacking, an ensemble technique in which the prediction of a first classifier (maxent, rules) is used as input for a second classifier (k-nn). In this chapter we compare the hybrid algorithms and classifier ensembles. In Chapter 3 we introduced four hybrid algorithms and tested their performance on a wide range of data sets. The results from these tests showed that two of the hybrids outperformed their eager learning parent algorithm and performed equally or slightly better than k-nn. These two hybrids are maxent-h, which combines k-nn classification with the internal representation of maximum entropy modeling, and rules-a-h, which is a k-nn classifier that uses extra features representing the rule set of ripper added to the instances. In this chapter we discuss the construction and performance of two classifier ensembles that have a close resemblance to the two successful hybrids maxent-h and rules-a-h. The goal of constructing these ensembles is to create analogy with the hybrids in order to build a minimal pair for comparison. We compare the two ensemble classifiers with the two hybrid classifiers and we also compare the classifier ensembles with their parent algorithms. This chapter is structured as follows. First we give a general introduction of classifier ensembles. In Section 4.1.1 we describe the ensemble method stacking we choose for our experiments. The experimental setup of the experiments with stacked classification is detailed in Section 4.2; the results are shown in Section 4.3. We discuss related work in Section 4.4 and summarize our findings in Section 4.5. We analyze the differences and commonalities between the stacked classifiers and hybrid algorithms in Chapter 5.
67
Chapter 4: Comparison with stacking
4.1
68
Classifier ensembles
The term classifier ensembles refers to techniques that combine the predictions of various classifiers into a single classifier. Concerning their goal, classifier ensembles are closely related to the hybrid algorithms as defined in Chapter 3 as they both try to combine classifiers together to reach a performance level intended to surpass that of the parents. Theoretical and empirical work offers many proofs and demonstrations that classifier ensembles have a better performance than any of the single classifiers that form the ensemble, given that the single classifiers are accurate and diverse (Hansen and Salamon, 1990; Opitz and Maclin, 1999; Opitz and Shavlik, 1996; van Halteren et al., 2001; Wolpert, 1992). Accurate classifiers are classifiers with a performance better than random guessing. Diverse classifiers are classifiers that make errors in different parts of the hypothesis space. In order to create a successful ensemble, the ensemble should therefore be composed of accurate classifiers that are diverse and make uncorrelated errors. Dietterich (2002) mentions four methods to force diversity in the ensemble of classifiers. Different training material Training classifiers on different subsamples of the labeled training set results in different classifiers. Bagging (Breiman, 1996a) and boosting (Freund, 1990) are examples of this approach. In bagging, the training set for the individual classifiers is randomly sampled from the original training set with replacement. Each sampled set has the same size as the original training set, but the sets differ as some instances will occur several times and other are left out. In boosting, a sequence of classifiers is used. Instances that are misclassified by the first classifier receive more weight in the new training set of the next classifier, for instance by duplicating them in the training set. Feature selection Training classifiers on different subsets of features leads to diverse classifiers. Examples of this approach can be found in (Bay, 1999; Zenobi and Cunningham, 2001). Both studies use feature subset selection to create an ensemble of k-nn classifiers. Different output labels Each classifier can be trained with a different set of output labels. One can split up a multi-class problem into sets of binary classes. Two commonly used methods are one-versus-all and one-versus-one classification (also called pairwise classification) (F¨ urnkranz, 2001). With one-versus-all, a binary classifier is built for each class. This classifier is trained on data in which this class is the positive class and all other classes are labeled negative. With one-versus-one, each pair of classes forms a binary classifier. Each classifier is only trained on the examples of the two classes involved; all differently labeled examples are not taken into account. Error correcting output codes as described in Section 2.3 are another example of a method that produces different output labels. Randomness Many machine learning algorithms offer the possibility to use randomness in some part of their learning module. Some algorithms already contain a random
69
4.1 Classifier ensembles element in their learning module. For instance, the winnow algorithm starts with random initial weights of the connections.
Another aspect of classifier ensembles is the method used to combine the predictions of individual classifiers. Majority voting is a simple method where the outputs of a set of different classifiers are combined in a voting scheme. For each new instance the predictions of all the classifiers are counted and the class label that is predicted by the majority of the classifiers is assigned to the new instance. By assigning a weight to the classifiers, more importance can be given to the votes of the better classifiers in the ensemble. This method is called weighted voting. An obvious method to estimate the weight of a classifier is its performance of the training set or on a separate validation set. Instead of having some combination scheme to combine the predictions of the different classifiers, it is also possible to add a learning algorithm to learn the best classifier combination. This process by which another classifier is trained on the predictions made by a set of classifiers trained on the instances is generally called meta-learning. We refer to (Vilalta and Drissi, 2002; Vilalta et al., 2005) for an overview of meta-learning. One particular method of meta learning is stacking.
4.1.1
Stacking
The general framework of stacking is introduced by Wolpert (1992). A set of different classifiers is trained on the original labeled training set and produces a set of outputs. These classifiers are named level-0 or base classifiers. Next a level-1 classifier is trained on the set of outputs of the base classifiers. When classifying a new test instance, the base classifiers predict a set of outputs which forms the input of the level-1 classifier that assigns the final class label. There are many different variations of stacking. Stacked classifiers can have two to multiple levels of classifiers. The output of the base classifiers can have any form such as a predicted class label or a probability; the level-1 classifier may not only receive the outputs of the base classifiers but also the original training instances as input. Figure 4.1 presents the architecture of a stacked classifier with one classifier at the base level that produces predictions which form the input for both the learning module and classification module of the level-1 classifier. We design a variant of stacking where the level-1 classifier does not only receive the previous predictions of the base classifier as input, but also the original training instances. We choose this design for the ensemble classifier to keep a close analogy to the construction of the hybrid algorithms. The stacked classifier uses an eager learning algorithm as base classifier and k-nn as level-1 classifier. The construction of the stacked classifier consists of three parts: Training set labeling: The training set is split in 10 folds. 10-fold cv experiments are
70
Chapter 4: Comparison with stacking level 0 classifier
level 1 classifier
training instances
[training instances]
learning
learning
module
module internal
internal
representation
representation
classification
classification module
test instance x
predictions
module
predicted label y
test instance x
Figure 4.1: A stacked classifier with one level-0 classifier that produces predictions which form the input for the level-1 classifier.
performed with the level-0 classifier trained on the training set. The predictions for the 10 test parts are concatenated and added as features to the training set. This produces a new training set where each instance has an added feature, a predicted class label. Training the classifiers : An eager level-0 classifier is trained on the whole training set. The level-1 k-nn classifier is trained on the new training set with added predictions of the base classifiers. Classification phase: The level-0 classifier trained on the whole training set predicts a label for the new test instance. This predicted label is added as an extra feature to the test instance. The test instance plus extra features forms the input for the level-1 classifier which predicts the final class label. We construct two stacked classifiers in analogy with the two hybrids maxent-h and rulesa-h. The first stacked classifier has maxent as base classifier and a k-nn as level-1 classifier – we label this classifier stack-maxent. The other stacked classifier is labeled as stackrules and has a ripper rule induction algorithm as base classifier and k-nn as level-1 classifier. The motivation for the construction method of the stacked classifiers is analogy with the hybrids in order to create a minimal pair for comparison. In the hybrid algorithms the eager parent algorithm first produces an internal representation on the basis of the training set. The k-nn algorithm uses this internal representation of the eager learner in the classification of the test instances. In the stacked classifier, the eager algorithm
71
4.2 Experimental setup
produces a set of predicted labels for the test instances. These labels are used by the k-nn algorithm in the final classification of the test instances. The main difference between the hybrid algorithms and the stacked classifiers is that the stacked classifiers consist of two separate algorithms, from which the second classifier uses the predictions of the first classifier. The hybrid algorithm consists of one classifier, and the combination of two classifiers takes place internally in the learning phase of the hybrid. Our expectation of the performance of the stacked classifiers is that they will have a quite similar performance as the level-1 k-nn classifier because the level-1 classifier predicts the final class label. We are uncertain about the influence of the level-0 classifier on the final prediction, as this will depend on the feature weight assigned to this feature in the k-nn classification process.
4.2
Experimental setup
We run experiments with the stacked classifiers on the 29 UCI benchmark tasks and the four NLP tasks described in Section 2.2. We perform 10-fold cv experiments to estimate the performance of a classifier. Each classifier trained on a particular training set is optimized with respect to the algorithmic parameters. As evaluation method we calculate average accuracy on the outcomes of the 10-fold cv experiments, we use paired t-tests to estimate the significance of the differences found between algorithms. This approach is detailed in Chapter 2. We compare the results of the two hybrid algorithms maxent-h and rules-a-h as described in Chapter 3 with the results of the stacked classifiers. We perform paired t-tests between the stacked classifiers and hybrid algorithms to estimate the significance of the differences in performance. We also compare the performance of the stacked classifiers with the results of their parent algorithms: we compare stack-maxent with its parent algorithms maxent and k-nn, and compare the classifier stack-rules with rules and k-nn. The experiments with the parent algorithms k-nn, rules and maxent are discussed in Chapter 3. Stacked classification is time-consuming, as classifier stacking requires an internal 10-fold cv on the training set. To estimate the performance of a stacked classifier on a particular data set, we perform 10-fold cv on the data set and for each training part the stacked classifier performs an internal round of 10-fold cv experiments. Not only does this imply 10*10 training and testing rounds, but each training round also involves an optimization process of the algorithmic parameters.
Chapter 4: Comparison with stacking
4.3
72
Results
First we describe the results of the stacked algorithm stack-maxent and compare these with the results of the hybrid maxent-h and the parent algorithms maxent and k-nn. Next we look at the stacked classifier stack-rules and compare its results with the hybrid rules-h and its parent algorithms rules and k-nn.
4.3.1
Comparison with stack-maxent
We compare the performance of stack-maxent with the performance of resembling hybrid maxent-h and with the parent algorithms maxent and k-nn. Table 4.1 shows the mean accuracies and standard deviations of the stacked classifier as measured in 10fold cv experiments on the 33 data sets, together with the results of maxent-h, maxent and k-nn that were already reported in Chapter 3. The last row of the table shows the macro-average over all data sets. The hybrid has a slightly higher average than the stacked classifier. The parent k-nn has a similar average and maxent a lower average. We do not display graphs of the accuracies of these algorithms compared to each other because these graphs do not display the information more clearly than the table does. The accuracies of the hybrid and the stacked classifier are very similar and when plotted in a graph the points are close to the diagonal. When we compare the stacked classifier to both parents we observe very similar behavior as shown in the comparison between the hybrid and both parents as displayed in Figure 3.2. We perform paired t-tests between the pairs of algorithms to estimate significance of the differences between the scores. We first compare the results of the two algorithms that both combine maximum entropy modeling with k-nn, stack-maxent and the hybrid maxent-h. The counts of won/tied/lost of the significance tests on the results of the two algorithms are shown in the first row of Table 4.2. The two algorithms have a similar performance. No significant difference is found on 27 of the 33 data sets. Both win and lose three times against each other. The results of the paired t-tests between the stacked classifier stack-maxent and its two parent algorithms are shown in the second and third row of Table 4.2. stack-maxent performs similarly to its parent algorithm k-nn, as no significant difference in performance is found on 29 data sets. stack-maxent and k-nn both win and lose two times against each other. Compared with its eager parent maxent, the stacked classifier performs significantly better on 14 data sets and stack-maxent is never outperformed by maxent.
73
4.3 Results
data set abalone audiology bridges car c-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-l splice tictactoe vehicle votes wine yeast pos chunk ner pp macro av.
stack-maxent 24.8 ± 2.3 81.8 ± 5.4 53.7 ± 13.1 95.8 ± 1.1 55.4 ± 7.5 78.0 ± 0.9 76.8 ± 8.1 70.4 ± 16.2 67.2 ± 10.2 97.6 ± 0.8 90.5 ± 1.7 35.8 ± 17.5 97.9 ± 6.3 92.6 ± 11.8 97.2 ± 2.5 100.0 ± 0.0 98.3 ± 2.4 97.6 ± 0.7 91.8 ± 1.4 92.6 ± 8.9 95.2 ± 1.7 94.0 ± 2.1 92.8 ± 3.6 94.7 ± 1.2 97.9 ± 1.0 67.3 ± 3.0 95.9 ± 2.7 96.0 ± 3.7 54.0 ± 3.2 94.6 ± 0.2 95.5 ± 0.2 98.1 ± 0.4 80.8 ± 1.4 83.4 ± 4.3
maxent-h 22.8 ± 2.2 81.3 ± 5.8 55.7 ± 13.1 96.5 ± 1.5 54.8 ± 5.8 78.1 ± 1.9 78.6 ± 2.8 68.9 ± 14.9 61.5 ± 9.9 99.1 ± 0.4 95.9 ± 0.7 43.3 ± 13.3 93.7 ± 10.5 96.3 ± 8.3 97.2 ± 2.5 100.0 ± 0.0 97.9 ± 1.1 97.3 ± 0.6 92.0 ± 1.4 93.5 ± 8.2 95.9 ± 1.1 94.2 ± 1.8 93.1 ± 3.2 94.8 ± 1.5 100.0 ± 0.0 66.3 ± 5.4 95.4 ± 2.3 96.6 ± 2.7 55.4 ± 3.0 93.5 ± 0.2 95.5 ± 0.1 98.0 ± 0.4 78.3 ± 0.9 83.7 ± 3.9
k-nn 24.6 ± 2.8 80.5 ± 6.3 54.7 ± 10.6 96.5 ± 1.3 55.7 ± 5.4 77.7 ± 1.8 79.5 ± 4.9 66.9 ± 11.4 67.7 ± 8.4 96.8 ± 1.2 95.6 ± 0.5 33.3 ± 12.9 100.0 ± 0.0 94.0 ± 11.8 97.2 ± 2.5 100.0 ± 0.0 99.4 ± 0.5 98.0 ± 0.7 93.4 ± 0.9 87.0 ± 7.1 95.7 ± 0.9 94.2 ± 2.2 92.8 ± 4.2 95.3 ± 1.0 95.8 ± 3.8 67.6 ± 4.5 95.2 ± 2.4 96.1 ± 2.6 53.3 ± 2.9 94.8 ± 0.1 95.5 ± 0.2 97.4 ± 1.2 80.1 ± 1.3 83.4 ± 3.6
maxent 23.6 ± 1.7 80.9 ± 5.0 61.6 ± 9.1 90.9 ± 2.2 55.1 ± 5.0 75.7 ± 0.5 76.5 ± 7.8 69.8 ± 13.4 70.1 ± 11.6 96.8 ± 0.6 85.0 ± 0.7 39.2 ± 24.7 75.0 ± 4.0 65.1 ± 5.5 97.2 ± 2.5 100.0 ± 0.0 92.4 ± 0.4 95.8 ± 0.8 86.0 ± 1.2 92.5 ± 9.0 92.1 ± 2.8 94.7 ± 1.8 92.2 ± 2.8 94.6 ± 0.8 98.3 ± 0.7 63.5 ± 5.2 96.5 ± 2.1 94.9 ± 5.3 49.3 ± 3.9 92.5 ± 0.1 95.0 ± 0.2 96.9 ± 0.1 81.6 ± 1.0 81.0 ± 4.0
Table 4.1: Mean accuracies and standard deviation of the stacked classifier stackmaxent, the hybrid maxent-h and their parent algorithms on the 33 data sets. Highest scores are printed in bold-face.
74
Chapter 4: Comparison with stacking algorithm pair stack-maxent – maxent-h stack-maxent – k-nn stack-maxent – maxent
won/tied/lost 3/27/3 2/29/2 14/19/0
Table 4.2: Summary of paired t-test outcomes of the stacked classifier stack-maxent compared to maxent-h and both parent algorithms on the 33 data sets.
4.3.2
Comparison with stack-rules
We report on the results of the stacked classifier stack-rules and compare these to the results of the hybrid rules-a-h and the parent algorithms rules and k-nn. Table 4.4 shows the mean accuracies and standard deviations on the 33 data sets of the four algorithms. The last row of the table shows that averaged over all data sets, the hybrid rules-a-h and k-nn have a slightly higher average than the stacked classifier. Again we do not display any graphs of these accuracies for the same reasons as mentioned for stack-maxent. We perform paired t-tests between pairs of algorithms to estimate the significance of the differences in performance. The outcomes of these significance tests are displayed in Table 4.3. When we compare the stacked classifier stack-rules and the hybrid rules-a-h we see that they have a similar performance. No significant difference is found on 28 of the 33 data data sets. stack-rules wins against rules-a-h two times and loses three times. We also compare the stacked classifier stack-rules with its parent algorithms rules and k-nn. The stacked classifier stack-rules performs quite similarly to k-nn, as there is no significant difference found in the mean accuracies on 25 data sets. stack-rules wins against k-nn on three tasks and loses on five tasks. When compared to the eager parent rules, the stacked classifier performs much better. stack-rules wins against rules on 17 data sets and does not lose on any of the data sets. algorithm pair stack-rules – rules-a-h stack-rules – k-nn stack-rules – rules
won/tied/lost 2/28/3 3/25/5 17/16/0
Table 4.3: Summary of paired t-test outcomes of the stacked classifier stack-rules compared to rules-a-h and both parent algorithms on the 33 data sets.
75
4.3 Results data set abalone audiology bridges car c-h-disease connect4 ecoli flag glass kr-vs-kp letter lung-cancer monks1 monks2 monks3 mushroom nursery optdigits pendigits promoters segment solar-flare soybean-l splice tictactoe vehicle votes wine yeast pos chunk ner pp macro av.
stack-rules 25.1 ± 2.0 74.4 ± 9.6 54.6 ± 11.1 97.8 ± 1.3 56.1 ± 7.7 77.4 ± 1.8 77.4 ± 6.4 65.4 ± 11.5 64.0 ± 10.6 99.3 ± 0.5 89.3 ± 5.2 45.0 ± 33.4 99.3 ± 2.0 85.5 ± 14.5 97.2 ± 2.5 100.0 ± 0.0 98.1 ± 0.6 97.5 ± 0.4 91.1 ± 2.5 86.9 ± 8.4 95.6 ± 1.0 94.7 ± 1.5 92.4 ± 3.8 95.6 ± 1.2 99.7 ± 0.7 67.8 ± 3.4 94.0 ± 1.5 95.5 ± 5.5 54.4 ± 4.3 94.8 ± 0.4 95.2 ± 0.2 96.2 ± 1.9 80.6 ± 2.6 83.0 ± 4.8
rules-a-h 25.0 ± 2.5 81.8 ± 5.2 55.7 ± 13.1 98.4 ± 0.9 58.4 ± 5.2 78.6 ± 2.5 78.0 ± 6.3 65.8 ± 10.9 66.9 ± 10.1 99.2 ± 0.5 95.6 ± 0.4 34.2 ± 16.0 100.0 ± 0.0 97.0 ± 4.2 97.2 ± 2.5 100.0 ± 0.0 99.2 ± 0.4 97.9 ± 0.6 92.5 ± 1.5 83.5 ± 11.2 95.8 ± 1.3 94.2 ± 1.9 92.8 ± 3.5 95.8 ± 0.7 100.0 ± 0.0 64.2 ± 5.7 94.9 ± 1.3 95.5 ± 4.9 52.0 ± 4.7 95.0 ± 0.1 95.5 ± 0.2 97.5 ± 1.0 80.2 ± 1.9 83.6 ± 3.7
k-nn 24.6 ± 2.8 80.5 ± 6.3 54.7 ± 10.6 96.5 ± 1.3 55.7 ± 5.4 77.7 ± 1.8 79.5 ± 4.9 66.9 ± 11.4 67.7 ± 8.4 96.8 ± 1.2 95.6 ± 0.5 33.3 ± 12.9 100.0 ± 0.0 94.0 ± 11.8 97.2 ± 2.5 100.0 ± 0.0 99.4 ± 0.5 98.0 ± 0.7 93.4 ± 0.9 87.0 ± 7.1 95.7 ± 0.9 94.2 ± 2.2 92.8 ± 4.2 95.3 ± 1.0 95.8 ± 3.8 67.6 ± 4.5 95.2 ± 2.4 96.1 ± 2.6 53.3 ± 2.9 94.8 ± 0.1 95.5 ± 0.2 97.4 ± 1.2 80.1 ± 1.3 83.4 ± 3.6
rules 18.1 ± 1.7 76.5 ± 7.7 53.8 ± 14.3 97.6 ± 1.1 58.4 ± 5.9 76.3 ± 1.7 69.7 ± 10.9 61.8 ± 8.8 60.7 ± 6.5 99.2 ± 0.5 73.8 ± 1.5 31.7 ± 24.1 99.3 ± 2.0 72.0 ± 8.1 97.2 ± 2.5 100.0 ± 0.0 97.7 ± 0.8 89.3 ± 1.0 82.6 ± 1.7 79.4 ± 7.9 90.5 ± 3.6 94.6 ± 1.5 91.1 ± 3.0 94.1 ± 1.6 99.7 ± 0.7 55.7 ± 5.0 94.2 ± 1.6 93.2 ± 6.5 42.2 ± 3.2 68.8 ± 0.5 92.9 ± 0.5 90.3 ± 0.9 77.0 ± 1.9 78.2 ± 4.2
Table 4.4: Mean accuracies and standard deviations of the stacked classifier stack-rules, the hybrid rules-a-h and their parent algorithms on the 33 data sets. Highest scores are printed in bold-face.
76
Chapter 4: Comparison with stacking algorithm pair stack-maxent – maxent-h stack-maxent – k-nn stack-maxent – maxent
won/tied/lost 2/2/0 0/4/0 3/1/0
Table 4.5: Summary of paired t-test outcomes of the stacked classifier stack-maxent compared to maxent-h and both parent algorithms on the four NLP data sets. algorithm pair stack-rules – rules-a-h stack-rules – k-nn stack-rules – rules
won/tied/lost 0/3/1 0/3/1 4/0/0
Table 4.6: Summary of paired t-test outcomes of the stacked classifier stack-rules compared to rules-a-h and both parent algorithms on the four NLP data sets.
4.3.3
Comparisons on NLP data sets
In this section we concentrate on the results obtained on the four NLP data sets. We first look at the stacked classifier stack-maxent. Table 4.5 lists the outcomes of paired t-tests on the four NLP data sets. We see that stack-maxent performs better than the hybrid maxent-h as it wins on two of the four data sets. These outcomes differ from the outcomes on all 33 data sets; compared over all data sets the two algorithms have a similar performance. Compared to both parent algorithms, stack-maxent performs similar to k-nn and better than maxent. The fact that the stacked classifier is similarly to k-nn forms also the explanation for the reason why the stack-maxent performs better than maxent-h, as k-nn also performs better than maxent-h on the NLP data sets. Next we look at the results for the stacked classifier stack-rules. Table 4.6 displays the outcomes of the paired t-tests on the four NLP data sets. When we compare stack-rules to the hybrid rules-a-h we see that the hybrid wins once. On the other three data sets no significant difference is found. Compared to the parent algorithms, k-nn wins once and has a similar performance on the other data sets. stack-rules is much better than rules and wins on all four data sets. Overall, results found for stack-rules on the NLP data sets do not really differ from the results found on all 33 data sets.
4.4
Related work
In this section we mention some different variants of stacking, some issues that are raised concerning stacking and we also briefly discuss stacking applied to NLP tasks.
77
4.4 Related work
Ting and Witten (1999) have further investigated stacking and propose a variant of stacking that uses class probability distributions produced by the level-0 classifiers as input for the level-1 classifier instead of predicted class labels. These probabilities not only predict the class label but also give information about the confidence of the base classifier. They also conclude that multi-response least squares linear regression is suited as a second level classifier as this classifier can take the confidence information into account. Chan and Stolfo (1997) propose meta-learning and in particular stacking as a method to deal with the problem of scaling to huge amounts of data. By training several classifiers on different data sets, one can cover large amounts of training material. Stacking offers the possibility to learn the coordination between these classifiers. They discuss two variants of stacking which they call arbitration and combining. In the first variant the level-1 classifier is an arbiter that arbitrates among the predictions of the base classifiers. The arbiter is trained on a held out set of the training material. In the combining approach, the level-1 classifier receives as input only the predictions of the base classifiers. Both variants perform better than any of the single classifiers or using majority voting over the set of base classifiers. Seewald (2002) shows that stacking performs worse on multi-class data sets than on twoclass data sets. He proposes a variant of stacking, an extension of the variant of Ting and Witten (1999) that is independent of the number of classes and has a significantly better performance than Ting’s stacking variant. A problem of stacking is the choice of the right combination of level-0 and level-1 classifiers for a particular task. Ledezma et al. (2004) propose to use genetic algorithms (GA) to find an optimal combination of stacked classifiers. Besides combinations of level-0 and level-1 classifiers, the GA also optimizes the algorithmic parameter setting of the base and level-1 classifiers. Ledezma et al. show that the stacked combinations found with the GA perform at least as well as the best stacking methods reported in the literature. Dzeroski and Zenko (2004) report on an extensive empirical evaluation of several stateof-the art stacked classifiers. They compare the stacking methods to each other, and to the best individual level-0 classifier. They show that most stacked classifiers perform at best similarly to selecting the best individual classifier by cross validation. They also propose two extensions of the stacked classifier that performed best in their comparative experiments (the classifier proposed by Ting and Witten (1999)), and show that one of these extensions performs better than the best individual classifier. This best performing stacked classifier is called multi-response model trees, and uses class probability distributions produced by the level-0 classifiers as input for the level-1 classifier, a model tree algorithm (Frank et al., 1998). In the last years, stacking has been applied to many NLP processing tasks. It is applied to part-of-speech tagging (van Halteren et al., 2001), text classification (Mihalcea, 2002), word sense disambiguation (Florian et al., 2002), named entity recognition (Florian, 2002; Hendrickx and van den Bosch, 2003; Wu et al., 2003), phrase chunking (Veenstra, 1998), spam filtering (Sakkis et al., 2001) and semantic role labeling (van den Bosch et al., 2004).
Chapter 4: Comparison with stacking
78
In a sequential task such as ner and chunk, labels are assigned to sequences of words or tokens. A common machine learning approach is to apply a labeling scheme such as IOBtagging and to create an instance for each word and its surrounding context as described in Section 2.3. For each word a prediction is made without regarding previous predictions. Stacking can also be applied to sequential NLP tasks (Hendrickx and van den Bosch, 2003; van den Bosch et al., 2004; Veenstra, 1998). In these studies it is not the prediction of the focus word of the level-0 classifier which is added as an extra feature to the instances for the level-1 classifier, but the predictions about the context words left and right of the focus word. Information about the class labels for the context words are valuable because these words can also be part of the same sequence. Work by van den Bosch and Daelemans (2005) discusses three methods to tackle the problem of blindness to previous or future predictions in sequential NLP tasks. The study focuses on the trigram method where the labels to predict are changed by concatenating the label of each word with the labels of the previous and next word. This method is compared to and combined with the feedback-loop method and stacking. In the feedback loop, previous predictions are added as extra features when predicting the current word. The results show that trigrams work equally well as stacking or the feedback-loop method. Combining trigrams with stacking has a slightly better performance than using only the trigrams.
4.5
Summary
In this chapter we compared classifier ensembles against the two best performing hybrids discussed in Chapter 3. We used stacking, a technique for combining classifiers that uses the predictions of the level-0 classifier as input for the level-1 classifier. We designed two stacked classifiers that have a close resemblance to the two hybrid algorithms maxent-h and rules-a-h. In these stacked classifiers, the level-0 classifier is an eager learning algorithm, and the level-1 algorithm is a k-nn classifier. The first stacked classifier stack-maxent uses maximum entropy modeling as eager parent. The second stacked classifier is called stackrules and has a rule induction algorithm as level-0 classifier. The main difference between these stacked classifiers and hybrid algorithms is that the stacked classifiers consist of two classifiers while the hybrids consist of one classifier. We ran experiments with the two stacked classifiers and compared them to the two hybrid algorithms and their parent algorithms. We observed that the stacked classifiers perform very similarly to the hybrid algorithms. Compared to their parent algorithms, the stacked classifiers performed similarly to k-nn and markedly better than their eager parent algorithm. When we focus on the results on the NLP data sets, we see that stackmaxent performs better than maxent-h. On the other hand, the hybrid rules-a-h performs slightly better on the NLP data than stack-rules.
Chapter 5
Analysis In this chapter we investigate to which extent the classification behavior of the hybrid algorithms resembles or deviates from the original algorithms. We investigate the differences and commonalities in the classification behavior of the hybrid algorithms and their parent algorithms. As a comparison, we perform the same analyses for the stacked algorithms that were discussed in Chapter 4. We conduct two types of analysis. We compute bias–variance decompositions of error rates, and we analyze the errors made in the comparative experiments described in Chapter 3 and 4. Bias–variance analyses offer the possibility to split a classifier’s error into two meaningful parts with different causes. The bias component of error rate reflects the systematic errors made by the algorithm. Variance expresses the errors due to variability over different training sets. Looking at the proportion between these two parts can offer insight in the differences between algorithms. Section 5.1 describes the bias–variance analysis. To conduct this analysis, we perform additional experiments with all algorithms. These experiments are somewhat different from the experiments discussed in the previous chapters. We do not use 10-fold cv or algorithmic parameter optimization. Rather, we perform sampling experiments, measure the average error rate and calculate the decomposition into bias and variance components. In Section 5.2 we analyze the differences in the errors gathered in the comparative experiments described in Chapter 3 and 4. We perform two types of error analysis on these outcomes. We compute complementary rates which measure to which extent classifiers misclassify the same instances. These rates give an indication of the similarity and difference in classification behavior of the algorithms. We also analyze the overlap in correct and incorrect predictions between the hybrid algorithms and their parent algorithms, and between the stacked classifiers and their 79
Chapter 5: Analysis
80
parent algorithms. We try to answer a range of related questions such as: How often does the hybrid predict the correct label when both parent algorithms also predict the correct label? What happens when only one parent algorithm predicts the correct label? Does the hybrid indeed combine the best of both worlds and also predicts the correct label in a majority of these cases? In Section 5.3 we provide an illustrative example of some of the classification processes of the two successful hybrids maxent-h and rules-a-h by analyzing their classifications of a single instance. As a comparison we also discuss the classifications of their parent algorithms and the two stacked classifiers. Section 5.4 discusses related work and Section 5.5 summarizes this chapter.
5.1
Bias and Variance
The term machine learning bias was introduced in Chapter 1. The bias of a machine learning algorithm stems from the learning strategy and the model of the algorithm (Mitchell, 1997). Machine learning bias can be split in absolute bias, by which certain hypotheses are eliminated from the hypothesis space, and relative bias, a preference for some hypotheses over others (Dietterich and Kong, 1995). The absolute bias of an algorithm is determined by the choice of the representation of the hypothesis space. For instance, a rule induction algorithm represents its hypothesis of the target function as a set of rules. A typical rule learner induces rules that consist of conjunctions of feature tests – this representation choice eliminates many other possible representations. The relative bias of a machine learning algorithm is determined by the choices (heuristics) made in the design of the learning module of the algorithm. For example, the search strategy of a typical rule induction algorithm is designed according to the minimum description length principle. As a result, the algorithm has a preference for small rule sets over large rule sets. The machine learning bias determines an algorithm’s capacity to learn a classification task. An algorithm that has no bias, that is fully consistent with the training data can only perform a simple table lookup. Such an algorithm can only classify new instances that are exact matches of the ones present in the training data. An algorithm with no bias has therefore no generalization power. It misses the essential capacity to generalize and learn the classification task. In statistical theory the term bias is also used, often in combination with the term variance. Given a classifier, a fixed test set representing a certain classification task and a sample of training sets of a fixed size, the expected average error of the classifier can be decomposed in three components: statistical bias, variance and noise. The statistical bias of an algorithm reflects the systematic error of the algorithm, whereas the term variance expresses the
81
5.1 Bias and Variance
variability in error over a set of different training sets. Noise represents the errors in the data. There is also a relation between the terms machine learning bias and statistical bias: a machine learning bias that suits the classification task at hand will result in a low statistical bias and low variance (Dietterich and Kong, 1995). Several bias–variance decomposition methods are described in literature (Dietterich and Kong, 1995; Domingos, 2000; German et al., 1992; Kohavi and Wolpert, 1996). We use the zero-one loss decomposition of Kohavi and Wolpert (1996). Literature suggests that this is a commonly used method in machine learning research (Brain and Webb, 2002; Witten and Frank, 2005). Given a classifier, a target function to be learned and a sequence of training sets of a fixed size sampled from a set with labeled training material, the formula in Equation 5.1 expresses the decomposition of the expected zero-one loss E(C) of the classifier into bias and variance: E(C) =
!
p(x)(σx2 + bias2x + variancex )
(5.1)
! [p(YT = y|x) − p(YH = y|x)]2
(5.2)
x
bias2x ≡ 0.5
y#Y
variancex ≡ 0.5(1 −
!
P (YH = y|x)2 )
(5.3)
y#Y
where x represents test point x, and σx2 represents the noise in the data set. Kohavi and Wolpert propose to estimate noise to be zero as it is hard to calculate in practice. bias 2 in Equation 5.2 is estimated as the squared difference between the true target class and the predicted class, averaged over the training samples. (We refer to bias2 as ‘bias’.) Variance as expressed in Equation 5.3 is estimated as the variability over the different training sets. p(YT = y|x) is the probability that test point x has the true target label y, averaged over the training set samples. p(YH = y|x) is the estimation that test point x is classified as y by the learning algorithm, in the context of target function f , training set d, and averaged over the training set samples m, written in full in Equation 5.4:
P (YH = y|f, m, x) =
!
P (d|f, m, x)P (YH = y|d, f, m, x)
!
P (d|f, m)P (YH = y|d, x)
d
=
(5.4)
d
where P (d|f, m) is the probability that d is generated from target function f . Summarized,
Chapter 5: Analysis
82
P (YH = y|x) is the average estimation of the classifier for a test point x over the training samples.
5.1.1
Experimental setup
We investigate whether the proportion between bias and variance components differs for the hybrids and their parent algorithms. We run additional experiments to calculate the bias–variance decomposition on a subset of the tasks described in Section 2.2. We choose the following 16 datasets from the UCI repository that have more than 500 instances: abalone, car, connect4, letter, kr-vs-kp, mushroom, nursery, optdigits, pendigits, segment, solar-flare, soybean-l, splice, tictactoe, vehicle, yeast. We run experiments on these 16 data sets with the ten algorithms k-nn, maxent, rules and winnow which are described in Chapter 2, the hybrid algorithms maxent-h, rulesa-h, rules-r-h, winnow-h detailed in Chapter 3 and the two stacked classifiers stackmaxent and stack-rules described in Chapter 4. Following Kohavi and Wolpert (1996), we divide each data set in a training part d and test set e. From training part d, 50 training sets of size m are randomly sampled without replacement within one sample. The size of d is chosen as 2m; m is 100 for data sets with less than 1000 instances and 250 for larger data sets. We compute the average expected error rate for each algorithm by training a classifier on each of the 50 training samples and testing it on the fixed test set e. For each instance in e 50 predictions are collected. We use these predictions to calculate the average error rate and its decomposition in a bias and variance component. We scale the error rate to 100% to get a better view on the balance between bias and variance. We report on averages over the 16 data sets. We use the default parameter settings for each of the 50 classifiers trained on different subsamples. In part we find a motivation for using only the default settings in the work of Kohavi and Wolpert (1996) who show that varying an algorithmic parameter can have an effect on the balance between bias and variance. At the same time we did not have a fixed parameter setting in our previous experiments, but rather a data-dependent one determined by wrapping. We made a practical choice between several options: We could optimize the algorithmic parameter settings for each of the 50 random samples; We could run the optimization step for one random sample and then use the resulting setting combination for each of the 50 sampling experiments; Or we could use the default setting. The first option was arguably the best option, but unfortunately infeasible due to time limitations. The pseudo-exhaustive cross-validation process described in Section 2.3.1 is time-consuming. Concerning the second option we do not think that a parameter setting chosen on the basis of one random sample would be a better choice than choosing the more neutral default parameter setting. In sum, the results described in the next section are based on experiments that deviate in
83
5.1 Bias and Variance
their experimental setup from the setup used in the previous chapters, as these experiments are random sampling experiments with default parameter settings.
5.1.2
Results
The first column in Table 5.1 lists the error rates averaged over the results of the classifiers trained on the 50 subsamples and averaged over the 16 data sets. When we compare the differences in performance of the algorithms found in these experiments to the differences observed in Chapter 3 and 4, we see that for the larger part the observations are quite similar. The main difference is that maxent has the lowest error rate of the four original algorithms. This is no surprise as the other algorithms were used without parameter optimization in these experiments. This affects the score of the hybrid maxent-h which performs similarly to its parent k-nn but lower than maxent. This also has an effect on stack-maxent, which has a lower error rate than k-nn and a higher rate than maxent. The hybrid rules-r-h performs worse than k-nn; its performance is more similar to and slightly lower than the performance of rules. The hybrid rules-a-h performs slightly better than k-nn and better than rules. winnow-h performs worse than both parent algorithms. stack-rules has a slightly higher error rate than k-nn and a lower error than rules. algorithm k-nn maxent rules winnow maxent-h rules-r-h rules-a-h winnow-h stack-maxent stack-rules
error 28.6 26.3 34.9 35.7 28.6 34.0 28.3 39.0 27.7 28.9
bias % 57.3 65.6 55.5 46.9 56.9 55.0 51.6 52.8 52.8 52.3
Table 5.1: The first column shows the average error rate and the second column shows the scaled bias component in the average error of all algorithms, averaged over the 16 data sets. The second column of Table 5.1 shows the scaled statistical bias component of the error rates of the ten algorithms, averaged over the 16 data sets (the variance always being (100 - bias)%). maxent has the highest bias; winnow has the lowest. k-nn has a higher bias than rules. This means that the majority of the errors made by maxent are systematic errors, while most of the errors made by winnow are caused by its variability over different training sets. An explanation for the high variance of winnow is the random initialization of weights that differs for each training set.
84
Chapter 5: Analysis
error bias %
maxent-h 28.6 56.9
k-nn 28.6 57.3
maxent 26.3 65.6
Table 5.2: Average error rates and the bias percentage of these error rates of the hybrid maxent-h and both parents, averaged over the 16 data sets.
In the next part of the section, we compare the hybrids to the parents. We repeat the numbers presented in Table 5.1 in separate tables to give a clear and surveyable picture of the outcomes. Table 5.2 displays the outcomes of the hybrid maxent-h and both parents. The outcomes of the hybrid are very similar to the outcomes of k-nn. maxent-h has the same average error rate, but a marginally lower bias component. The hybrid has a higher error rate than its parent maxent. This higher error rate is necessarily due to the variance component of the error rate, because the bias of the hybrid is much lower than the bias of maxent. error bias %
rules-r-h 34.0 55.0
rules-a-h 28.3 51.6
k-nn 28.6 57.3
rules 34.9 55.5
Table 5.3: Average error rates and the bias percentage of these error rates of the two rule hybrids and both parents, averaged over the 16 data sets.
Next we compare the results of the two rule hybrids to their parents, shown in Table 5.3. The error rate and bias percentage of the hybrid rules-r-h are more similar to the outcomes of the parent rules than to k-nn. rules-r-h has a slightly higher error rate than k-nn. As the hybrid has a lower bias, these errors are mostly caused by a higher variance of the hybrid. rules-r-h has a lower error rate than rules in combination with a slightly lower bias. The other hybrid rules-a-h has a lower error rate than both parents. With respect to the bias component, the algorithm deviates from both parent as its bias is much lower. This means that the lower error rate is due to a reduction in bias errors. error bias %
winnow-h 39.0 52.8
k-nn 28.6 57.3
winnow 35.7 46.9
Table 5.4: Average error rates and the bias percentage of these error rates of the hybrid winnow-h and both parents, averaged over the 16 data sets.
85
5.1 Bias and Variance
The outcomes of hybrid winnow-h differ from the outcomes of both parents, as can be seen in Table 5.4. winnow-h has a higher error rate than both parents, a lower bias than k-nn and a higher bias than winnow. This suggests that the lower error rate of k-nn is for the larger part due to less variance errors. The lower error rate of winnow on the other hand seems to be caused by less bias errors. In sum, all hybrids have a lower bias component in the average error rate than both parent algorithms, except for the pair winnow-h – winnow. Concerning the cases in which the parents have lower error rates, this decrease in error is in the majority of the cases due to a reduction of variance errors. Next we look at the bias–variance decomposition of the error rates of the stacked classifiers. Table 5.5 lists the outcomes of the stacked classifier stack-maxent, the hybrid maxenth and both parents. stack-maxent has a lower error rate than the hybrid maxent-h combined with a lower bias component; the stacked classifier makes less systematic errors than the hybrid. stack-maxent also makes less errors than k-nn, due to a reduction of bias errors. stack-maxent has a higher error rate than maxent. The lower error rate of maxent is due to a reduction in variance errors. Next we look at the stacked classifier stack-rules. The outcomes of stack-rules, rules-a-h and both parents are displayed in Table 5.6. When we compare the stacked classifier to the hybrid rules-a-h, the stacked classifier has a slightly higher error rate and bias component. This means that the hybrid makes less systematic errors than stackrules. The parent k-nn has a slightly lower error rate than stack-rules, for the most part due to a reduction of variance errors. The parent rules has a higher error rate and a higher bias than stack-rules. error bias %
stack-maxent 27.7 52.8
maxent-h 28.6 56.9
k-nn 28.6 57.3
maxent 26.3 65.6
Table 5.5: Average error rates and the bias percentage of these error rates of stackmaxent, maxent-h and both parents, averaged over the 16 data sets. data set error bias %
stack-rules 28.9 52.3
rules-a-h 28.3 51.6
k-nn 28.6 57.3
rules 34.9 55.5
Table 5.6: Average error rates and the bias percentage of these error rates of stack-rules, rules-a-h and both parents, averaged over the 16 data sets. As an illustration of the bias–variance analysis we provide a graph of the actual bias– variance decompositions of the ten algorithms on one randomly chosen data set. Figure 5.1 shows the average error rate over 50 samples for each algorithm on the car data set. Each bar represents the expected error decomposed in bias (lower part of bar) and variance
86
Chapter 5: Analysis
(upper part of bar). We see that the hybrid winnow-h has the highest average error of 26.9%. The hybrid maxent-h has the lowest error rate of 13.9%. When we compare the error bar of maxent-h to the error bars of both parents, we see only a marginal difference with k-nn and a small difference in error compared to maxent. However, the bias component of the hybrid is much lower than the bias component of maxent, which means that the hybrid makes less systematic errors. Most of the other algorithms have approximately a 50-50 proportion between bias and variance, except for the hybrids rules-r-h and winnow-h that both have a slightly higher bias than variance. 40
variance bias
Error
30
20
10
stack-rules
stack-max
winnow-h
rules-r-h
rules-a-h
maxent-h
winnow
rules
maxent
k-nn
0
Figure 5.1: The bias and variance decomposition of the averaged error rates of all algorithms on the car data set. The upper part of each bar represents variance and the lower part the bias.
5.2
Error analysis
We investigate the outcomes of the comparative experiments discussed in Chapter 3 and 4. We study the predictions of the hybrid algorithms, stacked classifiers and their parent algorithms. We perform two types of error analysis. First we describe a complementary rate analysis. Complementary rates measure to which extent two algorithms misclassify the same instances. We compute complementary rates between the hybrid algorithms and their parent algorithms, between the stacked classifiers and their parents, and between the hybrid algorithms and stacked classifiers. Next we count the cases of correct and incorrect
87
5.2 Error analysis
A
correct incorrect
correct a c
B incorrect b d
Table 5.7: Contingency table of the predictions of two algorithms A and B.
predictions of both parent algorithms and whether the hybrids or stacked classifiers also make correct or incorrect predictions.
5.2.1
Complementary rates
We employ the complementary rate analysis as proposed by Brill and Wu (1998). The complementary error rate comp(A, B) between two classifiers A and B measures the percentage of mistakes that A makes which are not made by algorithm B. The relative magnitude of the complementary error rate between two algorithms can be seen as an indication of their functional similarity with respect to classification behavior: ( ) # of common errors comp(A, B) = 1 − ∗ 100 # of error of A only
(5.5)
Two classifiers that misclassify exactly the same instances have a complementary rate of 0%. Two classifiers that always misclassify different instances have a complementary rate of 100%. The contingency table 5.7 presents the correct and incorrect predictions of the two algorithms to be compared. The complementary rate comp(A, B) is then computed as 1 − (d/(c + d)), and expresses the proportion of c to c + d. The asymmetric complementary rate comp(B, A) states the proportion of b to b + d. Another measurement that computes overlap or agreement between two algorithms is kappa (Carletta, 1996; Cohen, 1960) which is a symmetric measure. We have chosen complementary rate as our measure because it concentrates on errors only, while kappa is computed on the basis of all predictions. We compute the complementary rates between pairs of algorithms on the basis of the outcomes of the experiments discussed in Chapter 3 and 4. We compare the hybrid algorithms to both parents, the stacked classifiers to both parent algorithms, and the two stacked classifiers to the two hybrids maxent-h and rules-a-h. To calculate the complementary rate between two classifiers on a data set, we concatenate the 10 test parts of the 10-fold cv experiments to one large test set. We use the predictions on this whole set as the basis of the calculation. We report macro-averages over the 33 data
88
Chapter 5: Analysis
sets, except for the pairs involving winnow and winnow-h. For these two algorithms we calculate macro-averages over the 29 UCI data sets only.
5.2.1.1
Complementary rates of the hybrid algorithms algorithm pair maxent-h – k-nn maxent-h – maxent rules-a-h – k-nn rules-a-h – rules rules-r-h – k-nn rules-r-h – rules winnow-h – k-nn winnow-h – winnow
comp(A, B) 33.4 35.0 31.6 28.4 56.6 19.7 55.2 41.8
comp(B, A) 34.7 47.0 33.9 49.6 38.1 14.6 34.8 40.2
Table 5.8: Complementary rates between hybrids and their parent algorithms, macroaveraged over the 33 data sets (29 for winnow-h and winnow). Table 5.8 displays the macro-averaged complementary rates in both directions between each hybrid and its two parent algorithms. We first discuss the results in the first column. The complementary rate comp(A, B) expresses the average percentage of misclassified instances by the hybrid, that are correctly classified by the parent algorithm. The first observation is that all pairs have complementary rates of approximately 30% or higher, indicating that one-third of the misclassified test instances of the hybrids are predicted correctly by the parents. The exception to this observation is the pair rules-r-h – rules with a lower rate of 19.7%. The hybrids maxent-h and rules-a-h are quite similar to k-nn in their generalization performance, but the high complementary rates show that they are not similar in the errors that they make. The complementary rate comp(B, A) in the second column of the table expresses the percentage of errors made by the parent algorithm that is correctly predicted by the hybrid. When the hybrid performs better than the parent algorithm, the hybrid predicts correct labels for the misclassified instances by the parent to a higher extent than the parent predicts correct labels for the mistakes of the hybrid. In other words, the rates of the first column should be lower than the rates in the second column. This is the case for the hybrids maxent-h and rules-a-h compared to both parents, but not necessarily for the hybrids rules-r-h and winnow-h. One pair in the table has a distinct low complementary rate in both directions, namely the rules-r-h – rules pair. These low rates indicate that these algorithms tend to make the same errors to a high degree, indicating that their functional classification behavior is quite similar. The comparative experiments in Chapter 3 showed that these two algorithms also have a similar generalization performance. This can be explained by the observation
89
5.2 Error analysis
that rules-r-h does not fully explore the possibilities to involve different rules in the classification process. We observed in Section 3.2 that the average number of active bits in the binary rule-feature vectors is low, and that the hybrid rules-r-h is trained with k = 1 in the majority of the experiments. As a result, rules-r-h uses often only one rule to classify a test instance, similar to rules. 5.2.1.2
Complementary rates of the stacked classifiers
Table 5.9 shows the complementary rates macro-averaged over the 33 data sets of stackmaxent, maxent-h and both parents. First we compare stack-maxent to the hybrid maxent-h, displayed in the first row of the table. The complementary rate comp(A, B) is higher than comp(B, A), indicating that the hybrid generates less errors than the stacked classifier. Both complementary rates are above 35%, indicating that at least one-third of the instances misclassified by one algorithm is correctly classified by the other: the functional classification behavior of maxent-h and stacked classifier stack-maxent is quite different. Next we look at the complementary rates between stack-maxent and both parents, and the rates between the hybrid maxent-h and its parents. We observe a marked difference in the rates of the pair stack-maxent – maxent compared to the pair maxent-h – maxent. The two complementary rates of the first pair are lower than the rates of the latter, indicating that the stacked classifier is more similar to its eager parent in functional classication behavior than the hybrid. algorithm pair stack-maxent – maxent-h stack-maxent – k-nn stack-maxent – maxent maxent-h – k-nn maxent-h – maxent k-nn – maxent
comp(A,B) 36.6 33.6 20.5 33.4 35.0 37.9
comp(B,A) 35.5 32.1 34.4 34.7 47.0 46.5
Table 5.9: Complementary rates between stacked algorithms and their parent algorithms, and between the stacked algorithms and hybrid algorithms, macro-averaged over the 33 data sets. Table 5.10 lists the complementary rates of the stacked classifier stack-rules, rules-ah and both parents. Comparing the stacked classifier to the hybrid we observe the same points as can be seen in Table 5.9: the hybrid misclassifies less instances than the stacked classifier, and both algorithms have quite a different functional behavior. Next we compare the two stack-rules–parent pairs to the rules-a-h–parent pairs. We observe that the complementary rates of stack-rules – k-nn are higher than the rates
90
Chapter 5: Analysis
of the pair rules-a-h – k-nn. This implies that the hybrid is more similar to k-nn than the stacked classifier. Interestingly, the pair stack-rules – rules has lower complementary rates than the pair rules-a-h – rules, similar to what we observed in Table 5.9. The stacked classifiers are more similar to their eager parents in classification behavior than the hybrids and their eager parents. This can be explained by the fact that the level-1 k-nn classifier of the stacked classifier uses the prediction of the eager level-0 classifier as extra input feature, which influences the final class label that is assigned by the level-1 classifier. algorithm pair stack-rules – rules-a-h stack-rules – k-nn stack-rules – rules rules-a-h – k-nn rules-a-h – rules k-nn –rules
comp(A,B) 38.7 38.9 17.6 31.6 28.4 37.4
comp(B,A) 32.7 35.0 36.6 33.9 49.6 55.7
Table 5.10: Complementary rates between stacked algorithms and their parent algorithms, and between the stacked algorithms and hybrid algorithms, macro-averaged over the 33 data sets.
5.2.2
Error overlap with parent algorithms
We compute the overlap in correct and incorrect predicted class labels for each hybrid and its two parent algorithms. We check the cases where both parents assign a correct label, both parents predict an incorrect label and the cases in which one of the parents assigns an incorrect label. For each of these cases we verify whether the hybrid gives a correct or incorrect label. This leads to a total of six possible events. We count the numbers for these cases on the results of the comparative 10-fold cv experiments described in Chapter 3. The ten test folds are concatenated to one large test set and the counts are based on this large set. We report on macro-averaged percentages over the 33 data sets and over the 29 data sets for winnow-h. As a comparison we compute this error overlap analysis also for the stacked classifiers and their parent algorithms on the basis of the comparative experiments described in Chapter 4.
5.2.2.1
Results of the hybrid algorithms
The results for the four hybrid algorithms are listed in Table 5.11. The table shows the possible combinations of correct and incorrect predicted labels by the parents (column 1) and whether in these cases the hybrid assigned a correct or incorrect label (column 2). The other columns show the percentages of the six events for each of the hybrids. Each
91
5.2 Error analysis
column totals to 100% which represents the total number of predictions made by each hybrid. The most likely event is that both parents assign a correct label and the hybrid also assigns a correct label. This is expressed in the first row of the table. We see that this is the case for 73-75% of the cases for the three hybrids maxent-h, rules-r-h and rules-a-h. The hybrid winnow has a much lower percentage of 65.6%. The case that both parent algorithms assign a correct label while the hybrid assigns an incorrect label, listed in the second row of Table 5.11, is far less probable and also highly undesirable. The percentages in the second row are low, except for the hybrid winnow-h, attaining a percentage of 6.5%. The hybrid rules-a-h has the lowest percentage for this negative case. The case that both parent algorithms assign an incorrect label and the hybrid also assigns an incorrect label is expressed in the third row of the table. This is the case in roughly 11% of the cases for all hybrids. Next, a desirable but low frequency event is the case in which both parents predict incorrect labels while the hybrid predicts the label correctly. Suprisingly, we see that the hybrid winnow-h has the highest percentage for this positive outcome. The case in which one of the parents predicts a correct label is of particular interest, because we hypothesize that the hybrids can combine the best of both worlds. When this hypothesis is true, the hybrids should predict the correct label more often than the incorrect label for these cases. One could say that the hybrid ideally follows the prediction of the one parent that is correct. In the table we add an extra row that shows the difference in the percentage of the cases where one parent is correct and the hybrid is correct or incorrect. We see a positive difference for three of the hybrids. These hybrids predict the correct class label more often than the incorrect label. This is not the case for the hybrid rules-r-h which has a negative net score – it predicts the incorrect label more often than the correct label. The bottom row of the tabel presents a summarization of the differences between both parents and the hybrids. We summed the percentages of the case that both parents are correct and the hybrid is incorrect, the case that both parents are incorrect and the hybrid is correct, and the cases in which only one parent is correct. These summed scores confirm that the two hybrids maxent-h and rules-a-h have total positive net score, while the other two hybrids have a negative net score. We also observe that the two rule-hybrids have a larger deviation from their parents than the other two hybrids. The two hybrids rules-r-h and rules-a-h have the same parent algorithms. When we compare the counts of overlap in predictions between the rule-hybrids and both parents, we see that all 6 percentages are favorable to the hybrid rules-a-h.
92
Chapter 5: Analysis parents both correct both correct both incorrect both incorrect one correct one correct one correct hybrid–parent
hybrid correct incorrect incorrect correct correct incorrect difference difference
maxent-h 74.8 1.8 10.6 1.5 7.4 3.8 +3.6 +3.3
rules-r-h 72.3 1.6 11.8 0.4 4.8 9.1 -4.3 -5.4
rules-a-h 73.0 0.8 11.1 1.2 9.4 4.5 +4.9 +5.2
winnow-h 65.6 6.5 11.3 2.1 7.8 6.7 +1.1 -3.4
Table 5.11: Normalized overlap in correct/incorrect labels between the hybrids and their parents, macro-averaged over the 33 data sets (and 29 for winnow-h), in percentages.
5.2.2.2
Results of the stacked algorithms
We discuss the counts for the possible events of incorrect and correct predictions by both parents and whether the hybrid algorithms or stacked classifiers also predict the correct or incorrect label. We first look at the hybrid maxent-h and the stacked algorithm stackmaxent. Table 5.12 lists the percentages of overlap with the predictions of both parents and maxent-h, next to the overlap of the predictions of the parents with stack-maxent, macro-averaged over the 33 tasks. The stacked classifier performs better than hybrid maxent-h for the two cases where both parents predict a correct label. For these cases, stack-maxent has a slightly higher percentage of predicting the correct label and a lower percentage for assigning an incorrect label. For the cases where both parents predict an incorrect label, the hybrid performs better than the stacked classifier. maxent-h has a slightly lower percentage of assigning an incorrect label and a higher percentage of assigning the correct label in comparison to stack-maxent. The last row in the table presents the summed differences of the four cases for which the hybrid and stacked classifier deviate from both parents (sum of rows 2, 4, 5 and 6). Both algorithms have a positive net score, but the hybrid maxent-h has a higher difference than the stacked classifier. We also compare the hybrid rules-a-h to the stacked classifier stack-rules. Table 5.13 displays the overlap in correct and incorrect predictions of these two algorithms with their parents. The hybrid has a better score for the two cases in which both parents predict the correct label than the stacked classifier. However, stack-rules has a better performance
93
5.2 Error analysis parents both correct both correct both incorrect both incorrect one correct one correct one correct hybrid–parent
algorithm correct incorrect incorrect correct correct incorrect difference difference
maxent-h 74.8 1.8 10.6 1.5 7.4 3.8 +3.6 +3.3
stack-max 75.4 1.2 11.0 1.1 7.0 4.3 +2.7 +2.6
Table 5.12: Overlap in correct/incorrect labels between maxent-h and both parents, and between stack-maxent and both parents, macro-averaged over the 33 data sets. parents both correct both correct both incorrect both incorrect one correct one correct one correct hybrid–parent
algorithm correct incorrect incorrect correct correct incorrect difference difference
rules-a-h 73.0 0.8 11.1 1.2 9.4 4.5 +4.9 +5.2
stack-rules 72.6 1.2 10.8 1.5 8.9 5.0 +3.9 +4.2
Table 5.13: Overlap in correct/incorrect labels between rules-a-h and both parents, and between stack-rules and both parents, macro-averaged over the 33 data sets.
for the two cases in which both parents predict an incorrect label. (This is the opposite of what we observed in Table 5.12 for the two algorithms that combine maxent with k-nn.) In the case when only one of the parents predicts the correct label, the hybrid rules-a-h performs better and predicts the correct label more often than the stacked classifier. The last row of the table summes the differences between the hybrid and stacked classifier and both parents. We observe that both algorithms have a positive summed difference, but rules-a-h has a higher net score. Both stacked classifiers have a lower positive net score than the hybrids. The complementary rates showed that stacked classifiers are more similar to the eager parent algorithm in their classification behavior than the hybrids are. These lower positive net scores are again an indirect indication that the stacked classifiers follow the predictions of the parent algorithm. The comparative experiments between the four parent algorithms, detailed in Section 3.4.1, showed that the eager learning algorithms have a lower performance than
94
Chapter 5: Analysis
o
x
x
x
x
o
o
o
x
Figure 5.2: The test instance represents the depicted tic-tac-toe end game in which ’x’ does not win.
k-nn. Because the stacked classifiers follow the predictions of the eager parent relatively more often than the hybrids do, they also replicate incorrect predictions of the eager parent more often.
5.3
Illustrative analysis
In this section we provide an illustration of the classification processes of the different classifiers and some of the factors that can play a role in classification. We look at the details of the classification of a single instance. It is not feasible to investigate for each instance each of the classifications made by the classifiers and identify the cause of correct or incorrect predictions, due to the complexity of factors that can play a role in the classification process, and because of the large range of experiments. We study the classification process of the two successful hybrids maxent-h and rules-a-h for one particular instance. As a comparison we also investigate the classification of their parent algorithms and the two stacked classifiers. As our exemplary instance we choose one instance from the tictactoe data set. This data set is chosen because it is a simple problem with a small number of features (nine) and a small number of classes (two). Another important reason for choosing this task is the fact that the two hybrid algorithms (maxent-h and rules-a-h) score 100% accuracy on this task while their parents have a lower score. We illustrate what exactly happens in the classification process and show why algorithms classify the instance correctly or incorrectly. The tictactoe data set concerns the task of determining whether x has won (positive) or has not won (negative) at the end of a tic-tac-toe game. Each of the features represents one of the positions on the board and can have the values x, o, or b which represents
95
5.3 Illustrative analysis maxent feature 1 x o b feature 2 x o b feature 5 x o b
λ parameters neg pos -1.910 1.910 2.199 -2.199 -0.072 0.072 neg pos -1.785 1.785 2.068 -2.068 -0.065 0.065 neg pos -2.083 2.083 2.369 -2.369 -0.068 0.068
Table 5.14: The weights of the λ parameters produced by maxent for features 1, 2 and 5.
‘blank’. The class distribution of the data set shows a majority for the positive class of 65 %. The test instance that we choose from the data set is visualized in Figure 5.2, and has the class label negative. We report on the classification process as it takes place in the 10-fold cv experiments described in the previous chapters. The test instance is correctly classified by the two hybrid algorithms, the parent algorithm rules and the stacked classifier stackrules. The other algorithms, k-nn, maxent and the stacked classifier stack-maxent, misclassify this instance. In the next part of the section we detail the classification process of each of the classifiers. We start with k-nn, which misclassifies the test instance. The k-nn classifier is trained with the following setting chosen by the algorithmic parameter optimization step: k = 1, no feature weighting, overlap distance metric. When classifying the test instance, the k-nn classifier finds 19 nearest neighbors that al have the same distance of 2, i.e. all these training instances mismatch on two of the nine features. Eight of the 19 nearest neighbors predict the correct class negative, but the majority of 11 neighbors predict the incorrect class positive. The chosen parameters yield a coarse and simple distance calculation between instances. In this case the set of nearest neighbors could not provide the correct prediction. The maximum entropy modeling algorithm also misclassifies the test instance. maxent produces a λ-weight matrix between classes and feature values. When classifying the test instance, maxent sums the λ parameters which present the weights between the feature values present in the test instance and the classes. These summed weights are translated
96
Chapter 5: Analysis mvdm weights feature 1 neg pos x 0.289 0.711 o 0.420 0.580 b 0.289 0.711 feature 2 neg pos x 0.391 0.609 o 0.301 0.699 b 0.296 0.704 feature 5 neg pos x 0.192 0.808 o 0.557 0.443 b 0.299 0.701
maxent-h weights feature 1 neg pos x 0.097 0.903 o 0.964 0.036 b 0.485 0.515 feature 2 neg pos x 0.123 0.877 o 0.937 0.063 b 0.486 0.514 feature 5 neg pos x 0.060 0.940 o 1.000 0.000 b 0.486 0.514
Table 5.15: The left part of the table shows the class distributions used in the mvdm metric. The right part of the table shows the scaled weights produced by maxent that are used by maxent-h .
to probabilities through an exponential function (Equation 2.19), and the class with the highest probability is chosen. Table 5.14 shows the λ weights as produced by maxent for feature one, two and five. Feature one presents the top row left corner of the board, feature two the top row middle position, and feature five the center of the board. The table shows the effect of ‘maximizing the entropy’; the weights are equally and symmetrically distributed over the two classes. As the task is to predict whether or not x wins, the presence of x is highly correlated with the positive class, while o is correlated with the negative class (regardless of the position on the board). The value b is not correlated with the classes and should receive low weights. We see that the weights produced by maxent nicely mirror these correlations. x has a value of approximately 2 for the positive class and -2 for the negative class, while the value o has opposite values. b has a very slight tendency towards the (majority) positive class, as it has a weight of approximately 0.1 for the positive class and -0.1 for the negative class. The test instance contains five x values and four o values. The majority of the weights points to the positive class and maxent assigns this incorrect class label to the instance. The hybrid maxent-h combines maximum entropy modeling with k-nn classification. The hybrid correctly predicts the class label of the test instance. The classifier is trained with the following chosen setting: k = 1, no weighting and the maxent–mvdm distance metric. The classifier finds one nearest neighbor to the test instance which assigns the correct class
97
5.3 Illustrative analysis
o
x
x
o
o
x
o
x
b
x
x
o
o
o
x
o
x
x
Figure 5.3: The one nearest neighbor found by the hybrid maxent-h (left) and the nearest neighbor found by rules-a-h (right).
label. The tic-tac-toe end game presented by this nearest neighbor is shown in Figure 5.3. The fact that maxent-h is able to find one nearest neighbor that predicts the correct class while k-nn was not, is solely due to the different distance metric used (k-nn uses the overlap metric). All other circumstances are equal. The weights used in the maxent–mvdm distance metric are arguably better than the original mvdm weights. Table 5.15 shows the weights of the values related to the classes of the features one, two and five. The left part of the table shows the class distributions used in the mvdm metric. These estimates represent the probability of the class given the feature value, and are based on straight co-occurrence counts. The right part of the table shows the same λ weight parameters as Table 5.14, but in this table they are scaled between zero and one. These scaled weights are used in the maxent–mvdm distance metric of the hybrid maxent-h. The class distribution probabilities used in mvdm deviate from the weights produced by maxent. Both express the correlation between x and the positive class. The correlation between o and the negative class is weak in the mvdm probabilities. Only in case of feature five the value o has a higher probability for the negative class, for all other features the value o has a higher probability for the (majority) positive class. The value b also has a higher probability for the positive class. The fact that the positive class is the majority has a large influence on the probabilities that are used in the mvdm metric, while this influence is only marginally visible in the weights produced by maxent. It is clear that maxent pushes the weights to values that reflect the probabilities underlying this task better than the mvdm probabilities that are biased by the prior probabilities of the classes. Next we look at the rule induction algorithm. The algorithmic parameter optimization step has selected the following setting. rules produces an unordered rule set, each rule covers at least five instances, negative feature tests are allowed, misclassification cost is
Chapter 5: Analysis
98
equal for all errors, rule simplification is set to one (neutral), the number of optimization rounds is two. An unordered rule set means that rules induces rules for both classes. rules produces a rule set of 19 rules. There are eight lines to check in the tic-tac-toe game (three horizontal, three vertical and two diagonal). rules makes 18 rules which each check one of these lines for each of the two classes. The last rule is a ‘default’ rule without conditions that predicts the class negative. This default rule covers a group of ‘draw’ end games in which neither x nor o wins. This rule has the lowest coverage of all rules and will only apply when none of the other rules matches. This is actually the case for our test instance and the default rule correctly predicts the class label. The hybrid rules-a-h uses the rule set constructed by rules. The rules are transformed into binary rule-features, which are added to the original instances. The instances on which rules-a-h is trained contain 18 rule-features that present the 19 rules produced by rules except for the default rule without conditions. The automatic parameter selection selects the following setting for the hybrid: k = 1, overlap distance metric, gain ratio feature weighting. When classifying the test instance, the classifier finds one nearest neighbor for the test instance which correctly predicts the class label. Figure 5.3 shows the tic-tac-toe end game that is represented by this neighbor. The neighbor contains five x values and four o values and represents a ‘draw’ end game, similar to the test instance. The stacked classifier stack-maxent misclassifies the test instance, similar to both parent algorithms maxent and k-nn. The level-0 maxent classifier misclassifies the test instance as positive. This prediction is added as a feature to the test instance which forms the input for the level-1 classifier. The level-1 k-nn classifier is trained with the parameter settings k = 15, Jeffrey divergence distance metric, chi-square feature weighting and inverse distance neighbor weighting. The second column of Table 5.16 shows the chisquare weights for the features of the level-1 classifier. The tenth feature presents the added prediction of maxent and has a very high weight. Feature five which presents the center of the board also receives a much higher weight. We also see a difference in the weights of the features that present corners of the board (1,3,7,9) and the other positions (2,4,6,8). This last group of positions has the lowest chance of being part of a ‘three-in-arow’. Therefore the features are relatively less important for predicting the class. When classifying the test instance, the level-1 k-nn classifier finds 15 nearest neighbors. The highly weighted feature ten (which presents the misclassification of maxent) has the effect that all the found nearest neighbors match on this feature. Only one of these 15 neighbors predicts the correct class negative, all other predict the class positive. The mistake of the level-0 classifier is replicated by the level-1 classifier. The stacked classifier stack-rules correctly classifies the test instance. First the level0 classifier rules correctly classifies the test instance and this prediction is added as a feature. The level-1 k-nn classifier is trained with the settings k = 25, mvdm distance metric, gain ratio feature weighting and inverse linear neighbor weighting. The third column of Table 5.16 shows the gain ratio feature weights used in the level-1 classifier. The gain ratio weights show a similar picture as the chi-square weights discussed above. The tenth feature gets a much higher weight than the other features. The classifier finds
99
5.4 Related work feature 1 2 3 4 5 6 7 8 9 10
chi-square 15.089 7.894 16.775 9.285 105.320 9.147 17.614 6.329 16.641 800.794
gain ratio 0.008 0.004 0.009 0.005 0.060 0.005 0.010 0.003 0.009 0.978
Table 5.16: Chi-square and gain ratio feature weights of the stacked level-1 classifiers stack-maxent and stack-rules respectively. Feature 10 is the added prediction of the level-0 classifier.
25 nearest neighbors for the test instance. Each of these neighbors matches with the test instance on feature ten, and each of them predicts the correct class negative.
5.4
Related work
We discuss related work on bias and variance analysis of classifier ensembles. The literature tells that most types of classifier ensembles can reduce the variance component of error rate, and many of the ensembles reduce both variance and bias (for example (Bauer and Kohavi, 1999; Breiman, 1996a,b)). We discuss a few recent studies in this area. Xie et al. (2004) introduce a new ensemble method that consists of single classifiers that aim specifically at reducing the statistical bias. For these single classifiers a new learning algorithm framework k-mode competitor is introduced that is a local learning, partitionbased clustering method. This method aims at a low bias at the cost of possible high variance. However, combining these k-mode competitors in a simple majority voting ensemble, works to reduce variance. The new ensemble method is tested with two types of machine learning algorithms, C4.5 (Quinlan, 1986) and Naive Bayes (Kononenko, 1990) and compared to bagging (Breiman, 1996a). The new method outperforms bagging. A bias–variance analysis shows that the k-mode competitors indeed have a lower bias than C4.5 and Naive Bayes. The study of Webb and Zheng (2004) concentrates on the demand for diversity in classifier ensembles. They hypothesize that combining ensemble methods will increase the diversity in the ensemble and improve the overall generalization accuracy of the ensemble at the
Chapter 5: Analysis
100
cost of some increase of the error rate of the single components. They investigate three ensemble methods: wagging (Bauer and Kohavi, 1999), boosting (Bauer and Kohavi, 1999) and stochastic attribute selection committee learning approaches (Zheng and Webb, 1998). These three methods are combined in pairs and all together forming new ensemble strategies. Experimental results demonstrate that these ensembles have a higher accuracy than their component ensembles. Measurements of the diversity and error of the individual components of the ensembles did not confirm the hypothesis. For some of the combined ensemble techniques neither the diversity increased nor the error rate of the components increased. (Bay, 1999) describes a method for succesfully combining multiple k-nn classifiers. Standard methods that use subsamples of training material such as boosting and bagging do not lead to improved accuracy when used with k-nn classifiers. Bay proposes k-nn ensembles that consist of k-nn classifiers trained on a randomly selected subset of features. The experiments in the paper show that these k-nn ensembles perform better than several single k-nn variants. The k-nn ensembles are competitive with boosted decision trees (Dietterich, 2000). A bias–variance analysis shows that these ensembles reduce both bias and variance.
5.5
Summary
In this chapter the functional classification behavior of the ten algorithms described in the Chapters 2, 3 and 4 was investigated. The differences and commonalities between the hybrid algorithms and their parent algorithms were analyzed by performing a bias– variance analysis and an error analysis. These analyses were also performed on the stacked algorithms. We performed a bias-variance analysis of the error rates of the algorithms. We conducted additional experiments on a subset of the data sets, and measured the average error rate and decomposed it into bias and variance components. The results showed that the hybrid algorithms have a lower bias than both parent algorithms, except for the pair winnow – winnow-h. A lower bias indicates that the hybrids make less systematic errors than the parents. The stacked algorithms also have a lower bias than their parent algorithms. Comparing the stacked classifiers to the hybrids, we observed that rules-a-h has a lower bias than stack-rules, while stack-maxent has a lower bias than the hybrid maxent-h. Next the outcomes of the comparative experiments described in Chapter 3 and 4 were investigated. We performed two types of error analyses on these outcomes. We computed complementary rates which measure to which extent classifiers misclassify the same instances. We found for most pairs of algorithms complementary rates of approximately 30% or more, which means that at least one-third of the errors made by one algorithm are not made by the other algorithm. This means for the hybrids maxent-h and rules-a-h that, although they have a similar generalization performance as k-nn, they differ quite
101
5.5 Summary
markedly in their functional classification behavior. For the pair rules-r-h – rules we found complementary rates lower than 20%, indicating that these algorithms not only have a similar generalization performance as was shown in Chapter 3, but also a similar functional behavior. This can be explained by the observation that rules-r-h does not fully explore the possibilities to involve different rules in the classification process of a test instance. In many cases both rules-r-h and rules use one rule to classify a test instance. We also compared the complementary rates of the stacked classifiers to the rates of the hybrids. These complementary rates showed that the stacked classifiers are more similar to their eager parents in functional behavior than the hybrids and their eager parents. This can be explained by the fact that the level-1 k-nn classifier in the stacked classifier uses the predictions of the eager level-0 classifier as an extra input feature, which influences the final class label that is assigned by the level-1 classifier. We also investigated the overlap of correct and incorrect predictions of both parent algorithms, and whether the hybrids also made correct or incorrect predictions. The two most common events are the case that both parents predict the correct labels, and the case that both parents predict the incorrect label. For these two events, the hybrids follow the predictions of the parents with a deviation of approximately 2%. For the interesting event that only one parent predicts the correct label, we observed that three of the hybrids chose the correct label more often than the incorrect label. The negative exception is the hybrid rules-r-h which predicts the incorrect label more often. We also compared the results of the two stacked classifiers to the results of the hybrids maxent-h and rules-a-h. We found for the case in which one parent predicts the correct label, that all four algorithms have a positive net score and predict the correct label more often than the incorrect label. The positive net score of the hybrids is higher than the scores of the stacked classifiers. This lower net score of the stacked classifiers is again an indirect indication that the stacked classifiers follow the predictions of their eager parent. The eager parents make more errors than k-nn, and the stacked classifiers replicate these incorrect predictions relatively more often than the hybrids do.
Chapter 5: Analysis
102
Chapter 6
Conclusions We summarize the answers to the research questions central to this thesis.
6.1
Research questions
In this section we discuss the four research questions that were introduced in Chapter 1. For each question we summarize and interpret the experiments we performed to answer this question.
6.1.1
Construction and performance of hybrid algorithms
k-nn classification is a machine learning algorithm that has been applied successfully to many classification tasks (Aha et al., 1991; Cover and Hart, 1967; Daelemans and van den Bosch, 2005; Dasarathy, 1991). In this thesis we investigated whether the k-nn classification method also works well when combined with eager learning. The main hypothesis investigated in this thesis was the following: Can we, by replacing the simple classification method of an eager learner with memory-based classification, improve generalization performance in comparison to one or both of the parent algorithms? To answer this question, we designed hybrid algorithms that combine the learning module of eager learning algorithms with the k-nn classification method. These hybrids use the model as produced by the eager learning algorithm to modify the distance calculations in k-nn. We perform our study using instantiations of three quite different types of eager machine learning algorithms: rule induction, maximum entropy modeling and multiplicative update learning. We constructed the following hybrid algorithms:
103
Chapter 6: Conclusions
104
• The hybrid algorithm maxent-h combines maximum entropy modeling with k-nn classification. This algorithm classifies new instances by using the maxent–mvdm distance metric to find the k nearest neighbors in the data. • We designed two hybrids that combine rule induction with k-nn classification. The first hybrid rules-r-h is trained on instances consisting of binary rule-features that transform and replace the original features in the instances. The second hybrid rules-a-h is trained on instances in which binary rule-features are added to the original instances. • The hybrid winnow-h combines multiplicative update learning with k-nn. The hybrid uses the weights produced by winnow as feature weights. These weights are used in combination with the distance metric of the k-nn classification method to determine the similarity between instances in the classification phase. We tested the generalization performance of the hybrid algorithms and their parents by running experiments on a range of data sets. We estimated the significance of the differences found by performing paired t-tests. The results of the comparative experiments showed that the two hybrids maxent-h and rules-a-h performed equal to or slightly better than their parent k-nn, and better than their eager parent. The hybrid rulesr-h performed worse than k-nn and similar to its parent rules. The hybrid winnow-h performed worse than both of its parents. In sum, the answer to our main research question is that we have only partly succeeded in creating hybrid algorithms that perform better than at least one of their parents. The k-nn classification method does not always work successfully when it is combined with the model of an eager learner. We investigated the classification behavior of these hybrids in more detail to find possible explanations of the differences in performance of the hybrids and their parents. This investigation is summarized in the next section.
6.1.2
Differences and commonalities between hybrids and their parents
Besides performance we were also interested in the differences and commonalities between the hybrid algorithms and their parents. We analyzed the functional classification behavior of the hybrids and their parents by performing bias–variance analyses and error analyses on the outcomes of the comparative experiments. First we performed a bias-variance analysis of the error rates of the algorithms. We conducted additional experiments on a subset of the data sets. We measured the average error rate and decomposed it into bias and variance components. The results showed that the hybrid algorithms have a lower bias than both parent algorithms, except for winnowh which has a higher bias than winnow. A lower bias indicates that the algorithm makes relatively less systematic errors. The hybrids rules-r-h and winnow only have a relative
105
6.1 Research questions
lower number of systematic errors compared to their parents, as these hybrids have a higher error rate. The other two hybrids have an absolute lower number of systematic errors. Next we investigated the outcomes of the comparative experiments between the hybrid algorithms and their parent algorithms. We performed two types of error analyses on these outcomes. We computed complementary rates which measure to which extent classifiers misclassify the same instances. We observed than in general, at least one-third of the instances misclassified by one algorithm are correctly classified by the other. This indicates that the classification behavior of the algorithms differs markedly. The comparative experiments between hybrids and parents showed that the two hybrids maxent-h and rules-a-h are equal to k-nn in terms of their generalization performance. This analysis shows that they differ from k-nn in terms of their functional behavior. Exceptions are the complementary rates of the pair rules-r-h and rules which are much lower than the rates of the other algorithm pairs. This implies that not only the performance of the hybrid rules-r-h is similar to its parent rules, but also its functional behavior. The explanation of this observation is that rules-r-h does not fully explore the possibilities to involve different rules in the classification process of a test instance. In many cases both rules-r-h and rules use one rule to classify a test instance. We also investigated the overlap of correct and incorrect predictions of both parent algorithms and whether the hybrids also made correct or incorrect predictions. For the interesting case in which one parent predicts the correct label while the other parent predicts an incorrect label, all hybrids except rules-r-h have a positive net score; they predict the correct label more often than the incorrect label.
6.1.3
Comparison between hybrids and classifier ensembles
Classifier ensembles are similar to the hybrid algorithms in the sense that they also combine classifiers in order to make use of the complementarity of the individual classifiers and reach a better performance. We therefore performed a comparison between hybrid algorithms and classifier ensembles with the same pair of algorithms. We chose the ensemble method stacking, in which the predictions of base classifiers form the input for another classifier that predicts the final class labels. We designed two stacked classifier ensembles that have a close analogy to the two successful hybrids maxent-h and rules-a-h with the goal of creating a minimal pair for comparison. In each stacked classifier, the eager algorithm produces a set of predicted labels for the test instances. These labels are used by the k-nn algorithm in the final classification of the test instances. The first stacked classifier stack-maxent uses maximum entropy modeling as level-0 classifier. The second stacked classifier is called stack-rules and has a rule induction algorithm as level-0 classifier. The main difference between these stacked classifiers and hybrid algorithms is that the stacked classifiers consist of two classifiers while the hybrids consist of one.
Chapter 6: Conclusions
106
We ran experiments with the stacked classifiers and compared their performance to the performance of the hybrids, and to their parent algorithms. We observed that the stacked classifiers perform very similarly to the hybrid algorithms. Compared to their parent algorithms, the stacked classifiers performed similarly to k-nn and markedly better than their eager parent algorithm. We performed further analyses and investigated the differences and commonalities in the classification behavior of hybrid algorithms and stacked classifiers. As with the hybrid classifiers, we conducted a bias-variance analysis of error rates and we performed additional error analyses. The decomposition of error in bias and variance components showed that, similar to the hybrids, both stacked classifiers have a lower bias component in their error rate than both of their parents. Comparing the bias of the two stacked classifiers to the bias components of the two hybrids does not give unequivocal results. The hybrid maxent-h has a higher bias than stack-maxent. The hybrid rules-a-h on the other hand has a lower bias than the stacked classifier stack-rules. Analyzing complementary rates between the stacked classifiers and their parents, we made the interesting observation that the stacked classifiers are relatively more similar to their eager parents in the errors that they make than the hybrids are. In this aspect the stacked classifiers deviate from the hybrids. When we analyzed the complementary rates between the stacked classifiers and hybrid algorithms, we observed that they are quite different in their functional classification behavior, as more than one-third of the mistakes made by one algorithm is correctly classified by the other algorithm. We also compared the overlap in predictions with both parents of the two stacked classifiers to the results of the hybrids maxent-h and rules-a-h. We found for the case that one parent predicts the correct label that all four algorithms have a net positive score and predict the correct label more often than the incorrect label. However, the net positive score of the hybrids is higher than the net scores of the stacked classifiers. This is an indirect indication that the stacked classifiers follow the predictions of their eager parent more often. The eager parents make more errors than k-nn, and the stacked classifiers replicate part of these incorrect predictions relatively more often than the hybrids do. In sum, although the generalization performance of the stacked classifiers and hybrid algorithms is equal, they are quite different in their functional classification behavior. A marked difference is the observation that the stacked classifiers are more similar in their behavior to their eager parent algorithm than the hybrids are.
6.1.4
Performance of hybrids on NLP tasks
In this thesis we have concentrated on combining the k-nn classification method with eager learning. As explained in Chapter 1, k-nn has been shown to be a suitable machine learning algorithm for natural language processing tasks (Daelemans and van den Bosch, 2005;
107
6.2 k-nn classification
Daelemans et al., 1999). For this reason we were specifically interested in the performance of the hybrid algorithms on the NLP tasks. We investigated the results of the three hybrids maxent-h, rules-a-h and rules-r-h and their parent algorithms. We observed that the results on the NLP tasks do not differ much from the observations made on all tasks. An exception was the hybrid maxent-h which had a lower performance than k-nn on the NLP tasks, while compared on all tasks they had quite similar performance. An explanation for this observation may be the fact that the weights used in the mvdm – maxent distance metric do not capture the many low frequency events in a typical NLP task properly, as maxent optimizes these weights to maximize the entropy, thereby smoothing over low frequency events.
6.2
k-nn classification
The starting point of the research presented in this thesis was the contrast between lazy and eager learning. Eager learners put their effort in abstracting a condensed model of the target function to be learned from the training material. Lazy learners, on the other hand, do not put effort in the learning phase as they simply store all training instances in memory. They divert their effort to the classification phase and make a local estimation of the target function by searching for similar instances in their memory. We have seen in our comparative experiments that k-nn, when equipped with automatic parameter optimization, is a powerful machine learning algorithm: it performed better than the three eager learning algorithms on average. From the perspective of the eager learners, we used the k-nn classification method to enhance the performance of the eager algorithms by creating hybrid algorithms. Two of these constructed hybrids indeed had a better performance than their eager parents. From k-nn’s perspective we enhanced the distance calculation method between instances by adding the model of an eager learner. We have seen that k-nn is a machine learning algorithm that offers many possibilities to further refine and enhance the distance calculation method, by modifying the distance metrics, or the feature weighting methods, or the instances on which the algorithm is trained. In this thesis we explored a part of these possibilities. Exploring other possibilities to enhance the k-nearest neighbor algorithm is an appealing path of future research.
Chapter 6: Conclusions
108
Bibliography Abney, S. (1991). Parsing by chunks. In Principle-Based Parsing, pages 257–278. Kluwer Academic Publishers, Dordrecht. Aha, D. W., editor (1997). Lazy learning. Kluwer Academic Publishers, Dordrecht. Aha, D. W. and Bankert, R. L. (1997). Cloud classification using error-correcting output codes. Artificial Intelligence Applications: Natural Resources, Agriculture, and Environmental Science, 11(1):13–28. Aha, D. W., Kibler, D., and Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6:37–66. Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113– 141. Alpaydin, E. (2004). Introduction to Machine Learning. MIT press. Appelt, D. and Israel, D. (1999). Introduction to information extraction technology. Tutorial presented at the International Joint Conference on Artificial Intelligence (IJCAI-1999). Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105–139. Bay, S. D. (1999). Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis, 3:191–209. Beran, R. J. (1977). Minimum Hellinger distance estimates for parametric models. Annals of Statistics, 5:445–463. Berger, A., Della Pietra, S., and Della Pietra, V. (1996). Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1). Blake, C. and Merz, C. (1998). UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository.html.
109
BIBLIOGRAPHY
110
Bottou, L. and Vapnik, V. (1992). Local learning algorithms. Neural Computation, 4:888– 900. Bradley, J. (1997). The use of the area under the curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159. Brain, D. and Webb, G. I. (2002). The need for low bias algorithms in classification learning from large data sets. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD-2002), pages 62–73. Breiman, L. (1996a). Bagging Predictors. Machine Learning, 24(2):123–140. Breiman, L. (1996b). Bias, variance and arcing classifiers. Technical Report 460, University of California, Statistics Department, Berkeley, CA. Breiman, L., Friedman, J., Ohlsen, R., and Stone, C. (1984). Classification and regression trees. Wadsworth International Group, Belmont, CA. Brill, E. (1994). Some advances in transformation-based part-of-speech tagging. In Proceedings of the nineth National Conference on Artificial Intelligence (AAAI-1994), pages 722–727. Brill, E. and Wu, J. (1998). Classifier combination for improved lexical disambiguation. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL-1998), pages 191–195. Buchholz, S. (2002). Memory-based Grammatical Relation Finding. PhD thesis, Tilburg University, Tilburg, The Netherlands. Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254. Carlson, A. J., Cumby, C. M., Rosen, J. L., and Roth, D. (1999). Snow user’s guide. Technical report, University of Illinois, Computer Science Department. Chan, P. and Stolfo, S. (1997). On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8(1):5–28. Chen, S. and Goodman, J. (1996). An empirical study of smoothing techniques for language modelling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-1996), pages 310–318. Chen, S. F. and Rosenfeld, R. (1999). A gaussian prior for smoothing maximum entropy models. Technical Report CMU-CS-99-108, Computer Science Department, Carnegie Mellon University. Clark, P. and Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Proceedings of the Sixth European Working Session on Learning, pages 151–163.
111
BIBLIOGRAPHY
Clark, P. and Niblett, T. (1989). The CN2 rule induction algorithm. Machine Learning, 3:261–284. Cohen, J. (1960). A coefficient of agreement for nominal scales. Psychological Measurement, 20:37–46.
Educational and
Cohen, W. (1995). Fast effective rule induction. In Proceedings 12th International Conference on Machine Learning (ICML-1995), pages 115–123. Cost, S. and Salzberg, S. (1993). A weighted nearest neighbour algorithm for learning with symbolic features. Machine Learning, 10:57–78. Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. Institute of Electrical and Electronics Engineers Transactions on Information Theory, 13:21–27. Culberson, J. C. (1998). On the futility of blind search: An algorithmic view of ‘no free lunch’. Evolutionary Computation, 6:109–127. Daelemans, W. and van den Bosch, A. (2005). Cambridge University Press.
Memory-based Language Processing.
Daelemans, W., Van den Bosch, A., and Zavrel, J. (1999). Forgetting exceptions is harmful in language learning. Machine Learning, Special issue on Natural Language Learning, 34:11–41. Daelemans, W., Zavrel, J., van den Bosch, A., and Van der Sloot, K. (2003). MBT: Memory based tagger, version 2.0, reference guide. ILK Technical Report 03-13, Tilburg University, ILK. Daelemans, W., Zavrel, J., Van der Sloot, K., and Van den Bosch, A. (2004). TiMBL: Tilburg Memory Based Learner, version 5.1, reference manual. Technical Report ILK0402, ILK, Tilburg University. Dagan, I., Karov, Y., and Roth, D. (1997). Mistake-driven learning in text categorization. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-1997), pages 55–63. Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43(5):1470–1480. Dasarathy, B. V. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA. Della Pietra, S., Della Pietra, V., and Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–158.
BIBLIOGRAPHY
112
Dietterich, T. G. (2002). Ensemble learning. In The Handbook of Brain Theory and Neural Networks, Second edition, pages 405–408. The MIT Press, Cambridge, MA. Dietterich, T. G. and Bakiri, G. (1995). Solving Multiclass Learning Problems via Error– Correcting Output Codes. Journal of Artificial Intelligence Research, 2:263–286. Dietterich, T. G. and Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Oregon State University, Department of Computer Science. Domingos, P. (1996). Unifying instance-based and rule-based induction. Learning, 24:141–168.
Machine
Domingos, P. (2000). A unified bias-variance decomposition for zero-one and squared loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (NCAI-2000), pages 564–569. Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the International Conference on Machine Learning (ICML-1995), pages 194–202. Dudani, S. A. (1976). The Distance-Weighted k-Nearest Neighbor Rule. In IEEE Transactions on Systems, Man, and Cybernetics, volume SMC-6, pages 325–327. Dzeroski, S. and Zenko, D. (2004). Is combining classifiers better than selecting the best one? Machine Learning, 54:255–273. Escudero, G., M´ arquez, L., and Rigau, G. (2000). A comparison between supervised learning algorithms for word sense disambiguation. In Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL-2000), pages 31–36. Fawcett, T. (2003). ROC graphs: Notes and practical considerations for data mining researchers. Technical report, HP Laboratories, Palo Alto, USA. Fayyad, U. and Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-1993), pages 1022–1027. Fix, E. and Hodges, J. L. (1951). Disciminatory analysis, nonparametric discrimination, consistency properties. Technical Report Project 21-49-004, Report No. 4, USAF School of Aviation Medicine. Flach, P. A. (2004). The Many Faces of ROC Analysis in Machine Learning. Tutorial presented at the Twenty-First International Conference on Machine Learning. Florian, R. (2002). Named entity recognition as a house of cards: Classifier stacking. In Proceedings of the seventh Conference on Computational Natural Language Learning (CoNLL-2002), pages 175–178.
113
BIBLIOGRAPHY
Florian, R., Cucerzan, S., Schafer, C., and Yarowsky, D. (2002). Combining classifiers for word sense disambiguation. Journal of Natural Language Engineering, 8(4):327–341. Frank, E., Wang, Y., Inglis, S., Holmes, G., and Witten, I. H. (1998). Using model trees for classification. Machine Learning, 32(1):63–76. Freund, Y. (1990). Boosting a weak learning algorithm by majority. In Proceedings of the Third Workshop on Computational Learning Theory, pages 202–216. Friedman, J. H., Kohavi, R., and Yun, Y. (1996). Lazy decision trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference (AAAI-1996), pages 717–724. F¨ urnkranz, J. (1999). Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1):3–54. F¨ urnkranz, J. (2001). Round robin rule learning. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001), pages 146–153. F¨ urnkranz, J. and Widmer, G. (1994). Incremental reduced error pruning. In Proceedings of the Eleventh Annual Conference of Machine Learning, pages 70–77. German, S., Bienenstock, E., and Doursat, R. (1992). bias/variance dilemma. Neural Computation, 4:1–48.
Neural networks and the
Golding, A. R. and Roth, D. (1999). A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34(1–3):107–130. Gr¨ unwald, P., Myung, I. J., and Pitt, M., editors (2005). Description Length: Theory and Applications. MIT Press.
Advances in Minimum
Guiasu, S. and Shenitzer, A. (1985). The principle of maximum entropy. The Mathematical Intelligencer, 7(1). Hansen, L. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993–1001. Hendrickx, I. and van den Bosch, A. (2003). Memory-based one-step named-entity recognition: Effects of seed list features, classifier stacking, and unannotated data. In Proceedings of the seventh Conference on Computational Natural Language Learning (CoNLL-2003), pages 176–179. Hendrickx, I. and van den Bosch, A. (2004). Maximum-entropy parameter estimation for the k-nn modified value-difference kernel. In Proceedings of the 16th Belgian-Dutch Conference on Artificial Intelligence (BNAIC-2004), pages 19–26. Hendrickx, I. and van den Bosch, A. (to appear, 2005). Hybrid algorithms with instancebased classification. In Proceedings of the European Conference on Machine Learning (ECML-2005).
BIBLIOGRAPHY
114
Huang, J. and Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3):299–310. Jackson, P. and Moulinier, I. (2002). Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins. Keynes, J. M. (1921). Treatise on probability. Klein, D. and Manning, C. (2003). Maxent models, conditional estimation, and optimization, without the magic. Tutorial presented at the 41th Annual Meeting of the Association for Computational Linguistics (ACL-2003). Kohavi, R. and John, G. (1997). Wrappers for feature subset selection. Intelligence Journal, 97(1–2):273–324.
Artificial
Kohavi, R. and Sahami, M. (1996). Error-based and entropy-based discretization of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-1996), pages 114–119. Kohavi, R. and Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In Proceedings of the thirteenth International Conference on Machine Learning (ICML-1996), pages 275–283. Kolodner, J. (1993). Case-based reasoning. San Mateo, CA: Morgan Kaufmann. Kong, E. B. and Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 12th International Conference on Machine Learning, pages 313–321. Kononenko, I. (1990). Comparison of inductive and naive bayesian learning approaches to automatic knowledge acquisition. Current Trends in Knowledge Acquisition, pages 190–197. Koster, C. H. A., Seutter, M., and Beney, J. G. (2003). Multi-Classification of Patent Applications with Winnow. In Proceedings of the fifth Ershov Memorial Conference, pages 545–554. Langley, P. (1996). Elements of machine learning. Morgan Kaufmann, San Mateo, CA. Le, Z. (2004). Maximum Entropy Modeling Toolkit for Python and C++. Natural Language Processing Lab, Northeastern University, China. Ledezma, A., Aler, R., Sanchis, A., and Borrajo, D. (2004). Empirical evaluation of optimized stacking configurations. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI-2004), pages 49–55. Lee, C. H. and Shin, D. G. (1999). A multistrategy approach to classification learning in databases. Data & Knowledge Engineering, 31(1):67–93. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285–318.
115
BIBLIOGRAPHY
Liu, H., Hussain, F., Tan, C. L., and Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393–423. Marcus, M., Santorini, S., and Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330. Masulli, F. and Valentini, G. (2004). An experimental analysis of the dependence among codeword bit errors in ECOC learning machines. Neurocomputing, 57:189–214. Mihalcea, R. (2002). Classifier Stacking and Voting for Text Classification. In Proceedings of the 11th Text Retrieval Conference (TREC-11). Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York, NY. Munoz, M., Punyakanok, V., Roth, D., and Zimak, D. (1999). A learning approach to shallow parsing. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP-VLC’1999), pages 168–178. Ngai, G., Wu, D., Carpuat, M., Wang, C. S., and Wang, C. Y. (2004). Semantic role labeling with boosting, svms, maximum entropy, snow, and decision lists. In Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (Senseval-3), pages 183–186. Nigam, K., Lafferty, J., and McCallum, A. (1999). Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67. Nocedal, J. (1980). Updating quasi-Newton matrices with limited storage. Mathematics of Computation, 35:773–782. Opitz, D. and Maclin, R. (1999). Popular ensemble methods, an empirical study. Journal of AI Research, 11:169–198. Opitz, D. W. and Shavlik, J. W. (1996). Actively searching for an effective neural-network ensemble. Connection Science, 8:337–353. Pennock, D. M., Horvitz, E., Lawrence, S., and Giles, C. L. (2000). Collaborative filtering by personality diagnosis: A hybrid memory and model-based approach. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI-2000), pages 473– 480. Provost, F. and Domingos, P. (2003). Tree induction for probability-based rankings. Machine Learning, 52(3):199–215. Provost, F., Fawcett, T., and Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML-1998), pages 445–453.
BIBLIOGRAPHY
116
Punyakanok, V., Roth, D., Yih, W.-T., Zimak, D., and Tu, Y. (2004). Semantic role labeling via generalized inference over classifiers. In Proceedings of the 8th Conference on Computational Natural Language Learning (CoNLL-2004), pages 130–133. Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1:81–206. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Raaijmakers, S. (2000). Learning distributed linguistic classes. In Proceedings of the Fourth Conference on Computational Language Learning and the Second Learning Language in Logic Workshop (CoNLL-LLL’2000), pages 55–60. Ramshaw, L. A. and Marcus, M. P. (1995). Text chunking using transformation-based learning. In Proceedings of the 3rd ACL/SIGDAT Workshop on Very Large Corpora, pages 82–94. Ratnaparkhi, A. (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania. Ratnaparkhi, A., Reynar, J., and Roukos, S. (1994). A maximum entropy model for prepositional phrase attachment. In Proceedings of the ARPA Human Language Technology Workshop, pages 250–255. Ricci, F. and Aha, D. W. (1998). Error-correcting output codes for local learners. In Proceedings of the Tenth European Conference on Machine Learning (ECML-1998), pages 280–291. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11:416–431. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organisation in the brain. Psychological Review, 65:368–408. Roth, D. (1998). Learning to resolve natural language ambiguities: A unified approach. In Proceedings of the National Conference on Artificial Intelligence (NCAI-1998), pages 806–813. Roth, D. and Zelenko, D. (1998). Part of speech tagging using a network of linear separators. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics(COLING-ACL’1998), pages 1136–1142. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C. D., and Stamatopoulos, P. (2001). Stacking classifiers for anti-spam filtering of e-mail. In Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP-2001), pages 44–50. Santorini, B. (1990). Part-of-speech Tagging Guidelines for the Penn Treebank Project (third revision).
117
BIBLIOGRAPHY
Sebag, M. and Schoenauer, M. (1994). A rule-based similarity measure. In Wess, S., Althoff, K.-D., and Richter, M. M., editors, Topics in case-based reasoning, pages 119– 130. Springer Verlag. Seewald, A. K. (2002). How to make stacking better and faster while also taking care of an unknown weakness. In Proceedings of the International Conference on Machine Learning (ICML-2002), pages 554–561. Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237:1317–1228. Stanfill, C. and Waltz, D. (1986). Toward memory-based reasoning. Communications of the acm, 29(12):1213–1228. Ting, K. and Cameron-Jones, R. (1994). Exploring a framework for instance based learning and naive bayesian classifiers. In Proceedings of the Seventh Australian Joint Conference on Artificial Intelligence (AI-1994), pages 100–107. Ting, K. M. (1994). Discretization of continuous-valued attributes and instance-based learning, technical report 491. Technical report, Basser Department of Computer Science, University of Sydney. Ting, K. M. and Witten, I. H. (1999). Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271–289. Tjong Kim Sang, E. (2002). Memory-based shallow parsing. Journal of Machine Learning Research. Tjong Kim Sang, E. and Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of the Fourth Conference on Computational Language Learning and the Second Learning Language in Logic Workshop (CoNLL-LLL’2000), pages 127–132. Tjong Kim Sang, E. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh Conference on Computational Language Learning (CoNLL-2003), pages 142–147. Tjong Kim Sang, E. and Veenstra, J. (1999). Representing text chunks. In Proceedings of the ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL-1999), pages 173–179. van den Bosch, A. (2004a). Feature transformation through rule induction, a case study with the k-nn classifier. In Proceedings on the workshop on advances in Inductive rule learning at the ECML/PKDD 2004, pages 1–15. van den Bosch, A. (2004b). Wrapped progressive sampling search for optimizing learning algorithm parameters. In Proceedings of the 16th Belgian-Dutch Conference on Artificial Intelligence (BNAIC-2004), pages 219–226.
BIBLIOGRAPHY
118
van den Bosch, A., Canisius, S., Daelemans, W., Hendrickx, I., and Tjong Kim Sang, E. (2004). Memory-based semantic role labeling: Optimizing features, algorithm, and output. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004). van den Bosch, A. and Daelemans, W. (2005). Improving sequence segmentation learning by predicting trigrams. In Proceedings of the Nineth Conference on Computational Natural Language Learning, pages 80–87. van den Bosch, A. and Zavrel, J. (2000). Unpacking multi-valued symbolic features and classes in memory-based language learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), pages 1055–1062. van Halteren, H., Zavrel, J., and Daelemans, W. (2001). Improving accuracy in word class tagging through combination of machine learning systems. Computational Linguistics, 27(2):199–230. van Rijsbergen, C. J. (1979). Information Retrieval. Buttersworth, London. Vapnik, V. (1998). Statistical Learning Theory. John Wiley and Sons Inc., New York. Veenstra, J. (1998). Fast NP chunking using memory-based learning techniques. In Proceedings of the Annual Machine Learning Conference of Belgium and the Netherlands (BENELEARN-98), pages 71–78. Vilalta, R. and Drissi, Y. (2002). A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2):77 – 95. Vilalta, R., Giraud-Carrier, C., and Brazdil., P. (2005). Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, chapter Meta-Learning: Concepts and Techniques. Kluwer Academic Publishers (In Press). Webb, G. I. and Zheng, Z. (2004). Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering, 8:980–991. Weiss, S. and Kulikowski, C. (1991). Computer systems that learn. San Mateo, CA: Morgan Kaufmann. Wettschereck, D., Aha, D. W., and Mohri, T. (1997). A review and comparative evaluation of feature-weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, special issue on Lazy Learning, 11:273–314. White, A. and Liu, W. (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition.
119
BIBLIOGRAPHY
Wolpert, D. H. (1992). On overfitting avoidance as bias. Technical Report SFI TR 92-035001, The Santa Fe Institute. Wolpert, D. H. and Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute. Wu, D., Ngai, G., and Carpuat, M. (2003). A stacked, voted, stacked model for named entity recognition. In Proceedings of the seventh Conference on Computational Natural Language Learning (CoNLL-2003). Xie, Z., Hsu, W., and Lee, M. L. (2004). Mode committee: A novel ensemble method by clustering and local learning. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pages 628–633. Yang, M., Roth, D., and Ahuja, N. (2000). A SNoW-based Face Detector. In The Conference on Advances in Neural Information Processing Systems (NIPS-2000), pages 855–861. Zenobi, G. and Cunningham, P. (2001). Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error. Lecture Notes in Computer Science, 2167:576–591. Zheng, Z. and Webb, G. I. (1998). Stochastic attribute selection committees. In Selected papers from the 11th Australian Joint Conference on Artificial Intelligence on Advanced Topics in Artificial Intelligence (AI-1998), pages 321–332. Zheng, Z. and Webb, G. I. (2000). Lazy learning of bayesian rules. Machine Learning, 41(1):53–84.
BIBLIOGRAPHY
120
Appendix A
UCI data sets We enumerate the UCI data sets and describe for each data set the changes we made to the original format. We perform simple manipulations of the datasets such as replacing spaces by commas or moving the class label as first element of the instance to the last position of the instance. All data sets used in our experiments have the same format. Each line in a data file presents one instance. Each instance consists of a feature vector followed by a class label, separated by commas. Some of the UCI data sets have attributes that need a special manipulation such as removing name labels that are unique for each instance. We mention for each task the subject of the task, and contingent special manipulations. We thank the donors, designers and maintainers of the data in the UCI benchmark repository. We refer to the UCI benchmark repository for machine learning (Blake and Merz, 1998) for detailed information about the data sets. We first detail the data sets that are manipulated, followed by a list of the data sets that were used unchanged.
audiology
This task is to make a diagnosis on the basis of description of symptoms of diseases in the medical field of audiology. We use the standardized version of the data donated by J. Quinlan. We concatenate the provided training and test set to form one large data set. We thank Professor Jergen at Baylor College of Medicine for creating the original data set. 121
Appendix A: UCI data sets
bridges
This task concerns the classification of Pittsburgh bridges in types of design. We use version 2 of the data which contains discretized numeric features. We remove the identifier feature that has a unique value for each instance. The original data does not have one class label, instead each instance contains five design attributes to predict. We choose one of these attributes as class label and consider the other four potential class attributes as features. We choose the attribute type as class label. However, this attribute has missing values for three instances. We remove these three instances from the data set, as we cannot evaluate on instances without a class label. These manipulations result in a data set that has 104 instances and seven features and a class label.
cl-h-disease
The full name of this task is cleveland-heart-disease. The UCI repository contains four heart-disease data sets that are concerned with heart-disease diagnosis. We choose the cleveland-heartdisease data set. We use the version of the data in which each instance consists of 13 features. We thank Robert Detrano of the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation for making the data collection available.
ecoli
This task concerns the prediction of the cellular localization sites of proteins. We remove the first feature Sequence Name because it is a unique identifier for each instance.
flag
This data set contains attributes describing details of various nations and their flags. The data does not have a specified class label, any of the features can be chosen as class label. We choose the feature religion as class label. We remove the feature name that presents a country name and is unique for each instance.
glass
This task concerns the classification of types of glass. We remove the first feature as this is a unique identifier for each instance.
monks
These tasks concern the classification of artificial robots. The third monks data set has noise added to the data. We only use the provided training set.
mushroom
This task concerns the classification of edibility of mushrooms. We remove the feature veil-type because this feature has only one value for each instance and provides no information about the task.
122
123
pendigits
The task concerns the pen-based recognition of handwritten digits. The original data consists of a training and test set, and the digits in the test set are written by other authors than the digits in the training material. We concatenate the training and test set, which implies that the independence of the test material is lost.
optdigits
This task concerns optical recognition of handwritten digits. The original data consists of a training and test set, and the digits in the test set are written by other authors than the digits in the training material. We concatenate the training and test set, which implies that the independence of the test material is lost.
promoters
This task is part of the collection of Molecular Biology Databases. The task is to predict membership of the class of DNA sequences with biological promoter activity. We remove the feature instance name which is a unique identifier for each instance.
solar flare
This task concerns the prediction of solar flares in regions on the sun. The original data contains 3 potential class attributes, each pointing to a certain type of solar flare. We choose the M-class flares as class label, and consider the other two potential class attributes as normal features. The original datasets consist of two parts which we concatenate.
soybean-l
The full name of this task is soybean large. This task concerns the classification of soybean diseases. We concatenate the provided training and test set.
splice
This task is one of the Molecular Biology Databases. The full title of the task is Primate splice-junction gene sequences. The task concerns the recognition of boundaries in a sequence of DNA. We removed the second feature instance name which is a unique identifier for each instance.
yeast
This task concerns the prediction of the cellular localization sites of proteins. We remove the first feature which is a unique identifier for each instance.
Appendix A: UCI data sets
abalone
This data set consists of descriptive features of abalones. The task is to predict the number of rings in the shell, which corresponds with the age of the animal.
car
This task concerns the classification of whether or not to buy a car on the basis of attributes of the car such as price, technical characteristics and comfort.
connect4
This task concerns the classification of the game situations in the game connect-4.
kr-vs-kp
This task is part of the chess databases. The title is an abbreviation of: King+Rook versus King+Pawn on a7. The task concerns the classification of chess end games.
letter
This task concerns the recognition of scan images of letters.
lung-cancer
This task concerns the classification of pathological lung cancers.
nursery
This task concerns the classification of applications for nursery schools.
segment
The full name of this task is Image Segmentation data. The task is part of the Statlog collection, and concerns the classification of regions in outdoor pictures.
tictactoe
This task concerns the prediction of won or lost for the final board configurations of a tic-tac-toe game.
vehicle
This task is part of the Statlog collection. The task concerns the classification of a given silhouette as one of four types of vehicles.
votes
The full title of this task is Congressional Voting Records Database. This task concerns the classification of votes for the U.S. House of Representatives Congressmen.
wine
This task concerns the recognition of wine in terms of chemical analysis.
124
Appendix B
Summary Supervised machine learning can be divided in two main categories: lazy and eager learning. Eager learners put their effort in the learning phase. They abstract a condensed model of the target function from a set of labeled training instances. The classification phase reduces to a relatively effortless application of the abstracted model to new instances. In contrast, lazy learners or memory-based learning algorithms do not abstract a model from the training instances in the learning phase, but simply store them as such in memory. Lazy learners divert their effort to the classification phase and abstract a local model for each new instance by searching for the most similar instances in memory. The contrast between lazy and eager machine learning algorithms forms the motivation for constructing hybrid algorithms which combine both approaches. We aim to show that memory-based classification can replace the classifier component of an eager machine learning algorithm. The main research question investigated in thesis is the following: Can we, by replacing the simple classification method of an eager learner with memory-based classification, improve generalization performance in comparison to one or both original algorithms? We perform our study using instantiations of three quite different types of eager machine learning algorithms: A rule induction algorithm which produces a symbolic model in the form of lists of classification rules; Maximum entropy modeling, a probabilistic classifier which produces a matrix of weights associating feature values to classes; And winnow, a hyperplane discriminant algorithm which produces a (sparse) associative network with weighted connections between feature values and class units. We construct the following hybrid algorithms: • The hybrid algorithm maxent-h combines maximum entropy modeling with k-nn classification. This algorithm classifies new instances by using the weights in the 125
Appendix B: Summary
126
matrix produced by maximum entropy modeling within the distance metric of k-nn to find the k nearest neighbors among the instances stored in memory. • Two hybrids that combine rule induction with k-nn classification. The first hybrid rules-r-h is trained on instances consisting of binary rule-features that transform and replace the original features in the instances. The second hybrid rules-a-h is trained on instances in which binary rule-features are added to the original instances. • The hybrid winnow-h combines multiplicative update learning with k-nn. The hybrid uses the weights of the connections as produced by winnow as feature weights in the k-nn classification method. We perform experiments to test the generalization performance of the hybrid algorithms and compare their performance to the original (parent) algorithms. We conduct paired t-tests to estimate the significance of the differences found between algorithms. Experimental results show that two of the hybrid algorithms, maxent-h and rules-a-h, outperform their eager learning parent and perform similarly to, or slightly better than their memory-based learning parent. The other two hybrids perform worse than or at most similarly to their parent algorithms. In answer to our research question we conclude that we only partly succeeded in creating hybrids that improve generalization performance upon on or both original algorithms. Further analyses are conducted to investigate the differences and commonalities in the classification behavior of hybrid algorithms and their parent algorithms. We conduct a bias–variance decomposition of error rates and we perform additional error analyses. We observe that in general the functional classification behavior of the hybrid algorithms differs substantially from the behavior of their parents. The statistical bias of an algorithm represents the systematic errors which the algorithm makes. The bias–variance analysis shows that the hybrid algorithms have on average a lower bias than their parent algorithms. Given that the two hybrids, maxent-h and rules-a-h, perform better than their eager parent and similarly to k-nn, we conclude that these two hybrids represent a “best of both worlds” situation, since they avoid the systematic errors their parent algorithms make. The next part of the research concerns the construction and evaluation of two stacked classifier ensembles that have a close resemblance to the two successful hybrids. The main difference between these algorithms is that the hybrids consist of one classifier, while the stacked classifiers consist of two classifiers. Experiments show that the stacked classifiers perform rather similarly to the hybrid algorithms. Further error analyses show that although the generalization performance of the two hybrids and the resembling stacked classifiers is similar, they are quite different in their functional classification behavior. A marked difference is that the stacked classifiers are more similar in their behavior to their eager parent algorithm than the hybrids are.
Appendix C
Samenvatting Lerende systemen kunnen worden onderverdeeld in twee groepen: geheugen-gebaseerde leertechnieken en abstraherende leertechnieken. Abstraherende lerende systemen concentreren zich op het bouwen van een compact model op basis van een set gelabelde voorbeelden. Wanneer een dergelijk systeem een nieuw voorbeeld moet classificeren, wordt het model gebruikt om een label te voorspellen. Geheugen-gebaseerde systemen daarentegen leren geen model uit de voorbeelden, maar slaan simpelweg alle voorbeelden op in het geheugen. Pas wanneer dit lerend systeem een nieuw voorbeeld moet classificeren, bouwt het een lokaal model op basis van het nieuwe voorbeeld door te zoeken naar de meest gelijkende voorbeelden in het geheugen. Het contrast tussen geheugen-gebaseerde leertechnieken en abstraherende leertechnieken vormt de motivatie voor het bouwen van hybridische lerende systemen die beide methoden combineren. Deze hybriden gebruiken het model dat is geconstrueerd door een abstraherend lerend systeem in de geheugen-gebaseerde classificatiemethode. In dit proefschrift proberen we de volgende hoofdvraag te beantwoorden: kunnen we, door de simpele classificatiemethode van een abstraherend lerend systeem te vervangen door een geheugen-gebaseerde classificatiemethode, een hybridisch systeem ontwerpen dat betere prestaties levert dan een of beide oorspronkelijke lerende systemen? We gebruiken in ons onderzoek drie verschillende typen abstraherende systemen: een regelinductie systeem dat als model een lijst classificatieregels afleidt uit de set voorbeelden; maximum entropy modeling een statistische leertechniek die een matrix van gewichten tussen attributen en labels produceert; en winnow, een lerend systeem dat is gebaseerd op de winnow-leerregel en als model een associatief netwerk van gewogen verbindingen tussen attribuut- en labelknopen bouwt. We hebben de volgende hybridische lerende systemen geconstrueerd: • De hybride maxent-h combineert maximum entropy modeling met geheugen127
Appendix C: Samenvatting
128
gebaseerde leertechnieken. Dit lerend systeem gebruikt de gewichtenmatrix geproduceerd door maximum entropy modeling in de afstandsmetriek om de meest gelijkende voorbeelden te vinden in het geheugen. • Twee hybriden die regel-inductie combineren met geheugen-gebaseerde leertechnieken. De eerste hybride rules-r-h werkt op basis van voorbeelden waarvan de attributen zijn vervangen door speciale regel-attributen. De tweede hybride rulesa-h gebruikt voorbeelden waarbij de regel-attributen zijn toegevoegd aan de originele attributen. • De hybride winnow-h combineert de winnow-leerregel met geheugen-gebaseerd leren. De hybride gebruikt de gewichten die de winnow-leerregel afleidt voor de verbindingen als gewichten voor attributen in de voorbeelden. We hebben experimenten gedaan om de prestaties van de hybride systemen te testen. We vergelijken de hybride systemen met de abstraherende en geheugen-gebaseerde technieken. We voeren statistische tests uit om de significantie van de gevonden verschillen tussen de leertechnieken te schatten. De resultaten van de experimenten laten zien dat twee van de hybriden, maxent-h en rules-a-h, beter presteren dan de abstraherende technieken en vrijwel gelijk presteren aan de geheugen-gebaseerde systemen. De twee andere hybride systemen presteren minder goed dan, of gelijk aan de twee leertechnieken. Als antwoord op onze onderzoeksvraag kunnen we concluderen dat we er slechts gedeeltelijk in geslaagd zijn om hybride systemen te ontwikkelen die beter presteren dan een of beide oorspronkelijke leertechnieken. Verdere analyses zijn uitgevoerd waarbij we de verschillen en overeenkomsten van het functionele gedrag van de hybridische en oorspronkelijke leertechieken hebbben onderzocht. We hebben een bias–variance analyse en een foutenanalyse gedaan. Deze analyses laten zien dat in het algemeen het functionele gedrag van de hybride systemen duidelijk afwijkt van de andere twee leertechnieken. De bias–variance analyse laat zien dat de hybriden gemiddeld een lagere bias hebben dan de abstraherende en geheugen-gebaseerde systemen. De statistische bias reflecteert de systematische fouten die een systeem maakt. Gegeven de observatie dat de twee hybriden, maxent-h en rules-a-h, beter presteren dan abstraherende systemen en gelijk aan geheugen-gebaseerde systemen, concluderen we dat deze twee hybriden het beste van beide vertegenwoordigen omdat ze minder systematische fouten maken dan de oorspronkelijke systemen. Een ander deel van het onderzoek beschrijft de constructie van twee gestapelde lerende systemen die sterk lijken op de twee succesvolle hybriden. Experimenten met deze gestapelde systemen laten zien dat ze tamelijk gelijk presteren aan de hybride systemen. Verdere analyses laten zien dat hoewel de gestapelde systemen in termen van prestatie gelijk zijn aan de hybriden, hun functionele gedrag wel degelijk afwijkt. Een opvallend verschil is de observatie dat de gestapelde systemen meer overeenkomst hebben met de abstraherende leertechnieken dan de hybriden.