Department of Computer Science Series of Publications A Report A-2001-1
Ecient Range Partitioning in Classi cation Learning
Juho Rousu
University of Helsinki Finland
Department of Computer Science Series of Publications A Report A-2001-1
Ecient Range Partitioning in Classi cation Learning
Juho Rousu
To be presented, with the permission of the Faculty of Science of the University of Helsinki, for public criticism in Auditorium III, Porthania, on January 27th, 2001, at 10 o'clock.
University of Helsinki Finland
Contact information Postal address: Department of Computer Science P.O.Box 26 (Teollisuuskatu 23) FIN-00014 University of Helsinki Finland Email address:
[email protected] (Internet) URL: http://www.cs.Helsinki.FI/ Telephone: +358 9 1911 Telefax: +358 9 1914 4441
ISSN 1238-8645 ISBN 951-45-9676-5 Computing Reviews (1991) Classi cation: I.2.6, F.2.2, G.2.1 Helsinki 2001 Helsinki University Press
Ecient Range Partitioning in Classi cation Learning Juho Rousu Department of Computer Science P.O. Box 26, FIN-00014 University of Helsinki, Finland
[email protected]. PhD Thesis, Series of Publications A, Report A-2001-1 Helsinki, January 2001, 68 + 74 pages ISSN 1238-8645, ISBN 951-45-9676-5
Abstract Partitioning of data is the essence of many machine learning and data mining methods. It is particularly important in classi cation learning tasks, where the aim is to induce rules, decision trees or network structures that separate instances of dierent classes. This thesis examines the problem of partitioning ordered value ranges into two or more subsets, optimally with respect to an evaluation function. This task is encountered in the induction of multisplitting decision trees and, in many learning paradigms, as a data preprocessing stage preceding the actual learning phase. The goal of the partitioning in preprocessing is to transform the data into a form that better suits the learning algorithm or to decrease the resource-demands of the algorithm; the handling of numerical values during learning is often the bottleneck in time consumption. No polynomial-time algorithm is known for the range partitioning task in the general case, which has led to a number of heuristic approaches. These methods are fast and some of them produce good|but sub-optimal|partitions in practice. We study ways to make optimal partitioning more feasible in terms of time-complexity. The approach taken in this study is to take advantage of the general properties of the evaluation functions to decrease the computational demands. We show that many commonly used evaluation functions obtain their minima on a well-de ned subset of all cut point combinations. This subset can be identi ed in a linear-time preprocessing step. The size of the subset is not directly dependent on the size of the dataset but on the diversity of the class distributions along the numerical range. Restricting the class of evaluation functions enables quadratic- or cubic-time evaluation over the preprocessed sequence by dynamic programming. We i
introduce a pruning technique that let us speed up the algorithms further. In our tests on a large number of publicly available datasets, the average speed up from these improvements was over 50% of the running-time. As an application, we consider the induction of multisplitting decision trees. We present a comprehensive experimental comparison between the binary splitting, optimal multisplitting and heuristic multisplitting strategies using two well-known evaluation functions. We examine ways to postpone the evaluation of seemingly irrelevant attributes to a later stage, in order to further improve the eciency of the tree induction. Our main conclusion from these studies is that generating optimal multisplits during tree induction is feasible. However, the predictive accuracy of decision trees only marginally depends on the splitting strategy.
Computing Reviews (1991) Categories and Subject Descriptors:
I.2.6 Arti cial Intelligence: Learning F.2.2 Analysis of Algorithms and Problem Complexity: Nonnumerical Algorithms and Problems G.2.1 Discrete Mathematics: Combinatorics
General Terms:
Algorithms, Experimentation, Performance, Theory
Additional Key Words and Phrases:
Classi cation learning, Numerical attributes, Optimal partitioning, Dynamic programming, Decision trees
ii
Acknowledgements I am privileged to have Prof. Esko Ukkonen as my advisor. He has introduced me both to machine learning research and to the art of algorithm design and analysis. The high standard of his research has always been an inspiration and a goal to me. I am deeply grateful to him for his continuous support during my studies. I am indebted to Dr. Tapio Elomaa, not only for his contribution as an excellent co-author but also for his encouragement and persistence that helped me to carry on the research during my weaker moments. My warmest thanks are also due to Dr. Jyrki Kivinen for his insightful comments and tips. I also thank Marina Kurten, MA, for the speedy language correction. A major part of the work described in this thesis has been conducted while I have been working in the Biotechnology institute of the Technical Research Centre of Finland (VTT) and learned to know many wonderful people. Working together with Dr. Robert Aarts, Dr. John Londesborough and Ilkka Virkajarvi, Lic.(Tech)|to mention only a few|has been rewarding. My warmest thanks are due to Dr. Karin Autio for her interest in my research and eorts that have helped my work at VTT considerably. I value the support from Matti Siika-aho, Lic.(Tech), Dr. Silja Home, and Prof. Liisa Viikari. I also thank Prof. Juha Ahvenainen for his exibility and for the nancial support from VTT that made it possible for me to put this thesis together. Finally, I would like to thank my family for their love and care over the years. My parents Heikki and Helena have always been supportive to my studies. The value of the friendship of my brother Pekka and my sisters Outi, Elisa and Anna is immense. Helsinki, December 20, 2000 Juho Rousu
iii
iv
\The danger from computers is not that they will eventually get as smart as men, but we will meanwhile agree to meet them halfway." |Bernard Avishai
v
vi
Contents 1 Introduction
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
2.1 Classi cation learning : : : : : : : : : : : : : : 2.2 Partitions, splits and decision trees : : : : : : : 2.3 Evaluation functions : : : : : : : : : : : : : : : 2.3.1 Training set error : : : : : : : : : : : : : 2.3.2 Gini index : : : : : : : : : : : : : : : : : 2.3.3 Class entropy : : : : : : : : : : : : : : : 2.3.4 Balancing the bias of impurity functions
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
: : : : : : :
1.1 1.2 1.3 1.4
Classi cation learning : : : : : : : : : : : : Range partitioning in classi cation learning Contributions : : : : : : : : : : : : : : : : : Structure of the thesis : : : : : : : : : : : :
2 Preliminaries
: : : :
3 Ecient Range Partitioning
3.1 The problem : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Heuristic approaches : : : : : : : : : : : : : : : : : : : : : : 3.2.1 Top-down methods : : : : : : : : : : : : : : : : : : : 3.2.2 Bottom-up methods : : : : : : : : : : : : : : : : : : 3.3 Optimization algorithms for cumulative functions : : : : : : 3.3.1 A dynamic programming algorithm : : : : : : : : : : 3.3.2 Pruning the search space of the dynamic programming algorithm : : : : : : : : : : : : : : : : : : : : : : : : 3.4 On the complexity of optimal partitioning : : : : : : : : : :
4 Preprocessing Value Ranges
4.1 Boundary points : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Segment borders : : : : : : : : : : : : : : : : : : : : : : : : 4.2.1 A note on the proof setting : : : : : : : : : : : : : : 4.3 Utilizing boundary points and segment borders in preprocessing vii
1 1 3 6 7
9
9 10 13 15 16 16 17
21 21 22 22 24 25 26 28 33
35 35 37 38 39
viii
Contents
4.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : :
5 Multisplitting Numeric Ranges in Decision Trees 5.1 5.2 5.3 5.4
Top-down decision tree learning : : Using multisplitting decision trees : Postponing the evaluation : : : : : Discussion : : : : : : : : : : : : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
41
43 43 45 46 52
6 Software for Range Partitioning
55
7 Concluding Remarks References
57 59
6.1 Using the program : : : : : : : : : : : : : : : : : : : : : : :
56
Chapter 1 Introduction The last two decades have witnessed an explosive increase in the amount of data stored in digital form. In industry, the automation of processes has brought databases to production plants. Popular quality systems have also contributed to the stored data volumes. In commerce, collection of customer information for later analysis is increasingly commonplace. Electronic commerce, while still in its infancy, will certainly accelerate this trend. In science, astrophysics and bioinformatics are examples of increasingly dataintensive research areas. In all these elds, numeric data has an important, often crucial role beside text and images. The increasingly massive data collections require techniques that re ne the information to a more applicable form. Machine learning and data mining aim at helping this work by nding patterns, trends and dependencies hidden in the data and inducing models that have predictive power. Learning methods can be divided into supervised and unsupervised schemes based on whether a dedicated target function for prediction has been de ned or not. In unsupervised methods, such a function is not available, the goal is grouping or clustering instances based on some similarity or distance metric. In supervised learning, there is either a continuous or nominal-valued target function to be predicted. The former case is referred to as regression or continuous prediction and the latter as classi cation.
1.1 Classi cation learning In classi cation learning the setting is the following (Figure 1.1). We have a set of data instances from some instance space. The data has been divided into classes with some external mechanism, for example, by a human expert. 1
2
1 Introduction
Model family
M1
M2 M5 M10
Training sample
111 000 00011 111 00 000 111 00 11 000 111 000 111 000 111 000 111 00 11 00 11 000 111 000 111 000 111 000 111 00 11 00 11
A 1 A2 ... A p C 3 1 2
1 1 1
3 2.5 3
3 3 5
1 5 1 3.5 2 5
Y Y N N Y N Y N Y
Model
Learning algorithm
M
Class attribute
Figure 1.1: In classi cation learning, a learning algorithm is given a sample of preclassi ed examples from the problem domain. The algorithm learns a model that is then used to predict the classi cation of future examples. The instances are described by attributes that have a xed set of possible values, their domain. We distinguish three types of attributes, nominal (unordered, non-numeric), ordinal (ordered, non-numeric) and numeric attributes. Numeric attributes further divide into integer (discrete) and real-valued (continuous) attributes. A special nominal attribute, the class attribute assigns each instance to one of the prede ned classes. At our disposal, we have a learning algorithm capable of outputting models from a prede ned set, a model family. The task of the learning algorithm is to build|using the attributes|a model that best predicts the classi cation of future, possibly yet unseen, instances. The model families used for classi cation tasks are many. Very roughly, they can be divided into symbolic and subsymbolic approaches based on their representation form. Symbolic models such as decision trees [Qui86, Qui93, BFOS84] and classi cation rules [Riv87, CN89, Coh95] are data structures that partition the data using logical tests de ned on the attributes. Models of this kind are often understandable to human experts and, thus, they are
1.2 Range partitioning in classi cation learning
3
widely used in knowledge discovery tasks. Sub-symbolic models, such as arti cial neural networks [RHW86, Koh88], Bayesian networks [Pea88] and support vector machines [CV95, CS00], arguably, lack in understandability, thus their utility for data mining is more limited. Yet, they often work well as predictors.
1.2 Range partitioning in classi cation learning The need to partition ordered value ranges into subsets arises frequently in machine learning schemes. This task appears both as a preprocessing step preceding the learning phase and as a step integrated into the induction algorithms. The need to partition the ranges in preprocessing stems from two sources. First, the learning algorithm may not support continuous values or it will perform poorly when continuous values are given as input. Examples of such situations are found in many learning paradigms such as Bayesian inference [FG96, KMST97, FGL98, Paz95], instance-based learning [Tin94, WAM97], inductive logic programming [BR97], genetic algorithms [Hek98] and others [KF94, PU98]. Second, the discretization may be done to decrease the resource-demands of the learning algorithms, since handling numerical values during induction is often the bottleneck in time consumption. The main goal of this thesis is to nd ways to improve the time-eciency of numeric range partitioning. Another area where range partitioning is heavily applied is the induction of symbolic classi ers, such as decision trees or classi cation rules. Typically, the required partitions are binary, that is, the range is partitioned into two intervals by selecting one threshold value from the attribute's domain. Multi-way partitions are used in so-called multisplitting decision trees [FI93, AHM95, DFG+97]. Traditionally in decision tree learning [BFOS84, Qui86, Qui88], the tests on numerical attributes were restricted to binary while multi-way partitions were allowed for nominal attributes. This restriction was necessary as no ecient algorithms were known for computing such partitions in numerical domains until the early 1990's. The formulations of the partitioning problems and the methods to solve them are many. In Figure 1.2 the partitioning methods are organized into a taxonomy. Class-blind methods are typically used in unsupervised learning tasks where there is no dedicated target attribute or the target attribute has not been disclosed to the learner. Partitioning into equal-width or equal-frequency intervals are examples of basic class-blind methods. More advanced methods are found, for example, in Bayesian inference schemes
4
1 Introduction
Class-driven
Class-blind
Univariate
Multivariate
Context-dependent
Context-free
Optimal
Brute-force
Generic
Dynamic programming
TSE optimization
Heuristic
Top-down
Depth-first
Bottom-up
Best-first
Figure 1.2: The taxonomy of range partitioning methods. [FG96, KMST97, MC99]. In classi cation learning, the use of class-driven methods is advisable as the goal of partitioning is to facilitate class prediction. Those methods take advantage of the class attribute and usually some metric to quantify the quality of the partition from the perspective of class prediction. In broad terms, the goal of range partitioning is to map the original values into a smaller set of ordinal values while retaining as well as possible the ability to separate instances of dierent classes. Class-driven methods can be further decomposed into multivariate and univariate methods. In multivariate methods, the partitioning is performed simultaneously for two or more attributes. Using a linear combination of attributes to induce a binary partition of the data is perhaps the most popular multivariate approach [BU95, MKSB93, SC98]. Univariate partitioning methods are called context-dependent or contextfree based on whether the values of the other attributes are taken into account or not. In context-dependent methods [LS94, Hon97, KRP96] the distance of instances, as measured by all attributes and not only by the attribute to be partitioned, is taken into account in the quality criteria. Such
1.2 Range partitioning in classi cation learning
5
an approach is well suited for situations where there are strong dependencies between the attributes. This scheme, however, leaves open the question how to weight the contributions of dierent attributes in the metric. In context-free partitioning|the problem domain of this thesis|only the values of the attribute to be partitioned and the class attribute are considered. The quality of the partitions is measured by evaluation functions that depend on the class distributions of the subsets and in some cases the sizes of the subsets. The partition optimizing the criteria is considered the best. Despite the problem setting that is simpler than in the other partitioning schemes, up until the early nineties there were no ecient methods for classdriven context-free partitioning tasks. In 1991, Catlett [Cat91] proposed an ecient algorithm for multi-way partitioning a numeric value range in a classi cation setting. The algorithm was based on a greedy recursive computation of binary splits equipped with a heuristic criterion to limit the number of intervals. A year later, Kerber [Ker92] presented an approach using a test of statistical independence. His algorithm worked in bottom-up fashion, that is, by repeated merging of neighboring intervals. Fayyad and Irani [FI93] shortly followed with another greedy top-down algorithm using a metric combining the average class entropy with a cost term penalizing higharity partitions. The algorithm took advantage of a profound result [FI92b] that the average class entropy criterion [Qui86] has local optima only in socalled boundary points in the range. Heuristic partitioning methods have later been presented by many authors [RR95, Pfa95, HS97, CL97, CWC95, PT98, LS97a, Wu96], most of them being variants of the methods by Catlett or Kerber, altering the evaluation function or proposing some other search heuristic. Maass [Maa94] was the rst to suggest a polynomial-time optimization algorithm for this problem. His algorithm was applicable to the training set error function. Fulton, Kasif and Salzberg [FKS95] followed by introducing a quadratic time general algorithm and a linear-time algorithm for the training set error in two-class learning tasks. Later, Auer [Aue97] and Birkendorf [Bir97] devised linear-time algorithms for the multi-class case.
6
1 Introduction
1.3 Contributions This thesis extends the work of Fayyad and Irani [FI92b] and that of Fulton, Kasif and Salzberg [FKS95]. It contains the following contributions:
Analysis of the minima of common evaluation functions (Publication
I and II). We show in Publication I that the result of Fayyad and Irani [FI92b] holds true for many commonly used evaluation functions. Also, we show that the results immediately generalize to multisplitting. Independently of us, the same fact was realized by Zighed, Rakotomala and Feschet [ZRF97]. In Publication II, we show that an even more general result holds true: the optimal cut points lie in so called segment borders, a subset of the boundary points. A pruning technique to improve dynamic programming search (Publication III). The technique lets us discard partitions from the search space of the dynamic programming algorithm, without losing optimality. This technique is applicable for all convex functions that are optimizable by the quadratic-time dynamic programming algorithm by Fulton, Kasif and Salzberg [FKS95]. An optimized implementation of range partitioning algorithms. We have implemented a preprocessing algorithm that transforms the original range into a sequence of intervals, limited by the boundary points or segment borders. To combine with the preprocessing, a selection of search algorithms is available, including the dynamic programming algorithm equipped with the search phase pruning criteria. The software package is available from www.cs.helsinki. /u/rousu/SplittER.tgz. An empirical comparison of binary and multisplitting decision tree methods (Publication I). We compare the binary splitting and multisplitting decision tree methods on a large number of commonly used datasets. The most important nding is that multisplitting and binary splitting do not signi cantly dier in terms of predictive accuracy, independent of whether heuristic or optimal multisplitting is employed. The choice of the evaluation function, however, has an eect on the results. A study on postponing methods for attribute evaluation (Publication IV). We study ways of postponing the evaluation of dicult attributes during top-down decision tree construction. Several methods, both optima-preserving and heuristic, are devised. Unfortunately, in our tests the empirical behavior of the optima-preserving methods was not satisfactory.
1.4 Structure of the thesis
7
1.4 Structure of the thesis This thesis is a summary of four original papers published in peer-reviewed scienti c journals and conferences. The papers have been co-authored by Dr. Tapio Elomaa. In all the studies summarized in this thesis, the author's contributions have been central in the development of the theory and the algorithms. The author has also contributed signi cantly in conducting the tests and writing the publications.
Publication I: General and Ecient Multisplitting of Numerical Attributes. Machine Learning 36, 3 (1999), pages 201{244. Copyright c 1999 Kluwer Academic Publishers. Reprinted with permission.
Publication II: Generalizing Boundary Points. In Proceedings of the Sev-
enteenth National Conference on Arti cial Intelligence (AAAI-2000), Austin, TX, July/August 2000, MIT Press, pages 570{576. Copyright
c 2000 American Association for Arti cial Intelligence. Reprinted with permission.
Publication III: Speeding Up The Search for Optimal Partitions. In J.
Z_ ytkow and J. Rauch, editors, Proceedings of the Third European
Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-99), Prague, Czech Republic, September 1999. Lecture Notes in Arti cial Intelligence 1704 (1999), pages 89{97. Copyright c 1999 Springer-Verlag. Reprinted with permission.
Publication IV: Postponing the Evaluation of Attributes with a High
Number of Boundary Points. In M. Quafafou and J. Z_ ytkow, editors, Proceedings of the Second European Conference on Principles of Data Mining and Knowledge Discovery, (PKDD-98), Nantes, France, September 1998. Lecture Notes in Arti cial Intelligence 1510 (1998), pages 121{129. Copyright c 1998 Springer-Verlag. Reprinted with permission.
The rest of this thesis is organized as follows: Chapter 2 presents the central concepts in classi cation learning. Chapter 3 reviews the algorithms that can be used for solving ordered range partitioning problems. In Chapter 4, we discuss ways to improve the eciency of partitioning algorithms by preprocessing the range. Chapter 5 reviews the application of multisplitting in decision tree induction. Chapter 6 presents a software package for range partitioning. Chapter 7 concludes the thesis with some general remarks.
8
1 Introduction
Chapter 2 Preliminaries In this section we review elements of classi cation learning and introduce terminology that is used in later chapters of the summary and the publications.
2.1 Classi cation learning We assume an arbitrary set X called the instance space. The instances x 2 X are described by a prede ned set A = fA1 ; : : : ; Ap g of attributes. The domain Dom(A) of attribute A is the set of all possible values to the attribute. For each attribute, the function valA : X ! Dom(A) de nes the value of the attribute on the instances. A special class attribute C with the domain Dom(C ) = fc1 ; : : : ; cm g is used to assign each instance to a class. The attribute A is called ordered if Dom(A) is a totally ordered set, otherwise the attribute is called unordered. Furthermore, an ordered attribute is called numeric if Dom(A) R. The class attribute C is assumed to be unordered. A set fv 2 Dom(A)ja v bg, for some an ordered attribute A and arbitrary values a and b, is called a (value) range. A training sample S = fs1 ; : : : ; sn g X Dom(C ) consists of classi ed training examples s = (x; c). We assume that the examples are independently drawn from an unknown probability distribution D : X Dom(C ) ! [0; 1]. The class (frequency) distribution of the set S is the vector
S = (n(c1 ; S ); : : : ; n(cm ; S )); where n(c; S ) = jfs 2 S jvalC (s) = cgj is the number of instances of class ci in S . By p(c; S ) = jfs 2 S jvalC (s) = cgj=jS j we denote the relative frequency of items of the class c in the set S . The relative class (frequency) distribution 9
10
2 Preliminaries
of the set S is the vector (p(c1 ; S ); : : : ; p(cm ; S )). The set S is called class uniform if p(c; S ) = 1 for some c 2 Dom(C ). In order to facilitate learning, it is necessary to restrict the hypothesis space of the learning algorithm. This is accomplished by letting the algorithm suggest models only from a prede ned set, a model family. Formally, a model family M is a set of functions of the form
M : Dom(A1 ) Dom(Ap) ! Dom(C ): The value M (valA1 (x); : : : ; valAp (x)) is the model's prediction of the class of the instance x. The model's training error is the relative frequency of incorrect class predictions: f(x; c) 2 S jM (valA1 (x); : : : ; valAp (x)) 6= cg : (2.1) errorS (M ) =
jS j
A model M is called consistent (with sample S ) if it predicts the class of each instance (in S ) correctly, that is, errorS (M ) = 0. The model's generalization error is the probability of incorrect class prediction in the instance space as a whole:
errorD (M ) = PrD f(x; c) 2 X Dom(C )jM (valA1 (x); : : : ; valAp (x)) 6= cg: Above, PrD denotes the probability with respect to the distribution D. In practice, this probability cannot be computed exactly. Instead, it is usually estimated from an independent test sample (not shown to the learning algorithm) using the equation (2.1). The classi cation learning problem can be stated as follows: Given a sample S X Dom(C ) and a model family M, nd a model M 2 M that, with a high probability, has a generalization error, errorD (M ), close to the generalization error of the optimal model in M [Hau92].
2.2 Partitions, splits and decision trees Classi cation is, in essence, a partitioning task; our intent is to nd a function that divides the instance space cleanly into class uniform regions by decision boundaries. In Figure 2.1 the sample of instances of three classes have been divided into class-uniform regions by inserting axis-parallel decision boundaries in the space spanned by the two attributes A1 and A2 . Depending on the model family, the learned partition may be represented implicitly, for example in a weighted combination of hyperplanes (as in
2.2 Partitions, splits and decision trees 3
A1
111 000 000 111
2
1
11 00 00 11
111 000 000 111 00 11 00 11
111 000 000 111
1
1111 0000 0000 1111 0000 1111 0000 1111
111 000 000 111 000 111 000 111 000 111 000 111
111 000 000 111 000 111 111 000 000 111 000 111
11 00 00 00 11 11 00 11 2
11
3
11 00 00 11 11 00 00 11 111 000 000 111
4
11 00 00 11 00 11 00 11 111 000 000 111
5
A2
Figure 2.1: A sample of instances from three classes (circle, square, triangle) described by two attributes, A1 and A2 . The sample can be divided into class-uniform regions by axis-parallel decision boundaries (dashed lines). some neural nets) or in a distance function (as in nearest-neighbor schemes). Alternatively, the model may contain an explicit description of the decision boundaries which is the case in learning of symbolic rules. In schemes of the latter kind, learning can often be seen as search in the space of candidate partitions or decision boundaries. Let us recapitulate the text-book de nition of partitions: A partition of a nite set S is a set fS1 ; : : : ; Sk g that satis es 1. for all i, Si S , 2. 3.
for all i, Si 6= ;, Sk S = S , and i=1 i
4. if i 6= j then Si \ Sj = ;. The arity of the partition is the number of subsets it contains. A k-partition is a partition with k subsets. In machine learning, the partitions that can be de ned in terms of the attribute values are relevant. A surjective function
: X ! f1; : : : ; kg; for some k 2;
12
2 Preliminaries
is called a splitting function or a split. A split with k outcomes, a k-split, induces a k-partition of the instance space: an instance x 2 X belongs to the subset S(x) . In this thesis, we usually use the term split and partition interchangeably. The dierent kinds of splits are many. We concentrate on splits of ordered attributes. A k-split of an ordered domain is generated by selecting a set of threshold values or cut points from the attribute's domain T = fT1 ; : : : ; Tk?1 g Dom(A) that satisfy T1 < T2 < < Tk?1. The resulting split is then
8 > i if Ti? < valA(x) Ti , and :k if valA(x) > Tk? : 1
1
1
By P (T ; S ) we denote the partition of S induced by a set of thresholds T . The threshold values split the attribute's domain into intervals and, thus, the resulting partition is ordered in the following sense: if s 2 Si and U k 0 0 s 2 Si+1 then valA (s) < valA (s ). We use a shorthand i=1 SU i to denote k S , we an ordered k-partition f S 1 ; : : : ; Sk g. Given an ordered partition i=1 i U l call the partition i=1 Si ; l k, a pre x of the rst partition. More than one combination of thresholds may result in the same partition of the training set. If the attribute to be partitioned is continuous, there are an in nite number of alternatives. To induce a given partition of Si ] Si+1 , any value from the interval [u; v) could be chosen, where u = maxs2Si fvalA(s)g and v = mins2Si+1 fvalA(s)g. In the formulations of the theory and experiments reported in Publications I and IV, we have chosen to always take the minimum possible value, that is, the value u, as the threshold. This strategy is also used by the C4.5 decision tree learner [Qui93]. An alternative approach is to take the midpoint (u + v)=2 as the threshold. Evidence in favor of this strategy has been presented by ShaweTaylor and Christianini [SC98] in conjunction of learning perceptron decision trees. We stress, however, that the contributions in this thesis are largely independent of the threshold selection strategy. The theoretical results hold true regardless, and modifying the algorithms to implement another strategy is straightforward. A popular model family in data mining tasks is that of decision trees (Figure 2.2). A decision tree T is a data structure that is either 1. A leaf with a class label attached to it, or
2.3 Evaluation functions
13 A2 1.5
111 000
A1 1.5
00111100
Figure 2.2: A multisplitting decision tree classifying the example in Figure 2.1. 2. a decision node that contains a k-split T for some k 2 and has decision trees T1 ; : : : ; Tk as its children. An instance x is classi ed by a decision tree as follows:
If T is a leaf, answer the class label associated with T , otherwise answer the class label returned by the child Tl , where T (x) = l. The induction of multisplitting decision trees is a potential application area of the methods developed in this thesis (see Chapter 5). In multisplitting decision trees the nodes may contain splits that partition the data into more than two subsets (c.f. Figure 2.2).
2.3 Evaluation functions In partitioning tasks one has to decide how to rank the dierent choices for partitioning the data. Motivated by the well-known decision tree learning algorithms [BFOS84, Qui86, Qui93, FI93] we concentrate on methods that are 1. class-dependent (or supervised), that is, they take advantage of the class distribution of the instances, and 2. context-free, that is, they do not consider the values of the other attributes when ranking the candidate partitions.
14
2 Preliminaries
In such partitioning schemes the ranking of the competing partitions is based on evaluation functions. They typically measure both the internal class coherence of the subsets and the overall complexity of the partition, for example, the number of subsets in the partition or the simplicity of the splitting function. The intent is to nd coherent subsets with a low complexity partition. The design of evaluation functions that keep these two eects in good balance is a delicate and still not very well understood issue [Qui88, Lop91, WL94, Kon95b]. The class-coherence of partitions is typically measured by an impurity function f : P (X ) ! R+ . The following properties are recognized as bene cial [BFOS84, FI92a] for an impurity function: 1. f (S ) is minimized if and only if p(c; S ) = 1 for some c 2 Dom(C ). 2. f (S ) is maximized, when for all c 2 Dom(C ); p(c; S ) = 1=jDom(C )j. 3. If for some sets of instances S and S 0 , (p(c1 ; S ); : : : ; p(cm ; S )) and (p(c1 ; S 0 ); : : : ; p(cm ; S 0 )) are the same, except for permutation of the values, then f (S ) = f (S 0 ). 4. f is smooth. The rst property states that a class-uniform set has minimum impurity. The second property states that a set, where each class is equally likely, has maximum impurity. These two properties are necessary for an impurity function. The third property requires the measure to be symmetric with respect to classes. It is a reasonable requirement if the cost of misclassifying an instance does not depend on its class label. In cost-sensitive schemes [BFOS84, PFK98, MD00] this requirement should be transformed into a form that appropriately re ects the misclassi cation cost structure. These schemes are not handled in this thesis. The fourth property means that small changes to the class distribution result in small changes to the measure. This property is useful, for example, in noisy situations; a smooth measure tolerates slight distortions well. This property is sometimes expressed as the requirement of dierentiability [FI92a]. However, dierentiability may be an unnecessarily strict requirement. For example, the widely used training set error function (c.f. Section 2.3.1) is piecewise linear, thus not dierentiable in its range. Many evaluation functions for partitions compute the average impurity of the subsets, weighted by the sizes of the subsets: k k X ]k ! X S = jSi j f (S ) = 1 jS jf (S ): F i=1
i
i=1
jS j
i
jS j i
=1
i
i
2.3 Evaluation functions
15
In the following, we review a few of the most common impurity functions and approaches to penalize the excessive number of subsets in the partition. The theoretical and algorithmic ideas presented in this thesis are validated against these evaluation functions. Many other evaluation functions have been proposed [LS97b, FW98, Mar97, HS97, FI92a, WL94, CWC95], see also [Kon95b]. A rigorous study and comparison of these proposals would be interesting but it is out of the scope of this thesis.
2.3.1 Training set error
The simplest impurity function measures the frequency of instances in the set S that do not belong to the most frequent class:
(S ) = 1 ? max fp(c; S )g c2Dom(C )
In other words, (S ) quanti es the disagreements with the class label of the most frequent class in the set S . The corresponding measure for partitions is the relative error measure k ]k ! X Si = jjSSijj (Si ) RE i i =1
=1
that returns the average number of disagreements in the subsets. In Publications I and II we consider the unnormalized version of the RE function, the training set error function
TSE
]k !
]k !
i=1
i=1
Si = jS jRE
Si :
The behavior of these functions is essentially the same. The RE (or TSE ) functions are widely used in Computational Learning Theory [Val84, Ang92, KV94]. Algorithms that nd the training set error minimizing model from a model family are known to be eective learners [Hau92]. In broad terms, if the sample size jS j is large enough with respect to the model family and the learning algorithm, the probability of encountering a model M with a large dierence jerrorS (M ) ? errorD (M )j is small, and, consequently, the training set error of a model will be a good estimate of the generalization error. In such a situation, the training set error minimizing model is not likely to have a signi cantly higher generalization error than the optimal model in the model family [Hau92].
16
2 Preliminaries
Despite their theoretical justi cation, the RE and TSE measures suer from only considering the dichotomy between the most frequent class and all other classes. In multiclass classi cation tasks partitions that result in good overall coherence lose to partitions with a single most frequent class standing out. In greedy induction algorithms that use evaluation functions as means to guide the search, using the RE function may result in models with impaired ability to predict the minority classes.
2.3.2 Gini index
To make up for the de ciencies of the relative error measure, Breiman et al. [BFOS84] devised a new impurity function, the gini index of diversity. This function has a well-founded decision theoretic formulation; consider a set of instances S with the relative class distribution (p(c1 ; S ); : : : ; p(cm ; S )): Now, let us decide to (re)classify at random an arbitrary instance s 2 S into class ci with the probability p(ci ; S ). The probability that this is a correct decision is m X p(ci ; S )p(ci ; S ); i=1
where the rst part of each term denotes the probability of picking an instance of class ci and the second part the probability that the instance's class label does not change in the reclassi cation. With the probability
gini(S ) = 1 ?
m X i=1
p(ci; S )p(ci ; S ) =
m X X i=1 j 6=i
p(ci; S )p(cj ; S )
the decision is incorrect. Thus, gini(S ) represents the probability that the class distribution of the set changes when the class label of one random instance is discarded and a new one is drawn for it from the empirical probability distribution de ned by the current class frequency distribution. As before, the partitions are evaluated by the weighted sum k ]k ! X Si = jjSSijj gini(Si ): GI i i =1
=1
2.3.3 Class entropy
Perhaps the most widely utilized impurity measure is the class entropy
Ent(S ) = ?
m X i=1
p(ci ; S ) log2 p(ci ; S ):
2.3 Evaluation functions
17
It is used in the average class entropy function to measure the partition impurity: k jS j ]k ! X Si = jSij Ent(Si): ACE i=1
i=1
The entropy measure has the following information theoretic justi cation [SW49, BCW90]: Assume a set of m classes (or \messages") with probabilities p(c1 ; S ); : : : ; p(cm ; S ) that sum up to 1. We want to construct a function f to measure the amount of uncertainty that is associated with choosing one of the classes. Furthermore, the function should have the following properties: 1. f is a continuous function of p(ci ; S )
2. If p(ci ; S ) = = p(cm ; S ) then f should be a monotonously increasing function of m. 3. If the selection of a single class among m classes is performed as a series of binary decisions, the sum of the entropies of the binary decisions, weighted with the selection probability, should equal f . Shannon and Weaver [SW49] demonstrated that the functions of the form c Ent(S ), for some positive constant c, are the only functions having all these properties. The classic ID3 [Qui86] and C4.5 decision tree learners [Qui93] provide the information gain metric that is easily expressed in terms of average class entropy: ]k ! ]k ! IG Si = Ent(S ) ? ACE Si i=1
i=1
Since the rst term is independent of the partition, the ACE and IG rank the partitions identically. The only dierence is that IG should be maximized while ACE should be minimized.
2.3.4 Balancing the bias of impurity functions
The RE , GI and ACE functions share the following property [BFOS84, Hic96, CT91]: any partition of a set of instances U has at most as high impurity as the original set, that is, for any partition ki=1 Si of the set S ,
(2.2)
F (S ) F
]k !
i=1
Si :
18
2 Preliminaries
This property is problematic when multi-valued attributes are evaluated [KBR84, Qui88] and especially so with real-valued attributes. The number of dierent values may sometimes be of the same order of magnitude as the number of the instances. In such situations (2.2) begets that the partition minimizing the impurity has a very large number of subsets. Indeed, the partition that has as many subsets as there are dierent attribute values in the sample is among the optimal ones. This problem was realized early by the classi cation learning researchers, in conjunction with using multiway tests on nominal attributes in decision trees. Kononenko et al. [KBR84] decided to add a penalization to the information gain function. The resulting function was a \balanced" gain function U !
]k
IG
k S i=1 i
log2 k : Later, Quinlan [Qui86] suggested a slightly adjusted penalization in his gain ratio function U !
BG
i=1
Si =
IG ki=1 Si Si = Uk GR i=1 Si i=1
]k
where
]k ! i=1
Si = ?
k jS j X i log jSi j : jS j i jS j =1
The two penalization factors are closely related: the maximum value of the denominator in the gain ratio function is log2 k. This is obtained when the k subsets of the partition have equal sizes. This, incidentally, reveals a hardly justi able bias: given two partitions of equal impurity gain ratio favors the partition that has the most uneven distribution of subsets sizes. Consequently, the frequency of small subsets in GR-optimal partitions is higher than in BG-optimal partitions. In decision tree learning, small subsets cannot usually be processed further and their impurity will contribute to the error of the nal tree. This would not be a problem, if the small subsets were guaranteed to be purer than the larger ones but this is not the case when GR is used to evaluate partitions. Incidentally, in our experiments in Publication I the BG function outperforms GR in a statistically signi cant manner. Yet another penalization to the information gain was suggested by Lopez de Mantaras [Lop91]. His measure is based on the information theoretic distance between the class distribution and the evaluated partition. The
2.3 Evaluation functions
19
function, normalized distance measure, can be expressed as follows:
]k !
Uk S ) IG ( Si = 1 ? Uki=1 i ; ND ( i=1 Si) i=1
where
]k !
k X m jS j X ij log jSij j ; Si = ? j S j jS j i j i and Sij = Si \ fs 2 S jvalC (s) = j g. 2
=1
=1 =1
Another approach for penalizing high arity partitions is the use of Minimum Description Length (MDL,MDLP) [Ris83, Ris89, Ris95] and Minimum Message Length (MML) [WB68, WF87] principles. Intuitively, those theories suggest that the best model to describe the data is the one that minimizes the total length of a message encoding the theory and the data given the theory. Thus, in order to excel, complex models need to explain the data better than simpler models. MDL-based evaluation functions are typically composed of two parts, one term codes the interval boundaries in the split|the \theory"|and the other part codes the class distributions of the subsets|the \exceptions". There are many coding methods that reach the message length d? log2 pe bits for a message with probability p, near the information theoretic lower bound ? log p [SW49, BCW90, HV94]. Therefore, if there are n equally probable messages, we can transmit any of them with dlog2 ne bits. Given an attribute with V values, the message encoding the partition cut points needs to contain the value k 2 f1; : : : ; jV jg followed by the description of the of the cut points in the ordered example sequence. There ? locations are jVk?j?11 ways to pick k ? 1 cut points from jV j ? 1 possible locations. The length of the resulting code for the partition is approximately
! j V j ? 1 DLPartition (k; V ) = log jV j + log k ? 1 : 2
2
There are many possibilities for coding the exception part of the message [QR89, WP93, FI93, U Pfa95, Kon95b]. Fayyad and Irani [FI93] used the entropy jS jACE( ki=1 Si ) as the code length measure. This can be justi ed, since there are many codes that asymptotically approach entropy. Wallace and Patrick [WP93] chose to use a \real" arithmetic code. The message length for coding the class distributions in their code is
0 Qm n c ;S 1 A; WPcost(S ) = ? log @ j jS j 2
=1
(m)
( j
)
20
2 Preliminaries
where > 0 and ab is a shorthand notation for the increasing factorial a(a + 1) (a + b ? 1). This function re ects an arithmetic coding scheme [RL79, BCW90, HV94], where each class has an initial weight of . When coding an instance the current weight of the class is increased by 1 and the resulting weight vector is normalized to a probability distribution. The lower value has, the faster the code adjusts to the observed class distribution and \forgets" the prior distribution. Correspondingly, high -values result in long codes for skewed class distributions. Asymptotically WPcost measure approaches entropy (from above), but for small sets WPcost is signi cantly higher. This surplus cost can be actually considered an advantage of WPcost over entropy: entropy cannot decide between two partitions that have identical relative class distributions in the subsets, even if the other partition contains more subsets than the other. In this situation, WPcost will choose the partition with fewer subsets which is the intuitively the better decision. To code the partition, the codes of each subset can be attached together, so with the WPcost function we obtain the code length
WP
k ]k ! X
i=1
Si =
i=1
WPcost(Si)
for the exceptions. In the description of the partition the codes for the theory and the exceptions are combined. The resulting MDL/MML -evaluation function is then ! ]k ! ]k Si : Si; V = DLPartition (k; V ) + WP DL i=1
i=1
Chapter 3 Ecient Range Partitioning In this section we concentrate on the main problem in this thesis, ecient discovery of good partitions for ordered value ranges. First, we give a precise formulation of the range partitioning problem. Then, we review the main heuristic approaches for solving this task. Next, we turn to exact algorithms, that is, algorithms that optimize the evaluation function. We recapitulate the dynamic programming schema that underlies our own methods. A presentation of a pruning technique to further improve the methods follows. Finally, we discuss the inherent complexity of the range partitioning problem.
3.1 The problem Assume an evaluation function F; a set of instances S from some instance space X , an ordered attribute A, a set of candidate cut points T = fT1 ; : : : ; Tb?1 g Dom(A), T1 < T2 < < Tb?1 that partition the set S into b indivisible subsets I1 ; : : : ; Ib , and an integer k 2 f1; : : : ; bg. Without loss of generality, assume that F gives good partitions low scores, that is, F should be minimized. The ordered range partitioning problem is the following:
De nition 3.1 (Bounded-arity partitioning) Find a set T T of at most k ? 1 cut points such that F (P (T ; S )) F (P (T 0 ; S )) for all T 0 T ; jT 0j < k.
This de nition accepts as a solution a partition of any arity, up to the upper bound k. In some cases it may be necessary to require the result to have the arity of exactly k: 21
22
3 Efficient Range Partitioning
De nition 3.2 (Fixed-arity partitioning) Find a set T T of k ? 1 cut points so that F (P (T ; S )) F (P (T 0 ; S )) for all T 0 T ; jT 0 j = k ? 1. In this thesis we are mostly interested in bounded-arity partitioning. Fixedarity partitioning problems can usually be solved with the same algorithms. Some evaluation functions, however, are not well-suited to xed-arity problems; in Section 4.4 we brie y touch this question.
3.2 Heuristic approaches Many heuristic approaches for the partition search problem have been suggested [Cat91, Ker92, FI93, Pfa95, RR95, CWC95, HS97, CL97, PT98, LS97a, Wu96]. The majority of the methods are based on repetitive local splitting or merging operations. Top-down methods [Cat91, FI93] start with the whole range, divide it into two and repeat the procedure for the resulting intervals. Bottom-up [Ker92, RR95] methods, on the other hand, start with the partition of the maximal possible arity and iteratively merge neighboring intervals.
3.2.1 Top-down methods
In a top-down partitioning algorithm there are many possibilities for selecting the next interval to be split. Here, we give a generic description of a few alternatives.
procedure DFPartition(S; A; T ) 1. Let T 2 T such a value that F (P (fT g; S )) F (P (fT g; S )) for all T 2 T ; 2. if P (fT g; S ) ful lls the stopping criterion then 3. return ;; else 4. Rleft = DFPartition(fs 2 S jvalA (s) T g; A; fT 2 T jT < T g); 5. Rright = DFPartition(fs 2 S jvalA (s) > T g; A; fT 2 T jT > T g); 6. return Rleft [ Rright ; end Table 3.1: A depth- rst top-down partitioning algorithm. The algorithm takes as input a sample S , a numerical attribute A, a set of thresholds T , and returns a partition of S .
3.2 Heuristic approaches
23
A depth- rst algorithm [Cat91, FI93] recursively nds the best (as measured by the evaluation function F ) binary splits of parts of the data until a stopping criterion is met. Both the selection of the cut point location and the stopping criterion only depend on the part of the data that is currently being processed (Table 3.1), the quality of the partitioning of the data as a whole is not taken into account. Hence, the scheme does not directly t the xed-arity and bounded-arity problems de ned in the previous section. In practise, the partition generated by the depth- rst algorithm may violate the bound on the arity and needs to be post-processed, for example, with pruning methods designed for decision tree learning [BFOS84, Qui87, KM98]. The time-complexity of the depth- rst search algorithm is O(mb log b). A best- rst partitioning algorithm [CL97, Pfa95] (Table 3.2) is a better t to the problems presented in the previous section. In each stage of the processing, the algorithm picks a cut point that results in the best gain in the global quality of the partition. This search method has two advantages over the depth- rst search. First, the algorithm the algorithm can be used to solve xed and bounded arity problems. This only requires stopping the algorithm once k ? 1 cut points are selected. Second, the penalization methods described in Section 2.3.4 can be used in place of a separate stopping criterion. Then, the same evaluation function is used for ranking the partitions and for deciding when to stop further splitting. A straightforward implementation of a best- rst search leads to an
procedure BFPartition(S; A; T ) 1. R ;; 2. Let T 2 T such a value that F (P (fT g; S )) F (P (fT g; S )) for all T 2 T ; 3. while P (R [ fT g; S ) does not ful ll the stopping criterion do 4. R R [ fT g; 5. T T n fT g; 6. Let T 2 T such a value that F (P (R [ fT g; S )) F (P (R [ fT g; S )) for all T 2 T ; od 7. return R; end Table 3.2: A best- rst top-down partitioning algorithm. It takes as input a sample S , an attribute A, a set of thresholds T , and returns a partition of S.
24
3 Efficient Range Partitioning
O(k2 mb) algorithm. However, the time-complexity can be decreased to O(kmb) for the evaluation functions described in section 2.3 by careful im-
plementation. The idea is to compute the best binary split of each subinterval at most once (instead of up to k times), store them in a list together with data indicating how large a gain on the impurity score would be obtained by making the split. In Publications I and III, an algorithm Greedy that incorporated this implementation trick was used as the representative heuristic approach.
3.2.2 Bottom-up methods
An alternative way to approach the partition generation is to startU with the maximally partitioned range, that is, the partition P (T ; S ) = bi=1 Ii de ned by the set of potential cut points T and to iteratively merge adjacent intervals by removing cut points from the split, until a stopping criterion is satis ed (Table 3.3). Such a bottom-up approach was suggested rst by Kerber [Ker92]. In his ChiMerge algorithm the merging decision was based on the 2 -test of independence: if the class label is independent of the ordered attribute value in a subrange consisting of two adjacent intervals, the two intervals are candidates to be merged. ChiMerge chooses at each point the pair of intervals whose 2 -score is the lowest, indicating a high independence between the class label and the attribute value. Richeldi and Rossotto [RR95], among
procedure BUPartition(S; A; T ) 1. R T ; 2. Let T 2 R be such a value that F (P (R n fT g; S )) F (P (R n fT g; S )) for all T 2 R; 3. while P (R n fT g; S ) does not ful ll the stopping criterion do 4. R R n fT g; 5. Let T 2 R be such a value that F (P (R n fT g; S )) F (P (R n fT g; S )) for all T 2 R; od 6. return R; end Table 3.3: A bottom-up partitioning algorithm. The algorithm takes as input the sample S , an attribute A, a set of thresholds T , and outputs a partition of S .
3.3 Optimization algorithms for cumulative functions
25
some practical modi cations, changed the 2 -test to a -test that recti es a shortcoming of the 2 test that makes it generate excessively high-arity partitions when large samples are subjected to it. Another approach to rectify this problem was put forward by Liu and Setiono [LS97a] who essentially wrapped ChiMerge into a loop that adjusted the signi cance level of the 2 -test such that a partition meeting the user-de ned training set error bound was found.
3.3 Optimization algorithms for cumulative functions Heuristic algorithms cannot guarantee nding the optimal partition although some of them produce high-quality partitions in practice. The foremost motivation for using heuristics has been the perceived dicult of nding the optima. Finding the optimal partition is dicult in the general case|when nothing is known or assumed about the function to be optimized. The only available then is the brute-force exhaustive search that evaluates all ?b?1 = Omethod ((b ? 1)k?1 ) ordered k-partitions that can be induced to a range k?1 with b indivisible subintervals. The way forward is to take advantage of the properties of the evaluation functions. An example of such a useful property is cumulativity which means that the evaluation function F has the form
F
]k !
k X
i=1
i=1
Si = c
g(Si )
for some function g and constant c. The evaluation functions ACE and GI conform to this form with g(Si ) = jSi jf (Si ); c = jS1 j ; and the TSE and WP functions with g(Si ) = f (Si); c = 1; where f is the corresponding impurity function (c.f. Section 2.3). The IG and BG functions do not satisfy the cumulativity property by themselves. However, the IG maximizing partition U as the ACE U is the same minimizing partition. Optimizing the BG( ki=1 Si ) = IG( ki=1 Si )= log k function, however, requires computing the IG maximizing partition for each arity and selecting the best of them. A similar scheme needs to be employed with the DL function, as well. In the methods described below, the lower arity partitions are computed as a by-product of computing the maximum arity (allowed by the upper bound) partition. Hence, the additional computational cost when optimizing BG or DL is negligible.
26
3 Efficient Range Partitioning
Because of the cumulativity, the best partition of a set S 0 S is independent of the best partition of S n S 0 . This makes it possible to use dynamic programming. We can tabulate the scores of subsets and their partitions and use table lookup to avoid repetitive computation of the scores.
3.3.1 A dynamic programming algorithm
Fulton, Kasif and Salzberg [FKS95] were the rst to suggest a generic dynamic programming algorithm for cumulative functions in the partitioning task. They de ned a recurrence for computing the best k-partition from the impurity scores of subsets and best (k ? 1)-partitions of the pre xes of the data. Let us denote by Imp(i; k) the impurity of the best k-partition of the set I1 [ [ Ii , that is, the i leftmost indivisible subsets. The intent is to compute Imp(b; k) together with the corresponding partition. Solving the recurrence (3.1) 8 > if k = 1; mink?1j