Preprocessing Opportunities in Optimal Numerical

0 downloads 0 Views 763KB Size Report
We show that only the segment borders have to be taken into account as cut point candidates in searching for the optimal multisplit of a numerical value range ...
Preprocessing Opportunities in Optimal Numerical Range Partitioning Tapio Elomaa Department of Computer Science, P.O. Box 26 FIN-00014 University of Helsinki, Finland [email protected] Abstract We show that only the segment borders have to be taken into account as cut point candidates in searching for the optimal multisplit of a numerical value range with respect to convex attribute evaluation functions. Segment borders can be found efficiently in a linear-time preprocessing step. For strictly convex evaluation functions inspecting all segment borders is also necessary. With Training Set Error, which is not strictly convex, the data can be preprocessed into an even smaller number of cut point candidates, called alternations, when striving for optimal partition. Examining all alternations also seems necessary. We test empirically the reduction of the number of cut point candidates that can be obtained for Training Set Error on real-world data. The experiment shows that in some domains significant reduction in the number of cut point candidates can be obtained.

1. Introduction In classifier induction numerical attribute domains need to be partitioned. Discretization of numerical domains is a potential time-consumption bottleneck of induction, since in the general case the number of possible partitions is exponential in the number of candidate cut points within the domain. In this paper we consider class-driven partitioning in which knowledge of the class labels of the examples is used. Moreover, we are only concerned with univariate discretization methods, where only one attribute is partitioned at a time. More specifically, we study context-free methods, where — in addition to the class label — only the value of the attribute at hand is considered. In context-dependent methods [15, 19, 17] the distance of instances, as measured by all attributes, is taken into account in the quality criteria. With respect to the most commonly used evaluation functions numerical attribute value ranges can be optimally partitioned in quadratic time in the number of examples us-

Juho Rousu Department of Computer Science, P.O. Box 26 FIN-00014 University of Helsinki, Finland [email protected]

ing a dynamic programming algorithm [13, 22, 9]. The practical efficiency of the process is further enhanced by preprocessing the data suitably [12, 22, 9, 10, 7, 6]. Preprocessing does not improve the asymptotic efficiency of the optimal multisplitting task [11], but in practice it has a huge impact. In this paper we explore further enhancement possibilities and limitations of efficient preprocessing. Cumulative functions can be optimized in quadratic time in the number of possible cut points [13, 9], but only the Training Set Error (TSE) is known to optimize in linear time [13, 1, 2]. Even quadratic-time evaluation may be too much if the number of potential cut points is high. Therefore, reducing their number is critical for the efficiency of numerical attribute handling. It is known that only the so-called segment borders [10] need to be examined in searching for the optimal partition of a numerical value range with respect to many commonly-used functions. In this paper we show that this linear-time preprocessing applies to all functions that satisfy Jensen’s inequality [8]. Moreover, for TSE even less cut point candidates need to be examined when its optimal partition is strived for. Examining only a subset of segment borders is enough for it. There are many different relevant definitions of optimality. One can look for any partition that optimizes the value of the evaluation function being used, or search for the globally optimal partition with as few intervals as possible. One can, further, restrict the arity of a partition to some upper bound and then look for the optimal one among the partitions fulfilling this restriction. An even stricter restriction is to search for the optimal partition of some fixed arity . Section 2 discusses partitions of numerical value ranges and their optimality with respect to evaluation functions. In Section 3 we recapitulate the known preprocessing opportunities for common evaluation functions. Section 4 considers the consequences of Jensen’s inequality for concave and strictly concave functions. We show that a partition which has all example segments as its intervals is the unique minimal globally optimal partition of any value range with respect to any strictly concave function. We examine TSE, which is not strictly concave, separately in Section 5. It is

X X X X Y Y Y Y Y Y X X X Y Y Y Y Y Y Y Y Y X X X Y Y 1 2 2 3 3 3 4 4 4 4 4 4 5 5 5 5 5 6 7 7 7 8 8 9 9 9 9

Figure 1. A set of examples sorted according to the value of a numerical attribute. The class labels ( and ) of the examples are also shown.

1/– val

2/– 1

1/2

2/4 3

2

1/4

–/1 5

4

–/3 6

1/1 7

2/2 8

9

Figure 2. Example bins for the sample of Fig. 1. The class distributions of examples belonging to a bin are recorded. Partition cut points can be set at the bin borders.

of

proved that for it suffices to examine points in which the class majority changes in between two adjacent segments. In Section 6, we study empirically the reduction of cut point candidates that can be obtained for Training Set Error. Finally, Sections 7 and 8 conclude the paper.

are labeled by their majority class. For a partition it is defined as

2. Optimality of a partition

If one could make its own partition interval out of each example, this partition would have zero misclassification rate. However, in practice one cannot discern between all examples. Only examples that differ in the value of the attribute under consideration can be separated from each other. Consider, for example, the data set shown in Fig. 1. There are 27 examples ordered by an integer-valued attribute . The examples are instances of two classes; and . Partition cut points can only be set on those points where the value of changes. Therefore, for partitioning purposes, we can preprocess the data into bins as shown in Fig. 2. There is one bin for each separate value of attribute . Within each bin we record the class distribution of the instances belonging to it. It is evident that the optimal TSE value is achieved by the partition that has all bins as separate intervals because that partition will have the minimal misclassification rate. There are also other partitions that are guaranteed to have the optimal value. We will discuss them in the next section. However, this partition in which the number of intervals or the arity of the partition is not restricted is not always the optimal partition that we are looking for. Sometimes the optimal partition of at most intervals is sought after. Let be the set of all partitions of value range and . Then by we denote the arity of and by the value of with respect to evaluation function . Let OPT be the set of optimal partitions of the numerical value range with respect to evaluation function such that the partitions have at most intervals. In

A partition of consists of non-empty, disjoint subsets and covers the whole domain. When splitting a set of examples on the basis of the value of an attribute , there is a set of thresholds , where is the value range of the attribute , that defines a partition for the sample in an obvious manner: val val val

if if if

, , ,

where val denotes the value of attribute in example . Before evaluating the partitions of the sample with respect to an attribute , the data is usually sorted by the value of . Therefore, we actually consider partitioning of a numerical value range on the basis of the sample . A partition is optimal if it optimizes the value of the evaluation function being used. In machine learning algorithms there are many commonly-used attribute evaluation functions. Let us consider, for the time being, the very simple function Training Set Error, or TSE. The majority class maj of sample is its most frequently occurring class. The number of disagreeing instances, those in the set not belonging to its majority class, is given by val maj . Training Set Error is the number of training instances falsely classified in the partition when all intervals

TSE

other words OPT

is

3/– val

From a member of OPT it is further required that it has as few intervals as any optimal partition; OPT

OPT

In practice the maximum number of intervals has the upper bound , the number of bins in the range. One can define many different versions of the optimal multisplitting problem. From the practical viewpoint the interesting ones are the following. Globally optimal: Find an optimal partition OPT . It is sufficient to return any optimal partition, be it of any arity. Minimal globally optimal: Find an optimal partition OPT such that its arity is at most that of any other optimal partition of . Bounded arity optimal: Find an optimal partition OPT such that . It is sufficient to return any bounded arity optimal partition as long as its arity is at most . Minimal bounded arity optimal: Find a bounded arity optimal partition OPT such that its arity is at most that of any other bounded arity optimal partition of . Fixed arity optimal: Find an optimal partition OPT 1 such that , where

.

Example Consider the numerical value range shown in Figs. 1 and 2. As already observed, the partition with all bins as separate intervals is one of the globally TSE-optimal partitions OPT TSE for this numerical value range. It has nine intervals and makes 7 misclassifications. However, it is not a OPT TSE partition, because there is one partition with only two intervals that obtains only the same misclassification rate or TSE-value: the one in which the two first bins make up the first interval and the remaining bins are gathered into the second interval. Next, observe that the 3-optimal partition also has score 7. Since this is the globally optimal score, it must also be bounded arity optimal OPT TSE . However, the partition is not minimal locally optimal OPT TSE because of the existence of the binary partition with the same TSE-score.

3. Pruning of cut points in preprocessing As discussed above, the partition containing all bins as its intervals is a globally optimal partition for TSE. The

1/2

2/4 3

2

1/4

–/4 5

4

1/1 7

2/2 8

9

Figure 3. The blocks in the sample of Fig. 1. Block borders are the boundary points of the range .

3/– val

3/6 2

1/4 4

–/4 5

3/3 7

9

Figure 4. The segments in the sample of Fig. 1. Segment borders are a subset of the boundary points in the range .

same actually holds for all convex (and concave) evaluation functions (see the next section). However, it is possible to preprocess the numerical value range into an often radically smaller number of example groups without losing the possibility to recover optimal partitions. It has been proven that most known attribute evaluation functions cannot obtain their optimal value within a sequence of examples — the so-called block of examples [9] — in between two consecutive boundary points [12], or within a segment of examples in which the relative class distribution of the examples is static [10]. Most often the property is a consequence of the convexity of the evaluation function within a block or a segment of examples, but some non-convex functions also possess these properties. We recapitulate blocks and segments of examples, as well as boundary points, intuitively with the help of an illustration. The definition of boundary points assumes that the sample has been sorted into an ordered value range with respect to the value of the numerical attribute under consideration [12]. Let us start with the bins corresponding to . To determine the correlation between the value of an attribute and that of the class it suffices to examine their mutual frequencies, which have been stored to the bins. To construct blocks of examples we merge together adjacent class uniform bins with the same class label (see Fig. 3). The boundary points of the value range are the borders of its blocks. Block construction still leaves all bins with a mixed class distribution as their own blocks. The evaluation functions that are known to have optimal partitions defined by boundary points are Average Class Entropy, Information Gain [18], Gain Ratio [18], Normalized Distance Measure [16], Gini Index [5], Training Set Error, and the MDL-measure of Wallace and Patrick [21]. From bins we obtain segments of examples by combin-

ing adjacent bins with an equal relative class distribution (see Fig. 4). Segments group together adjacent mixeddistribution bins that have equal relative class distribution. Also adjacent class uniform bins fulfill this condition; hence, uniform blocks are a special case of segments and segment borders are a subset of boundary points. Elomaa and Rousu [10] have shown that for the abovementioned attribute evaluation functions (with the exception of the MDL-measure) only segment borders need to be examined. How can blocks and segments of examples be used in finding optimal partitions? The results show that the functions mentioned do not have a cut point within a block or a segment. Therefore, when searching for an optimal partition of any arity, it suffices to inspect the respective combinations of boundary points or segment borders. When the evaluation function is cumulative [13, 9], that is, takes the form of a sum, the combinations can be checked in quadratic time using dynamic programming. Not all of the above-mentioned evaluation functions possess this property. For example, the non-convex evaluation functions Gain Ratio and Normalized Distance Measure are not cumulative and, thus, cannot be optimized efficiently using dynamic programming. Next we will look into whether through efficient preprocessing further advantage could be obtained in searching for optimal partitions.

4. Optimal partitions of strictly convex evaluation functions Many of the most widely used attribute evaluation functions are either convex (upwards) or concave (i.e., convex downwards) [4, 14, 9, 6]; both are usually referred to as convex functions. Definition A function is said to be convex over an interval if for every and ,

A function is said to be strictly convex if equality holds only if or . A function is concave if is convex. Let be a variable with domain pectation. In the discrete case .

. Let

denote the ex, where

Theorem 1 (Jensen’s inequality [8]) If is a convex function and is a random variable, then

If

is strictly convex, then the above inequality implies that with probability 1, i.e., is a constant.

For a concave function Jensen’s inequality is reversed. Thus, for a concave , substituting discrete expectation, the inequality becomes (1) Jensen’s inequality does not restrict the probability distribution underlying the expectation. It is enough that the prob, , are all non-negative and sum up to abilities 1. Typically, partition ranking functions give each interval a score using an other function, which tries to estimate the class coherence of the interval. A common class of such functions are the impurity functions [5]. The interval scores are usually weighted relative to the sizes of the intervals. Thus, a common form of an evaluation function is (2) where is the relative class frequency distribution of the set and is an impurity function. By relative class frequency distribution of the set we mean the vector in which are the possible values of the class attribute and stands for the proportion of elements of that have class : val . Now, for all and . Corollary 1 If the impurity function Eq. 1)

is concave, then (by

(3) in which is the score of the unpartitioned data. Thus, any splitting of the data can only decrease the value of and splitting on all cut points will lead to the best score. Moreover, for strictly concave , the equality holds only if for all . Therefore, merging together adjacent segments will always result in a partition with a worse score. By considering a convex, rather than a concave, evaluation function for which optimization means maximization and reversing the inequalities of equations 1 and 3, it is observed that also then any partitioning leads to a better score. Examples of commonly used attribute evaluation functions that are strictly convex or concave are Average Class Entropy, Information Gain, and Gini index.

Let the full segment partition of a numerical value range be the one in which each segment of the range constitutes a partition interval. By Corollary 1 strictly convex and concave functions must have a globally optimal partition defined by segment borders. Next we show that this partition, in fact, is the unique minimal globally optimal partition with respect to such functions. the full Theorem 2 For any strictly concave function segment partition of any value range is the unique minimal globally optimal partition. Proof Let be an arbitrary partition of a value range . Let be obtained from by adding a cut point on each segment border that is not in . Each new cut point defines two intervals to with different relative class frequency distributions. By the strict concavity of and Corollary 1, . Adding a further cut point to cannot improve the score of the partition, since it must introduce two new intervals with an equal relative class frequency distribution. Therefore, is an optimal partition of . Next, we remove from all cut points that are not on segment borders to obtain partition . The intervals on both sides of a cut point that is removed must have the same relative class frequency distribution. Thus, . Obviously is the full segment partition of . No more cut points can be removed from without changing the -score, because all adjacent intervals now have different relative class distributions and is strictly concave. Hence, must be minimal. The claim follows because and are arbitrary. Bounding the arity of the partition below the number of segments means that the full segment partition cannot be considered as an alternative. However, many concave functions have non-positive second derivatives. This will help us to restrict our attention on segment borders even when the partition arity is bounded. Elomaa and Rousu [10] gave an explicit proof for a couple of such functions; the proof below applies to all such functions. The following proof (and some subsequent ones) concerns binary partitioning. However, for cumulative evaluation functions the theorem applies to the multi-interval case as well, since moving the cut point within an embedded binary partition affects only the corresponding two terms in the impurity score. The other terms in the multisplit are not affected by moving the cut point within the example segments under consideration (see [9, 10] for a more thorough treatment of this topic). Theorem 3 For an impurity function with a second derivative that is non-positive everywhere, there is a partition in OPT , where and are arbitrary, such that its cut points are segment borders.

P

Q

R

Figure 5. The proof setting considers partitioning of the sample within segment . Dividing at any point results in two subsets with equal relative class distributions.

Proof Let , , and be example sets separated by cut points and . Let the set be composed of a single example segment. Let us consider splitting the sets , , and in two intervals so that the cut point is situated inside or on its borders. Let us denote by the fraction of situated to the left of the cut point (see Fig. 5). The class distribution of the left-hand side of the cut is . Thus, the class distribution of the left-hand side of the partition moves along a line segment in the space of class distributions. Since has second derivative that is non-positive everywhere, is concave along the line segment. Hence, the local minima lie at the end points of the line segment; that is, on and . By symmetry, also the right-hand side of the split forms a concave curve, and the quality of the split is concave over , because the sum of two concave functions is also concave. Hence, no local minimum can lie inside the segment . Based on the above theorem, it is sufficient to restrict one’s attention to segment borders in bounded arity optimization given a strictly convex attribute evaluation function. Whether two segments should be separated or not in a bounded arity partition cannot be decided by looking at the (relative) class distributions of the segments alone. One needs to take into account as well the context in which the segments are. A context consists of the example set preceding the pair of segments and that following it. Theorem 4 For any strictly concave function and every pair of example segments and , there is a context , such that partition is optimal. Proof Because and distributions and context , such that

are segments, their relative class are different. Let us consider the . Then

because of the strict concavity of

.

Because of Theorem 4 one cannot prune away segment borders without taking into account the placement of the other cut points. Thus, it seems that developing a general subquadratic preprocessing scheme for operating on a subset of segment borders is difficult.

5. Optimal partitions of Training Set Error TSE is not a strictly concave function [11]. Therefore, Corollary 1 and Theorem 4 do not tell us anything about the function’s minimal globally optimal partition, although, Theorem 3 and the earlier explicit proof [10] show that only segment borders need to be considered. We prove that some segment borders can, indeed, be disregarded when trying to find optimal partitions with respect to TSE. Let a segment majority alternation point be the border in between two consecutive segments that have different majority classes. For example in the data set of Fig. 4 there is only one majority alternation, in between the first and the second segment. Majority alternations help to find TSEoptimal partitions. Theorem 5 The partition defined by segment majority alternations is the unique minimal globally optimal partition with respect to TSE. Proof Let be the partition defined by segment majority alternations. It is easy to check that has the same misclassification rate as the full segment partition. Thus, is globally optimal with respect to TSE. To see that is minimal as well, let us first show that all majority alternation points must be cut points in the minimal globally optimal partition of TSE. Let be an arbitrary value range (with at least one majority alternation) and OPT TSE be such that it does not contain all majority alternation points. Let be an interval of containing the majority alternation points , which are not a part of the partition. Because the majority class of segments changes at the majority alternation points, partitioning into subintervals at these majority alternations reduces misclassification of and, thus, gives a better partition than . This contradicts the optimality of . Therefore, any minimal globally optimal partition has a cut point in each majority alternation point. Let us then assume that the minimal globally optimal partition has in addition to the majority alternation points other cut points. Let be one such extra cut point. Now is in between two consecutive majority alternation points and . Thus, the majority classes in both sides of are the same, and the partition intervals and have the same class label. Removing will not affect the

number of misclassifications, but reduces the arity of the partition. Hence, the existence of contradicts the minimality of partition . If there are more than one extra cut points in between two consecutive majority alternation points, then the intervals induced by all of them have the same majority class, and all of them can be removed without affecting the misclassification of the partition. The segments of a value range can be identified in a leftto-right scan through the sorted example sequence. Thus, segments and also majority alternation points can be found in linear time with respect to the number of examples in the sample. As shown in Theorem 5, as a consequence, the unique minimal globally optimal partition can also be found in linear time. However, when a bounded arity optimal partition is required, examining majority alternation points alone does not suffice, one has to consider the frequency of the minority classes as well. In the following, we show how the concept of a majority alternation point can be generalized so that handling the bounded arity case becomes possible. Let and be two adjacent example segments with relative class frequency distributions and , respectively. There is a segment alternation point in between and if there is no index set such that and . In other words, an alternation occurs, if ordering of all the classes in descending order of frequency is different in the two segments. For a segment border to be a majority alternation, it suffices that the majority classes of the two segments are different. Therefore, a majority alternation is a special case of an alternation. However, all alternations are not majority alternations, except in the two-class setting, where the two concepts coincide. Thus, the number of alternations is always at least as high as that of majority alternations. Next we show that, analogously to Theorem 3, only alternations need to be considered when searching for the TSEoptimizing partition. Theorem 6 For each value range and each there is an optimal partition in OPT TSE that is defined on segment alternations. Proof Let , , , and form a sequence of subsets along the value range . Let the the relative class distributions of sets and , and , be arbitrary and let and be such that there is no alternation point in between the sets and . Let us consider now splitting the set into two intervals and labeling the left-hand side with class and the righthand side with class . Let denote the error of a binary partition with as the left-hand side and all remaining segments as the right-hand side. Respectively,

has as the right hand. Furthermore, let denote the error of subset with respect to class . The errors of the partitions are

instances of a single class . Now, consequently,

and,

Similarly, the class distribution of can be set so that is the majority class of . Then a similar reasoning concludes that By assumption, the point in between and is not a segment alternation. Therefore, from the definition of a segment alternation it follows that for any pair of classes and , , either 1) and or 2) and . In the first case and, consequently,

In the second case

, and

Hence, the partition is at most as good as the two other partitions. Since the classes and and the sets and are arbitrary, we have shown that in any partition a cut point that is not a segment alternation, can be replaced with another cut point without increasing the error. The consequence of the above theorem is that only those cut points that are alternations need to be considered when looking for the bounded arity optimal TSE partition. Hence the sample can be processed into intervals separated by alternation points. Note that this can be done in time , where is the number of example segments, if the decreasing frequency order of the classes is determined by sorting the relative class frequency vector. The time requirement, obviously, is linear in the size of the data, but not in the number of the classes . However, the number of classes is typically small. Analogously to Theorem 4, we can show that no alternations cannot be proven suboptimal without considering the context in which the cut point is; that is, which other cut points are present. Theorem 7 For each segment alternation there is a context in which it is the TSE-optimal cut point. Proof Let us consider the pair of segments and in the context , . Let there be a pair of classes and for which it holds that and . That is, there is a segment alternation in between and . Now, let the class distribution of be such that is the majority class of . Note that distributions of this kind are easily generated by choosing to be large enough and consist of

Thus, the optimal cut point in this case is the alternation in between and . Since and are arbitrary, the claim follows. The above theorem shows that the usefulness of an alternation point cannot be judged by examining the two neighboring subsets alone. Hence, the existence of a linear-time preprocessing scheme to prune out alternations from the set of possible cut points seems unlikely.

6. Empirical experiments In this section we examine segment alternations with real-world data. We test for 28 well-known data sets from the UCI data repository [3] what are the relations of average numbers of bin borders, boundary points, segment borders, segment alternations, and majority alternations per numerical attribute. Fig. 6 depicts the results of the experiment. A striking result is that those domains, where there are many classes (e.g., Abalone, Auto insurance, Letter recognition, and Yeast), the number of segment alternations is not much smaller than that of segment borders. The number of majority alternations, though, is somewhat smaller. However, this result is not really surprising, because the more there are classes the less common it is that two adjacent segments have the same frequency order for all of them. The majority class may still be the same in two adjacent segments, even though the number of classes is high. There is no difference in the number of alternations and majority alternations on two-class domains (e.g., Adult, Australian, Breast Wisconsin, etc.), which is clear from their definition. For some two-class domains (e.g., Breast Wisconsin, Euthyroid, Heart Hungarian, and Hypothyroid) the number of alternations and majority alternations is significantly lower than that of segment borders (ca. 75%). Hence, one can expect important time savings for TSE by processing alternations rather than segment borders. On other multiclass domains (e.g., Annealing, Heart Cleveland, Letter Recognition, Satellite, Vehicle, and Yeast) there is a significant difference in the numbers of alternations and majority alternations.

Abalone Adult Annealing

863.7 3,673.7 27.5 188.2 61.4 9.9 86.3 156.8 123.8 145.9 115.3 30.5 25.1 54.3 165.7 30.8 16.0 54.7 909.2 76.3 137.7 123.2 187.6 79.4 623.5 714.0 98.2 51.5

Australian Auto insurance Breast Wisconsin Colic Diabetes Euthyroid German Glass Heart Cleveland Heart Hungarian Hepatitis Hypothyroid Iris Letter recognition Liver Page blocks Satellite Segmentation Shuttle Sonar Vehicle Vowel Waveform Wine Yeast 25%

50%

75%

# cut points

Figure 6. The average number of bin borders (the figures on the right) and the relative numbers of boundary points (black bars below), segment borders (white bars), segment alternation points (gray bars), and majority alternations (dark gray bars on top) per numerical attribute of the domain.

7. Discussion We saw above that only segment borders need to be examined when searching for any bounded arity optimal partition of a convex or concave evaluation function. Hence, only a well-defined subset of boundary points needs to be

examined as cut point candidates. Processing the data into a sequence of example segments can be done in linear time, which makes the result also applicable in practice. Preprocessing the data into segments rather than using all the cut points can result in a speed-up of 40–90% [20]. On the negative side, we showed that the technique of deciding the optimality of a cut point just by looking at the neighboring indivisible subsets cannot be extended beyond example segments if the evaluation function is strictly convex. Hence, other linear-time preprocessing schemes may be hard to come by. Using significantly more than a linear time is not fruitful either, because the complexity of the subsequence search phase is usually quadratic. Training Set Error, which differs from other common evaluation functions by not being strictly convex, is able to break the barrier of segment borders. For it only alternation points have to be examined in bounded-arity partitioning and a subset of them, majority alternations, are the only cut point candidates in global optimization. Examining all of them cannot be avoided in the left-to-right preprocessing scan. On some real-world domains the number of alternation points was discovered to be significantly smaller than that of segment borders. Therefore, practical enhancements in optimizing the value of TSE can be expected from using alternation points rather than segment borders. However, since TSE optimization only requires a linear time [1, 2], careful implementation is needed to harvest the benefit. Let us still illuminate the significance of the lower bound results for the number of cut point candidates presented in Theorems 3 and 7. The results show that the optimality of an alternation point (segment border) cannot be decided by looking at the neighboring example segments alone. Instead, one must consider the context of the cut point, that is, the neighboring cut points to the left and right. However, since there is a quadratic number of such contexts, one cannot enumerate all of them in a reasonable time. In any case, a linear-time algorithm seems to be out of reach. Moreover, the question of the optimality of a segment border (or an alternation) becomes conditional on the bound on split arity. Since the full segment (and full majority alternation) partition is the minimal globally optimal partition, when the arity bound is very close to the number of segments (or majority alternations), only very few segment borders (and majority alternation points) are suboptimal. Hence, in general, no further savings can be obtained.

8. Conclusion In this paper we have shown that it suffices to examine segment borders, which are a subset of boundary points, in searching for the optimal partition of a value range with respect to a convex evaluation function. This holds both in searching globally and bounded arity optimal partitions.

The evaluation function Training Set Error, which is not strictly convex, even lets us ignore some of the segment borders; only majority alternations need to be considered when searching for the global optimum and segment alternations when searching for the bounded arity optimum. On the other hand, we were able to show that no segment borders can be ignored with strictly convex functions and in bounded arity partitioning no alternations can be ignored with TSE. For every such point there is a context in which it is a part of the optimal split. Hence, the existence of a very fast preprocessing schemes for strictly convex functions and TSE that would prune out segment borders and alternations, respectively, is unlikely.

References [1] P. Auer. Optimal splits of single attributes. Technical report, Instute for Theoretical Computer Science, Graz University of Technology, 1997. [2] A. Birkendorf. On fast and simple algorithms for finding maximal subarrays and applications in learning theory. In S. Ben-David, editor, Computational Learning Theory, volume 1208 of Lecture Notes in Artificial Intelligence, pages 198–209, Berlin, Heidelberg, 1997. Springer-Verlag. [3] C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. [4] L. Breiman. Some properties of splitting criteria. Machine Learning, 24(1):41–47, 1996. [5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Pacific Grove, CA, 1984. [6] C. W. Codrington and C. E. Brodley. On the qualitative behavior of impurity-based splitting rules I: The minima-free property. Machine Learning, 2001. To appear. [7] D. Coppersmith, S. J. Hong, and J. R. M. Hosking. Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery, 3(2):197–217, 1999. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, N.Y., 1991. [9] T. Elomaa and J. Rousu. General and efficient multisplitting of numerical attributes. Machine Learning, 36(3):201–244, 1999. [10] T. Elomaa and J. Rousu. Generalizing boundary points. In Proc. Seventeenth National Conference on Artificial Intelligence, pages 570–576, Menlo Park, CA, 2000. AAAI Press. [11] T. Elomaa and J. Rousu. On the computational complexity of optimal multisplitting. Fundamenta Informaticae, 47, 2001. In press. [12] U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87–102, 1992. [13] T. Fulton, S. Kasif, and S. Salzberg. Efficient algorithms for finding multi-way splits for decision trees. In A. Prieditis and S. Russell, editors, Proc. Twelfth International Conference on Machine Learning, pages 244–251, San Francisco, CA, 1995. Morgan Kaufmann.

[14] R. J. Hickey. Noise modelling and evaluating learning from examples. Artificial Intelligence, 82(1–2):157–179, 1996. [15] S. J. Hong. Use of contextual information for feature ranking and discretization. IEEE Transactions on Knowledge and Data Engineering, 9(5):718–730, 1997. [16] R. L´opez de M`antaras. A distance-based attribute selection measure for decision tree induction. Machine Learning, 6(1):81–92, 1991. [17] H. Nguyen and A. Skowron. Boolean reasoning for feature extraction problems. In Z. Ra´s and A. Skowron, editors, Foundations of Intelligent Systems, volume 1325 of Lecture Notes in Artificial Intelligence, pages 117–126, Charlotte, NC, 1997. Springer-Verlag. [18] J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. ˇ [19] M. Robnik-Sikonja and I. Kononenko. An adaptation of Relief for attribute estimation in regression. In D. H. Fisher, editor, Proc. Fourteenth International Conference on Machine Learning, pages 296–304, San Francisco, CA, 1997. Morgan Kaufmann. [20] J. Rousu. Efficient Range Partitioning in Classification Learning. PhD thesis, Department of Computer Science, University of Helsinki, 2001. Report A-2001-1. [21] C. S. Wallace and J. D. Patrick. Coding decision trees. Machine Learning, 11(1):7–22, 1993. [22] D. Zighed, R. Rakotomalala, and F. Feschet. Optimal multiple intervals discretization of continuous attributes for supervised learning. In D. Heckerman and et al., editors, Proc. Third International Conference on Knowledge Discovery and Data Mining, pages 295–298, Menlo Park, CA, 1997. AAAI Press.

Suggest Documents