Fast Subsampling Performance Estimates for ... - CiteSeerX

1 downloads 0 Views 121KB Size Report
databases, hold-out estimates are usually preferred [Rasmussen et al. 1996]. But even the computation of a single hold-out estimate can become too slow for ...
Fast Subsampling Performance Estimates for Classification Algorithm Selection Technical Report TR-2000-07

Johann Petrak ([email protected]) Austrian Research Institute for Artificial Intelligence

Abstract The typical data mining process is characterized by the prospective and iterative application of a variety of different data mining algorithms from an algorithm toolbox. While it would be desirable to check many different algorithms and algorithm combinations for their performance on a database, it is often not feasible because of time and other resource constraints. This paper investigates the effectiveness of simple and fast subsampling strategies for algorithm selection. We show that even such simple strategies perform quite well in many cases and propose to use them as a base-line for comparison with meta-learning and other advanced algorithm selection strategies.

1 Introduction With the availability of a wide range of different classification algorithms, strategies for selecting the most adequate one in a particular data mining situation become more crucial. Many characteristics of both the learning algorithm and the kind of model generated by the algorithm potentially influence the decision of which algorithm should be selected: the most important characteristics are the accuracy and understandability of the model, or the amount of computing resources (time, memory) needed to find the model. Data mining often also requires to iteratively process a database and reassess the question of which algorithms to choose based on earlier results. In this paper we analyze a much simpler non-iterative decision problem: given a database and a set of classification learning algorithms, how can we find the algorithm that has minimal expected i.i.d error (i.e. minimal error on independent identically distributed samples from the domain of the database). A common solution for this problem is to estimate the expected error of all algorithms and to pick the one with minimum error estimate. Previous publications have found cross validation to be an effective strategy for algorithm selection [Schaffer 1993]. Cross validation as a means for estimating expected i.i.d error of learning algorithms has been used in many comparative studies of learning algorithms [Lim 2000], and has  http://www.ai.univie.ac.at/cgi-bin/tr-online?number+2000-07

1

been compared to other, often more computing-intensive methods like the Bootstrap [Bailey & Elkan 1993], [Kohavi 1995]. Variants of a cross validation estimate are typically used for smaller datasets, e.g. for empirical comparison of learning algorithms [Dietterich 1996], while for bigger databases, hold-out estimates are usually preferred [Rasmussen et al. 1996]. But even the computation of a single hold-out estimate can become too slow for very big databases. Subsampling the data can be used to further reduce the computational effort. An entirely different approach to algorithm selection is to avoid error estimation altogether and base the decision of which algorithm to choose on other kinds of information. With Metalearning for algorithm selection, learning algorithms are used to find a model that can be used to choose a suitable learning algorithm based on the characteristics of the database [Michie et al. 1994], [Brazdil & Soares 2000], [Hilario & Kalousis 1999], [Lindner & Studer 1999]. The computational effort of this approach depends mainly on the database characteristics used and how effective these measures can be calculated. Consequently, metalearning approaches are preferable to approaches that use error estimation, if the quality of the suggestions is comparable, but the computational effort for calculating the suggestions is smaller. The work presented in this paper was partly motivated by the wish to empirically establish a “base line” against which metalearning approaches can be compared and to find measures how the effectiveness of the approaches can be compared. As a first step we study how well very simple subsampling-based estimation strategies do when compared to the more computing-intensive crossvalidation method. These strategies are described in section 2. In sections 4 and 5, we present a qualitative and quantitative analysis of how well these simple estimation procedures perform when compared to the computationally expensive procedure based on a full crossvalidation estimate, and when compared to very simple strategies (e.g. always picking a random algorithm).

2 Algorithm selection strategies Data mining practitioners often assess the usability of algorithms in a data mining situation by calculating a quick error estimate on a subsample of the data. This is often done in an ad-hoc fashion, where the sample size is chosen fairly arbitrarily. The algorithm to proceed with is then selected based on which estimate yielded the least error. The optimal decision in this situation would of course be to choose the algorithm with least true i.i.d error. Since the true i.i.d. error cannot be known, we need some other procedure to decide on which algorithm to choose. We call any such procedure a selection strategy. To assess how useful even simple ad-hoc selection strategies might be, we define the three strategies S1 , S2 , and S3 which are based on a fast single holdoutestimate of error. To compute this estimate, each learning algorithm is trained on a subset of size nt of the database (the training set) and the error of the generated model as evaluated on a different subset of size ne (the test set) is used as the desired estimate. Table 2 shows the training and test set sizes for the three selection strategies analyzed. Strategy S1 is based on constant training and test set sizes, no matter what the size of the database is, so that the computational effort is roughly the same for all databaseses. Strategy S2 uses a constant size training set, but a test set that grows with the size of the 2

database, since for most learning algorithms the computational effort of computing the estimate mainly grows with training set size, and bigger testsets reduce the variance of the estimates. Finally strategy S3 utilizes all the data available after removing the test set, with the same test set size as for S2 . This strategy is computationally expensive for big databases, but still more than 10 times faster than the full 10-fold crossvalidation procedure. Strategy

S1 S2 S3

nt

ne

1000 1000

1000

n=4 n=4 n=4

3

Table 1: Training- and testset sizes used for holdout error estimates In order to evaluate these strategies we have compared them to the following set of alternative strategies:

 SLk : always pick algorithm Lk  S0 : pick one of the algorithms at random  SXV + : always pick the best algorithm as determined by a stratified 10-fold crossvalidation error estimate eXV for all algorithms on the full database

 SXV ? :

always pick the worst algorithm as determined by a stratified 10-fold crossvalidation error estimation for all algorithms on the full database

Strategy SXV serves as the closest approximation of the ideal strategy based on true i.i.d. error. S0 represents a strategy where nothing is known about the relative performance of algorithms, SLk represents a strategy where a specific algorithm is assumed to perform best a priori. Finally SXV ? represents the worst strategy we can possibly use. Note that the purpose of the fast subsampling estimates in strategies S1 ; S2; S3 is not to give as good estimates of the true i.i.d. error as possible but to mimic as closely as possible those decisions we would reach by using strategy SXV + . Name c50boost c50rules c50tree lindiscr ltree mlcib1 mlcnb ripper

Type boosted decision trees, a component of the C5.0 program (see http://www.rulequest.com) ordered rule list, a component of C5.0 decision tree, a component of C5.0 linear discriminant classifier linear tree [Gama & Brazdil 1999] 1-k nearest-neighbor implementation in MLC++ [Kohavi et al. 1996] naive bayes implementation in MLC++ ordered rule list [Cohen 1995] Table 2: Learning algorithms used in the experiments

3

name abalone allbp allhyper allhypo allrep ann byzantine dis hypo image kr-vs-kp krkopt letter musk nettalk nursery optical page-blocks pyrimidines quadruped quisclas recljan2jun97 sat segmentation shuttle sick-euthyroid sick splice taska-part-hhold taska-part-related taskb-hhold titanic triazines waveform21 waveform40

n 4177 3772 3772 3772 3772 7200 17750 3772 3772 2310 3196 28056 20000 6598 146934 12960 5620 5473 6996 5000 5891 33170 6435 2310 58000 3163 3772 3190 17267 18254 12934 2201 52264 5000 5000

attr 9 28 28 28 28 22 61 29 28 20 37 7 17 167 16 9 65 11 55 73 19 20 37 20 10 26 29 61 52 57 44 4 121 22 41

symb 1 21 21 21 21 15 0 22 21 0 36 6 0 0 15 8 0 0 0 0 0 2 0 0 0 18 22 60 33 38 30 3 0 0 0

num 7 6 6 6 6 6 60 6 6 19 0 0 16 166 0 0 64 10 54 72 18 17 36 19 9 7 6 0 18 18 13 0 120 21 40

classes 29 3 5 5 4 3 71 2 5 7 2 18 26 2 52 5 10 5 2 4 3 2 7 7 7 2 2 3 3 3 3 2 2 3 3

default acc. 0.165 0.957 0.973 0.923 0.967 0.926 0.014 0.985 0.923 0.143 0.522 0.162 0.041 0.846 0.148 0.333 0.102 0.898 0.500 0.259 0.426 0.940 0.238 0.143 0.786 0.907 0.939 0.519 0.621 0.611 0.433 0.677 0.500 0.339 0.338

Table 3: The 35 datasets used in the experiments

4

3 Learning Curves In order to evaluate the performance of the various selection strategies, a number of experiments were performed using a set of 35 real-world databases with at least 2000 cases in combination with 8 learning algorithms. The list of the databases used is shown in table 3 which also shows the number of cases in each database, the total number of attributes, the number of symbolic and numeric attributes, and the baseline accuracy (the relative frequency of the most frequent class). The list of algorithms is shown in table 2. These databases and algorithms are both also used in the metalearning-related project “METAL”1 . For a first impression on how the subsampling estimates relate to the crossvalidation estimates, we created the learning curves of all algorithms obtained for varying amounts of training data and a test set containing 25% of the data. In these experiments, the remaining 75% of the data were repeatedly divided by a factor of 1.2 as long as the training set size remained larger than 1000 cases. The factor 1.2 is a good compromise between evenly spaced increments that will not show the details for small training sets when the complete data base is large, and repeatedly taking half of the data which only yields a small number of measures for small data bases. Some illustrating examples of the whole set of 35 learning curves are shown in figures 1 to 6. The short segments on the right side of each figure show the full crossvalidation estimates for the algorithms. The x-axis shows the sizes of the training (sizes given for the crossvalidation part of the figures are meaningless). The figures show that often the holdout estimates for even small training sets can roughly indicate which algorithms have a low crossvalidation estimate, even though the actual values are highly biased. Figure 8 shows a good example for a database where it is possible to select the best algorithm based on a small subsample even though the estimates are very biased for all algorithms. Figure 6 shows that sometimes the rate of how fast the holdout error estimates converge towards the final values can be rather different for different algorithms. In this case the estimate for mlcnb converges much faster than the estimate for mlcib1, making it difficult to find the correct relative performance of these algorithms with training sets smaller than 20000 cases. Luckily, in this case, the distinction between these two algorithms does not matter in order to pick the best (c50boost) which could be done based on the smallest training set sample of about 1000 cases.

4 Qualitative evaluation From the preliminary impressions gained from these learning curves we decided to evaluate the three strategies as defined in the previous chapter: S2 corresponds roughly to the situation at the beginning of the learning curves (though the actual values in the learning curves differ due to the way training set sizes were calcualted), S3 corresponds to the end of the learning curves, and S1 represents a strategy with a fixed size testing set. We conducted experiments to obtain the error estimates and the CPU times for selection strategies S1 ; S2, and S3 . 1 Esprit Long-Term Research Project A Meta-Learning Assistant for Providing User Support in Data Mining and Machine Learning, project number 26357.

5

An obvious way of comparing the selection strategies is to find out how often each of the selection strategies suggests the exact same algorithm as strategy SXV + . Table 4 shows how often the algorithm selected by the subsampling-based strategies S1 , S2 , or S3 and the alternative strategy S0 is identical to the algorithm selected by the target strategy SXV + . In addition the number of matches is also shown for strategy Sc50boost of always using algorithm c50boost, the algorithm which has the lowest crossvalidation error estimate for more of the 35 databases than each of the other algorithms. The expected rate of success for S0 – picking one of the 8 algorithms – is always 0:12. Strategy SXV ? will never match the target strategy SXV + , and is therefore not shown. Since the primary use of fast selection strategies is with large databases, the number of matching decisions is given for subsets of databases with different minimal numbers of cases. Size all

 5000  10000  15000

Nr.DBs 35 21 11 9

S1

22 (0.63) 11 (0.52) 6 (0.55) 6 (0.67)

S2

26 (0.74) 18 (0.86) 10 (0.91) 8 (0.89)

S3 Sc50boost S0 (expected)

26 (0.74) 17 (0.81) 11 (1.00) 9 (1.00)

19 (0.54) 10 (0.48) 5 (0.45) 4 (0.44)

4.4 (0.12) 2.6 (0.12) 1.4 (0.12) 1.1 (0.12)

Table 4: Rates of predicting the best learning algorithm On the set of all databases, strategy S2 and S3 perform better than S1 , which was to be expected, since S1 uses constant size training and test sets. However, S1 is still better than always picking the algorithm known to be the best most of the time on these datasets. The performance of all subsampling strategies increases with growing databases sizes, strategy S3 suggests exactly the same algorithms for databases bigger than 10000 cases as SXV . It is interesting that S2 improves quite a lot with bigger datasets too, although its training size always only contains 1000 cases. For bigger databases, all subsampling strategies perform better than Sc50boost.

Picking exactly the algorithm with lowest eXV may not be necessary though: if the difference between the error estimates eXV of two algorithms is not statistically significant, we can argue that it does not matter which of the two algorithms gets selected. We therefore evaluated the performance of the subsampling-based selection strategies for picking one of those learning algorithms for which the error estimates are statistically not significantly different from the best one. The McNemar test for equality of binomial distributions with a significance level of = 0:05 was used to define statistical significance [Feelders & Verkooijen 1995]. The results are shown in table 5. Size all

 5000  10000  15000

Nr.DBs 35 21 11 9

S1

24 (0.69) 14 (0.67) 6 (0.55) 6 (0.67)

S2

26 (0.74) 18 (0.86) 10 (0.91) 8 (0.89)

S3 Sc50boost

30 (0.86) 20 (0.95) 11 (1.00) 9 (1.00)

21 (0.60) 12 (0.57) 5 (0.45) 4 (0.44)

Table 5: Rates of predicting one of the learning algorithms not distinguishable from the best one This analysis shows that for smaller databases, the subsampling strategies will sometimes pick an algorithm that is not the one with the smallest crossvalidation estimate, but one where the estimate is not to be regarded significantly different either. One such case is the difference between algorithms ltree and c50tree for the dataset 6

allrep: as can seen in figure 1, the strategy based on the smallest subsampling estimate will pick algorithm ltree, while the smallest crossvalidation estimate is the one for algorithm c50tree. However, the distributions of the predictions of algorithms ltree and c50tree in the full corssvalidation are not significally different on the 5% level, so both selections can be regarded to be equally good. For the whole set of datasets, even strategy S1 will show success rates a bit higher than the default strategy of using algorithm c50boost, both for matching the best algorithm and matching one of the algorithms in the best group. Since algorithm c50boost performed better than the other algorithms on more than half of all the data bases and since it is the only boosted algorithm, it is interesting to see how the qualitative evaluation changes when this algorithm is excluded. Tables 6 and 7 show the success rates in that case for matching the best algorithm, and matching one of the algorithms in the best group, respectively. Size all

 5000  10000  15000

Nr.DBs 35 21 11 9

S1

S2

S3

13 (0.37) 11 (0.52) 7 (0.64) 7 (0.78)

20 (0.57) 13 (0.62) 8 (0.73) 7 (0.78)

22 (0.63) 19 (0.90) 11 (1.00) 9 (1.00)

Sc50tree S0 (estimated) 10 (0.29) 3 (0.14) 1 (0.09) 1 (0.11)

5.0 (0.14) 3.0 (0.14) 1.6 (0.14) 1.3 (0.14)

Table 6: Rates of predicting the best learning algorithm (without c50boost) Size all

 5000  10000  15000

Nr.DBs 35 21 11 9

S1

S2

S3

Sc50tree

15 (0.43) 12 (0.57) 7 (0.64) 7 (0.78)

20 (0.57) 13 (0.62) 8 (0.73) 7 (0.78)

27 (0.77) 20 (0.95) 11 (1.00) 9 (1.00)

14 (0.40) 4 (0.19) 1 (0.09) 1 (0.11)

Table 7: Rates of predicting one of the learning algorithms not distinguishable from the best one (without c50boost) Success rates for the full set of databases are again better than the default strategy of always picking the best overall algorithm, which is now c50tree2. With increasing databases, success rates increase to about the same as for the whole set of algorithms.

5 Quantitative evaluation Picking an algorithm that is not in the best group might not be that bad a decision after all: what we are really interested in is the increase in error we have to accept when using a simple but fast strategy. So instead of just counting the number of correct matches it seems to be more reasonable to quantitatively assess the performance of different selection strategies by comparing the increase or decrease of accuracy: picking learning algorithm Lj instead of Lk as suggested by SXV + means that we have to expect the 2 Note

however, that in practice an overall best algorithm can only be known after performing all the computation-intensive crossvalidation error estimation procedures. It only makes sense to use this strategy as a “base line” in the comparison if we expect it to be the best learning algorithm for new databases with about the same probability. This of course cannot be the case in general [Wolpert 1996] and it is unclear to what extent it should be expected in practical situations.

7

lower accuracy aXV (Lj ) = 1 ? eXV (Lj ) instead of aXV (Lk ) = 1 ? eXV (Lk ), i.e. we expect the accuracy to change by the factor fj;k = aXV (Lk )=aXV (Lj ).

SXV + S0 SXV ? Slindiscr Smlcnb Sltree Sc50rules Sc50boost Sc50tree Smlcib1 Sripper

S1

S2

S3

0.996 1.052 1.204 1.197 1.180 1.023 1.027 1.012 1.034 1.050 1.063

0.998 1.055 1.208 1.200 1.184 1.026 1.030 1.015 1.037 1.054 1.066

0.999 1.055 1.209 1.202 1.184 1.027 1.031 1.016 1.038 1.054 1.067

Table 8: Average change of accuracy Table 8 shows the increase or decrease of accuracy, averaged over all databases, of selection strategies S1 , S2 , or S3 , compared with the alternative selection strategies SXV , S0 , SXV ? , and SLk . The closer the ratio shown for comparison with target SXV + is to 1.0, the better (1.0 would indicate no loss in accuracy for any dataset). For comparison with all other strategies, ratios greater than 1.0 by bigger amounts are better. Interestingly, strategy S3 is not substantially better than S2 – the bigger testsets used for S2 seem to cause the improvement of S2 over S1 while additional data for training does not improve much the quality of the suggestions with regard to average increase of accuracy.

Slindiscr Smlcnb Sltree Sc50rules Sc50boost Sc50tree Smlcib1 Sripper S0

SXV +

0.753 0.873 0.948 0.916 0.959 0.970 0.928 0.869 0.947

S0 SXV ?

0.796 0.923 1.002 0.968 1.013 1.025 0.981 0.918

0.911 1.057 1.147 1.108 1.160 1.173 1.123 1.052 1.145

Table 9: Average change of accuracy to best, average, or worst when always selecting a constant algorithm, and when selecting a random algorithm Excluding algorithm c50boost from the analysis does not change the picture much – in that case, instead of c50boost, algorithm ltree is the one with overall performance closest to the one picked by any of the subsampling strategies. Table 9 shows the average decrease of accuracy in comparison with selection strategy

SXV + , when picking a single algorithm each time, or picking a random algorithm each time.

Any of the subsampling strategies perform better than picking the algorithm known to be the best one overall. The improvement over the strategy of picking a random 8

algorithm (S0 ) is even higher. Again, it is interesting that the improvement over S0 is very similar for both strategies S2 and S3 , while there is a bigger difference to S1 . This is an indication, that for many databases in the set used for these experiments, a sample of 1000 cases was quite sufficient but the quality of the single holdout estimate improved with bigger test set sizes.

6 CPU time comparison A rough comparison of CPU times necessary to obtain the estimates for strategies S1 , For each dataset, the total CPU time needed to calculate all estimates for a strategy was measured. We then calculated the ratio between the total time for SXV and each of the total times for S1 , S2 , and S3 , i.e. the factor how much faster than the crossvalidation based strategy the subsampling based selection strategies are for each database. Since the big ratios from a few large databases have a large impact on the average, table shows the median value in addition to the mean value for each of the three sets of ratios.

S2 , and S3 with the effort for SXV + is shown in table 10.

S1 S2 S3

mean 499.7 131.2 11.4

median 53.9 45.3 10.8

Table 10: Mean and median ratio of CPU times when comparing with SXV

7 Conclusion The findings of this analysis support what is common folklore among data mining and machine learning practitioners: simple subsampling estimates can be used quite well for picking a good algorithm. This is especially true for big databases, where a quick and simple procedure is most needed in the first place. Surprisingly, having a testset that grows with the size of the database seemed to be more important than having a growing training set in our experiments. We believe that the evaluation of such simple strategies for algorithm selection can serve as a reasonable base-line against which to compare other strategies, like metalearning. We are also interested in studying more advanced sampling strategies that dynamically adapt sample size to the actual database for which an algorithm has to be selected so that a good compromise is achieved between the quality of the suggestion and the amount of computational resources needed. Empirical studies of this kind face a paradoxical problem, however: while a commonly cited reason for the pressing need for effective and efficient data mining algorithms is the growing number of huge databases, the data mining research community almost never gets to see those databases. Most databases available for empirical studies are ridiculously small. Unless a number of realistic and big databases become publically available, the only way to fill the gap seems to be the use of artificially generated databases. Therefore we are currently studying the usability of such artificial data for empirical studies similar to the one presented in this paper.

9

8 Acknowledgments This work was funded as part of ESPRIT long term project METAL 26357. The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Education, Science and Culture.

References [Bailey & Elkan 1993] Bailey T.L., Elkan C.: Estimating the Accuracy of Learned Concepts, in Bajcsy R.(ed.), Proceedings of the 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, Los Altos/Palo Alto/San Francisco, pp.895-901, 1993. [Brazdil & Soares 2000] Brazdil P., Soares C.: A Comparison of Ranking Methods for Classification Algorithm Selection. Proceedings of ECML-2000, 2000. [Cohen 1995] Cohen W.W.: Fast Effective Rule Induction, in Prieditis A. & Russell S.(eds.), Proceedings of the 12th International Conference on Machine Learning (ICML’95), Morgan Kaufmann, Los Altos/Palo Alto/San Francisco, 115-123, 1995. [Dietterich 1996] Dietterich T.G.: Statistical Tests for Comparing Supervised Classification Learning Algorithms, Department of Computer and Information Science, University of Oregon, 1996. [Feelders & Verkooijen 1995] Feelders A., Verkooijen W.: Which method learns most from the data? Proc. of 5th International Workshop on Artificial Intelligence and Statistics, January 1995, Fort Lauderdale, Florida, pp. 219-225, 1995. [Gama & Brazdil 1999] Gama J., Brazdil P.: Linear Tree, Intelligent Data Analysis, 3(1), pp.1-22, 1999. [Hilario & Kalousis 1999] Hilario M., Kalousis A.: Building Algorithm Profiles for Prior Model Selection in Knowledge Discovery Systems. Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, 1999. [Kohavi 1995] Kohavi R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, in Mellish C.S.(ed.), Proceedings of the 14th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, Los Altos/Palo Alto/San Francisco, pp.1137-1143, 1995. [Kohavi et al. 1996] Kohavi R., Sommerfield D., Dougherty J.: Data Mining using MLC++, Data Mining and Visualization, Silicon Graphics, Inc., CA, 1996. [Lim 2000] Lim T.-S., W.Y. Loh, Y.S. Shih: A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Meachine Learning, to appear. [Lindner & Studer 1999] Lindner G., Studer R.: AST: Support for Algorithm Selection Using a CBR Approach. Proceedings of the Third European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’99), 1999.

10

[Michie et al. 1994] Michie D., Spiegelhalter D.J., Taylor C.C.(eds.): Machine Learning, Neural and Statistical Classification, Ellis Horwood, Chichester/New York, 1994. [Rasmussen et al. 1996] Rasmussen C.E., Neal R.M., Hinton G.E., Camp D.van, Revow M., Ghahramani Z., Kustra R., Tibshirani R.: The DELVE Manual, University of Toronto, Canada, 1996. [Schaffer 1993] Schaffer C.: Technical Note: Selecting a Classification Method by Cross- Validation, Machine Learning, 13(1), 135-143, 1993. [Wolpert 1996] Wolpert D.H.: The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1381-1390, 1996.

allrep - cstho1: errors 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1000 err c50boost err c50rules

1500

2000

2500

err c50tree err lindiscr

err ltree err mlcib1

3000

3500

err mlcnb err ripper

Figure 1: Sample learning curves obtained with single holdout estimates (left part in each graph) in comparison with the full-size crossvalidation estimates (right part) for database allrep

11

byzantine - cstho1: errors 0.35

0.3

0.25

0.2

0.15

0.1

0.05

0 0

2000

4000

6000

8000

err c50tree err lindiscr

err c50boost err c50rules

10000 err ltree err mlcib1

12000

14000

16000

err mlcnb err ripper

Figure 2: Sample learning curves for database byzantine

image - cstho1: errors 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1000

1200

err c50boost err c50rules

1400

1600

err c50tree err lindiscr

1800 err ltree err mlcib1

2000 err mlcnb err ripper

Figure 3: Sample learning curves for database image

12

2200

krkopt - cstho1: errors 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0

5000

10000

15000

err c50tree err lindiscr

err c50boost err c50rules

20000

25000

err ltree err mlcib1

30000

err mlcnb err ripper

Figure 4: Sample learning curves for database krkopt

letter - cstho1: errors 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

2000

err c50boost err c50rules

4000

6000

8000

err c50tree err lindiscr

10000

12000

err ltree err mlcib1

14000

16000 err mlcnb err ripper

Figure 5: Sample learning curves for database letter

13

18000

nettalk - cstho1: errors 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0

20000

err c50boost err c50rules

40000

60000

80000

err c50tree err lindiscr

err ltree err mlcib1

100000

120000 err mlcnb err ripper

Figure 6: Sample learning curves for database nettalk

14

140000

Suggest Documents