Virtual Attribute Subsetting - CiteSeerX

0 downloads 0 Views 102KB Size Report
Attribute subsetting is a meta-classification technique, based on ..... tic-tac-toe. 85.6. 86.4 W. 87.3 W. 93.6. 93.6 W. 95.8 W vehicle. 72.3. 72.4 W ... AAAI Workshop on Integrating Multiple Learned Models, 13th National Conference on Ar-.
Virtual Attribute Subsetting Michael Horton, Mike Cameron-Jones, Ray Williams School of Computing, University of Tasmania, TAS 7250, Australia {Michael.Horton, Michael.CameronJones, R.Williams}@utas.edu.au

Abstract. Attribute subsetting is a meta-classification technique, based on learning multiple base-level classifiers on projections of the training data. In prior work with nearest-neighbour base classifiers, attribute subsetting was modified to learn only one classifier, then to selectively ignore attributes at classification time to generate multiple predictions. In this paper, the approach is generalized to any type of base classifier. This ‘virtual attribute subsetting’ requires a fast subset choice algorithm; one such algorithm is found and described. In tests with three different base classifier types, virtual attribute subsetting is shown to yield some or all of the benefits of standard attribute subsetting while reducing training time and storage requirements.

1 Introduction A large portion of machine learning research covers the field of ‘classifier learning’. Given a list of instances with known classes as training data, a classifier learning algorithm creates a classifier which, given an instance with unknown class, attempts to predict that class. More recent classifier research includes ‘meta-classification’ algorithms, which learn multiple classifiers and combine their output to yield classifications that are often more accurate than those of any of the individual classifiers used. Some common meta-classification techniques are bagging [1], boosting [2] and stacking [3]. Attribute subsetting [4, 5] is another meta-classification technique, based on learning multiple base classifiers on different training sets. Each training set contains all the original training instances, but only a subset of the attributes. In [4], for the specific case of nearest-neighbour base classifiers, one copy of the training data was kept; at classification time, attributes in the instance to classify were selectively ignored. In this paper, the same approach is generalized: regardless of classifier type, only one base classifier is learnt, although multiple attribute subsets are still generated. At classification time the instance to be classified is copied, with one copy for each subset. The values of the attributes not appearing in the corresponding subset are set to ‘unknown’, and the instance is passed to the single base classifier. The predictions from each instance copy are then combined for a final prediction. This form of attribute subsetting generates multiple classifications without learning multiple classifiers, so it is called ‘virtual attribute subsetting’. This paper describes existing work on standard attribute subsetting and classification given instances with unknown attribute values, followed by an outline of the virtual attribute subsetting algorithm. The choice of attribute subsets is significant, so

A. Sattar and B.H. Kang (Eds.): AI 2006, LNAI 4304, pp. 214 – 223, 2006. © Springer-Verlag Berlin Heidelberg 2006 http://www.springerlink.com/content/wk5r285q58488v3q/

2

Michael Horton, Mike Cameron-Jones, Ray Williams

four different attribute subset choice algorithms are considered. Results are then reported from tests using virtual attribute subsetting with three different base classifier types: Naïve Bayes, decision trees and rule sets. It is shown to gain some or all of the improved accuracy of standard attribute subsetting, while reducing training time and storage requirements. The degree of benefit depends on the base classifier type.

2 Previous Work The most relevant previous work, that on standard attribute subsetting, will now be discussed, after a description of the related and better known technique of bagging. Virtual attribute subsetting depends on its base classifier being able to classify instances with missing values, so some methods used to achieve this are also covered. 2.1 Bagging ‘Bagging’, or ‘bootstrap aggregating’ [1], is an ensemble classification technique. Multiple training datasets, each with as many instances as the original training data, are built by sampling with replacement. A classifier is then learnt on each training dataset. When a new instance is classified, each classifier makes a prediction and the predictions are combined to give a final classification. (The training sets are known as ‘bootstrap’ samples, hence ‘bootstrap aggregating’). 2.2 Standard Attribute Subsetting Attribute subsetting is an ensemble classification technique with some similarities to bagging. Multiple training datasets are generated from the initial training data; each training set contains all the instances from the original data, but some attributes are removed. A classifier is then learnt on each training set. To classify a new instance, predictions from each classifier are combined to give a final classification. If the training data are stored in a table with each row representing an instance and each column containing attribute values, bagged training samples are built by selecting table rows, while attribute subsetted training samples are built by selecting table columns. Because of this similarity, the technique has been referred to as ‘attribute bagging’ [6] and ‘feature bagging’ [7], terms which are only technically accurate if the attributes are sampled with replacement. This has been tested [4], but did not significantly affect accuracy. Attribute subsetting has been effective when the base classifiers are neural networks [8], decision trees [5, 6], nearest-neighbour classifiers [4] or conditional random fields [7].

Virtual Attribute Subsetting

3

2.3 Unknown Attributes in Classification Sometimes the instance provided to a classifier does not have a known value for every attribute. Many types of classifiers can handle such ‘missing values’, but their means of doing so differ. Three such classifiers are considered here. Naïve Bayesian classifiers can easily handle missing values by omitting the term for that attribute when the probabilities are multiplied [9]. Decision tree classifiers can deal with missing values at classification time in several different ways [10]. In the specific case of C4.5, if a node in the tree depends on an attribute whose value is unknown, all subtrees of that node are checked and their class distribution summed [11]. Many rule induction algorithms can handle missing values, but some, such as RIPPER, use rules which fail if they need a missing value [12]. Early tests showed that virtual attribute subsetting performs poorly with base classifiers of this type. A more promising rule set inductor is the PART algorithm, which trains decision nodes with the C4.5 decision tree learning algorithm [13].

3 The Virtual Attribute Subsetting Algorithm Bay [4] applied attribute subsetting to a single nearest-neighbour base classifiers. The only training step for nearest-neighbour classifiers is reading the training data. At classification time, multiple nearest-neighbour measurements were made, each measuring a subset of attributes. This saved training time and storage space, while returning identical results to standard attribute subsetting. This approach can be generalized to other base classifiers, to create ‘virtual attribute subsetting’. At training time, a single base classifier is learnt on all of the training data and multiple subsets are generated. To classify an instance, one copy of the instance is created for each subset. Each instance copy has the attributes missing from its subset set to value ‘unknown’. The instance copies are then passed to the base classifier, and its predictions are combined to give an overall prediction. For most base classifiers, virtual attribute subsetting may be less accurate than standard attribute subsetting, but will have used less training time and storage. Specifically, it only needs the time and space required by the single base classifier and the subsets; even the subset generation can be left to classification time if necessary. A side benefit is that, if a classifier has already been learnt, virtual attribute subsetting can be applied to it, if it can classify instances with missing attributes. Three parameters of a virtual attribute subsetting classifier may be varied: the subsets, the type of base classifier, and the means of combining predictions. 3.1 Subset Choice There are many ways to select multiple subsets of attributes. Some subsets were chosen by hand, based on domain knowledge [7, 8]. Pseudorandom subsets have also been used, as in [5], where each subset contained 50% of the attributes. Pseudorandom subsets may be optimized by introspective subset choice: a portion of the train-

4

Michael Horton, Mike Cameron-Jones, Ray Williams

ing data is set aside for evaluation, and only those subsets which lead to accurate classifiers on the evaluation data are used. Introspective subset choice may be based on learning the optimal proportion of attributes for each dataset [4], or by generating many subsets and only using those which lead to accurate classifiers [6]. Virtual attribute subsetting is intended to be a fast, generic technique. The subsets should not be based on domain knowledge (which is not generic) or introspective subset choice (which is time-consuming). For this experiment, four types of pseudorandom subset choice were tested: random, classifier balanced, attribute balanced and both balanced. Each subset generation algorithm receives three parameters: a, the number of attributes, s, the number of subsets to generate and p, the desired proportion of attributes per subset. 3.1.1 Random Subsets This is the simplest algorithm. It iterates through the a×s attribute/subset pairs, randomly selecting a×s×p attributes to include in the subsets. 3.1.2 Classifier Balanced Subsets This algorithm chooses subsets such that each subset contains close to s×p attributes. Since s×p may not be an integer, it rounds some subsets up and some down to build an overall proportion as close to p as possible. 3.1.3 Attribute Balanced Subsets This algorithm chooses subsets such that each attribute appears in close to a×p subsets. Since a×p may not be an integer, it rounds some counts up and some down to build an overall proportion as close to p as possible. 3.1.4 Both Balanced Subsets Creating subsets that are balanced both ways is slightly more difficult. The algorithm is illustrated by associating a line segment with each attribute. The length of each line segment is the number of further times that attribute needs to be added to a subset. It is described in pseudocode below and the process is illustrated in Figure 1. Note that this algorithm may create duplicated subsets; a version that created non-duplicated subsets was tested and had an insignificant effect on accuracy. Determine the number of times each attribute should appear; if achieving the correct proportion requires varied attribute counts (some must be rounded up and some rounded down), randomly distribute the counts For each subset: Randomly arrange the attributes in line selectorLength ← number of subsets remaining selectorPos ← random(0..selectorLength–1) While selectorPos lies adjacent to an attribute: Add that attribute to the subset selectorPos ← selectorPos + selectorLength Reduce the lengths of the chosen attributes by 1

Virtual Attribute Subsetting

Step 1: Attribute lengths are calculated. Attributes A B C

5

D

Step 2: Attributes are arranged in random order. Attributes C D B A Step 3: The length of the attribute selector is the number of subsets remaining. (In this example, there are 5 subsets remaining.) It starts at a random position 0..selectorLength–1 units from the start of the first attribute. Attributes C D B A Selection Step 4: Attributes are selected at steps the length of the attribute selector long. Attributes C D B A Selection D A C Step 5: The attributes that were selected have their lengths reduced by 1. Attributes C D B A Steps 2–5 are then repeated until all subsets have been generated. Fig. 1. Visualization of steps in both balanced attribute subset choice Sample subsets generated by all four subset choice algorithms are shown in Table 1. Table 1. Examples of subset generation algorithm output with a = 4, s = 5 and p = 0.7. Both balanced Attributes Ttl 1 0 1 1 3 1 1 1 0 3 1 1 0 1 3 0 1 1 1 3 0 0 1 1 2 3 3 4 4 14 Ttl

Subsets

Attribute balanced Ttl Attributes 1 0 1 1 3 0 1 0 1 2 1 1 1 0 3 0 0 1 1 2 1 1 1 1 4 3 3 4 4 14 Ttl

Subsets

Classifier balanced Attributes Ttl 0 1 1 1 3 1 1 0 1 3 0 1 0 1 2 1 1 1 0 3 1 1 1 0 3 3 5 3 3 14 Ttl

Subsets

Subsets

No balancing Attributes Ttl 0 1 1 1 3 0 1 1 1 3 1 1 0 1 3 0 1 0 0 1 1 1 1 1 4 2 5 3 4 14 Ttl

3.2 Base Classifiers The effect of virtual attribute subsetting will vary, depending upon the base classifier used. Three types of base classifiers were tested in this experiment: Naïve Bayes, C4.5 decision trees, and PART rule sets. Making an attribute value unknown makes Naïve Bayes ignore it as if it were not in the training data, so there should be no difference in output between standard and virtual attribute subsetting if both use Naïve Bayes as the base classifier. For C4.5 and PART base classifiers, virtual attribute subsetting is unlikely to match the accuracy of standard attribute subsetting, as the classifiers used for each subset will not be appropriately independent. However, virtual attribute subsetting may still be more accurate than a single base classifier.

6

Michael Horton, Mike Cameron-Jones, Ray Williams

3.3 Combining Predictions There are many ways to combine the predictions of multiple classifiers. The method chosen for virtual attribute subsetting was to sum and normalize the class probability distributions of the base classifiers to give an overall class probability distribution.

4 Method All experiments were carried out in the Waikato Environment for Knowledge Analysis (WEKA), a generic machine learning environment [14]. WEKA conversions of 31 datasets from the UCI repository [15, 16] were selected for testing. The dataset names are listed in Table 8. Each test of a classifier on a dataset involved 10 repetitions of 10-fold cross-validation. Both standard and virtual attribute subsetting have three adjustable settings: subset choice algorithm (‘no balancing’, ‘classifier balanced’, ‘attribute balanced’ or ‘both balanced’), attribute proportion (a floating point number from 0.0 to 1.0) and number of subsets (any positive integer). Preliminary tests showed that reasonable default settings are balancing=‘both balanced’, attribute proportion=0.8 and subsets=10. The decision tree classifier used was ‘J4.8’, the WEKA implementation of C4.5. The rule set classifier was the WEKA implementation of PART. All base classifier settings were left at their defaults. For both J4.8 and PART, this meant pruning with a confidence factor of 0.25 and requiring a minimum of two instances per leaf.

5 Results A virtual attribute subsetting classifier with a given base classifier may be considered to succeed if it is more accurate than a single classifier with the same type and settings: it has made an accuracy gain for no significant increase in storage space or training time. This is the main comparison made here; standard attribute subsetting results are also shown, as they provide a probable upper bound to accuracy. 5.1 Naïve Bayes The ability of virtual attribute subsetting to yield exactly the same results as standard attribute subsetting when Naïve Bayes is the base classifier was verified, so the results shown apply to both standard and virtual attribute subsetting. Attribute subsetting only significantly improved the accuracy of a Naïve Bayesian classifier when attributes were balanced (Table 2). The best attribute proportion was 0.9 (Table 3).

Virtual Attribute Subsetting

7

Table 2. Win/draw/loss analysis for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single Naïve Bayesian classifier Subset algorithm No balancing Wins 16 Draws 0 Losses 15 Wins-Losses 1

Classifier balanced 17 1 13 4

Attribute balanced 20 1 10 10

Both balanced 21 0 10 11

Table 3. Win/draw/loss analysis standard/virtual attribute subsetting with varying proportion compared with a single Naïve Bayesian classifier Proportion Wins Draws Losses Wins-Losses

0.1 6 0 25 -19

0.2 9 0 22 -13

0.3 15 0 16 -1

0.4 15 0 16 -1

0.5 17 1 13 4

0.6 18 0 13 5

0.7 19 0 12 7

0.8 21 0 10 11

0.9 22 0 9 13

5.2 Decision Trees With J4.8 as the base classifier, standard attribute subsetting performed well with all subset algorithms, but virtual attribute subsetting was only effective when attributes were balanced and improved when both attributes and classifiers were balanced (Table 4). Both attribute subsetting classifiers were most accurate with attribute proportions of 0.8 or 0.9 (Table 5). The accuracies achieved with optimal settings are listed in Table 8. Three additional experiments with J4.8 were undertaken but not reported in detail. The number of subsets was varied from 2 to 40, with no significant improvement in virtual attribute subsetting once the number of subsets was at least 6. Unpruned decision trees were also tested; virtual attribute subsetting using unpruned J4.8 as the base classifier performed well against a single unpruned J4.8 classifier but poorly against a single pruned J4.8 classifier. Finally, measurement of decision tree size showed that neither method learnt consistently larger trees. Table 4. Wins–losses for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single J4.8 classifier No Classifier Attribute Both Subset algorithm balancing balanced balanced balanced Standard attribute subsetting 19 22 23 20 Virtual attribute subsetting 3 3 8 12

Table 5. Wins–losses for standard/virtual attribute subsetting with varying proportion compared with a single J4.8 classifier Proportion Standard attribute subsetting Virtual attribute subsetting

0.1 -17 -25

0.2 -11 -25

0.3 1 -17

0.4 10 -9

0.5 18 -5

0.6 20 1

0.7 21 7

0.8 20 12

0.9 23 4

8

Michael Horton, Mike Cameron-Jones, Ray Williams

5.3 Rule Learning Results for the PART algorithm are similar to those for J4.8. Standard attribute subsetting gave good results, while virtual attribute subsetting was only effective when the attributes were balanced. Balancing both attributes and classifiers did not cause further improvements (Table 6). Of the attribute proportions tested, attribute subsetting using PART was most accurate with proportions of 0.8 or 0.9 (Table 7). The accuracies achieved with optimal settings are listed in Table 8. Table 6. Wins–losses for standard/virtual attribute subsetting with varying subset choice algorithms compared with a single PART classifier No Classifier Attribute Both Subset algorithm balancing balanced balanced balanced Standard attribute subsetting 26 23 29 29 Virtual attribute subsetting 0 4 15 13

Table 7. Wins–losses for standard/virtual attribute subsetting with varying proportion compared with a single PART classifier Proportion Standard attribute subsetting Virtual attribute subsetting

0.1 -25 -29

0.2 -5 -23

0.3 3 -17

0.4 14 -9

0.5 18 -5

0.6 22 -3

0.7 20 6

0.8 29 13

0.9 28 15

5.4 Training time Since the training for virtual attribute subsetting is limited to building subsets and training one classifier on the original training data, it was expected to have similar training time to a single classifier. The standard attribute subsetting algorithm builds a classifier for each subset. Since the training data for each classifier have some attributes removed, the individual classifiers may take less time to learn than a single classifier (as there are less attributes to consider) or more time (as more steps may be needed to build an accurate classifier). To test this, the ratios of attribute subsetting training time to single classifier training time were measured. As expected, virtual attribute subsetting took comparable training time to a single classifier, while standard attribute subsetting needed at least six times longer than both a single classifier and virtual attribute subsetting. 5.5 Accuracy Table The accuracies returned by the most accurate attribute subsetting and virtual attribute subsetting classifiers using J4.8 and PART base classifiers are shown in Table 8, along with their win/draw/loss totals.

Virtual Attribute Subsetting

9

Table 8. Percentage accuracy over the 31 datasets for J4.8, PART and standard/virtual attribute subsetting using PART and J4.8 base classifiers; wins and losses are based on a simple accuracy comparison and were measured before these figures were truncated for presentation

Single Dataset classifier anneal 98.6 audiology 77.3 autos 81.8 balance-scale 77.8 horse-colic 85.2 credit-rating 85.6 german_credit 71.3 pima_diabetes 74.5 Glass 67.6 cleveland-heart 76.9 hungarian-heart 80.2 heart-statlog 78.1 hepatitis 79.2 hypothyroid 99.5 ionosphere 89.7 iris 94.7 kr-vs-kp 99.4 labor 78.6 lymphography 75.8 mushroom 100.0 primary-tumor 41.4 segment 96.8 sick 98.7 sonar 73.6 soybean 91.8 tic-tac-toe 85.6 vehicle 72.3 vote 96.6 vowel 80.2 waveform 75.3 zoo 92.6 Win/Draw/Loss Wins-Losses

J4.8 PART Virtual Standard Virtual Standard Single attribute attribute attribute attribute subsetting subsetting classifier subsetting subsetting 98.6 D 98.6 W 98.3 98.3 D 98.5 W 77.4 W 79.2 W 79.4 79.3 L 82.2 W 81.9 W 83.4 W 75.1 75.1 W 81.7 W 78.3 W 78.2 W 83.2 83.5 W 86.1 W 85.2 W 85.2 D 84.4 84.7 W 85.4 W 85.6 W 85.7 W 84.4 84.5 W 86.7 W 71.3 W 72.1 W 70.5 70.7 W 74.6 W 74.8 W 74.7 W 73.4 74.1 W 74.5 W 68.0 W 68.8 W 68.7 68.8 W 74.0 W 76.7 L 77.4 W 78.0 78.1 W 81.8 W 80.2 L 80.3 W 81.1 81.3 W 81.9 W 78.4 W 78.9 W 77.3 77.3 D 81.2 W 79.3 W 79.7 W 79.8 79.9 W 82.7 W 99.5 D 99.6 W 99.5 99.5 W 99.6 W 89.8 W 90.6 W 90.8 90.8 D 92.9 W 94.7 D 94.7 D 94.2 94.2 D 94.6 W 99.4 D 99.4 W 99.2 99.2 D 99.4 W 78.6 D 78.6 D 77.7 77.7 D 79.1 W 75.4 L 75.9 W 76.4 76.4 W 81.0 W 100.0 L 100.0 D 100.0 100.0 D 100.0 D 42.0 W 43.1 W 40.9 41.6 W 45.2 W 96.8 W 97.0 W 96.6 96.6 L 97.9 W 98.5 L 98.7 L 98.6 98.6 L 98.6 W 73.7 W 74.3 W 77.4 77.4 D 81.2 W 91.6 L 92.3 W 91.4 91.4 W 93.7 W 86.4 W 87.3 W 93.6 93.6 W 95.8 W 72.4 W 73.6 W 72.2 72.6 W 74.5 W 96.6 D 96.6 D 96.0 96.0 D 96.6 W 80.3 W 82.2 W 77.7 77.7 W 91.6 W 76.8 W 78.1 W 77.6 78.9 W 84.2 W 92.6 D 92.6 D 93.4 93.4 D 93.4 D 18/7/6 24/6/1 18/10/3 29/2/0 12 23 15 29

6 Conclusions and Further Work Compared to standard attribute subsetting using Naïve Bayes as the base classifier, virtual attribute subsetting produces the same classifications with reduced training time and storage. Virtual attribute subsetting also frequently increases the accuracy of C4.5 decision trees and PART rule sets for no significant increase in training time. The subset choice was found to be significant. Balancing attributes consistently improved accuracy; balancing attributes and subsets usually increased it further. More sophisticated subset choice algorithms may be better still.

10

Michael Horton, Mike Cameron-Jones, Ray Williams

Virtual attribute subsetting may also help other base-level classifiers. Whether it does so depends upon how they handle unknown attribute values. It may also be possible to combine virtual attribute subsetting with other meta-classification techniques. Acknowledgements We would like to thank the authors of WEKA, [14], the compilers of the UCI repository [15, 16] and Mr Neville Holmes and the anonymous reviewers for their comments, some of which we were unable to implement due to time constraints. This research was supported by a Tasmanian Graduate Research Scholarship.

References 1. Breiman, L.: Bagging Predictors. In: Machine Learning 24 (1996) 123-140 2. Freund, Y., Schapire, R.E.: A Short Introduction to Boosting. In: Journal of the Japanese Society for Artificial Intelligence 14 (1999) 771-780 3. Wolpert, D.H.: Stacked Generalization. In: Neural Networks 5 (1992) 241-259 4. Bay, S.D.: Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann (1998) 37-45 5. Ho, T.K.: The random subspace method for constructing decision forests. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 832-844 6. Bryll, R., Gutierrez-Osuna, R., Quek, F.: Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. In: Pattern Recognition 36 (2003) 1291-1302 7. Sutton, C., Sindelar, M., McCallum, A.: Feature Bagging: Preventing Weight Undertraining in Structured Discriminative Learning. In: CIIR Technical Report IR-402 (2005) 8. Cherkauer, K.J.: Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In: Chan, P., (ed.): Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, 13th National Conference on Artificial Intelligence, AAAI Press (1996) 15-21 9. Kohavi, R., Becker, B., Sommerfield, D.: Improving Simple Bayes. In: Someren, M.v., Widmer, G., (eds.): 9th European Conference on Machine Learning, Springer-Verlag (1997) 10.Quinlan, J.R.: Unknown Attribute Values in Induction. In: Segre, A.M., (ed.): Proceedings of the 6th International Workshop on Machine Learning, Morgan Kaufmann (1989) 164-168 11.Quinlan, J.R.: Unknown Attribute Values. In: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993) 30 12.Cohen, W.W.: Fast Effective Rule Induction. In: Prieditis, A., Russell, S.J., (eds.): Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann (1995) 115-123 13.Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: Proceedings of the 15th International Conference on Machine Learning, Morgan Kaufmann (1998) 144-151 14.Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. 2nd edn. Morgan Kaufmann (2005) 15.Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html (1998) 16.Weka 3 - Data Mining with Open Source Machine Learning Software in Java - Collections of datasets. http://www.cs.waikato.ac.nz/~ml/weka/index_datasets.html

Suggest Documents