carried out on attribute reduction [2], particularly by people in the Rough Sets society. ... plastic heavy black small t3 round wood heavy white large t4 round wood light white ... cation of elements of the universe as the whole set of attributes.
Minimal Attribute Space Bias for Attribute Reduction Fan Min, Xianghui Du, Hang Qiu, and Qihe Liu School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China {minfan, xianghd, qiuhang, qiheliu}@uestc.edu.cn
Abstract. Attribute reduction is an important inductive learning issue addressed by the Rough Sets society. Most existing works on this issue use the minimal attribute bias, i.e., searching for reducts with the minimal number of attributes. But this bias does not work well for datasets where different attributes have different sizes of domains. In this paper, we propose a more reasonable strategy called the minimal attribute space bias, i.e., searching for reducts with the minimal attribute domain sizes product. In most cases, this bias can help to obtain reduced decision tables with the best space coverage, thus helpful for obtaining small rule sets with good predicting performance. Empirical study on some datasets validates our analysis. Keywords. Attribute reduction, bias, space coverage, rule set.
1
Introduction
Practical machine learning algorithms are known to degrade in performance (prediction accuracy) when faced with many attributes that are not necessary for rule discovery [1]. It is therefore not surprising that much research has been carried out on attribute reduction [2], particularly by people in the Rough Sets society. A reduct is a subset of attributes that is jointly sufficient and individually necessary for preserving the same information (in terms of positive region [3], class distribution [4] among others) under consideration as provided by the entire set of attributes. [5] A commonly used reduct selection/construction strategy, called the minimal attribute bias [1][3][6], is to searching for a reduct with the minimal number of attributes. In some cases, especially when different attribute have approximately the same size of domain, this bias may be helpful for obtaining small rule sets with good performance. However, for data in reality where attribute domain sizes vary, this strategy is unfair since attributes with larger domains tend to have better discernibility or other significance measures [7], and it has severe implications when applied blindly without regarding for the resulting induced concept [1]. To cope with these problems, in this paper we propose a new bias called the minimal attribute space bias which is intended to minimize the attribute space. We argue that this bias is more reasonable, thus more helpful for obtaining small
rule set, than the minimal attribute bias. Empirical study on some datasets in the UCI library [8] validates our analysis.
2
Preliminaries
In this section we enumerate basic concepts introduced by Pawlak [3] through an example. Formally, a decision table is a triple S = (U, C, {d}) where d 6∈ C is the decision attribute and elements of C are called conditional attributes or simply conditions. Table 1 lists a decision table where U = {t1, . . . , t8}, C = {Shape, Material, Weight, Color} and d = Size. Table 1. An exemplary decision table Toy t1 t2 t3 t4 t5 t6 t7 t8
Shape round round round round triangle triangle triangle triangle
Material wood plastic wood wood wood plastic plastic plastic
Weight light heavy heavy light light heavy light heavy
Color red black white white green blue pink yellow
Size small small large small small large large large
Any ∅ = 6 B ⊆ C ∪ {d} determines a binary relation I(B) on U , which will be called an indiscernibility relation, and is defined as follows: I(B) = {(xi , xj ) ∈ U × U |∀a ∈ B, a(xi ) = a(xj )},
(1)
where a(x) denotes the value of attribute a for element x. A partition determined by B is denoted by U/I(B), or simply by U/B. Let BX denotes B−lower approximation of X, S the positive region of {d} with respect to B ⊆ C is defined as P OSB ({d}) = X∈U/{d} B(X). A reduct is the minimal subset of attributes that enables the same classification of elements of the universe as the whole set of attributes. This can be formally defined as follows: Definition 1. Any B ⊆ C is called a decision relative reduct of S = (U, C, {d}) iff: 1. P OSB ({d}) = P OSC ({d}), and 2. ∀a ∈ B, P OSB−{a} ({d}) ⊂ P OSC ({d}). A decision relative reduct can be simply called a relative reduct, or a reduct for briefness if the decision attribute is obvious from the context. According to Definition 1, the exemplary decision table has two reducts: R1 ={Shape, Material, Weight} and R2 ={Weight, Color}.
3
The Minimal Attribute Bias
This bias is described as follows: A reduct R is optimal iff |R| is minimal, where | · | denotes the cardinality of a set. According to this bias, R2 ={Weight, Color} is an optimal reduct of Table 1 because |R1 | = 3 and |R2 | = 2 . For the sake of clarity, we use the term minimal reduct instead of optimal reduct in the following context. The main advantage of this bias is simple and tend to give short rules. For datasets where different attributes have approximately the same size of domains, it may be also helpful for obtaining small rule sets with good predicting performance. However, it also has the following drawbacks: 1. Unfair for different attributes. For example, in Table 1, attribute Color is the most important attribute from the viewpoint of discernibility. But this is due to its relatively large domain (7 values versus 2 of others) rather than its intrinsic importance. 2. Too many optimal reducts. For example, the Mushroom dataset [8] (further discussed in Subsection 5.1) has 14 minimal reducts. Some of them perform well in terms of further rule set generation and/or decision tree construction, but others do not. The bias does not indicate a more detailed strategy for choosing among those reducts.
4
The Minimal Attribute Space Bias
We propose the minimal attribute space bias as follows: A reduct R is optimal iff Πa∈R |Va | is minimal, where Va is the domain of attribute a. According to this bias, R1 = {Shape, Material, Weight} is an optimal reduct of Table 1 because Πa∈R1 |Va | = 8 and Πa∈R2 |Va | = 14. We also use the term minimal space reduct instead of optimal reduct. Remark 1. If Vai = Vaj for any ai , aj ∈ C, then the minimal attribute space bias coincides with the minimal attribute bias. Now we explain why this bias is more reasonable than the minimal attribute bias using the exemplary decision table. Each object in the table corresponds with a decision rule. For example, t1 corresponds with Shape = round ∧Material = wood∧Weight = light∧Color = red ⇒ Size = small. This type of rules will be called original rules since no inductive learning algorithm has been introduced. Because no object pairs are indiscernible, there are a total of 8 original rules. On the other hand, the attribute space of the decision table is |Shape| × |Material| × |Weight| × |Color| = 2 × 2 × 2 × 7 = 56. Therefore objects in the decision table only cover 8/56 = 1/7 = 0.143 of the attribute space, and the original rule set may have poor performance in terms of coverage.
As an inductive approach, attribute reduction can reduce the number of attribute; or more importantly, it can reduce the attribute space. The attribute space of S(R1 ) = {{t1, . . . , t8}, R1 , {Size}} is |Shape| × |Material| × |Weight| = 2 × 2 × 2 = 8, while it has two indiscernible object pairs: t1 and t4 as well as t6 and t8, hence only 8−2 = 6 original rules could be obtained, incurring 6/8 = 0.75 of space coverage. The attribute space of S(R2 ) = {{t1, . . . , t8}, R2 , {Size}} is |Weight| × |Color| = 2 × 7 = 14, and no indiscernible object pairs exist, hence 8 original rules could be obtained, incurring 8/14 = 0.571 of space coverage. From this viewpoint, both reducts have notable generalization ability, and R1 performs better (with space coverage 0.75 versus 0.571 of R2 ).
Table 2. Rule sets generated from S(R1 ) and S(R2 ) rule No. r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 r14 r15 r16
rule Material = wood ∧ Weight = light ⇒ Size = small Shape = triangle ∧ Material = plastic ⇒ Size = large Shape = round ∧ Weight = light ⇒ Size = small Shape = triangle ∧ Weight = heavy ⇒ Size = large Shape = round ∧ Material = plastic ⇒ Size = small Material = wood ∧ Weight = heavy ⇒ Size = large Shape = triangle ∧ Material = wood ⇒ Size = small material = plastic ∧ weight = light ⇒ Size = large Color = red ⇒ Size = small Color = black ⇒ Size = small Weight = heavy ∧ Color = white ⇒ Size = large Weight = light ∧ Color = white ⇒ Size = small Color = green ⇒ Size = small Color = blue ⇒ Size = large Color = pink ⇒ Size = large Color = yellow ⇒ Size = large
support 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1
It should be noted that better generalization ability does not ensure smaller rule sets. In fact, using the exhaustive algorithm [9] we obtained 8 rules for either reduced decision tables, as listed in Table 2, where the former 8 rules corresponds with S(R1 ). But it should be noted further that each rule generated from S(R2 ) is supported by only 1 object, while in all rules generated from S(R1 ), 2 rules (r1 and r2) are supported by 3 objects, and another 2 rules (r3 and r4) are supported by 2 objects. In other words, R1 is more helpful for generating strong rules. One can observe that in this case both rule sets cover the whole attribute space. While for larger datasets, reducts with smaller attribute spaces often result in smaller rule sets with better space coverage. Formally, the space coverage of S = (U, C, {d}) is defined as SC(S) =
|U/C| , Πa∈C |Va |
(2)
which serves as an important factor for further rule generation / decision tree construction. From the viewpoint of space coverage, the goal of attribute reduction should be searching for a reduct R such that SC((U, R, {d})) is minimal. One approach is to maximize |U/R|, but |U/R| has an upper bound |U |, and |U/R| does not vary too much for different reducts. Thus this approach does not make sense. The other approach is to minimize Πa∈R |Va |, which coincides with the minimal attribute space bias. In most cases, minimal attribute space results in maximal space coverage. One might construct a counterexample as follows: A decision table S = (U, C, {d}) and two reducts R1 , R2 satisfying |U/R1 | > |U/R2 |, Πa∈R1 |Va | > Πa∈R2 |Va | and SC((U, R1 , {d})) > SC((U, R2 , {d})); but this situation is quite unlikely to happen in real data.
5
Experiments and Comparisons
There are very few datasets (e.g., Nursery) with complete coverage of the attribute space. For many datasets, attribute reduction is a very important approach to improving the space coverage. We tested these two reduct selection biases on some datasets of the UCI library [8] using RSES [9] and a software developed by us called RDK (Rough Developer’s Kit). For some datasets the set of all minimal space reducts coincides with the set of all minimal reducts. These datasets can be classified as follows: 1. One reduct datasets, e.g., Zoo, solar-flare and Monks; 2. Datasets whose all conditional attributes have the same size of domain, e.g., Letter Recognition and Tic-Tac-Toe, and 3. Other Datasets, e.g., bridges. In what follows experiments on two datasets will be discussed in more detail. 5.1
Experiments on Mushroom
The Mushroom dataset [8] contains 8416 objects and 22 conditional attributes. The domain sizes of attributes vary from 1 (veil-type, the attribute value UNIVERSAL announced in agaricus-lepiota.names never appeared in the dataset) to 12 (gill-color). It has 292 reducts, and all minimal space reducts are also minimal reducts. For each minimal reduct, LEM2 was employed (with the cover parameter set to 1.0) to generate rule sets on reduced decision tables. Furthermore, we used CV-5 (the rule generation algorithm was also LEM2) to test the performance of those reducts. For all reducts tested, both the coverage and the accuracy of respective rule sets were 1.0. Other results are listed in Table 3. For this dataset, the number of minimal attribute reducts is much less than the number of minimal reducts. Also, minimal attribute reducts are more helpful for obtaining smaller rule sets.
Table 3. Experimental results of Mushroom
optimal reducts minimal rule set size maximal rule set size average rule set size
5.2
minimal attribute bias 14 19 30 26.5
minimal attribute space bias 2 19 26 22.5
Experiments on Soybean
The Large Soybean Database [8] contains two parts: the training set with 307 instances, and the testing set with 376 instances. There are 35 nominal conditional attributes, with domain sizes varying from 2 to 8, and some of them have missing (unknown) values. The domain size (i.e., 18) of the decision attribute is rather large. Due to the relatively large number of attributes, when we tried to use the exhaustive algorithm [9] to obtain the set of all reducts, an error “out of memory” was reported. So we used the genetic algorithm on the training set to obtain reducts, with the speed set to low and the number of reducts set to 100. In this way, 100 reducts were obtained, 35 of which were minimal reducts, and 9 of which were minimal space reducts. 8 out of 9 minimal space reducts contain 9 attributes, hence they were also minimal reducts. Rule sets were generated through employing the exhaustive algorithm on reduced decision tables, then they were tested on the testing set. Some results are listed in Table 4.
Table 4. Experimental results of Soybean (Bolded Values Indicate the Best Results) minimal attribute reducts minimal maximal average rule set size coverage accuracy F -measure
1201 0.818 0.544 0.522
2563 0.960 0.766 0.740
1902 0.930 0.668 0.644
minimal attribute space reducts minimal maximal average 1324 0.912 0.626 0.602
1806 0.949 0.766 0.740
1577 0.926 0.692 0.668
Since the minimal attribute reduct set is much larger than the minimal attribute space reduct set, it contains both the “best” and the “worst” reducts. In general, for this dataset the minimal attribute space bias outperforms the minimal attribute bias in terms of average rule set size (325 less), average accuracy (0.024 more) and averge F -measure (0.024 more). It is quite interesting that the latter bias outperforms (0.004 more than) the former in terms of average coverage. Two reducts drew our special attention:
1. The best reduct. It helped obtaining a rule set with a predication accuracy of 0.766 and F -measure of 0.740, and it was included in both reduct sets; and 2. The minimal attribute space reduct with 10 attributes. Although not a minimal reduct, it helped obtain a rule set containing 1789 rules, with the predication coverage 0.949, accuracy 0.658 and F -measure 0.641. The results are fairly good compared with that of minimal reducts.
6
Discussion
In this section we discuss these two biases from a broader viewpoint.
Rule set Discretization scheme
Predicting performance
Reduct
Decision tree
Fig. 1. A typical inductive learning scenario
As depicted in Fig. 1, the ultimate goal of inductive learning is to obtain good predicting performance, defined by the coverage, the accuracy, or the combination of both (e.g., F -measure) on the data. But the predicting performance can be obtained only after rule set was generated or decision tree [10] was constructed (for the sake of simplicity, other approaches such as neural network or kNN are not included in Fig. 1). According to Occam’s Razor, smaller rule sets, or simpler decision trees (with least nodes) are preferred. In order to obtain a small rule set or a simple decision tree, also according to Occam’s Razor, the simplest reduct is desired. But the key issue is: What is the metric of evaluating the simplicity of a reduct? Aforementioned biases are two metrics, among which the new bias seems to be closer to the essence. Then why the minimal attribute bias worked well for so many applications? In fact, many reduct construction algorithms use the following strategy: “[i]f two attributes have the same performance with respect to the criterion described above then the one having less values is selected” [1], and it is quite possible that a minimal space reduct be constructed while a minimal reduct is required. Moreover, even if the minimal reduct obtained is not a minimal space reduct, its attribute spaces is not too large compared with that of a minimal space reduct. In other words, the minimal attribute bias is often a good approximation of the minimal attribute space bias.
7
Conclusions and Further Works
Compared with the minimal attribute bias, the minimal attribute space bias is closer to goal of constructing simple reducts from viewpoints of attribute space and attribute space coverage. Also, it does not incur the fairness problem. Experiments on two datasets showed that the new bias can help to narrow the scope of optimal reducts, and more importantly, it can help to obtain better rule sets in terms of accuracy and F -measure. Since the definition of a bias is a quite fundamental issue, many research works, e.g., reduct constructing algorithms, on the new bias are expected in the near future.
Acknowledgement Fan Min was supported by an information distribution project under grant No. 9140A06060106DZ223 and the Youth Foundation of UESTC. The authors would like to thank Zichun Zhong and Yue Liu for their help in experiments and paper proofing.
References 1. Zhong, N., Dong, J.: Using rough sets with heuristics for feature selection. Journal of Intelligent Information Systems, 16 (2001) 199–214 2. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11 (1982) 341–356 3. Pawlak, Z.: Some issues on rough sets. In Peters, J.F., Skowron, A., Grzymala´ Busse, J.W., Kostek, B., Swiniarski, R.W., Szczuka, M.S., eds.: Transactions on Rough Sets I. LNCS 3100. Springer-Verlag, Berlin Heidelberg (2004) 1–58 4. Zhang, W., Mi, J., Wu, W.: Knowledge reductions in inconsistent information systems. Chinese Journal of Computers 26(1) (2003) 12–18 5. Yao, Y., Yan, Z., Wang, J.: On reduct construction algorithms. In Wang, G., Peters, J.F., Skowron, A., Yao, Y., eds.: RSKT 2006. LNCS 4062, Berlin Heidelberg, Springer-Verlag (2006) 297–304 6. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In Slowi´ nski, R., ed.: Intelligent Decision Support–Handbook of Applications and Advances of the Rough Sets Theory. Kluwer, Academic Publishers, Dordrecht (1992) 331–362 7. Xu, C., Min, F.: Weighted reduction for decision tables. In Wang, L., Jiao, L., Shi, G., Li, X., Liu, J., eds.: FSKD 2006. LNCS 4223, Berlin Heidelberg, Springer-Verlag (2006) 246–255 8. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, http://www.ics.uci.edu/˜mlearn/mlrepository.html (1998) 9. Bazan, J., Szczuka, M.: The RSES homepage, http://alfa.mimuw.edu.pl/˜rses (1994–2005) 10. Quinlan, J.R.: Induction of decision trees. Machine Learning 1 (1986) 81–106