Consistency Based Attribute Reduction - CiteSeerX

16 downloads 580 Views 262KB Size Report
BND. −. = )( is called the boundary of the approximations. As a definable set, the ..... end. Step 3: select the attribute k a which satisfies: )),,(. (max. ),,(. Bred a. SIG.
Consistency Based Attribute Reduction Qinghua Hu, Hui Zhao, Zongxia Xie, and Daren Yu Harbin Institute of Technology, Harbin 150001, P.R. China [email protected]

Rough sets are widely used in feature subset selection and attribute reduction. In most of the existing algorithms, the dependency function is employed to evaluate the quality of a feature subset. The disadvantages of using dependency are discussed in this paper. And the problem of forward greedy search algorithm based on dependency is presented. We introduce the consistency measure to deal with the problems. The relationship between dependency and consistency is analyzed. It is shown that consistency measure can reflects not only the size of decision positive region, like dependency, but also the sample distribution in the boundary region. Therefore it can more finely describe the distinguishing power of an attribute set. Based on consistency, we redefine the redundancy and reduct of a decision system. We construct a forward greedy search algorithm to find reducts based on consistency. What’s more, we employ cross validation to test the selected features, and reduce the overfitting features in a reduct. The experimental results with UCI data show that the proposed algorithm is effective and efficient.

1 Introduction As the capability of gathering and storing data increases, there are a lot of candidate features in some pattern recognition and machine learning tasks. Applications show that excessive features will not only significantly slow down the learning process, but also decrease the generalization power of the learned classifiers. Attribute reduction, also called feature subset selection, is usually employed as a preprocessing step to select part of the features and focuses the learning algorithm on the relevant information [1, 3, 4, 5, 7, 8]. In recent years, rough set theory has been widely discussed and used in attribute reduction and feature selection [6, 7, 8, 14, 16, 17]. Reduct is a proper term in rough set methodology. It means a minimal attribute subset with the same approximating power as the whole set [14]. This definition shows that a reduct should have the least redundant information and not loss the classification ability of the raw data. Thus the attributes in a reduct should not only be strongly relevant to the learning task, but also be not redundant with each other. This property of reducts exactly accords with the objective of feature selection. Thereby, the process of searching reducts, called attribute reduction, is a feature subset selection process. As so far, a series of approaches to search reducts have been published. Discernibility Matrices [11, 14] were introduced to store the features which can distinguish the corresponding pair of objects, and then Boolean operations were conducted on the matrices to search all of the reducts. The main problem of this method is space and Z.-H. Zhou, H. Li, and Q. Yang (Eds.): PAKDD 2007, LNAI 4426, pp. 96–107, 2007. © Springer-Verlag Berlin Heidelberg 2007

Consistency Based Attribute Reduction

×1

4

4

97

time cost. We need a 10 0 matrix if there are 104 samples. What’s more, it is also time-consuming to search reducts from the matrix with Boolean operations. With the dependency function, a heuristic search algorithm was constructed [1, 6, 7, 8, 16]. There are some problems in dependency based attribute reduction. The dependency function in rough set approaches is the ratio of sizes of the positive region over the sample space. The positive region is the sample set which can be undoubtedly classified into a certain class according to the existing attributes. From the definition of the dependency function, we can find that it ignores the influence of boundary samples, which maybe belong to more than one class. However, in classification learning, the boundary samples also exert an influence on the learned results. For example, in learning decision trees with CART or C4.5 learning, the samples in leaf nodes sometimes belong to more than one class [2, 10]. In this case, the nodes are labeled with the class with majority of samples. However, the dependency function does not take this kind of samples into account. What’s more, there is another risk in using the dependency function in greedy feature subset search algorithms. In a forward greedy search, we usually start with an empty set of attribute, and then we add the selected features into the reduct one by one. In the first round, we need to compute the dependency of each single attribute, and select the attribute with the greatest dependency value. We find that the greatest dependency of a single attribute is zero in some applications because we can not classify any of the samples beyond dispute with any of the candidate features. Therefore, according to the criterion that the dependency function should be greater than zero, none of the attributes can be selected. Then the feature selection algorithm can find nothing. However, some combinations of the attributes are able to distinguish any of the samples although a single one cannot distinguish any of them. As much as we know, there is no research reporting on this issue so far. These issues essentially result from the same problem of the dependency function. It completely neglects the boundary samples. In this paper, we will introduce a new function, proposed by Dash and Liu [3], called consistency, to evaluate the significance of attributes. We discuss the relationship between dependency and consistency, and employ the consistency function to construct greedy search attribute reduction algorithm. The main difference between the two functions is in considering the boundary samples. Consistency not only computes the positive region, but also the samples of the majority class in boundary regions. Therefore, even if the positive region is empty, we can still compare the distinguishing power of the features according to the sample distribution in boundary regions. Consistency is the ratio of consistent samples; hence it is linear with the size of consistent samples. Therefore it is easy to specify a stopping criterion in a consistency-based algorithm. With numerical experiments, we will show the specification is necessary for real-world applications. In the next section, we review the basic concepts on rough sets. We then present the definition and properties of the consistency function, compare the dependency function with consistency, and construct consistency based attribute reduction in section 3.We present the results of experiments in section 4. Finally, the conclusions are presented in section 5.

98

Q. Hu et al.

2 Basic Concepts on Rough Sets Rough set theory, which was introduced to deal with imperfect and vague concepts, has tracted a lot of attention from theory and application research areas. Data sets are usually given as the form of tables, we call a data table as an information system, formulated as IS =< U , A, V , f > , where U = {x1 , x 2 , L x n } is a set of finite and nonempty objects, called the universe, A is the set of attributes characterizing the objects, V is the domain of attribute value and f is the information function f : U × A → V . If the attribute set is divided into condition attribute set C and decision attribute set D , the information system is also called a decision table. With arbitrary attribute subset B ⊆ A , there is an indiscernibility relation IND ( B ) :

at

IND ( B ) = {< x, y >∈ U × U | ∀a ∈ B, a( x) = a( y )} . < x, y >∈ IND ( B ) means objects x and y are indiscernible with respect to attribute set B. Obviously, indiscernibility relation is an equivalent relation, which satisfies the properties of reflexivity, symmetry and transitivity. The equivalen class induced by the attributes B is denoted by

t

[ x i ] B = {x |< x i , x >∈ IND ( B), y ∈ U } . Equivalent classes generated by B are also called B-elemental granules, Binformation granules. The set of elemental granules forms a concept system, which is used to characterize the imperfect concepts in the information system. Given an arbitrary concept X in the information system, two unions of elemental granules are associated with B X = {[ x] B | [ x] B ⊆ X , x ∈ U } , B X = {[ x] B | [ x] B I X ≠ ∅, x ∈ U } .

The concept X is approximated with the two sets of elemental granules. B X and B X are called lower and upper approximations of X in terms of attributes B. B X is

also called the positive region. X is a definable if B X = B X , which means the concept X can be perfectly characterized with the knowledge B, otherwise, X is indefinable. An indefinable set is called a rough set. BND( X ) = B X − B X is called the boundary of the approximations. As a definable set, the boundary is empty. Given , C and D will generates two partitions of the universe. Machine learning is usually involved in using condition knowledge to approximate the decision and finding the mapping from the conditions to decisions. Approximating U / D with U / C , the positive and boundary regions are defined as: POS C ( D ) =

U

X ∈U / D

C X , BNDC ( D ) =

U

X ∈U / D

CX −

U

X ∈U / D

CX .

The boundary region is the set of elemental granules which can not be perfectly described by the knowledge C, while the positive region is the set of C-elemental granules which completely belong to one of the decision concepts. The size of positive or boundary regions reflects the approximation power of the condition

Consistency Based Attribute Reduction

99

attributes. Given a decision table, for any B ⊆ C , it is said the decision attribute set D depends on the condition attributes with the degree k, denoted by B ⇒ k D , where k = γ B ( D) =

| POS B ( D) | . |U |

The dependency function k measures the approximation power of a condition attribute set with respect to the decision D. In data mining, especially in feature selection, it is important to find the dependence relations between attribute sets and to find a concise and efficient representation of the data. Given a decision table DT =< U , C U D, V , f > , if P ⊆ Q ⊆ C , we have

γ Q ( D ) ≥ γ P ( D) Given a decision table DT =< U , C U D, V , f > , B ⊆ C , a ∈ B , we say that the

condition attribute a is indispensable if γ ( B − a ) ( D ) < γ B ( D) , otherwise we say a is redundant. We say B ⊆ C is independent if any a in B is indispensable. Attribute subset B is a reduct of the decision table if 1) γ B ( D ) = γ C ( D) ; 2) ∀a ∈ B : γ B ( D ) > γ B − a ( D) . A reduct of a decision table is the attribute subset which keeps the approximating capability of all the condition attributes. In the meantime it has no redundant attribute. The term of “reduct” presents a concise and complete ways to define the objective of feature selection and attribute reduction.

3 Consistency Based Attribute Reduction A binary classification problem in discrete spaces is shown in Figure 1, where the samples are divided into a finite set of equivalence classes {E1 , E 2 , L , E K } based on their feature values. The samples with the same feature values are grouped into one equivalence class. We find that some of the equivalence classes are pure as their samples belong to one of decision classes, but there also are some inconsistent equivalence classes, such as E 3 and E 4 in figure1. According to rough set theory, they are named as decision boundary region, and the set of consistent equivalence classes is named as decision positive region. The objective of feature selection is to find a feature subset which minimizes the inconsistent region, in either discrete or numerical cases, accordingly, minimizes Bayesian decision error. It is therefore desirable to have a measure to reflect the size of inconsistent region for discrete and numerical spaces for feature selection. Dependency reflects the ratio of consistent samples over the whole set of samples. Therefore dependency doesn’t take the boundary samples into account in computing significance of attributes. Once there are inconsistent samples in an equivalence class, these equivalence classes are just ignored. However, inconsistent samples can be divided into two groups: a subset of samples under the majority class and a subset under the minority classes. According

100

Q. Hu et al.

p ( E1 | ω1 ) p ( E 2 | ω 1 )

E1

p(E 3 | ω2 )

E2

E3

p(E 4 | ω2 )

E4

p( E 3 | ω1 )

p(E5 | ω2 )

E5 p ( E 4 | ω1 )

(1)

p( E 6 | ω2 )

E6

p ( E1 | ω1 ) p ( E 2 | ω 1 )

E1

p(E 3 | ω2 )

E2

E3

p(E 4 | ω2 )

p( E5 | ω2 )

E4

p ( E 3 | ω1 )

p( E6 | ω2 )

E5

E6

p( E 4 | ω1 )

(2)

Fig. 1. Classification complexity in a discrete feature space

to Bayesian rule, only the samples under the minority classes are misclassified. Fox example, the samples in E 3 and E 4 are inconsistent in figure 1. But only the samples labeled with P( E 3 | ω 2 ) and P( E 4 | ω1 ) are misclassified. The classification power in this case can be given by f = 1 − [P( E 3 | ω 2 ) P( E 3 ) − P( E 4 | ω1 ) P ( E 4 )] .

Dependency can not reflect the true classification complexity. In the discrete cases, we can see from comparison of figure 1 (1) and (2) although the probabilities of inconsistent samples are identical, the probabilities of misclassification are different. Dependency function in rough sets can not reflect this difference. In [3], Dash and Liu introduced the consistency function which can measure the difference. Now we present the basic definition on consistency. Consistency measure is defined by inconsistency rate, computed as follows. Definition 1. A pattern is considered to be inconsistent if there are at least two objects such that they match the whole condition attribute set but are with different decision label. Definition 2. The inconsistency count ξ i for a pattern p i of feature subset is the number of times it appears in the data minus the largest number among different class labels. Definition 3. The inconsistency rate of a feature subset is the sum,

∑ξi

, of all the inconsistency counts over all patterns of the feature subset that appears in data divided by | U | , the size of all samples, namely ∑ ξ i / | U | . Correspondingly, consistency is computed as δ = (| U | −∑ ξ i ) / | U | .

Based on the above analysis, we can understand that dependency is the ratio of samples undoubtedly correctly classified, and consistency is the ratio of samples probably correctly classified. There are two kinds of samples in POS B ( D ) U M . POS B (D) is the set of consistent samples, while M is the set of the samples with the largest number among different class labels in the boundary region. In the paper, we will call M pseudoconsistent samples.

Consistency Based Attribute Reduction

101

Property 1: Given a decision table , ∀B ⊆ C , we have 0 ≤ δ B ( D ) ≤ 1 ,

γ B ( D) ≤ δ B ( D ) . Property 2 (monotonicity): Given a decision table , if B1 ⊆ B 2 ⊆ D ,

we have δ B 1 ( D ) ≤ δ B 2 ( D ) . Property 3: Given a decision table , if and only if U / C ⊆ U / D , namely, the table is consistent, we have δ C ( D ) = γ C ( D ) = 1 Definition 4. Given a decision table DT =< U , C U D, V , f > , B ⊆ C , a ∈ B , we

say condition attribute a is indispensable in B if δ ( B − a ) ( D ) < δ B ( D) , otherwise; we

say a is redundant. We say B ⊆ C is independent if any attribute a in B is indispensable.

δ B (D ) reflects not only the size of positive regions, but also the distribution of boundary samples. The attribute is said to be redundant if the consistency doesn’t decrease when we delete it. Here the term “redundant” has two meanings. The first one is relevant but redundant, the same as the meaning in general literatures [6, 7, 8, 14, 16, 17]. The second meaning is irrelevant. So consistency can detect the two kinds of superfluous attributes [3]. Definition 5. Attribute subset B is a consistency-based reduct of the decision table if

(1) δ B ( D) = δ C ( D ) ; (2) ∀a ∈ B : δ B ( D ) > δ B − a ( D ) . In this definition, the first term guarantees the reduct has the same distinguishing ability as the whole set of features; the second one guarantees that all of the attributes in the reduct are indispensable. Therefore, there is not any superfluous attribute in the reduct. Finding the optimal subset of features is a NP-hard problem. We require evaluating N 2 − 1 combinations of features for find the optimal subset if there are N features in the decision table. Considering computational complexity, here we construct a forward greedy search algorithm based on the consistency function. We start with an empty set of attribute, and add one attribute into the reduct in a round. The selected attribute should make the increment of consistency maximal. Knowing attribute subset B, we evaluate the significance of an attribute a as SIG (a, B, D) = δ B U a ( D) − δ B ( D ) . SIG (a, B, D ) is the increment of consistency by introducing the new attribute a in the condition of B. The measure is linear with the size of the new consistent and pseudo-consistent samples. Formally, a forward greedy reduction algorithm based on consistency can be formulated as follows.

102

Q. Hu et al.

Algorithm: Greedy Reduction Algorithm based on Consistency Input: Decision table < U , C U d , f > Output: One reduct red . Step 1: ∅ → red ; // red is the pool to contain the selected attributes. Step 2: For each ai ∈ A − red Compute

SIG (a i , red , D ) = δ red U ai ( D ) − δ red ( D ) end Step 3: select the attribute a k which satisfies:

SIG (a k , red , D ) = max ( SIG(a i , red , B )) i

Step 4: If SIG (a k , red , D ) > 0 , red U a k → red go to step2 else return red Step 5: end In the first round, we start with an empty set, then specify δ ∅ ( D) = 0 . In this algorithm, we generate attribute subsets with a semi-exhaustive search. Namely, we evaluate all of the rest attributes in each round with the consistency function, and select the feature producing the maximal significance. The algorithm stops when adding any of the rest attributes will not bring increment of consistency value. In realworld applications, we can stop the algorithm if the increment of consistency is less than a given threshold to avoiding the over-fitting problem. In section 4, we will discuss this problem in detail. The output of the algorithm is a reduced decision table. The irrelevant, relevant and redundant attributes are deleted from the system. The output results will be validated with two popular learning algorithms: CART and SVM, in section 4. By employing a hashing mechanism, we can compute the inconsistency rate approximately with a time complexity of O(| U |) [3]. In the worst case the whole computational complexity of the algorithm can be computed as | U | × | C | + | U | ×(| C | −1) + L + | U |= (| C | +1)× | C | × | U | / 2 .

4 Experimental Analysis There are two main objectives to conduct the experiments. First, we compare the proposed method with dependency based algorithm. Second, we study the classification performance of the attributes selected with the proposed algorithm, In particular, how the classification accuracy varies with adding a new feature. This can tell us where the algorithm should be stopped. We download data sets from UCI Repository of machine learning databases. The data sets are described in table 1. There are some numerical attributes in the data sets. Here we employ four discretization techniques to transform the numerical data into

Consistency Based Attribute Reduction

103

Table 1. Data description Data set Australian Credit Approval Ecoli Heart disease Ionosphere Sonar, Mines vs. Rocks Wisconsin Diagnostic Breast Cancer Wisconsin Prognostic Breast Cancer Wine recognition

Abbreviation Crd Ecoli Heart Iono Sonar WDBC WPBC Wine

Samples 690 336 270 351 208 569 198 178

Features 15 7 13 34 60 31 33 13

Classes 2 7 2 2 2 2 2 3

categorical one: equal-width, equal-frequency, FCM and entropy. Then we conduct the dependency based algorithm [8] and the proposed one on the discretized data sets. The numbers of the selected features are presented in table 2. From table 2, we can find there is a great problem with dependency based algorithm, where, P stands for dependency based algorithm, and C stands for consistency based algorithm. The algorithm selects two few feature for classification learning as to some data sets. As to the discretized data with Equal-width method, the dependency based algorithm only selects one attribute, while the consistency one selects 7 attributes. As to Equal-frequency method, the dependency based algorithm selects nothing for data sets Heart, Sonar and WPBC. The similar case occurs to Entropy and FCM based discretization methods. Obviously, the results are unacceptable if a feature selection algorithm cannot find anything. By contrast, the consistency based attribute reduction algorithm finds feature subsets with moderate sizes for all of the data sets. What’s more, the sizes of selected features with the two algorithms are comparable if the dependency algorithm works well. Why does the dependency based algorithm find nothing for some data sets? As we know, dependency just reflects the ratio of positive regions. The forward greedy algorithm starts off with an empty set and adds, in turn, one of the best attributes into the pool at a time, those attributes that result in the greatest increase in the dependency function, until this produces its maximum possible value for the data set. In the first turn, we need to evaluate each single attribute. For some data sets, the dependency is zero for each single attribute. Therefore, no attribute can be added into the pool in the first turn. Then the algorithm stops here. Sometimes, the algorithm can Table 2. The numbers of selected features with different methods

Crd Ecoli Heart Iono Sonar WDBC WPBC Wine Aver.

Raw data 15 7 13 34 60 30 33 13 25.63

Equal-width P C 11 11 6 6 10 9 7 1 7 7 12 12 9 10 5 4 7.63 8.25

Equal-frequency P C 9 9 7 7 8 0 1 7 6 0 6 6 6 0 4 4 -6.63

P 11 1 0 10 0 7 11 4 --

Entropy C 11 7 11 8 14 7 7 5 8.75

FCM P 12 1 0 10 6 8 7 4 --

C 11 6 8 9 6 10 7 4 7.63

104

Q. Hu et al.

also stop in the second turn or the third turn. However, the selected features are not enough for classification learning. Consistency can overcome this problem as it can reflect the change in distribution of boundary samples. Now we use the selected data to train classifiers with CART and SVM learning algorithms. We test the classification power of the selected data with 10-fold cross validation. The average classification accuracies with CART and SVM are presented in tables 3 and 4, respectively. From table 3, we can find most of the reduced data can keep, even improve the classification power if the numbers of selected attributes are appropriate although most of the candidate features are deleted from the data. It shows that most of the features in the data sets are irrelevant or redundant for training decision trees; thereby, it should be deleted. However, the classification performance will greatly decrease if the data are excessively reduced, such as iono in the equalwidth case and ecoli in the entropy and FCM cases. Table 3. Classification accuracy with 10-fold cross validation (CART)

Crd Ecoli Heart Iono Sonar WDBC WPBC Wine Aver.

Raw data 0.8217 0.8197 0.7407 0.8755 0.7207 0.9050 0.6963 0.8986 0.8098

Equal-width P C 0.8246 0.8246 0.8138 0.8138 0.7630 0.7630 0.7499 0.9064 0.7024 0.7014 0.9367 0.9402 0.7413 0.7024 0.9090 0.9035 0.8051 0.8194

Equal-frequency P C 0.8346 0.8150 0.8197 0.8138 0.7704 0 0.7499 0.9064 0.7445 0 0.9402 0.9508 0.7121 0 0.9208 0.9153 -0.8285

Entropy P C 0.8288 0.8186 0.4262 0.8168 0.7630 0 0.9318 0.8922 0.7448 0 0.9420 0.9420 0.6805 0.6855 0.9208 0.9437 -0.8258

FCM P C 0.8274 0.8158 0.42620 0.8168 0.7815 0 0.9089 0.9062 0.6926 0.6976 0.9351 0.9315 0.6955 0.6924 0.8972 0.8972 -0.8174

We can also find from table 4 that most of classification accuracies of reduced data decrease a little compared with the original data. Correspondingly, the average classification accuracies for all of the four discretization algorithms are a little lower than the original data. This shows that both dependency and consistency based feature selection algorithms are not fit for SVM learning because both dependency and consistency compute the distinguishing power in discrete spaces. Table 5 shows the selected features based on consistency algorithm and the corresponding turns being selected for parts of the data, where we use the FCM discretized data sets. The trends of consistency and classification accuracies with Table 4. Classification accuracy with 10-fold cross validation (SVM)

Crd Ecoli Heart Iono Sonar WDBC WPBC Wine Aver.

Raw data 0.8144 0.8512 0.8111 0.9379 0.8510 0.9808 0.7779 0.9889 0.8767

Equal-width P C 0.8144 0.8144 0.8512 0.8512 0.8074 0.8074 0.7499 0.9320 0.7398 0.7595 0.9668 0.9650 0.7737 0.7684 0.9444 0.9701 0.8309 0.8585

Equal-frequency P C 0.8028 0.82729 0.8512 0.8512 0.8111 0 0.7499 0.9320 0.7300 0 0.9597 0.9684 0.7737 0 0.9660 0.9660 -0.8575

Entropy P C 0.8100 0.8275 0.4262 0.8512 0.8111 0 0.9154 0.9207 0.8229 0 0.9561 0.9649 0.7632 0.7632 0.9722 0.9556 -0.8646

FCM Positive C 0.8058 0.8058 0.8512 0.4262 0.8074 0.0000 0.9348 0.9435 0.7074 0.7843 0.9649 0.9632 0.7837 0.7632 0.9486 0.9486 -0.8584

Consistency Based Attribute Reduction

105

Table 5. The selected Features with method FCM + Consistency 1st 13 5 11 28 25

Heart Iono Sonar WDBC WPBC

2nd 12 6 16 21 33

3rd 3 8 37 22 1

4th 10 21 3 3 7

1.05 n o 1 i t a 0.95 c i f 0.9 i s s a 0.85 l c f 0.8 o x 0.75 e d n 0.7 I

5th 1 9 9 7 23

6th 4 3 33 14 18

7th 7 10

8th 5 7

9th

15 6

2

4

10th

28 6

1.05 n o i 1 t a c i f i s 0.95 s a l c f 0.9 o x e 0.85 d n I

0.65

1

2

3

4

5

6

7

8

0.8 1

2

Number of features

1

2

4

5

6

7

8

9

Number of features

(1)Heart 1.05 n 1 o i t a 0.95 c i f 0.9 i s s 0.85 a l c 0.8 f o 0.75 x e 0.7 d n I 0.65 0.6

3

(2) Iono

3 4 5 Number of features

1.02 1 n o i 0.98 t a c 0.96 i f 0.94 i s 0.92 s a l c 0.9 0.88 f o 0.86 x e d 0.84 n I 0.82 0.8

6

(3) Sonar

1

2

3 4 5 6 7 Number of features

8

9

10

(4) WDBC

1.05 1 n o 0.95 i t a c 0.9 i f i s 0.85 s a l c 0.8 f o 0.75 x e d n 0.7 I 0.65 0.6

Consistency CART accuracy

1

2

3 4 5 Number of features

6

7

SVM accuracy

(5) WPBC Fig. 4. Trends of consistency, accuracies with CART and SVM

CART and SVM are shown in figure 4. As to all of the five plots, the consistency monotonously increases with the number of selected attributes. The maximal value of consistency is 1, which shows that the corresponding decision table is consistent. With the selected attributes, all of the samples can be distinguished. What’s more, it is

106

Q. Hu et al.

noticeable that the consistency rapidly rises at the beginning; and then slowly increases, until stops at 1. It means that the majority of samples can be distinguished with a few features, while the rest of the selected features are introduced to discern several samples. This maybe leads to the over-fitting problem. Therefore the algorithm should be ceased earlier or we need a pruning algorithm to delete the over-fitting features. The classification accuracy curves also show this problem. In figure 4, the accuracies with CART and SVM rise at first, arrive at a peak, then keep unchanged, or even decrease. In terms of classification learning, it shows the features after the peak are useless. They sometimes even deteriorate learning performance. Here we can take two measures to overcome the problem. The first one is to stop the algorithm when the increment of consistency is less than a given threshold. The second one is to employ some learning algorithm to validate the selected features, and delete the features after the accuracy peak. However, sometimes the first one, called prepruning method, is not feasible because we usually cannot exactly predict where the algorithm should stop. The latter, called post-pruning, is widely employed. In this work, cross validation are introduced to test the selected features. Table 6 shows the numbers of selected features and corresponding classification accuracies. We can find that the classification performance improves in most of the cases. At the same time, the selected features with consistency are further reduced. Especially for data sets Heart and Iono, the improvement is high to 10% and 18% with CART algorithm. Table 6. Comparison of features and classification performance with post-pruning

Heart Iono Sonar WDBC WPBC

features 13 34 60 30 33

Raw data CART 0.7630 0.7499 0.7024 0.9367 0.7413

SVM 0.8111 0.9379 0.8510 0.9808 0.7779

CART features Accuracy 3 0.8519 6 0.9260 3 0.7407 6 0.9420 7 0.7461

features 4 9 6 5 7

SVM Accuracy 0.8593 0.9435 0.7843 0.9702 0.7837

5 Conclusions In this paper, we introduce consistency function to overcome the problems in dependency based algorithms. We discuss the relationship between dependency and consistency, and analyze the properties of consistency. With the measure, the redundancy and reduct are redefined. We construct a forward greedy attribute reduction algorithm based on consistency. The numerical experiments show the proposed method is effective. Some conclusions are shown as follows. Compared with dependency, consistency can reflect not only the size of decision positive region, but also the sample distribution in boundary region. Therefore, the consistency measure is able to describe the distinguishing power of an attribute set more finely than the dependency function. Consistency is monotonous. The consistency value increases or keeps when a new attribute is added into the attribute set. What’s more, some attributes are introduced into the reduct just for distinguishing a few samples. If we keep these attributes in the final result, the attributes maybe overfit the data. Therefore, a pruning technique is

Consistency Based Attribute Reduction

107

required. We use 10-fold cross validation to test the results in the experiments and find more effective and efficient feature subsets.

Reference 1. Bhatt R. B., Gopal M.: On fuzzy-rough sets approach to feature selection. Pattern Recognition Letters 26 (2005) 965–975. 2. Breiman L., Freidman J., Olshen R., Stone C.: Classification and regression trees. California: Wadsworth International. 1984. 3. Dash M., Liu H.: Consistency-based search in feature selection. Artificial Intelligence 151 (2003) 155-176. 4. Guyon I., Weston J., Barnhill S., et al.: Gene selection for cancer classification using support vector machines. Machine Learning. 46 (2002) 389-422. 5. Guyon I., Elisseeff A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3 (2003) 1157-1182. 6. Hu Q. H., Li X. D., Yu D. R.: Analysis on Classification Performance of Rough Set Based Reducts. Q. Yang and G. Webb (Eds.): PRICAI 2006, LNAI 4099, pp. 423 – 433, 2006. Springer-Verlag Berlin Heidelberg. 7. Hu Q. H., Yu D. R., Xie Z. X.: Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognition Letters 27 (2006) 414-423. 8. Jensen R., Shen Q.: Semantics-preserving dimensionality reduction: Rough and fuzzyrough-based approaches. IEEE transactions of knowledge and data engineering 16 (2004) 1457-1471. 9. Liu H., Yu L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering. 17 (2005) 491-502. 10. Quinlan J. R.: Induction of decision trees. Machine Learning 1 (1986) 81–106. 11. Skowron A., Rauszer C.: The Discernibility Matrices and Functions in Information Systems. Intelligent Decision Support-Handbook of Applications and Advances of the Rough Sets Theory, Slowinski R (ed.), 1991, pp.331-362. 12. Slezak D.: 2001. Approximate decision reducts. Ph.D. Thesis, Warsaw University. 13. Ślezak D.: Approximate Entropy Reducts. Fundamenta Informaticae 53 (2002) 365– 390. 14. Swiniarski R. W., Skowron A.: Rough set methods in feature selection and recognition. Pattern recognition letters 24 (2003) 833-849. 15. Xie Z. X., Hu Q. H., Yu D. R.: Improved feature selection algorithm based on SVM and correlation. Lecture notes in computer science 3971(2006) 1373-1380. 16. Zhong N., Dong,J., Ohsuga S.: Using rough sets with heuristics for feature selection. J. Intelligent Information Systems 16 (2001) 199-214. 17. Ziarko W.: Variable precision rough sets model. Journal of Computer and System Sciences 46 (1993) 39-59.

Suggest Documents