regions is known to be a source of nonzero Bayes error. Intuitively, it is easy to ..... also see that the error probability in A is far less than that in B. Dependency is ...
Measures of classification complexity based on neighborhood model Qinghua Hu, Hui Zhao, Daren Yu
Harbin Institute of Technology, Harbin 150001, P. R. China Abstract: It is useful to measure classification complexity for understanding classification tasks, selecting feature subsets and learning algorithms. In this work, we review some current measures of classification complexity and propose two new coefficients: neighborhood dependency (ND) and neighborhood decision error (NDEM). ND reflects the ratio of boundary samples over the whole sample set; while NDEM is the decision error rate based on neighborhood local information of samples. We introduce neighborhood rough set model to define and compute decision boundary, furthermore compute NDEM. As one hopes to find the feature subspace where the classification task is with the least complexity, we construct a feature selection algorithm based on the proposed measures and sequentially forward selection. Experimental results show that NDEM correlates best with classification error rate from well-known classifiers compared to current measures and ND. Accordingly, NDEM can select the minimal feature subset with comparative classification performance. Key words: classification complexity; neighborhood; rough set; decision error; dependency; feature selection 1.
Introduction
Classification complexity plays an important role in understanding classification tasks, selecting learning algorithms and selecting feature subsets. As to a complex task, one should employ an elaborate algorithm to learn the decision boundary; otherwise the learning algorithm is not capable of extracting complex classification rules [1]. Thus it is useful to estimate the complexity of a classification task before learning in understanding the task and selecting an appropriate learning algorithm. What’s more, Complexity measures can be used to evaluate the quality of feature subspaces and select a good subset of features [2, 3, 4]. There are several factors determining the complexity of a classification problem [5]. Overlapping of decision regions is known to be a source of nonzero Bayes error. Intuitively, it is easy to misclassify the samples within the overlapping region because they belong to different classes although they are with the similar or even same feature values. Complex decision boundary or subclass structures are another origin of complexity, so that there is not a compact description of the decision boundary. These two factors are dependent of samples size and feature space dimensionality. Sparsity of samples in feature spaces introduces another difficulty for classification learning as it is difficult to precisely estimate the classification boundary with a limited set of training samples. A number of measures were proposed to compute classification complexities [2, 5, 6, 15]. Most of the evaluating measures used in feature selection can be employed to estimate the complexity of learning tasks, such as mutual information [9, 10, 11], dependency [12] and consistency [13, 14]. In [5], Ho and Basu presented several measures to characterize the geometrical complexity of classification problems. These measures are selected to highlight the manner in which classes are separated or interleaved. In [2], Singh introduced some coefficients to compute complexity based on information slicing. These include purity, neighborhood separability,
collective entropy and compactness. These measures are compared with the measures presented in [5] and experimental results show purity and neighborhood separability correlate well with the true classification errors of known classifiers [15]. We present two new measures, called neighborhood dependency (ND) and neighborhood decision error minimization (NDEM), to compute the complexity of classification tasks based on boundary region analysis in this paper. Furthermore we use the proposed measures to evaluate the quality of feature subset and construct algorithms for feature subset selection based on sequentially forward search. We experimentally compare the proposed measures with purity and neighborhood separability and find NDEM is much effective. 2.
Neighborhood rough sets
Definition 1. Given arbitrary xi ∈ U and B ⊆ C , neighborhood δ B ( xi ) of xi in B is defined as
δ B ( x i ) = {x j | x j ∈ U , Δ B ( x i , x j ) ≤ δ } , where Δ is a metric function. Definition 2. Let B1 ⊆ A and B 2 ⊆ A be numerical and discrete attributes, respectively. The neighborhood granules induced by B1 , B 2 and B1 U B 2 are defined as (1) δ B1 ( x) = {x i | Δ B1 ( x, x i ) ≤ δ , x i ∈ U } ; (2) δ B2 ( x) = {xi | Δ B2 ( x, x i ) = 0, x i ∈ U } ; (3) δ B1U B2 ( x) = {xi | Δ B1 ( x, xi ) ≤ δ ∧ Δ B2 ( x, xi ) = 0, xi ∈U } , where ∧ means “and” operator. The first item is designed for numerical attributes, the second item is for discrete attributes, and the last item is for the system where numerical and discrete attributes coexist. Therefore definition 2 is able to deal with mixed numerical and discrete features. By this definition, the samples in the same neighborhood granules are similar; correspondingly, they are close to each other in
numerical spaces. However, they are equivalent to each other in discrete spaces. Given a metric space < U , Δ > , the family of neighborhood granules {δ ( x i ) | x i ∈ U } forms an elemental granule system, which cover the universe, and we have 1) ∀x i ∈ U , δ ( x i ) ≠ ∅ as x i ∈ δ ( x i ) ; 2) U δ ( x) = U . x∈U
Obviously, neighborhood relations are one kind of similarity relations, which satisfy reflexivity and symmetry. Neighborhood relations draw the objects together for similarity or indistinguishability in terms of distances. Definition 3. Giving a set of samples U, Ν is a neighborhood relation on U , {δ ( x i ) | x i ∈ U } is the family of neighborhood granules. Then we call < U , Ν > a neighborhood approximation space. Definition 4. Given < U , Ν > , for arbitrary X ⊆ U , two subsets of objects, called lower and upper approximations of X are defined as Ν X = {x i | δ ( xi ) ⊆ X , x i ∈ U } ,
Ν X = {xi | δ ( xi ) I X ≠ ∅, xi ∈ U } . The lower approximation sometimes is also call positive region of X; U − Ν X is called negative region of X. The boundary region of X in the approximation space is formulated as BNX = Ν X − Ν X . Theorem 1. Given < U , Δ, Ν > and two nonnegative δ 1 and δ 2 , if δ 1 ≥ δ 2 , we have 1) ∀x i ∈ U : Ν 1 ⊇ Ν 2 , δ 1 ( x i ) ⊇ δ 2 ( x i ) ; 2) ∀X ⊆ U : Ν 1 X ⊆ Ν 2 X ; Ν 2 X ⊇ Ν 1 X , where Ν 1 and Ν 2 are the neighborhood relations induced with δ 1 and δ 2 , respectively. Delta is a parameter to control the size of neighborhoods. If δ 1 ≥ δ 2 ≥ 0 , then δ 1 ( x i ) ⊇ δ 2 ( x i ) , ∀x i ∈ U . We also understand that δ is a parameter to control the resolution to estimate classification complexity. An information system is called a neighborhood system if its attributes generate a neighborhood relation on the universe, denoted by NIS =< U , A, V , f > , To be more specific, a neighborhood information system is also called a neighborhood decision system if there are two kinds of attributes in the system: condition and decision. And it is denoted by NDT =< U , C U D, V , f > . Definition 5. Given a neighborhood decision table NDT =< U , C U D, V , f > , X 1 , X 2 , L , X N are the object subsets with decisions 1 to N, δ B ( x i ) is the neighborhood information granules including x i and generated by attributes B ⊆ C , the lower and upper
approximations of decision D with respect to attributes B are then defined as N
N
i =1
i =1
ΝBD = U ΝBXi , ΝBD = U ΝBXi ,
where Ν B X = {x j | δ B ( x j ) ⊆ X , x j ∈ U } , Ν B X = { x j | δ B ( x j ) I X ≠ ∅, x j ∈ U } .
The decision boundary region of D with respect to attributes B is defined as BN ( D ) = Ν B D − Ν B D . Decision boundary is the object subset whose neighborhoods come from more than one decision class. The lower approximation of decision, also called positive region of decision, denoted by POS B (D) , is the subset of objects whose neighborhoods consistently belong to one of the decision classes. 3.
Complexity measures
The size of the boundary samples reflects the complexity of a classification problem. It also reflects the distinguishing capability of the corresponding attributes. A measure of complexity, called neighborhood dependency, can be formulated as follows. Definition 6. Given a neighborhood decision system NDT =< U , C U D, V , f > , B ⊆ C , the dependency of D to B in the neighborhood space is defined as || POS B ( D) || γ B ( D) = . || U || γ B (D) reflects the ability of B to approximate D. Obviously, 0 ≤ γ B ( D ) ≤ 1 . We say that D totally depends on B if γ B ( D) = 1 , denoted by B ⇒ D ; otherwise, we say D γ − depends on B, denoted by B ⇒r D . Theorem 2. < U , C U D, V , f > is a neighborhood decision system; B1 , B 2 ⊆ C , B1 ⊆ B 2 , with the same metric in computing neighborhoods, we have 1) Ν B1 ⊇ Ν B 2 ; 2) ∀X ⊆ U , Ν B1 X ⊆ Ν B2 X ; 3) POS B1 ( D) ⊆ POS B2 ( D) , γ B1 ( D) ≤ γ B2 ( D) . Theorem 2 shows dependency monotonously increases with attributes. The monotonicity of the dependency function is useful for constructing a greedy search algorithm, floating search or branch and bound algorithm. And it guarantees that the addition of any new feature into the current subset does not lead a decrease in the significance of a new subset. Neighborhood dependency reflects the fraction of boundary samples in numerical, discrete or their mixture spaces. And it extends the definition of dependency in classical rough sets to deal with numerical features free of discretization. Since discrete and numerical features
usually coexist in real-world databases, this extension expands the application scope of rough set theory.
x2
x1
x3
p(ω 2 | x)dx + ∫ p(ω1 | x)dx . x2
Generally speaking, e , arbitrary x i ∈ U , δ ( x i ) is
E2
E1
E4
E3
p(ω1 | E 3)
the neighborhood granule of x i and P(ω j | δ ( x i )) ,
E6
E5
p(ω1 | E 4)
Fig. 1 An inconsistent discrete decision system As we know, not all the samples in a decision boundary region are necessarily misclassified. Only those under minority classes can not be recognized with Bayes rule. So, dependency can not reflect the true classification complexity. Figure 2 shows the similar problem in numerical feature spaces. In subplots A and B, the whole feature spaces are inconsistent as two class probability densities are greater than zero everywhere. As a result, the neighborhood of each sample would not be pure and the samples come from two classes. Dependency is zero in these cases. Whereas, the probabilities of Bayes errors are less than one in these two cases. What’s more, we can also see that the error probability in A is far less than that in B. Dependency is not able to reflect these differences. Subplots C and D show the similar fact, where the probabilities of inconsistent samples are of little difference but Bayesian error rates are distinct.
A
B
p( x | ω1 )
x1
p( x | ω1 )
p( x | ω 2 )
x3
x2
C
x1 D
p(ω1 | x)
p(ω 2 | x)
p( x | ω 2 )
x3
x2 p(ω1 | x)
p(ω 2 | x)
x0 x0 x1
x3 x 4
x2
x0
x1
x2
x3 x 4
Fig. 2 inconsistent discrete system Taking subplot C as an example, we can see that the region between x 0 and x1 is the decision positive region of class ω1 ; the region between x3 and x 4 is the decision positive region of class ω 2 . The region between x1 and x3 is the decision boundary region. The inconsistent rate can be computed as I=
2
x3
i =1
1
∑ ∫x
p (ωi | x)dx ,
However, Bayesian error rate is
j = 1, 2, L , c , is the class probability of class ω j , then ND ( x i ) = ω l if P (ω l | δ ( x i )) = max P (ω j | δ ( x i )) , j
where P(δ ( x i ) | ω j ) = n j / N , N is the number of samples in the neighborhood, n j is the number of samples with decision ω j in this neighborhood. We introduce 0-1 loss function for misclassified samples ⎧0 ω( x i ) = ND ( x i ) λ (ND ( x i ) | ω ( x i ) ) = ⎨ , ⎩1 ω( x i ) ≠ ND ( x i ) where ω ( xi ) is the real class of x i . Definition 8. Neighborhood decision error rate (NDER) is defined as 1 n NDER = ∑ λ (ND( x i ) | ω ( x i ) ) , n i =1 where n is the number of samples. We have the following properties 1) γ A ( D) ≤ 1 − NDER ; 2) γ A ( D) = 1 − NDER if NDT is consistent in the neighborhood granulation space. In this case γ C ( D ) = 1 and NDER = 0 . For convenience, we name 1− NDER as Neighborhood Recognition Rate (NRR). NDER is an estimate of Bayesian decision error. Neighborhood decision error minimization (NDEM) is an idea for feature selection via minimizing NDER or maximizing NRR in different feature subsets. 4.
Feature selection with complexity measures
There are two key problems in constructing an algorithm for feature selection: feature evaluating measures and search strategies. Measures of classification complexity can obviously be used to evaluate the quality of feature subsets. In this paper, we focus our work on the measures of classification complexity, rather than searching strategies. We use sequentially forward search to construct algorithms for feature selection. We compare the results from the search strategy with different measures of classification complexity. Formally, a sequentially forward selection algorithm based on
NDEM can be written below. Algorithm 1: SFS-NDEM Input: decision table < U , C U d , f > ; delta δ // Control the size of neighborhood Output: feature subset red . 1: ∅ → red ; // red is the pool to contain the selected features. 2: if C − red = ∅ 3: return red 4: else 5: for each a i ∈ C − red 6: 7: 8: 9:
Compute SIG(a i , red , d ) = NRRred U ai (d ) end select the attribute a k which satisfies:
SIG(a k , red , d ) = max( SIG(a i , red , B)) i
dependency, AS H and AS NN . Accordingly, a new algorithm is constructed. We denote these algorithms by SFS-NDEM, SFS-ND, SFS-ASH and SFS-ASNN, respectively. 5.
Experimental analysis
There are two objectives in conducting the experiments. First, we should test the effectiveness of the measures to estimate the true classification complexity. Second, we also should test the capability of the measures in feature selection. In this work, we take AS H and AS NN as baseline methods, and compare them with the measures of ND and NDEM. 10 databases are gathered and described in table 1. We estimate their classification complexity with four measures: AS H , AS NN , neighborhood dependency (ND) and neighborhood decision error (NDEM), present in table 1, where we specify the size δ of neighborhood as 0.14. In the mean while, we also employ four classifiers: CART, RBF-SVM, KNN and neighborhood classifier (NEC) [7] to estimate classification errors based on 10-fold cross validation. Classification complexity and accuracies are shown in tables 1 and 2. We compute the correlation between complexities and accuracies, given in table 3.
10: If SIG (a k , red , d ) − SIG (red , d ) > 0 , 11: red U a k → red 12: go to line 2 13: else 14: return red 15: end In algorithm 1, we can substitute the feature evaluating function with other measures, such as neighborhood Table 1 UCI Data (Features (Classes)-Samples) and Classification Complexities 1 2 3 4 5 6 7 8 9 10
Dataset balance ecoli glass iono iris sonar wdbc wine yeast abalone
F(C)-S 4(3)-625 7(8)-336 9(7)-214 34(2)-351 4(3)-150 60(2)-208 30(2)-569 13(3)-178 8(10)-1484 8(29)-4177
ASH 0.96311 0.95398 0.91079 0.9875 0.95935 0.98341 0.98367 0.98386 0.84792 0.51613
ASNN 0.98793 0.98931 0.97847 0.99788 0.99666 0.9972 0.9984 0.9992 0.9463 0.81161
ND(0.14) 0.5824 0.044643 0.13551 0.66667 0.50667 0.79808 0.18981 0.86517 0.0067385 0.00023941
NDEM(0.14) 0.912 0.87798 0.68692 0.96439 0.95667 0.95433 0.93322 0.98876 0.42621 0.24611
Table 2 Average accuracies with 10-fold CV using four classifiers Dataset balance ecoli glass iono iris sonar wdbc wine yeast abalone
CART 0.64814 0.79615 0.53235 0.87546 0.96667 0.72071 0.90501 0.86944 0.5262 0.19174
SVM 0.88779 0.85118 0.57908 0.93789 0.96667 0.85095 0.98076 0.98333 0.58202 0.25727
KNN 0.735 0.85082 0.65427 0.84116 0.95333 0.81286 0.96839 0.96042 0.54358 0.22857
NEC 0.87206 0.86565 0.50511 0.7673 0.96 0.7931 0.95263 0.97153 0.55084 0.24709
Table 3 Correlation coefficients between accuracies and classification complexities ASH ASNN ND NDEM
CART 0.8726 0.8717 0.5426 0.9074
SVM 0.9095 0.8964 0.6533 0.9580
Table 3 shows the correlation between the accuracies of four classifiers and the complexity measures. Measure NDEM consistently gets the highest correlation among these measures, while ND gets the lowest. We can conclude that NDEM is a better estimate of classification complexity than AS H , AS NN and ND, and all the four
KNN 0.9154 0.9156 0.5667 0.9433
NEC 0.8583 0.8484 0.5854 0.9187
correlation coefficients are greater than 0.9. In order to show the influence of parameter delta, we specify δ with a series of values, and compute ND and NDEM at each value. The correlation coefficients between the accuracies and complexities are shown in figure 3. We see that complexity of ND is much sensitive
1 CART SVM KNN NEC
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0.1
0.2 delta
0.3
classification complexity [15], we can conclude that NDEM is robust to the resolution of complexity analysis. This property is important for constructing feature selection algorithms with the measure of complexity. Correlationbetweencomplexity andaccuracy
Correlation between complexity and accuracy
to the size of neighborhood, while NDEM is rather robust. We can find a wide range of delta which yields good estimates of complexity as to NDEM. As the size of neighborhood can be understood as the resolution of
1 0.95 0.9 0.85 0.8 0.75 CART SVM KNN NEC
0.7 0.65
0.4
0
0.1
(1) ND
0.2 delta
0.3
0.4
(2) NDEM
Fig.3 Variation of correlation between accuracies and complexities with different sizes of neighborhoods Now we conduct feature selection on the 10 databases The results are shown in figures 4 and 5, respectively, with different measures and compare the number of the where figure 4 presents the number of selected features selected features and the corresponding 10-fold cross and figure 5 gives the accuracies from the feature subsets. validation accuracies with CAR, SVM, KNN and NEC. 16
NDEM ND ASNN ASH
14 12 10 8 6 4 2 0 1
2
3
4
5
6
7
8
9
10
Fig. 4 Number of features selected with different measures of complexity 1.2
NDEM ND ASNN ASH
1 0.8
1.2
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
NDEM ND ASNN ASH
1
0
1
2
3
4
5
6
7
8
9
10
1
(1) Forward selection (CART)
2
3
4
5
6
7
8
9
10
(2) Forward selection (SVM)
1.2
NDEM ND ASNN ASH
1 0.8
1.2
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
NDEM ND ASNN ASH
1
0
1
2
3
4
5
6
7
(3) Forward selection (KNN)
8
9
10
1
2
3
4
5
6
7
8
9
10
(4) Forward selection s (NEC)
Fig. 5 Comparison of classification accuracies with different measures Seeing the results from forward selection, we find that NDEM has 5 times with the least features, ND 2 times, ASNN
3 times and ASH 2 times. At the same time, NDEM has 1 time with the most features, ND 4 times, ASNN 2 times and ASH 5 times. As to GA based algorithm, NDEM has 6 times with the least features, ND 3 times, ASNN 4 times and 1 times. In the mean while, NDEM has 3 times with the most features, ND 4 times, ASNN 4 times and 3 times. As a whole, NDEM gets the most times to find the minimal sets of features and the least times to get maximal sets. Now we compare the accuracies yielding from feature subsets selected based on different measures of complexity. We can see that NDEM has the most times to get the best classification performance from figure 5. Transactions on PAMI 27 (2005) 1226-1238 6. Conclusion [10] Q. H. Hu, D. R. Yu, Z. X. Xie. Informationpreserving hybrid data reduction based on There are several factors causing classification fuzzy-rough techniques. Pattern recognition letters complexity. In this work, we focus on the complexity 27 (2006) 414-423 introduced by class ambiguity, i.e. overlapping region [11] Q. H. Hu, D. R. Yu, Z. X. Xie, J. F. Liu. Fuzzy between different classes. Neighborhood rough sets are probabilistic approximation spaces and their introduced to define and compute the decision boundary, information measures. IEEE Transactions on fuzzy and neighborhood dependency is defined as a measure of systems 2006 (14)191-201. classification complexity, which reflects the ratio of [12] R. Jensen, Q. Shen. Semantics-Preserving boundary samples over the whole samples. Furthermore, Dimensionality Reduction: Rough and we point out that not all the boundary samples are Fuzzy-Rough Based Approaches. IEEE misclassified in practice, and propose a new measure of Transactions on knowledge and data engineering complexity, called neighborhood decision error. 16 (2004) 1457-1471 Experiments are conducted on UCI databases. We [13] M. Dash, H. Liu. Consistency-based search in compute the correlation coefficients between the feature selection. Artificial Intelligence 151 (2003) estimated complexities and classification error rate from 155-176 four well-known classifiers, and the results show that [14] Q. H. Hu, H. Zhao, Z. X. Xie, D. R. Yu. NDEM correlates best with all the classification errors Consistency based attribute reduction. Z.-H. Zhou, compared with the other three measures; all the H. Li, and Q. Yang (Eds.): PAKDD 2007, LNAI correlation coefficients of NDEM are greater than 0.9. As 4426, pp. 96–107, 2007. to feature selection, we also find that NDEM can select [15] S. Singh. Multiresolution estimates of classification the minimal subsets of features, at the same time, their complexity. IEEE Transactions on pattern analysis classification performances are competent. and machine intelligence 25 (2003) 1534-1539 [16] R. Thawonmas, S. Abe. A novel approach to feature Reference selection based on analysis of class regions. IEEE [1] D. Elizondo. The Linear Separability Problem: Transactions on SMC—Part B: Cybernetics 27 Some Testing Methods. IEEE Transactions on (1997) 196-207 neural networks 17 (2006) 330-344 [17] S. Abe, R. Thawonmas, Y. Kobayashi. Feature [2] S. Singh. PRISM–A novel framework for pattern Selection by Analyzing Class Regions recognition. Pattern Analysis and Applications 6 Approximated by Ellipsoids. IEEE Transactions On (2003) 134–149 Systems, Man, And Cybernetics—Part C: [3] N. Abe, M. Kudo. Non-parametric classifier Applications and Reviews 28 (1998) 282-287 -independent feature selection. Pattern recognition [18] Z. Pawlak. Rough sets. International journal of 39 (2006) 737–746 computer information and science 11 (1982) [4] A. Kohn, G. M. Nakano, M. Silva. A class 341–356 discriminability measure based on feature space partition. Pattern Recognition 29 (1996) 873-887 [5] T. K. Ho, M. Basu. Complexity measures of supervised classification problems. IEEE Transactions on pattern analysis and machine intelligence 24 (2002) 289-300 [6] W. Pierson, Using boundary methods for estimating class separability, Ph.D. Thesis, Ohio State University, Ohio, 1998. [7] Q. H. Hu, D. R. Yu, Z. X. Xie. Neighborhood classifiers. Expert systems with applications. (2006), doi:10.1016/j.eswa.2006.10.043 [8] J. Sancho, W. Pierson, B. Ulug, A. Figueiras-Vidal, S. Ahalt. Class separability estimation and incremental learning using boundary methods. Neurocomputing 35 (2000) 3-26 [9] H. Peng, F. Long, C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE