Journal of Information & Computational Science 10:13 (2013) 4105–4115 Available at http://www.joics.com
September 1, 2013
A Backtracking Approach to Minimal Cost Feature Selection of Numerical Data ⋆ Hong Zhao ∗, Fan Min ∗,
William Zhu
Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363000, China
Abstract In data mining applications, one of the most fundamental problems is feature selection, which can improve the generalization capability of patterns. Recent developments in this field have led to a renewed interest in cost-sensitive feature selection. Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. However, this problem has been addressed on only nominal data. In this paper, we consider numerical data with measurement errors and propose a backtracking approach for this problem. First, we build a data model with normal distribution measurement errors. In order to deal with this model, the neighborhood of each data item is constructed through the confidence interval. Compared with discretized intervals, neighborhoods are more reasonable to maintain the information of data. And then, we redefine the minimal total cost feature selection problem on the neighborhood model. Finally, a backtracking algorithm with three effective pruning techniques is proposed to deal with this problem. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for datasets with nearly one thousand objects. Keywords: Feature Selection; Normal Distribution Measurement Errors; Test Costs; Misclassification Costs
1
Introduction
Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. Test costs are what we pay for collecting data items [10]. Test costs are often measured by time, money, and other resources. When test costs are only considered (see, e.g., [12, 16, 17, 19, 20]), the minimal cost feature selection problem degrades to the minimal test cost reduct problem [10, 11]. Therefore, the minimal cost feature selection problem is a generalization of the minimal test cost reduct problem. ⋆ This work is in part supported by the Natural Science Foundation of Fujian Province, China under Grant No. 2012J01294, the Fujian Province Foundation of Higher Education under Grant No. JK2012028, the National Science Foundation of China under Grant No. 61170128, and the Education Department of Fujian Province under Grant No. JA12222. ∗ Corresponding author. Tel.: +86 133 7690 8359 Email addresses:
[email protected] (Hong Zhao),
[email protected] (Fan Min).
1548–7741 / Copyright © 2013 Binary Information Press DOI: 10.12733/jics20102163
4106
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
In addition to the test costs, misclassification costs are necessary to be considered in costsensitive learning [22]. Misclassification cost is the penalty we receive while deciding that an object belongs to class J when its real class is K [2, 3]. For example, in medical diagnosis, if a cancer (non-cancer) is regarded as the negative (positive) class, there will be punishment. When misclassification costs are only considered (see, e.g., [2, 26]), the average misclassification cost is a more general metric than the accuracy [13]. It is important to consider both test costs and misclassification costs in many applications. Test costs are paid on each object. Misclassification costs are paid to misclassified objects [20]. Therefore we should take into account total cost on a set of objects. This issue has been addressed recently on nominal data. In real applications, however, data are often numerical and they always have some measurement errors. The measurement errors of the data have certain universality and are inescapability. In this paper, we consider numerical data with measurement errors and propose a backtracking approach to study minimal cost feature selection problem. First, a data model with measurement errors under the normal distribution is defined. Second, we construct the neighborhood of each data item through the confidence interval. Compared with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, considering the trade-off between test costs and misclassification costs, we define a new minimal total cost feature selection problem. Fourth, a backtracking algorithm with three effective pruning techniques is proposed to deal with this problem. Four open datasets from the University of California - Irvine (UCI) library are employed to study the effect and efficiency of our algorithm. Experiments are undertaken with the open source software Coser [15] to validate the performance of this algorithm. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for datasets with nearly one thousand objects. This paper is organized as follows. Section 2 introduces a decision system with normal distribution measurement errors. Then we introduce test costs and misclassification costs to this decision system, and define a cost-sensitive decision system. In Section 3, we define a new minimal cost feature selection problem by considering test costs and misclassification costs. In order to deal with this problem, a full description of a backtracking algorithm is given in Section 4. Experimental settings and results are discussed in Section 5. Finally, Section 6 concludes and suggests further research trends.
2
Data Models
In this section, the concept of decision systems with Normal Distribution Measurement Errors (NDME) is revisited. Then the neighborhood of each data item is constructed through the confidence interval. Finally decision systems based on NDME with test costs and misclassification costs is presented. Definition 1 [24] A decision system with Normal Distribution Measurement Errors (NEDS) S is the 6-tuple: S = (U, C, d, V = {Va |a ∈ C ∪ {d}}, I = {Ia |a ∈ C ∪ {d}}, n),
(1)
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
4107
where U is the nonempty set called a universe, C is the nonempty set of conditional attributes, d is the decision attribute, Va is the set of values for each a ∈ C ∪ {d}, and Ia : U → Va is an information function for each a ∈ C ∪ {d}. n : C → R+ ∪ {0} is the maximal measurement error of a ∈ C, and ±n(a) are confidence limits of a, respectively. Confidence limits are the lower and upper boundaries of a confidence interval. In real applications, there are a number of measurement methods to obtain a numerical data item with different measurement errors. The measurement errors often satisfy normal distribution which is found to be applicable over almost the whole of science and engineering measurement. We introduce the confidence interval of normal distribution to our model. According to Definition 1, we have defined a neighborhood in [25] as follows. Definition 2 Let S = (U, C, d, V, I, n) be a NEDS. Given xi ∈ U and B ⊆ C, the neighborhood of xi with respect to normal distribution measurement errors on test set B is defined as nB (xi ) = {x ∈ U |∀a ∈ B, |a(x) − a(xi )| ≤ 2n(a)},
(2)
it represents the error value of a in [−n(a), +n(a)]. From Definition 2 we know that nB (xi ) =
∩
n{a} (xi ).
(3)
a∈B
That is, the neighborhood nB (xi ) is the intersection of a number of basic neighborhoods (see, e.g., ∪ [1, 7, 8, 23]). Given ∀x ∈ U , ∀a ∈ B, x ∈ nB (x). Therefore, for any B ⊆ C, x∈U nB (x) = U . Hence the set {nB (xi )|xi ∈ U } is a covering (see, e.g., [9, 27, 28, 29, 30]) of U . In cost-sensitive learning, test costs and misclassification costs are two most important types of costs. Now, we define a cost-sensitive decision system by considering both test and misclassification costs. Definition 3 A decision system based on NDME with test costs and misclassification costs (NEDS-TM) S is the 8-tuple: S = (U, C, d, V, I, n, tc, mc), (4) where U, C, d, V, I and n have the same meanings as in Definition 1, tc : C → R+ ∪ {0} is the test costs function and mc : k × k → R+ ∪ {0} is the misclassification costs function, where k = |Id |. For any B ⊆ C, the sequence-independent test costs function tc is defined as follows: ∑ tc(B) = tc(a).
(5)
a∈B
The misclassification costs function can be represented by a matrix mc = {mck×k }. Misclassification costs [4, 5, 21] is the penalty we receive while deciding that an object belongs to class m when its real class is n [3, 14]. If classification is correct, the misclassification costs mc[m, m] = 0. The following example gives us intuitive understanding.
4108
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
Table 1: An example of numerical value attribute decision table x
a1
a2
a3
d
x1
0.31
0.23
0.08
y
x2
0.14
0.38
0.23
y
x3
0.25
0.40
0.40
y
x4
0.60
0.46
0.15
n
x5
0.41
0.64
0.62
n
x6
0.35
0.50
0.75
n
Table 2: A neighborhood boundaries vector and a test costs vector a
a1
a2
a3
n(a)
0.0069
0.0087
0.0086
c(a)
$28
$19
$56
Example 1 A NEDS-TM is given by Tables 1 and 2, and ] [ 0 800 . mc = 200 0
(6)
That is, the test costs are $28, $19 and $56, respectively. In this dataset, the decision attribute is used to split dataset into two sets. In case that a person belongs to set “y”, and he is misclassified as set “n”, a penalty of $800 is paid, contrarily $200 is paid.
3
Minimal Cost Feature Selection Problem
In this work, we focus on selecting a subset of features to minimize the total cost, that is, costsensitive feature selection based on test costs and misclassification costs. The problem of finding such a subset of features is called the minimal cost feature selection problem. Problem 1 The minimal cost feature selection problem (MCFS). Input: S = (U, C, d, V, I, n, tc, mc); Output: R ⊆ C and the average total cost (AT C(R)); Optimization objective: min(AT C(R)). Compared with the classical minimal reduction problem, there are several differences as follows. The first is the input, the external information is the test costs and misclassification costs as well as normal distribution measurement errors. The second is the optimization objective, which is to minimize the total cost instead of the number of features. In data mining applications, average total cost (ATC) is considered to be a more general metric than the accuracy [26]. The average total cost is computed as follows. Step 1. Compute the neighborhood of each data item. Let B be a selected feature set, and N (B) be a set {nB (xi )|xi ∈ U }.
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
4109
Step 2. Assign one class. Let U ′ = nB (xi ) and cB (xi )(dm ) be the number of class dm of U ′ , where dm ∈ {Id }. We adjust different classes of elements in U ′ to one class for minimizing the misclassification cost of U ′ . U ′ includes the following two types of cases. 1. U ′ ⊆ P OSB ({d}) if and only if d(x) = d(y) for any x, y ∈ U ′ . In this case, the misclassification costs of U ′ is mc(U ′ , B) = 0. For any x ∈ U ′ , the assigning class d′ (x) = d(x). 2. However, if there exists x, y ∈ U ′ st. d(x) ̸= d(y), we may adjust different classes of elements in U ′ to one class. Let d(x) be m-class and d(y) be n-class. We select one case to minimize the misclassification costs of U ′ . ′ mc(U ′ , B) = min(mcm,n × |Um |, mcn,m × |Un′ |).
For any x ∈ U ′ ,
{ d′ (x) =
n-class
′ If mc(U ′ , B) = mcm,n × |Um |,
m-class If mc(U ′ , B) = mcn,m × |Un′ |,
(7)
(8)
′ where mcm,n is the cost of classifying an object of the m-class into the n-class, and |Um | is the number of m-class.
Step 3. Compute average misclassification cost. The class of xi is d′ (xi ) = dm if and only if max{cB (xi )(dm )|dm ∈ {Id }}. The misclassification costs are If d(xi ) = d′ (xi ), 0 (9) mc∗ (d(xi ), d′ (xi )) = mcm,n If d(xi ) = m and d′ (xi ) = n, mc ′ If d(x ) = n and d (x ) = m. n,m
i
i
In this way, the average misclassification cost is given by ∑ mc∗ (d(xi ), d′ (xi )) mc(U, B) = xi ∈U . |U |
(10)
Step 4. Compute average total cost (ATC). The average total cost is given by AT C(U, B) = tc(B) + mc(U, B).
(11)
In this context, we select a feature subset in order to minimize the average total cost. The minimal average total cost is given by AT C(U, B) = min{AT C(U, B ′ )|B ′ ⊆ C}.
(12)
The MCFS problem has a similar motivation with the cheapest cost problem [18], or the Minimal Test cost Reduct (MTR) problem (see, e.g., [10]). However, compared with the MTR problem, our MCFS problem is different from theirs in two aspects. 1. In addition to considering the test costs of each attribute, we take misclassification cost into account. When the misclassification costs are too large compared with test costs, the MCFS problem coincides with the MTR problem. Therefore, the MCFS problem is a generalization of the MTR problem. 2. The attribute reduction [6] needs to preserve a particular property of the decision system. The feature selection relies on the only cost information.
4110
4
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
Algorithm
In this section, the backtracking algorithm to the Minimal Cost Feature Selection problem (MCFS) is illustrated in Algorithm 1. To invoke the algorithm, one should initialize the global variables as follows: R = ∅ is a feature subset with minimal total cost; cmc = mc(U, R) is currently minimal cost; and use the following statement: backtracking(R, 0). The result of a feature subset with minimal total cost will be stored in R. Algorithm 1 A backtracking algorithm to the MCFS problem. Input: (U, C, d, {Va }, {Ia }, n, tc, mc), select tests R, current level test index lower bound l Output: A set of features R with AT C and cmc, they are global variables Method: backtracking 1: for (i = l; i < |C|; i + +) do 2: B = R ∪ {ai } 3: if (tc(B) > cmc) then 4: continue; //Pruning for too expensive test cost 5: end if 6: if ((AT C(U, B) ≥ AT C(U, R)) and (mc(B) < mc(R)) then 7: continue; //Pruning for non-decreasing total cost and decreasing misclassification cost 8: end if 9: if (AT C(U, B) < cmc)) then 10: cmc = AT C(U, B); //Update the minimal total cost R = B; //Update the set of features with minimal total cost 11: 12: end if 13: backtracking (B, i + 1); 14: end for Generally, the search space of the feature selection algorithm is 2|C| . In this context, a backtracking algorithm with pruning techniques is used to select a feature subset to minimize the total cost. There are essentially three pruning techniques employed in Algorithm 1. 1. In Algorithm 1, Line 1 indicates that the variable i starts from l instead of 0. Whenever we move forward (see Line 13), the lower bound is increased. With this pruning technique, the solution space is 2|C| instead of |C|!. 2. In Algorithm 1, Lines 3 through 5 show the second pruning method. The misclassification costs are non-negative in the practical application. In this conditions, the feature subsets B will be discarded if the test costs of B is larger than the current minimal cost (cmc). This technique can prune most branches. 3. Lines 6 through 8 indicate that if the new feature subset produce high cost, the current branch will never produce the feature subset with minimal total cost.
5
Experiments
Experiments are undertaken on four datasets from the UCI Repository of Machine Learning Databases, as listed in Table 3. We undertake three groups of experiments from different viewpoints.
4111
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
In many real-world applications such as fraud detection, medical diagnosis, costs of different types of misclassifications can be vastly different. Therefore, in our examinations, we assume that the misclassification costs are different from each other. Table 4 is the optimal feature subset based on different misclassification costs for Diab dataset. The ratio of two misclassification costs is set 10 in this experiment. As shown in this table, when the misclassification costs are too large compared with test costs, the test costs increase and even equal to the average total cost. In this case, the MCFS problem coincides with the MTR problem. Table 5 shows the optimal feature subset based on unified misclassification cost. Since the misclassification costs are small enough, Table 3: Datasets information No.
Name
Domain
|U |
|C|
D = {d}
1
Liver
clinic
345
6
selector
2
Credit
commerce
690
15
class
3
Iono
physics
351
34
class
4
Diab
clinic
768
8
class
Table 4: The optimal feature subset based on different misclassification costs MisCost1
MisCost2
Test costs
Total cost
Feature subset
120
1200
6.00
9.75
[1, 2, 3]
140
1400
6.00
10.38
[1, 2, 3]
160
1600
6.00
11.00
[1, 2, 3]
180
1800
6.00
11.63
[1, 2, 3]
200
2000
6.00
12.25
[1, 2, 3]
220
2200
10.00
12.58
[1, 3, 6]
240
2400
10.00
12.81
[1, 3, 6]
260
2600
13.00
13.00
[1, 2, 6]
Table 5: The optimal feature subset based on unified misclassification cost MisCost1
MisCost2
Test costs
Total cost
Feature subset
120
120
6.00
9.59
[1, 2, 3]
140
140
6.00
10.19
[1, 2, 3]
160
160
10.00
10.42
[0, 1, 2, 3]
180
180
10.00
10.47
[0, 1, 2, 3]
200
200
10.00
10.52
[0, 1, 2, 3]
220
220
10.00
10.57
[0, 1, 2, 3]
240
240
10.00
10.63
[0, 1, 2, 3]
260
260
10.00
10.68
[0, 1, 2, 3]
4112
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
the algorithm chooses a feature subset with minimal average total cost. From the efficiency of the algorithm, we design two sets of experiments to evaluate performance of pruning technique. In this case, we propose Fast approach and Slow approach. The Fast approach is a backtracking algorithm with three pruning methods, which is given by the Algorithm 1. We also propose the Slow approach without the third pruning method. The first set studies the change of backtracking steps with the different misclassification costs. Fig. 1 shows the backtracking steps of the algorithm. The smaller misclassification costs are listed in X-axis. From these figures we observe the effectiveness of the third pruning technique.
350
14 13
300
11
250
10
200
Steps
Steps
12
9 8 7
100 Fast Slow
6 5 4
150
(a) 10
40 70 100 130 Misclassification cost setting
Fast Slow
50 0
160
(b) 10
40 70 100 130 Misclassification cost setting
160
35
250
30 200 25 Steps
Steps
150 100
20 15 10
50
Fast Slow
(c) 0
10
40 70 100 130 Misclassification cost setting
160
Fast Slow
5 (d) 0
10
40 70 100 130 Misclassification cost setting
160
Fig. 1: Backtracking steps: (a) Liver; (b) Credit; (c) Iono; (d) Diab
From the costs viewpoint, the changes of test costs and the average minimal total cost are shown in Fig. 2. The smaller misclassification costs are listed in X-axis. As shown in Fig. 2 (a), compared with the test costs, when the misclassification costs are large enough, the average total cost equal to the test costs. In this way, the misclassification cost is zero, and the set of selected tests is a attribute reduct. In real applications, we could not select any attribute when misclassification costs are low. Fig. 2 (d) shows this situation.
4113
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115 7
13
6
12 11 10
4
Cost
Cost
5
3
8
2
7 Test cost Average total cost
1 (a) 0
9
10
40 70 100 130 Misclassification cost setting
6 5
160
(b) 10
Test cost Average total cost 40 70 100 130 Misclassification cost setting
160
20 4.5
18
4.0
16
3.5
14 12 Cost
Cost
3.0 2.5
10
2.0
8
1.5
6
1.0 0.5 0
Test cost Average total cost
(c) 10
40 70 100 130 Misclassification cost setting
160
4 2 (d) 0 10
Test cost Average total cost 40 70 100 130 Misclassification cost setting
160
Fig. 2: Test costs and average total cost: (a) Liver; (b) Credit; (c) Iono; (d) Diab
6
Conclusion
In this paper, we take data with normal distribution measurement errors into account and study feature selection with minimal cost. This minimal cost feature selection problem has a wide application area for two reasons. From the viewpoint of the data, measurement errors under considered are ubiquitous. From the viewpoint of the minimal cost problem, the resource one can afford is often limited. In order to obtain the optimal result, a backtracking algorithm with three effective pruning techniques is designed for MCFS problem. Experimental results show that the pruning techniques are effective. This work also serves as the benchmark for other heuristic algorithms which should be designed in our further works for large datasets.
References [1] [2]
A. Bargiela, W. Pedrycz, Granular Computing: An Introduction, Kluwer Academic Publishers, Boston, 2002 C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the 7th International Joint Conference on Artificial Intelligence, 2001
4114
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
[3]
E. B. Hunt, J. Marin, P. J. Stone (eds.), Experiments in Induction, Academic Press, New York, 1966
[4]
M. Kukar, I. Kononenko et al., Cost-sensitive learning with neural networks, in: Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), Chichester, John Wiley & Sons, 1998
[5]
J. Lan, M. Hu, E. Patuwo, G. Zhang, An investigation of neural network classifiers with unequal misclassification costs and group sizes, Decision Support Systems, 48(4), 2010, 582-591
[6]
H. X. Li, X. Z. Zhou, J. B. Zhao, D. Liu, Attribute reduction in decision-theoretic rough set model: A further investigation, in: Proceedings of Rough Sets and Knowledge Technology, LNCS, Vol. 6954, 2011
[7]
T. Y. Lin, Granular computing on binary relations-analysis of conflict and Chinese wall security policy, in: Proceedings of Rough Sets and Current Trends in Computing, LNAI, Vol. 2475, 2002
[8]
T. Y. Lin, Granular computing - Structures, representations, and applications, in: Lecture Notes in Artificial Intelligence, Vol. 2639, 2003
[9]
L. W. Ma, On some types of neighborhood-related covering rough sets, International Journal of Approximate Reasoning, 53(6), 2012, 901-911
[10] F. Min, H. P. He, Y. H. Qian, W. Zhu, Test-cost-sensitive attribute reduction, Information Sciences, 181 (2011), 4928-4942 [11] F. Min, Q. H. Liu, A hierarchical model for test-cost-sensitive decision systems, Information Sciences, 179 (2009), 2442-2452 [12] F. Min, W. Zhu, Attribute reduction with test cost constraint, Journal of Electronic Science and Technology of China, 9(2), 2011, 97-102 [13] F. Min, W. Zhu, Minimal cost attribute reduction through backtracking, Database Theory and Application, Bio-Science and Bio-Technology Communications in Computer and Information Science, Vol. 258, 2011, 100-107 [14] F. Min, W. Zhu, A competition strategy to cost-sensitive decision trees, in: Proceedings of Rough Set and Knowledge Technology, LNAI, Vol. 7414, 2012 [15] F. Min, W. Zhu, H. Zhao, G. Y. Pan, J. B. Liu, Z. L. Xu, Coser: Cost-senstive rough sets, 2012, http://grc.fjzs.edu.cn/˜fmin/coser/ [16] S. Norton, Generating better decision trees, in: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 1989 [17] M. N´ un ˜ez, The use of background knowledge in decision tree induction, Machine Learning, 6(3), 1991, 231-250 [18] R. Susmaga, Computation of minimal cost reducts, in: Z. Ras, A. Skowron (eds.), Foundations of Intelligent Systems, LNCS, Vol. 1609, 1999, 448-456 [19] M. Tan, Cost-sensitive learning of classification knowledge and its applications in robotics, Machine Learning, 13(1), 1993, 7-33 [20] P. Turney, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research (JAIR), 2 (1995), 369-409 [21] P. Turney, Types of cost in inductive concept learning, in: Proceedings of the ICML’2000 Workshop on Cost-Sensitive Learning, 2000 [22] Y. Weiss, Y. Elovici, L. Rokach, The cash algorithm-cost-sensitive attribute selection using histograms, Information Sciences, Vol. 222, 2013, 247-268 [23] H. Zhao, F. Min, W. Zhu, Test-cost-sensitive attribute reduction based on neighborhood rough set, in: Proceedings of the 2011 IEEE International Conference on Granular Computing, 2011
H. Zhao et al. / Journal of Information & Computational Science 10:13 (2013) 4105–4115
4115
[24] H. Zhao, F. Min, W. Zhu, Inducing covering rough sets from error distribution, Journal of Information & Computational Science, 10(3), 2013, 851-859 [25] H. Zhao, F. Min, W. Zhu, Test-cost-sensitive attribute reduction of data with normal distribution measurement errors, to appear in Mathematical Problems in Engineering, 2013 [26] Z. Zhou, X. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, 18(1), 2006, 63-77 [27] W. Zhu, Generalized rough sets based on relations, Information Sciences, 177(22), 2007, 4997-5011 [28] W. Zhu, Relationship among basic concepts in covering-based rough sets, Information Sciences, 17(14), 2009, 2478-2486 [29] W. Zhu, Relationship between generalized rough sets based on binary relation and covering, Information Sciences, 179(3), 2009, 210-225 [30] W. Zhu, F. Wang, Reduction and axiomization of covering generalized rough sets, Information Sciences, 152(1), 2003, 217-230