Minimal cost feature selection of data with normal distribution measurement errors Hong Zhao, Fan Min∗, William Zhu
arXiv:1211.2512v1 [cs.AI] 12 Nov 2012
Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363000, China
Abstract Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In this paper, we consider numerical data with measurement errors and study minimal cost feature selection in this model. First, we build a data model with normal distribution measurement errors. Second, the neighborhood of each data item is constructed through the confidence interval. Comparing with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, we define a new minimal total cost feature selection problem through considering the trade-off between test costs and misclassification costs. Fourth, we proposed a backtracking algorithm with three effective pruning techniques to deal with this problem. The algorithm is tested on four UCI data sets. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects. Keywords: feature selection; normal distribution measurement errors; test costs; misclassification costs.
1. Introduction Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. Test costs are what we pay for collecting data items [1]. Test costs are often measured by time, money, and other resources. When test costs are only considered (see, e.g., [2, 3, 4, 5]), the minimal cost feature selection problem degrades to the minimal test cost reduct problem [1]. Therefore, the minimal cost feature selection problem is a generalization of the minimal test cost reduct problem. In addition to the test costs, misclassification costs are necessary to be considered in cost-sensitive learning [6]. Misclassification cost is the penalty we ∗ Corresponding
author. Tel.: +86 133 7690 8359 Email addresses:
[email protected] (Hong Zhao),
[email protected] (Fan Min),
[email protected] (William Zhu)
Preprint submitted to Elsevier
November 13, 2012
receive while deciding that an object belongs to class J when its real class is K [7, 8]. For example, in medical diagnosis, if a cancer (non-cancer) is regarded as the negative (positive) class, there will be punishment. When misclassification costs are only considered (see, e.g., [8, 9]), the average misclassification cost is a more general metric than the accuracy [10]. It is important to consider both test costs and misclassification costs in many applications [5]. Test costs are paid on each object. While misclassification costs are paid to misclassified objects. Therefore we should take into account total cost on a set of objects through considering the trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In real applications, however, data are often numeric and they always have some measurement errors. The measurement errors of the data have certain universality and are inescapability. In this paper, we study minimal cost feature selection considering numerical data with measurement errors. First, a data model with measurement errors under the normal distribution is defined. Second, we construct the neighborhood of each data item through the confidence interval. Compared with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, considering the trade-off between test costs and misclassification costs, we define a new minimal total cost feature selection problem. Fourth, a backtracking algorithm with three effective pruning techniques is proposed to deal with this problem. Four open data sets from the UCI (University of California - Irvine) library are employed to study the efficiency and effect of our algorithm. Experiments are undertaken with open source software Coser [11] to validate the performance of this algorithm. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects. This paper is organized as follows. Section 2 introduces a decision system with normal distribution measurement errors. Then we introduce test costs and misclassification costs to this decision system, and define a cost-sensitive decision system. In Section 3, we define a new minimal cost feature selection problem by considering test costs and misclassification costs. In order to deal with this problem, a full description of a backtracking algorithm is given in Section 4. Experimental settings and results are discussed in Section 5. Finally, Section 6 concludes and suggests further research trends. 2. Data models In this section, the concept of decision systems with normal distribution measurement errors (NDME) is revisited. Then the neighborhood of each data item is constructed through the confidence interval. Finally decision systems based on NDME with test costs and misclassification costs is presented.
2
Definition 1. [12] A decision system with normal distribution measurement errors (NEDS) S is the 6-tuple: S = (U, C, d, V = {Va |a ∈ C ∪ {d}}, I = {Ia |a ∈ C ∪ {d}}, n),
(1)
where U is the nonempty set called a universe, C is the nonempty set of conditional attributes, d is the decision attribute, Va is the set of values for each a ∈ C ∪ {d}, and Ia : U → Va is an information function for each a ∈ C ∪ {d}. n : C → R+ ∪ {0} is the maximal measurement error of a ∈ C, and ±n(a) are confidence limits of a, respectively. Confidence limits are the lower and upper boundaries of a confidence interval. In real applications, there are a number of measurement methods to obtain a numerical data item with different measurement errors. The measurement errors often satisfy normal distribution which is found to be applicable over almost the whole of science and engineering measurement. We introduce the confidence interval of normal distribution to our model. With Definition 1, a new neighborhood is defined as follows. Definition 2. [12] Let S = (U, C, d, V, I, n) be a NEDS. Given xi ∈ U and B ⊆ C, the neighborhood of xi with respect to normal distribution measurement errors on test set B is defined as nB (xi ) = {x ∈ U |∀a ∈ B, |a(x) − a(xi )| ≤ 2n(a)},
(2)
it represents the error value of a in [−n(a), +n(a)]. From Definition 2 we know that nB (xi ) =
\
n{a} (xi ).
(3)
a∈B
That is, the neighborhood nB (xi ) is the intersection of a number of basic neighborhoods (see, e.g., [13, 14, S 15, 16]). Given ∀x ∈ U , ∀a ∈ B, x ∈ nB (x). Therefore, for any B ⊆ C, x∈U nB (x) = U . Hence the set {nB (xi )|xi ∈ U } is a covering (see, e.g., [17, 18, 19]) of U . In cost-sensitive learning, test costs and misclassification costs are two most important types of costs. Now, we define a cost-sensitive decision system by considering both test and misclassification costs. Definition 3. A decision system based on NDME with test costs and misclassification costs (NEDS-TM) S is the 8-tuple: S = (U, C, d, V, I, n, tc, mc),
(4)
where U, C, d, V, I and n have the same meanings as in Definition 1, tc : C → R+ ∪ {0} is the test costs function and mc : k × k → R+ ∪ {0} is the misclassification costs function, where k = |Id |.
3
For any B ⊆ C, the sequence-independent test costs function tc is defined as follows: X tc(B) = tc(a). (5) a∈B
The misclassification costs function can be represented by a matrix mc = {mck×k }. Misclassification costs [20, 21, 22] is the penalty we receive while deciding that an object belongs to class m when its real class is n [7, 23]. If classification is correct, the misclassification costs mc[m, m] = 0. The following example gives us intuitive understanding. Table 1: An example numerical value attribute decision table.
x x1 x2 x3 x4 x5 x6
a1 0.31 0.14 0.25 0.60 0.41 0.35
a2 0.23 0.38 0.40 0.46 0.64 0.50
a3 0.08 0.23 0.40 0.51 0.62 0.75
d y y y n n n
Table 2: A neighborhood boundaries vector and a test costs vector.
a n(a) c(a)
a1 0.0069 $28
a2 0.0087 $19
Example 4. A NEDS-TM is given by Tables 1 and 2, and 0 800 mc = . 200 0
a3 0.0086 $56
(6)
That is, the test costs are $28, $19 and $56, respectively. In this data set, the decision attribute is used to split data set into two sets. In case that a person belongs to set “y”, and he is misclassified as set “n”, a penalty of $800 is paid, contrarily $200 is paid. 3. Minimal cost feature selection problem In this work, we focus on selecting a subset of features to minimize the total cost, that is, cost-sensitive feature selection based on test costs and misclassification costs. The problem of finding such a subset of features is called the minimal cost feature selection problem. 4
Problem 5. The minimal cost feature selection problem (MCFS). Input: S = (U, C, d, V, I, n, tc, mc); Output: R ⊆ C and the average total cost (ATC); Optimization objective: min|AT C(R)|. Compared with the classical minimal reduction problem, there are several differences as follows. The first is the input, the external information are the test costs and misclassification costs as well as normal distribution measurement errors. The second is the optimization objective, which is to minimize the total cost instead of the number of features. In data mining applications, average total cost (ATC) is considered to be a more general metric than the accuracy [9]. The average total cost is computed as follows. Step 1. Compute the neighborhood of each data item. Let B be a selected feature set, and N (B) be a set {nB (xi )|xi ∈ U }. Step 2. Assign one class. Let U 0 = nB (xi ) and cB (xi )(dm ) be the number of class dm of U 0 , where dm ∈ {Id }. We adjust different classes of elements in U 0 to one class for minimizing the misclassification cost of U 0 . U 0 includes the following two types of cases. 1. U 0 ⊆ P OSB ({d}) if and only if d(x) = d(y) for any x, y ∈ U 0 . In this case, the misclassification costs of U 0 is mc(U 0 , B) = 0. For any x ∈ U 0 , the assigning class d0 (x) = d(x). 2. However, if there exists x, y ∈ U 0 st. d(x) 6= d(y), we may adjust different classes of elements in U 0 to one class. Let d(x) be m-class and d(y) be n-class. We select one case to minimize the misclassification costs of U 0 . 0 mc(U 0 , B) = min(mcm,n × |Um |, mcn,m × |Un0 |).
(7)
For any x ∈ U 0 , d0 (x) =
n-class m-class
0 |, If mc(U 0 , B) = mcm,n × |Um 0 If mc(U , B) = mcn,m × |Un0 |,
(8)
where mcm,n is the cost of classifying an object of the m-class into the 0 n-class, and |Um | is the number of m-class. Step 3. Compute average misclassification cost. The class of xi is d0 (xi ) = dm if and only if max{cB (xi )(dm )|dm ∈ {Id }}. The misclassification costs are If d(xi ) = d0 (xi ), 0 ∗ 0 mcm,n If d(xi ) = m and d0 (xi ) = n, mc (d(xi ), d (xi )) = (9) mcn,m If d(xi ) = n and d0 (xi ) = m. In this way, the average misclassification cost is given by P mc∗ (d(xi ), d0 (xi )) mc(U, B) = xi ∈U . |U |
(10)
Step 4. Compute average total cost (ATC). The average total cost is given by (11) AT C(U, B) = tc(B) + mc(U, B). 5
In this context, we select a feature subset in order to minimize the average total cost. The minimal average total cost is given by AT C(U, B) = min{AT C(U, B 0 )|B 0 ⊆ C}.
(12)
The MCFS problem has a similar motivation with the cheapest cost problem [24], or the minimal test cost reduct (MTR) problem (see, e.g., [1]). However, compared with the MTR problem, our MCFS problem is different from theirs in two aspects. 1. In addition to considering the test costs of each attribute, we take misclassification cost into account. When the misclassification costs are too large compared with test costs, the MCFS problem coincides with the MTR problem. Therefore the MCFS problem is a generalization of the MTR problem. 2. The attribute reduction needs to preserve a particular property of the decision system. The feature selection relies on the only cost information. 4. Algorithm In this section, the backtracking algorithm to the minimal cost feature selection problem (MCFS) is illustrated in Algorithm 1. To invoke the algorithm, one should initialize the global variables as follows: R = ∅ is a feature subset with minimal total cost; cmc = mc(U, R) is currently minimal cost; and use the following statement: backtrack(R, 0). The result of a feature subset with minimal total cost will be stored in R. Generally, the search space of the feature selection algorithm is 2|C| . In this context, a backtracking algorithm with pruning techniques is used to select a feature subset to minimize the total cost. There are essentially three pruning techniques employed in Algorithm 1. 1. In Algorithm 1, Line 1 indicates that the variable i starts from il instead of 0. Whenever we move forward (see Line 14), the lower bound is increased. With this pruning technique, the solution space is 2|C| instead of |C|!. 2. In Algorithm 1, Lines 3 through 5 show the second pruning method. The misclassification costs are non-negative in the practical application. In this conditions, the feature subsets B will be discarded if the test costs of B is larger than the current minimal cost (cmc). This technique can prune most branches. 3. Lines 6 through 8 indicate that if the new feature subset produce high cost, the current branch will never produce the feature subset with minimal total cost. 5. Experiments Experiments are undertaken on four data sets from the UCI Repository of Machine Learning Databases, as listed in Table 3. We undertake three groups of experiments from different viewpoints. 6
Algorithm 1 A backtracking algorithm to the MCFS problem Input: (U, C, d, V, I, n, tc, mc), select tests R, current level test index lower bound l Output: A set of features R with minimal total cost and AMC, they are global variables Method: backtracking 1: for (i = l; i < |C|; i + +) do 2: B = R ∪ {ai } 3: if (tc(B) > cmc) then 4: continue; //Pruning for too expensive test costs 5: end if 6: if (AT C(U, B) ≥ AT C(U, R)) then 7: continue; //Pruning for non-decreasing total cost 8: end if 9: if (AT C(U, B) < cmc)) then 10: cmc = AT C(U, B); //Update the minimal total cost 11: R = B;//Update the set of features with minimal total cost 12: end if 13: end for 14: backtrack(R, i + 1);
Table 3: Data sets information.
No. 1 2 3 4
Name Liver Credit Iono Diab
|U | 345 690 351 768
Domain clinic commerce physics clinic
|C| 6 15 34 8
D = {d} selector class class class
From the viewpoint of different misclassification costs settings, we undertake two sets of experiments. First, we assume that the misclassification costs are different from each other. Table 4 is the optimal feature subset based on different misclassification costs for Diab data set. The ratio of two misclassification costs is set 10 in this experiment. As shown in this table, when the misclassification costs are too large compared with test costs, the test costs increase and even equal to the average total cost. In this case, the MCFS problem coincides with the MTR problem. Second, misclassification costs are identical for different misclassification. Table 5 shows the optimal feature subset based on unified misclassification cost. Since the misclassification costs are small enough, the algorithm chooses a feature subset with minimal average total cost. From the efficiency of the algorithm, we design two sets of experiments to evaluate performance of pruning technique. In this set of experiments, misclassification costs are identical for different misclassification. In this case, we
7
Table 4: The optimal feature subset based on different misclassification costs.
MisCost1 120 140 160 180 200 220 240 260
MisCost2 1200 1400 1600 1800 2000 2200 2400 2600
Test costs 6.00 6.00 6.00 6.00 6.00 10.00 10.00 13.00
Total cost 9.75 10.38 11.00 11.63 12.25 12.58 12.81 13.00
Feature subset [1,2,3] [1,2,3] [1,2,3] [1,2,3] [1,2,3] [1,3,6] [1,3,6] [1,2,6]
Table 5: The optimal feature subset based on unified misclassification cost.
MisCost1 120 140 160 180 200 220 240 260
MisCost2 120 140 160 180 200 220 240 260
Test costs 6.00 6.00 10.00 10.00 10.00 10.00 10.00 10.00
Total cost 9.59 10.19 10.42 10.47 10.52 10.57 10.63 10.68
Feature subset [1,2,3] [1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3]
propose Fast approach and Slow approach. The Fast approach is a backtracking algorithm with three pruning methods, which is given by the Algorithm 1. We also propose the Slow approach without the third pruning method. The first set studies the change of backtracking steps with the misclassification costs. Figure 1 shows the backtracking steps of the algorithm. The second set studies the change of run time with the misclassification costs. Figure 2 shows the run time. From these two figures we observe the effectiveness of the third pruning technique. From the costs viewpoint, the changes of test costs and the average minimal total cost are shown in Figure 3. In real world, we could not select expensive tests when misclassification costs are low. Figure 3 shows this situation clearly. 6. Conclusion and further works In this paper, we take data with normal distribution measurement errors into account and study feature selection with minimal cost. This new feature selection problem, called minimal cost feature selection (MCFS), has a wide 8
14
300
13 250
12 11
200
9
Steps
Steps
10
8 7
150
100
6 5
50 Fast Slow
4 3
10
40 70 100 130 160 Misclassification cost setting for Liver dataset
Fast Slow 0
190
10
(a)
40 70 100 130 160 Misclassification cost setting for Credit dataset
190
(b)
200
30
150
25 Steps
35
Steps
250
100
20
50
15 Fast Slow
0
10
40 70 100 130 160 Misclassification cost setting for Iono dataset
Fast Slow 10
190
(c)
10
40 70 100 130 160 Misclassification cost setting for Wdbc dataset
190
(d)
Figure 1: Backtracking steps: (a) Liver; (b) Credit; (c) Iono; (d) Diab
application area for two reasons. From the viewpoint of the data, measurement errors under considered are ubiquitous. From the viewpoint of the minimal cost problem, the resource one can afford is often limited. In order to obtain the optimal result, a backtracking algorithm with three effective pruning techniques is designed for MCFS problem. Experimental results show that the pruning techniques are effective. This work also serves as the benchmark for other heuristic algorithms which should be designed in our further works for large data sets. Acknowledgment This work is in part supported by the Fujian Province Foundation of Higher Education (No. JK2012028), the National Nature Science Foundation of China (No. 61170128), and the Natural Science Foundation of Fujian Province, China (Nos. 2011J01374, 2012J01294).
9
8000
70
7000
60
6000
50
5000 Time (ms)
Time (ms)
80
40
4000
30
3000
20
2000
10 0
1000
Fast Slow 10
40
70 100 130 Misclassification cost setting
160
0
190
Fast Slow 10
40
160
1400
700
1200
600
1000
500
800
600
400
400
300
200
200
100 Fast Slow
0
190
(b)
Time (ms)
Time (ms)
(a)
70 100 130 Misclassification cost setting
10
40
70 100 130 Misclassification cost setting
160
Fast Slow 0
190
(c)
10
40
70 100 130 Misclassification cost setting
160
190
(d)
Figure 2: Run time: (a) Liver; (b) Credit; (c) Iono; (d) Diab
References [1] F. Min, H. P. He, Y. H. Qian, and W. Zhu, “Test-cost-sensitive attribute reduction,” Information Sciences, vol. 181, pp. 4928–4942, 2011. [2] S. Norton, “Generating better decision trees,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 1989, pp. 800–805. [3] M. N´ un ˜ez, “The use of background knowledge in decision tree induction,” Machine learning, vol. 6, no. 3, pp. 231–250, 1991. [4] M. Tan, “Cost-sensitive learning of classification knowledge and its applications in robotics,” Machine Learning, vol. 13, no. 1, pp. 7–33, 1993. [5] P. Turney, “Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm,” Journal of Artificial Intelligence Research (JAIR), vol. 2, 1995. 10
7
12
6
10
5 8
cost
cost
4 6
3 4 2 2
1 Test cost Average total cost 0
10
40
70 100 130 Misclassification cost setting
160
Test cost Average total cost 0
190
10
40
(a)
70 100 130 Misclassification cost setting
160
190
(b)
5
11
4.5
10
4
9 8
3.5
7 cost
cost
3 2.5
6 5
2 4 1.5
3
1
2
0.5 0
Test cost Average total cost 10
40
70 100 130 Misclassification cost setting
160
Test cost Average total cost
1 0
190
(c)
10
40
70 100 130 Misclassification cost setting
160
190
(d)
Figure 3: Test costs and average total cost: (a) Liver; (b) Credit; (c) Iono; (d) Diab
[6] Y. Weiss, Y. Elovici, and L. Rokach, “The cash algorithm-cost-sensitive attribute selection using histograms,” Information Sciences, 2011. [7] E. B. Hunt, J. Marin, and P. J. Stone, Eds., Experiments in induction. New York: Academic Press, 1966. [8] C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 7th International Joint Conference on Artificial Intelligence, 2001, pp. 973–978. [9] Z. Zhou and X. Liu, “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006. [10] F. Min and W. Zhu, “Minimal cost attribute reduction through backtracking,” in Proceedings of International Conference on Database Theory and Application, ser. FGIT-DTA/BSBT, CCIS, vol. 258, 2011, pp. 100–107.
11
[11] F. Min, W. Zhu, H. Zhao, J. B. Liu, and Z. L. Xu, “Coser: Cost-senstive rough sets, http://grc.fjzs.edu.cn/˜fmin/coser/,” 2012. [12] H. Zhao, F. Min, and W. Zhu, “Inducing covering rough sets from error distribution,” Journal of Information & Computational Science, 2012. [13] T. Y. Lin, “Granular computing on binary relations-analysis of conflict and chinese wall security policy,” in Proceedings of Rough Sets and Current Trends in Computing, ser. LNAI, vol. 2475, 2002, pp. 296–299. [14] A. Bargiela and W. Pedrycz, Granular Computing: An Introduction. Kluwer Academic Publishers, Boston, 2002. [15] T. Y. Lin, “Granular computing - structures, representations, and applications,” in Lecture Notes in Artificial Intelligence, vol. 2639, 2003, pp. 16–24. [16] H. Zhao, F. Min, and W. Zhu, “Test-cost-sensitive attribute reduction based on neighborhood rough set,” in Proceedings of the 2011 IEEE International Conference on Granular Computing, 2011, pp. 802–806. [17] W. Zhu and F. Wang, “Reduction and axiomization of covering generalized rough sets,” Information Sciences, vol. 152, no. 1, pp. 217–230, 2003. [18] W. Zhu, “Generalized rough sets based on relations,” Information Sciences, vol. 177, no. 22, pp. 4997–5011, 2007. [19] L. Ma, “On some types of neighborhood-related covering rough sets,” International Journal of Approximate Reasoning, 2012. [20] J. Lan, M. Hu, E. Patuwo, and G. Zhang, “An investigation of neural network classifiers with unequal misclassification costs and group sizes,” Decision Support Systems, vol. 48, no. 4, pp. 582–591, 2010. [21] P. Turney, “Types of cost in inductive concept learning,” in Proceedings of the ICML’2000 Workshop on Cost-Sensitive Learning, 2000, pp. 15–21. [22] M. Kukar, I. Kononenko et al., “Cost-sensitive learning with neural networks,” in Proceedings of the 13th European conference on artificial intelligence (ECAI-98). Chichester, John Wiley & Sons, 1998, pp. 445–449. [23] F. Min and W. Zhu, “A competition strategy to cost-sensitive decision trees,” in Proceedings of Rough Set and Knowledge Technology, ser. LNAI, vol. 7414, 2012, pp. 359–368. [24] R. Susmaga, “Computation of minimal cost reducts,” in Foundations of Intelligent Systems, ser. LNCS, Z. Ras and A. Skowron, Eds. Springer Berlin / Heidelberg, 1999, vol. 1609, pp. 448–456.
12