Minimal cost feature selection of data with normal distribution

Minimal cost feature selection of data with normal distribution measurement errors Hong Zhao, Fan Min∗, William Zhu

arXiv:1211.2512v1 [cs.AI] 12 Nov 2012

Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363000, China

Abstract Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In this paper, we consider numerical data with measurement errors and study minimal cost feature selection in this model. First, we build a data model with normal distribution measurement errors. Second, the neighborhood of each data item is constructed through the confidence interval. Comparing with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, we define a new minimal total cost feature selection problem through considering the trade-off between test costs and misclassification costs. Fourth, we proposed a backtracking algorithm with three effective pruning techniques to deal with this problem. The algorithm is tested on four UCI data sets. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects. Keywords: feature selection; normal distribution measurement errors; test costs; misclassification costs.

1. Introduction Minimal cost feature selection is devoted to obtain a trade-off between test costs and misclassification costs. Test costs are what we pay for collecting data items [1]. Test costs are often measured by time, money, and other resources. When test costs are only considered (see, e.g., [2, 3, 4, 5]), the minimal cost feature selection problem degrades to the minimal test cost reduct problem [1]. Therefore, the minimal cost feature selection problem is a generalization of the minimal test cost reduct problem. In addition to the test costs, misclassification costs are necessary to be considered in cost-sensitive learning [6]. Misclassification cost is the penalty we ∗ Corresponding

author. Tel.: +86 133 7690 8359 Email addresses: [email protected] (Hong Zhao), [email protected] (Fan Min), [email protected] (William Zhu)

Preprint submitted to Elsevier

November 13, 2012

receive while deciding that an object belongs to class J when its real class is K [7, 8]. For example, in medical diagnosis, if a cancer (non-cancer) is regarded as the negative (positive) class, there will be punishment. When misclassification costs are only considered (see, e.g., [8, 9]), the average misclassification cost is a more general metric than the accuracy [10]. It is important to consider both test costs and misclassification costs in many applications [5]. Test costs are paid on each object. While misclassification costs are paid to misclassified objects. Therefore we should take into account total cost on a set of objects through considering the trade-off between test costs and misclassification costs. This issue has been addressed recently on nominal data. In real applications, however, data are often numeric and they always have some measurement errors. The measurement errors of the data have certain universality and are inescapability. In this paper, we study minimal cost feature selection considering numerical data with measurement errors. First, a data model with measurement errors under the normal distribution is defined. Second, we construct the neighborhood of each data item through the confidence interval. Compared with discretized intervals, neighborhoods are more reasonable to maintain the information of data. Third, considering the trade-off between test costs and misclassification costs, we define a new minimal total cost feature selection problem. Fourth, a backtracking algorithm with three effective pruning techniques is proposed to deal with this problem. Four open data sets from the UCI (University of California - Irvine) library are employed to study the efficiency and effect of our algorithm. Experiments are undertaken with open source software Coser [11] to validate the performance of this algorithm. Experimental results indicate that the pruning techniques are effective, and the algorithm is efficient for data sets with nearly one thousand objects. This paper is organized as follows. Section 2 introduces a decision system with normal distribution measurement errors. Then we introduce test costs and misclassification costs to this decision system, and define a cost-sensitive decision system. In Section 3, we define a new minimal cost feature selection problem by considering test costs and misclassification costs. In order to deal with this problem, a full description of a backtracking algorithm is given in Section 4. Experimental settings and results are discussed in Section 5. Finally, Section 6 concludes and suggests further research trends. 2. Data models In this section, the concept of decision systems with normal distribution measurement errors (NDME) is revisited. Then the neighborhood of each data item is constructed through the confidence interval. Finally decision systems based on NDME with test costs and misclassification costs is presented.

2

Definition 1. [12] A decision system with normal distribution measurement errors (NEDS) S is the 6-tuple: S = (U, C, d, V = {Va |a ∈ C ∪ {d}}, I = {Ia |a ∈ C ∪ {d}}, n),

(1)

where U is the nonempty set called a universe, C is the nonempty set of conditional attributes, d is the decision attribute, Va is the set of values for each a ∈ C ∪ {d}, and Ia : U → Va is an information function for each a ∈ C ∪ {d}. n : C → R+ ∪ {0} is the maximal measurement error of a ∈ C, and ±n(a) are confidence limits of a, respectively. Confidence limits are the lower and upper boundaries of a confidence interval. In real applications, there are a number of measurement methods to obtain a numerical data item with different measurement errors. The measurement errors often satisfy normal distribution which is found to be applicable over almost the whole of science and engineering measurement. We introduce the confidence interval of normal distribution to our model. With Definition 1, a new neighborhood is defined as follows. Definition 2. [12] Let S = (U, C, d, V, I, n) be a NEDS. Given xi ∈ U and B ⊆ C, the neighborhood of xi with respect to normal distribution measurement errors on test set B is defined as nB (xi ) = {x ∈ U |∀a ∈ B, |a(x) − a(xi )| ≤ 2n(a)},

(2)

it represents the error value of a in [−n(a), +n(a)]. From Definition 2 we know that nB (xi ) =

\

n{a} (xi ).

(3)

a∈B

That is, the neighborhood nB (xi ) is the intersection of a number of basic neighborhoods (see, e.g., [13, 14, S 15, 16]). Given ∀x ∈ U , ∀a ∈ B, x ∈ nB (x). Therefore, for any B ⊆ C, x∈U nB (x) = U . Hence the set {nB (xi )|xi ∈ U } is a covering (see, e.g., [17, 18, 19]) of U . In cost-sensitive learning, test costs and misclassification costs are two most important types of costs. Now, we define a cost-sensitive decision system by considering both test and misclassification costs. Definition 3. A decision system based on NDME with test costs and misclassification costs (NEDS-TM) S is the 8-tuple: S = (U, C, d, V, I, n, tc, mc),

(4)

where U, C, d, V, I and n have the same meanings as in Definition 1, tc : C → R+ ∪ {0} is the test costs function and mc : k × k → R+ ∪ {0} is the misclassification costs function, where k = |Id |.

3

For any B ⊆ C, the sequence-independent test costs function tc is defined as follows: X tc(B) = tc(a). (5) a∈B

The misclassification costs function can be represented by a matrix mc = {mck×k }. Misclassification costs [20, 21, 22] is the penalty we receive while deciding that an object belongs to class m when its real class is n [7, 23]. If classification is correct, the misclassification costs mc[m, m] = 0. The following example gives us intuitive understanding. Table 1: An example numerical value attribute decision table.

x x1 x2 x3 x4 x5 x6

a1 0.31 0.14 0.25 0.60 0.41 0.35

a2 0.23 0.38 0.40 0.46 0.64 0.50

a3 0.08 0.23 0.40 0.51 0.62 0.75

d y y y n n n

Table 2: A neighborhood boundaries vector and a test costs vector.

a n(a) c(a)

a1 0.0069 $28

a2 0.0087 $19

Example 4. A NEDS-TM is given by Tables 1 and 2, and 0 800 mc = . 200 0

a3 0.0086 $56

(6)

That is, the test costs are $28, $19 and $56, respectively. In this data set, the decision attribute is used to split data set into two sets. In case that a person belongs to set “y”, and he is misclassified as set “n”, a penalty of $800 is paid, contrarily $200 is paid. 3. Minimal cost feature selection problem In this work, we focus on selecting a subset of features to minimize the total cost, that is, cost-sensitive feature selection based on test costs and misclassification costs. The problem of finding such a subset of features is called the minimal cost feature selection problem. 4

Problem 5. The minimal cost feature selection problem (MCFS). Input: S = (U, C, d, V, I, n, tc, mc); Output: R ⊆ C and the average total cost (ATC); Optimization objective: min|AT C(R)|. Compared with the classical minimal reduction problem, there are several differences as follows. The first is the input, the external information are the test costs and misclassification costs as well as normal distribution measurement errors. The second is the optimization objective, which is to minimize the total cost instead of the number of features. In data mining applications, average total cost (ATC) is considered to be a more general metric than the accuracy [9]. The average total cost is computed as follows. Step 1. Compute the neighborhood of each data item. Let B be a selected feature set, and N (B) be a set {nB (xi )|xi ∈ U }. Step 2. Assign one class. Let U 0 = nB (xi ) and cB (xi )(dm ) be the number of class dm of U 0 , where dm ∈ {Id }. We adjust different classes of elements in U 0 to one class for minimizing the misclassification cost of U 0 . U 0 includes the following two types of cases. 1. U 0 ⊆ P OSB ({d}) if and only if d(x) = d(y) for any x, y ∈ U 0 . In this case, the misclassification costs of U 0 is mc(U 0 , B) = 0. For any x ∈ U 0 , the assigning class d0 (x) = d(x). 2. However, if there exists x, y ∈ U 0 st. d(x) 6= d(y), we may adjust different classes of elements in U 0 to one class. Let d(x) be m-class and d(y) be n-class. We select one case to minimize the misclassification costs of U 0 . 0 mc(U 0 , B) = min(mcm,n × |Um |, mcn,m × |Un0 |).

(7)

For any x ∈ U 0 , d0 (x) =

n-class m-class

0 |, If mc(U 0 , B) = mcm,n × |Um 0 If mc(U , B) = mcn,m × |Un0 |,

(8)

where mcm,n is the cost of classifying an object of the m-class into the 0 n-class, and |Um | is the number of m-class. Step 3. Compute average misclassification cost. The class of xi is d0 (xi ) = dm if and only if max{cB (xi )(dm )|dm ∈ {Id }}. The misclassification costs are  If d(xi ) = d0 (xi ),  0 ∗ 0 mcm,n If d(xi ) = m and d0 (xi ) = n, mc (d(xi ), d (xi )) = (9)  mcn,m If d(xi ) = n and d0 (xi ) = m. In this way, the average misclassification cost is given by P mc∗ (d(xi ), d0 (xi )) mc(U, B) = xi ∈U . |U |

(10)

Step 4. Compute average total cost (ATC). The average total cost is given by (11) AT C(U, B) = tc(B) + mc(U, B). 5

In this context, we select a feature subset in order to minimize the average total cost. The minimal average total cost is given by AT C(U, B) = min{AT C(U, B 0 )|B 0 ⊆ C}.

(12)

The MCFS problem has a similar motivation with the cheapest cost problem [24], or the minimal test cost reduct (MTR) problem (see, e.g., [1]). However, compared with the MTR problem, our MCFS problem is different from theirs in two aspects. 1. In addition to considering the test costs of each attribute, we take misclassification cost into account. When the misclassification costs are too large compared with test costs, the MCFS problem coincides with the MTR problem. Therefore the MCFS problem is a generalization of the MTR problem. 2. The attribute reduction needs to preserve a particular property of the decision system. The feature selection relies on the only cost information. 4. Algorithm In this section, the backtracking algorithm to the minimal cost feature selection problem (MCFS) is illustrated in Algorithm 1. To invoke the algorithm, one should initialize the global variables as follows: R = ∅ is a feature subset with minimal total cost; cmc = mc(U, R) is currently minimal cost; and use the following statement: backtrack(R, 0). The result of a feature subset with minimal total cost will be stored in R. Generally, the search space of the feature selection algorithm is 2|C| . In this context, a backtracking algorithm with pruning techniques is used to select a feature subset to minimize the total cost. There are essentially three pruning techniques employed in Algorithm 1. 1. In Algorithm 1, Line 1 indicates that the variable i starts from il instead of 0. Whenever we move forward (see Line 14), the lower bound is increased. With this pruning technique, the solution space is 2|C| instead of |C|!. 2. In Algorithm 1, Lines 3 through 5 show the second pruning method. The misclassification costs are non-negative in the practical application. In this conditions, the feature subsets B will be discarded if the test costs of B is larger than the current minimal cost (cmc). This technique can prune most branches. 3. Lines 6 through 8 indicate that if the new feature subset produce high cost, the current branch will never produce the feature subset with minimal total cost. 5. Experiments Experiments are undertaken on four data sets from the UCI Repository of Machine Learning Databases, as listed in Table 3. We undertake three groups of experiments from different viewpoints. 6

Algorithm 1 A backtracking algorithm to the MCFS problem Input: (U, C, d, V, I, n, tc, mc), select tests R, current level test index lower bound l Output: A set of features R with minimal total cost and AMC, they are global variables Method: backtracking 1: for (i = l; i < |C|; i + +) do 2: B = R ∪ {ai } 3: if (tc(B) > cmc) then 4: continue; //Pruning for too expensive test costs 5: end if 6: if (AT C(U, B) ≥ AT C(U, R)) then 7: continue; //Pruning for non-decreasing total cost 8: end if 9: if (AT C(U, B) < cmc)) then 10: cmc = AT C(U, B); //Update the minimal total cost 11: R = B;//Update the set of features with minimal total cost 12: end if 13: end for 14: backtrack(R, i + 1);

Table 3: Data sets information.

No. 1 2 3 4

Name Liver Credit Iono Diab

|U | 345 690 351 768

Domain clinic commerce physics clinic

|C| 6 15 34 8

D = {d} selector class class class

From the viewpoint of different misclassification costs settings, we undertake two sets of experiments. First, we assume that the misclassification costs are different from each other. Table 4 is the optimal feature subset based on different misclassification costs for Diab data set. The ratio of two misclassification costs is set 10 in this experiment. As shown in this table, when the misclassification costs are too large compared with test costs, the test costs increase and even equal to the average total cost. In this case, the MCFS problem coincides with the MTR problem. Second, misclassification costs are identical for different misclassification. Table 5 shows the optimal feature subset based on unified misclassification cost. Since the misclassification costs are small enough, the algorithm chooses a feature subset with minimal average total cost. From the efficiency of the algorithm, we design two sets of experiments to evaluate performance of pruning technique. In this set of experiments, misclassification costs are identical for different misclassification. In this case, we

7

Table 4: The optimal feature subset based on different misclassification costs.

MisCost1 120 140 160 180 200 220 240 260

MisCost2 1200 1400 1600 1800 2000 2200 2400 2600

Test costs 6.00 6.00 6.00 6.00 6.00 10.00 10.00 13.00

Total cost 9.75 10.38 11.00 11.63 12.25 12.58 12.81 13.00

Feature subset [1,2,3] [1,2,3] [1,2,3] [1,2,3] [1,2,3] [1,3,6] [1,3,6] [1,2,6]

Table 5: The optimal feature subset based on unified misclassification cost.

MisCost1 120 140 160 180 200 220 240 260

MisCost2 120 140 160 180 200 220 240 260

Test costs 6.00 6.00 10.00 10.00 10.00 10.00 10.00 10.00

Total cost 9.59 10.19 10.42 10.47 10.52 10.57 10.63 10.68

Feature subset [1,2,3] [1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3] [0,1,2,3]

propose Fast approach and Slow approach. The Fast approach is a backtracking algorithm with three pruning methods, which is given by the Algorithm 1. We also propose the Slow approach without the third pruning method. The first set studies the change of backtracking steps with the misclassification costs. Figure 1 shows the backtracking steps of the algorithm. The second set studies the change of run time with the misclassification costs. Figure 2 shows the run time. From these two figures we observe the effectiveness of the third pruning technique. From the costs viewpoint, the changes of test costs and the average minimal total cost are shown in Figure 3. In real world, we could not select expensive tests when misclassification costs are low. Figure 3 shows this situation clearly. 6. Conclusion and further works In this paper, we take data with normal distribution measurement errors into account and study feature selection with minimal cost. This new feature selection problem, called minimal cost feature selection (MCFS), has a wide 8

14

300

13 250

12 11

200

9

Steps

Steps

10

8 7

150

100

6 5

50 Fast Slow

4 3

10

40 70 100 130 160 Misclassification cost setting for Liver dataset

Fast Slow 0

190

10

(a)

40 70 100 130 160 Misclassification cost setting for Credit dataset

190

(b)

200

30

150

25 Steps

35

Steps

250

100

20

50

15 Fast Slow

0

10

40 70 100 130 160 Misclassification cost setting for Iono dataset

Fast Slow 10

190

(c)

10

40 70 100 130 160 Misclassification cost setting for Wdbc dataset

190

(d)

Figure 1: Backtracking steps: (a) Liver; (b) Credit; (c) Iono; (d) Diab

application area for two reasons. From the viewpoint of the data, measurement errors under considered are ubiquitous. From the viewpoint of the minimal cost problem, the resource one can afford is often limited. In order to obtain the optimal result, a backtracking algorithm with three effective pruning techniques is designed for MCFS problem. Experimental results show that the pruning techniques are effective. This work also serves as the benchmark for other heuristic algorithms which should be designed in our further works for large data sets. Acknowledgment This work is in part supported by the Fujian Province Foundation of Higher Education (No. JK2012028), the National Nature Science Foundation of China (No. 61170128), and the Natural Science Foundation of Fujian Province, China (Nos. 2011J01374, 2012J01294).

9

8000

70

7000

60

6000

50

5000 Time (ms)

Time (ms)

80

40

4000

30

3000

20

2000

10 0

1000

Fast Slow 10

40

70 100 130 Misclassification cost setting

160

0

190

Fast Slow 10

40

160

1400

700

1200

600

1000

500

800

600

400

400

300

200

200

100 Fast Slow

0

190

(b)

Time (ms)

Time (ms)

(a)


10

40


160

Fast Slow 0

190

(c)

10

40


160

190

(d)

Figure 2: Run time: (a) Liver; (b) Credit; (c) Iono; (d) Diab

References [1] F. Min, H. P. He, Y. H. Qian, and W. Zhu, “Test-cost-sensitive attribute reduction,” Information Sciences, vol. 181, pp. 4928–4942, 2011. [2] S. Norton, “Generating better decision trees,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 1989, pp. 800–805. [3] M. N´ un ˜ez, “The use of background knowledge in decision tree induction,” Machine learning, vol. 6, no. 3, pp. 231–250, 1991. [4] M. Tan, “Cost-sensitive learning of classification knowledge and its applications in robotics,” Machine Learning, vol. 13, no. 1, pp. 7–33, 1993. [5] P. Turney, “Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm,” Journal of Artificial Intelligence Research (JAIR), vol. 2, 1995. 10

7

12

6

10

5 8

cost

cost

4 6

3 4 2 2

1 Test cost Average total cost 0

10

40


160

Test cost Average total cost 0

190

10

40

(a)


160

190

(b)

5

11

4.5

10

4

9 8

3.5

7 cost

cost

3 2.5

6 5

2 4 1.5

3

1

2

0.5 0

Test cost Average total cost 10

40


160

Test cost Average total cost

1 0

190

(c)

10

40


160

190

(d)

Figure 3: Test costs and average total cost: (a) Liver; (b) Credit; (c) Iono; (d) Diab

[6] Y. Weiss, Y. Elovici, and L. Rokach, “The cash algorithm-cost-sensitive attribute selection using histograms,” Information Sciences, 2011. [7] E. B. Hunt, J. Marin, and P. J. Stone, Eds., Experiments in induction. New York: Academic Press, 1966. [8] C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 7th International Joint Conference on Artificial Intelligence, 2001, pp. 973–978. [9] Z. Zhou and X. Liu, “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006. [10] F. Min and W. Zhu, “Minimal cost attribute reduction through backtracking,” in Proceedings of International Conference on Database Theory and Application, ser. FGIT-DTA/BSBT, CCIS, vol. 258, 2011, pp. 100–107.

11

[11] F. Min, W. Zhu, H. Zhao, J. B. Liu, and Z. L. Xu, “Coser: Cost-senstive rough sets, http://grc.fjzs.edu.cn/˜fmin/coser/,” 2012. [12] H. Zhao, F. Min, and W. Zhu, “Inducing covering rough sets from error distribution,” Journal of Information & Computational Science, 2012. [13] T. Y. Lin, “Granular computing on binary relations-analysis of conflict and chinese wall security policy,” in Proceedings of Rough Sets and Current Trends in Computing, ser. LNAI, vol. 2475, 2002, pp. 296–299. [14] A. Bargiela and W. Pedrycz, Granular Computing: An Introduction. Kluwer Academic Publishers, Boston, 2002. [15] T. Y. Lin, “Granular computing - structures, representations, and applications,” in Lecture Notes in Artificial Intelligence, vol. 2639, 2003, pp. 16–24. [16] H. Zhao, F. Min, and W. Zhu, “Test-cost-sensitive attribute reduction based on neighborhood rough set,” in Proceedings of the 2011 IEEE International Conference on Granular Computing, 2011, pp. 802–806. [17] W. Zhu and F. Wang, “Reduction and axiomization of covering generalized rough sets,” Information Sciences, vol. 152, no. 1, pp. 217–230, 2003. [18] W. Zhu, “Generalized rough sets based on relations,” Information Sciences, vol. 177, no. 22, pp. 4997–5011, 2007. [19] L. Ma, “On some types of neighborhood-related covering rough sets,” International Journal of Approximate Reasoning, 2012. [20] J. Lan, M. Hu, E. Patuwo, and G. Zhang, “An investigation of neural network classifiers with unequal misclassification costs and group sizes,” Decision Support Systems, vol. 48, no. 4, pp. 582–591, 2010. [21] P. Turney, “Types of cost in inductive concept learning,” in Proceedings of the ICML’2000 Workshop on Cost-Sensitive Learning, 2000, pp. 15–21. [22] M. Kukar, I. Kononenko et al., “Cost-sensitive learning with neural networks,” in Proceedings of the 13th European conference on artificial intelligence (ECAI-98). Chichester, John Wiley & Sons, 1998, pp. 445–449. [23] F. Min and W. Zhu, “A competition strategy to cost-sensitive decision trees,” in Proceedings of Rough Set and Knowledge Technology, ser. LNAI, vol. 7414, 2012, pp. 359–368. [24] R. Susmaga, “Computation of minimal cost reducts,” in Foundations of Intelligent Systems, ser. LNCS, Z. Ras and A. Skowron, Eds. Springer Berlin / Heidelberg, 1999, vol. 1609, pp. 448–456.

12

Minimal cost feature selection of data with normal distribution

Minimal cost feature selection of data with normal distribution

Suggest Documents

A Backtracking Approach to Minimal Cost Feature Selection of

COST-SENSITIVE FEATURE SELECTION

Data Visualization with Simultaneous Feature Selection - CiteSeerX

A New Feature Selection Scheme Using Data Distribution ... - UCL/ELEN

Feature Selection: A Data Perspective

Feature Selection: A Data Perspective

Feature Selection: A Data Perspective

Feature Selection: A Data Perspective

RIGIDITY OF MINIMAL SUBMANIFOLDS WITH FLAT NORMAL ...

A Minimal Subset of Features Using Feature Selection for Handwritten

Feature selection for unbalanced class distribution ...

Test-Cost-Sensitive Attribute Reduction of Data with Normal ...

Feature Selection of Microarray Data Using

Ensemble Feature Selection with Dynamic

1216 Minimal Processing Feature

Histograms of Gaussian normal distribution for feature matching in ...

Feature Selection with Linked Data in Social Media - Semantic Scholar

On Biclustering with Feature Selection for Microarray Data Sets

Feature Selection used for Wind Speed Forecasting with Data ... - JESTR

Clustering Complex Data with Group-Dependent Feature Selection

Feature selection in functional data classification with recursive

Feature Selection in Proteomic Pattern Data with Support Vector

Feature Selection in Data Mining - dollar

Feature Selection in Imbalance data sets - CiteSeerX