Multi-objective Differential Evolution Algorithm for Multi-label Feature Selection in Classification Yong Zhang(B) , Dun-Wei Gong, and Miao Rong School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou 221116, China
[email protected]
Abstract. Multi-label feature selection is a multi-objective optimization problem in nature, which has two conflicting objectives, i.e., the classification performance and the number of features. However, most of existing approaches treat the task as a single objective problem. In order to meet different requirements of decision-makers in real-world applications, this paper presents an effective multi-objective differential evolution for multi-label feature selection. The proposed algorithm applies the ideas of efficient non-dominated sort, the crowding distance and the Pareto dominance relationship to differential evolution to find a Pareto solution set. The proposed algorithm was applied to several multi-label classification problems, and experimental results show it can obtain better performance than two conventional methods. abstract environment. Keywords: Classification · Multi-label Multi-objective · Differential evolution
1
feature
selection
·
Introduction
Multi-label classification is a challenging problem that emerges in many modern real applications [1,2]. By removing irrelevant or redundant features, feature selection can effectively reduce data dimensionality, speeding up the training time, simplify the learned classifiers, and/or improve the classification performance [3]. However, this problem has not received much attention yet. In the few existing literature, a main way is to convert multi-label problems into traditional single-label multi-class ones, and then each feature is evaluated by new transformed single-label approach [4-6]. This way provides a connection between single-label learning and multi-label learning. However, since a new created label maybe contain too many classes, this way may increase the difficulty of learning, and reduce the classification accuracy. Differential evolution (DE) has been applied to single-label feature selection [7,8], because of population-based characteristic and good global search capability. However, the use of DE for multi-label feature selection has not been investigated. Compared with single-label classification learning [9], since there can c Springer International Publishing Switzerland 2015 Y. Tan et al. (Eds.): ICSI-CCI 2015, Part I, LNCS 9140, pp. 339–345, 2015. DOI: 10.1007/978-3-319-20466-6 36
340
Y. Zhang et al.
be complex interaction among features, and these labels are usually correlated, multi-label feature selection becomes more difficult. Furthermore, multi-label feature selection has two conflicting objectives: maximizing the classification performance and minimizing the number of features. Therefore, in this paper, we study an effective multi-objective approach for multi-label feature selection based on DE.
2
Problem Formulation
We use a binary string to represent solutions of the problem. Taking a set of data with D features as an example, a solution of the problem can be represented as follows: (1) X = (x1 , x2 , · · · , xD ), xi ∈ {0, 1} , i = 1, 2, · · · , D. Selecting hamming loss [5] to evaluate the classification performance of classifier which is decided by feature subsets, a multi-label feature selection problem is formulated as a combinatorial multi-objective optimization one with discrete variables: min F (X) = (Hloss(X), |X|) (2) Where |X| represents the number of features, Hloss(X) is the hamming loss in terms of the feature subset X.
3 3.1
Proposed Algorithm Encoding
In DE, an individual refers to a possible solution of the optimized problem, thus it is very important to define a suitable encoding strategy first. This paper adopts the probability-based encoding strategy proposed in our previous work [10]. In this strategy, an individual is represented as a vector of probability, Pi = (pi,1 , pi.2 , · · · , pi.D ), pi,j ∈ [0, 1]
(3)
Where the probability pi,j > 0.5 means that the j-th feature will be selected into the i-th feature subset. 3.2
Improved Randomized Localization Mutation
For mutation operator in DE, the traditional approach chooses the base vector at random within three vectors [11]. This approach has an exploratory effect but it slows down the convergence of DE. Randomized localization mutation (RLM) is first introduced to deal with single objective optimization in [12]. Since it can get a good balance between the global exploratory and convergence capabilities,
Multi-objective Differential Evolution Algorithm
341
this paper extends it to the multi-objective case by incorporating the Pareto domination relationship. The improved mutation is described as follows: Vi (t) = Pi,best (t) + F · (Pr2 (t) − Pr3 (t))
(4)
Where Pi,best (t) is the non-dominated one among the three random vectors, Pr2 (t) and Pr3 (t) are the rest two vectors. 3.3
Selection Based on Efficient Non-dominated Sort
For selection operator, fast non-dominated sorting (FNS) proposed in [13] were often used to finding Pareto-optimal individuals in DE. Efficient non-dominated sort (ENS) [14] is a new, computationally efficient comparison technique. Theoretical analysis shows that it has a space complexity of O(1), which is smaller than FNS. Based on the advantage above, this paper uses a variation of the ENS, together with the crowding distance, to update the external archive. Supposing that the parent population at generation t is St , the set of trial vectors produced by crossover and mutation is Qt , first all the individuals among Rt = St ∪ Qt are classed into different rank sets according to ENS. Herein, a solution to be assigned to the Pareto front needs to be compared only with those that have already been assigned to a front, thereby avoiding many unnecessary dominance comparisons. Individuals belonging to the first rank set are of best ones in Rt . And then, the new population is selected from subsequent rank sets in the order of their ranking. If the number of individuals selected goes beyond the population size, then individuals that have high rank and crowding distance values are deleted. 3.4
Implement of the Proposed Algorithm
Based on these operators above and some established operators, detailed steps of the proposed algorithm are described as follows: Step 1: Initialize. First, set relative parameters, including the population size N , the scale factor F , the crossover probability CR, and the maximal generation times Tmax . Then, initialize the positions of individuals in the search space. Step 2: Implement the mutation proposed in subsection 3.2. Step 3: Implement the uniform crossover technique introduced in [15] to generate a trail vector, i.e., a new offspring; Step 4: Select the new population. First, evaluate the fitness of each offspring by the method introduced in subsection 3.1; then, combine these offsprings and the parent population, and generate new population by using the method proposed in subsection 3.3; Step 5: Judge whether the algorithm meets termination criterion. If yes, stop the algorithm, and output the individuals with the first rank as finial result; otherwise, go to step 2. Furthermore, Figure 1 shows the flowchart of the proposed multi-objective feature selection algorithm.
342
Y. Zhang et al.
Fig. 1. Flowchart of the proposed DE-based multi-objective feature selection algorithm
3.5
Complexity Analysis
Since ENS and the crowding distance both need O (M N logN ) basic operation, the mutation operator needs O (3N ) basic operation, and the crossover operator needs O (N ) basic operation, the time complexity of the proposed algorithm can be simplified as O (M N logN ).
4
Experiments and Results
We compared the proposed algorithm with two conventional multi-label feature selection methods, ReliefF based on the binary relevance (RF-BR) [4] and mutual information approach based on pruned problem transformation (MI-PPT) [6]. Table 1. Data sets used in experiments Data sets Emotions Yeast Scene
Patterns 593 2417 2407
Features 72 103 294
Labels 6 14 6
Table 2. Best hamming loss obtained by the two algorithms
Datasets Emotions Yeast Scene
Proposed algorithm HLoss No. of features 0.18 17.89 0.193 39.29 0.087 140.24
RF-BR HLoss No. of features 0.22 17.2 0.24 40.58 0.12 56.3
Multi-objective Differential Evolution Algorithm
Emotions 0.32 Proposed algorithm MI−PPT
Hamming loss
0.28
0.24
0.2
0.16 6
7
8
9
10 11 Number of features
12
13
14
Yeast 0.21 Proposed algorithm MI−PPT
Hamming loss
0.206
0.202
0.198
0.194
0.19 15
20
25
30
35
40
No of features Scene
Proposed algorithm MI−PPT
0.102
Hamming loss
0.098
0.094
0.09
0.086 70
80
90
100 No of features
110
120
130
Fig. 2. Solution sets obtained by the proposed algorithm and MI-PPT
343
344
Y. Zhang et al.
For the proposed algorithm, we set the population size as 30, and the maximum iteration sets as 100. Due to easy implementation and less parameter, the most frequently used ML-KNN [16] is selected as classifier. Table 1 lists the datasets [17] employed in the experiments, which have been widely used in multi-label classification. The first experiment is designed to evaluate the proposed algorithms performance by looking for the extreme optimal solution, i.e., the smallest hamming loss value. Table 2 compares our best hamming loss values with the existing results obtained by RF-BR. It can be seen that although RF-BR reduced the number of features obviously, the proposed algorithm obtained a significant improvement in terms of the hamming loss. Taking Emotions as an example, compared with RF-BR, the hamming loss of the proposed algorithm has decreased by 4.0 percent. The second experiment is designed to evaluate the parallel search capability of the proposed DE-based algorithm. Here we run the proposed algorithm only once for each test problems. Due to the population-based characteristic, the proposed algorithm obtained simultaneously a set of optimal solutions (i.e, feature subset) for each test problems. In order to evaluate the whole quality of the solution set produced by the proposed algorithm, MI-PPT was run sequentially to find a set of solutions which has the same number of features as the proposed algorithm. Figure 2 shows the solution sets obtained by the proposed algorithm and MIPPT. Clearly, the solutions of MI-PPT are dominated by those of the proposed algorithm in most cases. Taking Yeast as an example, the proposed algorithm has inferior hamming loss only when the number of features is equal to 15.
5
Conclusion
This paper proposed a new multi-objective multi-label feature selection technique based on DE. According to the experiments, the following can be concluded: 1) The proposed algorithm shows high performance on looking for the extreme optimal solution, i.e., the smallest hamming loss; 2) Due to the population-based characteristic, the proposed algorithm can find simultaneously a set of optimal solutions for a test problem, which have smaller hamming loss than MI-PPT. Acknowledgments. This work was jointly supported by the Fundamental Research Funds for the Central Universities (No. 2013XK09), the National Natural Science Foundation of China (No. 61473299), the China Postdoctoral Science Foundation funded project (No. 2012M521142, 2014T70557), and the Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1301009B).
Multi-objective Differential Evolution Algorithm
345
References 1. Schapire, R., Singer, Y.: Boostexter: A boosting-based system for text categorization. Machine Learning. 39, 135–168 (2000) 2. Sun, F.M., Tang, J.H., Li, H.J., Qi, G.J., Huang, S.: Multi-label image categorization with sparse factor representation. IEEE Transactions on Image Processing. 23, 1028–1037 (2014) 3. Xue, B., Zhang, M.J., Browne, W.N., Huang, S.: Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Transactions on cybernetic. 43, 1656–1671 (2013) 4. Spolaor, N., Alvares Cherman, E., Carolina Monard, M., Lee, H.D.: A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science. 292, 135–151 (2013) 5. Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q.: Document transformation for multi-label feature selection in text categorization. In: 7th IEEE International on Data Mining, pp. 451–456. IEEE Press, Omaha NE (2007) 6. Doquire, Gauthier, Verleysen, Michel: Feature selection for multi-label classification problems. In: Cabestany, Joan, Rojas, Ignacio, Joya, Gonzalo (eds.) IWANN 2011, Part I. LNCS, vol. 6691, pp. 9–16. Springer, Heidelberg (2011) 7. Sikdar, U.K., Ekbal, A., Saha, S., Uryupina, O., Poesio, M.: Differential evolutionbased feature selection technique for anaphora resolution. Soft Computing (2014). doi:10.1007/s00500-014-1397-3 8. Ani, A., Alsukker, A., Khushaba, R.N.: Feature subset selection using differential evolution and a wheel based search strategy. Swarm and Evolutionary Computation 9, 15–26 (2013) 9. Unler, A., Murat, A.: A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research 206, 528–539 (2010) 10. Zhang, Y., Gong, D.W.: Feature selection algorithm based on bare bones particle swarm optimization. Neurocomputing 148, 150–157 (2015) 11. Fieldsend, J., Everson, R.: Using unconstrained elite archives for multi-objective optimization. IEEE Transactions on Evolutionary Computation 7, 305–323 (2003) 12. Kumar, P., Pant, M.: Enhanced mutation strategy for differential evolution. In: IEEE World Congress on Computational Intelligence, pp. 1–6. IEEE Press, Brisbane (2012) 13. Zhang, X., Tian, Y., Cheng, R., Jin, Y.C.: An efficient approach to non-dominated sorting for evolutionary multi-objective optimization. IEEE Transactions On Evolutionary Computation. (2015). doi:10.1109/TEVC.2014.2308305 14. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions On Evolutionary Computation 6, 182–197 (2002) 15. Englebrecht, A.P.: Computational Intelligence: An Intorduction(second edition). John Wiley and Sons, West Sussex (2007) 16. Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40, 2038–2048 (2007) 17. A Java Library for Multi-Label Learning. http://mulan.sourceforge.net/datasets. html