Swarm and Evolutionary Computation 9 (2013) 15–26
Contents lists available at SciVerse ScienceDirect
Swarm and Evolutionary Computation journal homepage: www.elsevier.com/locate/swevo
Feature subset selection using differential evolution and a wheel based search strategy Ahmed Al-Ani n, Akram Alsukker, Rami N. Khushaba Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia
a r t i c l e i n f o
abstract
Article history: Received 6 February 2012 Received in revised form 19 September 2012 Accepted 21 September 2012 Available online 5 October 2012
Differential evolution has started to attract a lot of attention as a powerful search method and has been successfully applied to a variety of applications including pattern recognition. One of the most important tasks in many pattern recognition systems is to find an informative subset of features that can effectively represent the underlying problem. Specifically, a large number of features can affect the system’s classification accuracy and learning time. In order to overcome such problems, we propose a new feature selection method that utilizes differential evolution in a novel manner to identify relevant feature subsets. The proposed method aims to reduce the search space using a simple, yet powerful, procedure that involves distributing the features among a set of wheels. Two versions of the method are presented. In the first one, the desired feature subset size is predefined by the user, while in the second the user only needs to set an upper limit to the feature subset size. Experiments on a number of datasets with different sizes proved that the proposed method can achieve remarkably good results when compared with some of the well-known feature selection methods. & 2012 Elsevier B.V. All rights reserved.
Keywords: Feature selection Differential evolution Wheel structure Search strategy
1. Introduction Feature selection has proved to be of an increased importance to a wide range of applications such as classification of remote sensing images, gene expressions analysis, text categorization, image recognition, and many other applications [1]. Researchers working in the fields of pattern recognition, machine learning and data mining have investigated the problem of feature selection. The reasons behind using feature selection methods include: reducing dimensionality, removing irrelevant and redundant features, reducing the amount of data needed for learning, improving methods’ predictive accuracy, and increasing the constructed models’ comprehensibility [2]. This is achieved by identifying features that offer complementary information to best discriminate between the target classes. The search for the optimal feature subset requires an evaluation measure to estimate the goodness of subsets and a search strategy to generate candidate feature subsets [3]. Evaluation measures are broadly divided into three categories: filters, wrappers and embedded. Filter methods use measures that are independent of the predetermined classification algorithms to estimate the goodness of candidate subsets. Wrapper methods estimate the n
Corresponding author. E-mail addresses:
[email protected],
[email protected] (A. Al-Ani),
[email protected] (A. Alsukker),
[email protected] (R.N. Khushaba). 2210-6502/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.swevo.2012.09.003
goodness of candidate subsets using the classification accuracy obtained by feeding those subsets to the adopted classification algorithms. Embedded methods incorporate feature selection as part of the classifier training process. Wrapper and embedded methods are computationally more expensive than filters, however they are usually more accurate. Searching for the optimal subset, which can achieve the best performance according to the defined evaluation measure, is a quite challenging task. The exhaustive search, which considers all possible subsets, is guaranteed to find the optimal solution. However, it is impractical to run, even with moderate size feature sets. A number of other search strategies that differ in their computational cost and optimality have been proposed in the literature. One of the early search strategies is the branch and bound [4], which requires the evaluation function to be monotonic. This method can be computationally expensive for large datasets. Sequential search methods, such as sequential forward selection (SFS) and sequential backward elimination (SBE) [5], have been widely used because of their simplicity and relatively low computational cost. The major drawback of the traditional sequential search methods is the nesting effect, i.e., in backward search when a feature is deleted, it cannot be re-selected, and in forward search when a feature is selected, it cannot be deleted afterwards. A slightly more reliable sequential search method is the plus-l-minus-r (lr), which considers removing features that were previously selected and selecting features that were previously eliminated [6]. Another trend of search strategies is the
16
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
stochastic search, where it has been found that including some randomness in the search process makes it less sensitive to the dataset [3], and hence helps avoid local minima. Some of the famous stochastic methods used in feature selection are: simulated annealing [7,8], genetic algorithm (GA) [9–16], ant colony optimization (ACO) [17–20], particle swarm optimization (PSO) [21–23] and recently differential evolution (DE) [24]. Another stochastic evolutionary algorithm that has recently been applied to a classification problem is the bacterial foraging optimization (BFO) [25]. A review of evolutionary optimization can be found in [26,27]. When considering the feature selection problem, many of the aforementioned methods perform well on certain datasets, while such methods may fail to escape local minima when applied to huge datasets with very large number of features. One possible cause of this limitation is that some of these methods lack the ability to explore and exploit the search space in a proper way. In addition, most of these methods do not attempt to narrow down the search space properly, which can be quite important when dealing with large feature sets. In this paper we propose a novel differential evolution algorithm for wrapper feature selection that uses a simple, yet effective way to narrow down the search space without eliminating any feature. A number of datasets with different sizes will be used to evaluate the performance of the proposed method, which can give a good indication about its exploration and exploitation capabilities. The paper is organized as follows: Section 2 gives a brief description of different population-based feature selection algorithms. Section 3 describes the proposed DE-based feature selection method. Section 4 presents the experimental results, and a conclusion is given in Section 5.
2. Background Different population-based search strategies that use stochastic operators were explored in the feature selection problem. These are categorized into the following subsections. 2.1. Genetic algorithm The simple genetic algorithm (SGA) [9] employs binary chromosomes to represent the presence/absence of features in subsets. For a particular subset, if a certain feature is represented by ‘1’, then the feature is present in the subset, while ‘0’ means it is not present. The length of the chromosome is equal to the total number of features. The crossover and mutation operators are used to search for the optimal feature subset. When using enough chromosomes, GA can produce good results. However, the main limitations of GA-based feature selection are the fact that GA has a number of parameters that need to be handled properly to achieve a reasonably good performance [13]. Moreover, for large datasets with limited number of informative features, the performance of the algorithm will be highly dependent on the initial population. For instance, if one of the features that form ‘‘the optimal’’ subset is unique and is not present in all subsets of the initial population, then the chance of selecting this particular feature is very low. Many variations of GA-based feature selection algorithms were introduced in the literature [10,11,15,16]. A constrained version of GA that limits the number of selected features to a predefined subset size was introduced in [13] and denoted as hybrid genetic algorithm (HGA). It uses an embedded local search to fine tune the search by performing a constrained crossover and mutation operators that produce subsets whose sizes are close, but not necessarily equal to the desired number of features (DNF).
A local sequential search is then executed to make the size of all subsets equal to DNF. Two local search operators are introduced, which are ripple_addðrÞ and ripple_remðrÞ, where r is a constant number that reflects the number of features to be added/removed. The ripple_addðrÞ operator adds r features, one at a time, to the current subset followed by removing r1 features. The ripple_remðrÞ on the other hand removes r features followed by adding r1 features. The addition/removal of features is implemented by examining all features that do not belong/belong to the current subset. This method was proved to give good results when used on datasets with small number of features (less than 100) [13]. However, for larger datasets that consist of thousands of features, this method is computationally very expensive, as the number of subsets to be formed and evaluated increases exponentially with the number of features in the dataset. 2.2. Swarm intelligence algorithms Swarm intelligence-based algorithms can be mainly divided into particle swarm optimization [21] and ant colony optimization [17]. For feature subset selection, binary particle swarm optimization (BPSO) [22] algorithm employs a set of particles that adjust their own positions according to two fitness values, a local fitness value, also denoted as the personal best pbest and a global fitness value gbest. An inertia weight that controls those two values is fine tuned to enhance the exploration capability of PSO. However, it is stated in the literature that the performance of PSO degrades when the dimensionality of the problem is too large, and that PSO can easily get trapped in local minima [28]. An improved version of the binary particle swarm (IBPSO) was presented in [23]. This modified algorithm retires gbest after certain number of iterations to avoid being trapped in a local optimum. However, there is no guarantee that this modification would produce better results. On the other hand, ACO based feature selection methods represent the features as nodes that are connected with links. The search for the optimal feature subset is implemented by employing a number of ants that would traverse through the graph (feature space). This is performed by utilizing previous knowledge, i.e., pheromone trails, and local importance that is measured with respect to the features (nodes) that have already been visited. The advantage associated with such a representation is that the pheromone laid by the ants while traversing the graph represents a global information sharing medium that can lead the ants to the vicinity of the best solution. Since solutions represented by the ants in the original ACO algorithm are constructed sequentially, an optimal performance is not guaranteed for the problem of feature selection. Thus, modifications to ACO-based feature selection need to be introduced to overcome this problem. Also, many ACO based feature selection algorithms employ some sort of prior estimation of features’ importance. For large datasets with thousands of features, this property will increase the memory requirement of the algorithm. For instance, the algorithm presented in [18] requires the computation of mutual information between feature pairs, IðX i ; X j Þ and between feature pairs and the target classes, IðC; fX i ,X j gÞ. One can easily notice that computational cost and memory requirement increase exponentially with the dataset size. Hence, the algorithm may not be practical to run for very large datasets. More details about this algorithm can be found in [18]. 2.3. Differential evolution Differential evolution (DE) is a stochastic optimization method that has recently attracted an increased attention [29]. It is a simple population-based optimizer that encodes all the parameters as
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
floating-point numbers and manipulates them with arithmetic operators [30]. Considering that we have NP members in the population, the first step in the DE optimization technique is to generate a D-dimensional real-valued parameter vector for each member, i.e., we will have a matrix of size NP D [30]. The differential combination operator of element j that belongs to vector xi , where i is a population index that ranges between 1 and NP, is implemented by adding the weighted difference between element j of two randomly selected population members m and n, xj,m and xj,n , to the value of a third random member of the population xj,p in order to create a new mutant element vj,i : vj,i ¼ xj,p þ F ðxj,m xj,n Þ,
ð1Þ
where F A ð0,1Þ is a scale factor that controls the rate at which the population evolves. Parameters within vectors are indexed with j, which ranges between 1 and D. Extracting both distance and direction information from the population to generate random deviations results in an adaptive scheme that has good convergence properties. It is important to mention that not all elements of the new vector vi are generated using Eq. (1), as DE also employs uniform crossover (discrete recombination). To increase exploration, each dimension of the new individual vj,i is switched with the corresponding dimension of xj,i with a uniform probability CO A ½0,1 and the new xnew is generated j,i ( vj,i if randð0; 1Þ r CO ð2Þ xnew j,i ¼ xj,i Otherwise: results in a lower objective If the newly generated vector xnew i function value (better fitness) than the predetermined population vector xi , then xnew replaces xi [31]. i There has been a number of modifications to the original DE algorithm to improve its performance. A number of modifications to the differential combination operator have been proposed in the literature. These include in addition to Eq. (1) [35]: vj,i ¼ xj,l þ F ðxj,m xj,n Þ:
ð3Þ
vj,i ¼ xj,i þ F ðxj,l xj,i Þ þ F ðxj,m xj,n Þ:
ð4Þ
vj,i ¼ xj,l þ F ðxj,p xj,q Þ þF ðxj,m xj,n Þ:
ð5Þ
where l is the member with the best fitness in the current generation, while m,n,p and q are randomly chosen members. The difference between these three equations and Eq. (1) is the utilization of the best member of the population, as this may lead to a faster convergence. Eq. (1) on the other hand gives higher emphasis to the exploration of the search space. Also, Eqs. (4) and (5) are in a way a combination of Eqs. (1) and (3). In addition to the above, a number of adaptive DE algorithms have been proposed in the literature. In [36], separate values of COi and Fi were assigned to each member of the population and adaptively optimized using fuzzy logic controllers. The values of COi and Fi were adapted in [37] using a simple probabilistic equation. In [38], the optimization of COi and Fi along with the population size was implemented through mutation and crossover in a way similar to that of the main vector. Qin et al. [39] proposed an algorithm that adapts both the differential combination operator (Eqs. (3)–(5)) and the optimization parameters COi and Fi by learning from their previous experiences in generating promising solutions. A memory is used to keep track of the success rates of the optimized variables. In [40] the optimization parameters COi and Fi were sampled from normal and Cauchy probability distributions. The mean values of these distributions were updated in each generation based on their historical records of success. In [41], the values of COi and Fi are adjusted based on
17
the fitness values of the population members with respect to that of the best member. Members that are distributed away from the best one in fitness-space have their Fi values large to keep on exploring, and hence to maintain adequate population diversity. Similarly, the values of COi vary based on the fitness value of the different population members, where the higher the fitness of a given vector the more influence it will have on the new vector. Other DE versions include distributed and compact DE [42,43]. For an in-depth review of DE, the reader is referred to [35,44,45]. In terms of feature selection, only few attempts, up to the authors’ knowledge, have considered DE to search for optimal subsets. In [46], the parameters of asymmetric subsethood product fuzzy neural network were optimized using DE. It was reported that the network was shown to have feature selection capabilities when applied to a small synthetic dataset. A discrete binary DE was also presented in [47] that transforms numerical encoding to binary encoding depending on some probability measure. However, the method was only tested on very small datasets with a maximum of 56 features. A modified binary DE for feature selection that was combined with support vector machine classifier as a wrapper function was presented in [48]. However, a significantly large number of iterations, reported as 4000 iterations [48], was required to achieve powerful performance. One can notice that these methods are based on very minor modifications of the original DE algorithm with no attempt to narrow down the search space, and hence may not lead to fast convergence to global minima when applied to large search spaces. In comparison, the work presented in [24,49] employs statistical feature distribution measures to aid the DE optimization process in selecting the most promising subset of features, which proved to be promising when applied to very large datasets with thousands of features. A floating-point DE was utilized with a roulette wheel structure and a statistical repair mechanism to avoid duplications of features when rounding up the floating-point solutions of DE. However, the algorithm, which will be referred to as DEFSO, lacks the ability to discover the optimal feature subset size as its functionality is mainly limited to selecting feature subsets with a pre-defined cardinality. The next section presents the proposed DE-based feature selection algorithm, which will be abbreviated as DEFSW, and a comparison with DEFSO and other well-known population-based feature selection methods will be presented in the experimental section.
3. The proposed algorithm 3.1. Selection of subsets using a pre-defined subset size The first version of the proposed algorithm requires the desired number of features (DNF), or the size of feature subsets, to be specified by the user. If NF refers to the total number of features in the feature set, then we assume that DNF rN F =2. A number of wheels equal to DNF will be constructed and features will be shuffled and ‘‘equally’’ distributed among the wheels (some wheels may have one feature more than others depending on the remainder of N F =DNF ). To form a feature subset, one feature from each wheel is chosen. This will ensure that a feature will not be duplicated in the same subset, and most importantly will lead to a noticeable shrink in the search space. For example, if the size of the original feature set N F ¼ 62 and the desired subset size DNF ¼ 4, then four wheels will be constructed, two of which will have 16 features, while the other two will have 15 features, as shown in Fig. 1(a). Thus, NFw, which represents the number of features in each wheel, will be NFw ¼ f16,16,15,15g. The objective now is to search for subset S ¼ ff 1 ,f 2 ,f 3 ,f 4 g that best discriminate between patterns according to their target
18
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
Fig. 1. Wheels constructed to select subsets of size 4 from the original set of 62 features.
classes. Note that each of the four features is selected from its corresponding wheel, i.e., f1 can only be selected from the features of wheel one. A DE-based feature selection algorithm is developed to search through the feature space of each wheel to produce new candidate subsets. It is important to mention that other optimization algorithms, such as GA, can also be utilized, however we will focus here on DE. A flowchart of the proposed search strategy is shown in Fig. 2. Like nearly all Evolutionary Algorithms (EAs), the proposed differential evolution feature selection algorithm is a population-based optimizer that attacks the starting point problem by sampling the objective function at multiple, randomly chosen initial points, where the number of points is equal to the population size (NP). Because we restricted the formation of subsets to the wheel structure, a re-shuffle of features among the wheels is important to explore new regions of the feature space and reduce the possibility of having two or more of the important features stuck in the same wheel. The re-shuffling will
be performed after searching the features of current wheels for a specified number of iterations, i.e., well explore the feature space of the current wheels. When re-shuffling the wheels, the algorithm fixes features of the best k subsets in their corresponding wheels to ensure that the best subsets that have been found so far still exist after the re-shuffling. The rest of the features are mixed and re-distributed among the wheels. The size of the population matrix is (NP DNF), which is formed using NP vectors, xi , (1ri r NP ) that are randomly initialized, and the dimension of the feature vector D ¼ DN F . In the proposed algorithm, we are using real-number representation, for example, x1 ¼ f7:20,12:48,3:76,7:68g and x2 ¼ f16:08,3:89,5:52,12:90g, represent possible vectors for two members of the population. In fact, xj,i can have any value in the range ½0:5,N FwðjÞ þ0:5Þ, where NFwðjÞ is the number of features in wheel j, hence, when the number is rounded, it will produce an integer that ranges between 1 and NFwðjÞ . To illustrate this point, rounding vector x1 , will produce the integer vector f7,12,4,8g, which represents features f12,22,57,11g
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
19
refer to those two members as m and n. The value of vj,i will be calculated according to the equation xnew j,i ¼ xj,i þF ðxj,m xj,n Þ otherwise, to perform uniform crossover, assign xnew j,i ¼ xj,l , where l is one randomly chosen member of the best k subsets predefined in step 3 (selected using a roulette wheel approach) check the wheel boundaries as follows: ( vj,i NFwðjÞ if vj,i 4 NFwðjÞ þ 0:5 vj,i ¼ ð6Þ vj,i þ NFwðjÞ if vj,i o 0:5:
Start
Specify No. of wheels (desired No. of features). Randomly distribute features among the wheels
Randomly generate a subset for each population member
identify the features of the newly generated subset, then evaluate the subset
if the newly generated subset achieved a lower fitness Evaluate all subsets and identify the best k ones
5. 6.
Generate new subsets using the differential combination and uniform crossover operators
Randomly generate Np − k subsets and evaluate their performance (Np = population size)
For each population member, adopt the newly generated subset if its performance is better than the old one. Identify the k best subsets
Fix features of the k best subsets and randomly distribute the rest of features among the wheels
Stopping criterion met?
Yes Stop
No No
Re−shuffle the wheels?
Yes
Fig. 2. Flowchart of the proposed DEFS algorithm.
according to the wheels shown in Fig. 1(a). The detailed implementation steps of the algorithm are shown below. 1. randomly distribute features among the wheels, set g ¼0, where g indicates number of times features have been redistributed among wheels 2. randomly generate the vectors of the first generation and produce the corresponding subset for each member of the population 3. evaluate the subsets and find the k best ones (a value of k ¼ round ðN P =10Þ was used here) 4. for each member of the population i for each wheel, j, determine whether to perform uniform crossover or differential combination by evaluating the following formula: randð0,1Þ r CO, where CO is the crossover probability as defined previously to implement differential combination, choose two members of the population, other than i. The first member is randomly chosen from the k best members, while the other is randomly chosen from the rest of the population. Let’s
7. 8. 9.
than the old one, then assign xi ¼ vi , otherwise keep xi unchanged goto step 4 until a certain number of iterations (iter) is reached fix the features of the k best subsets in their corresponding wheels. Randomly distribute the rest of the features among the D wheels randomly assign real values to the N P k vectors and evaluate the fitness of their corresponding subsets g¼g þ 1 if g rG, then goto step 4, otherwise stop, where G is the maximum allowed number for re-distribution of features. Accordingly, the total number of iterations (Titer) is calculated as follows: T iter ¼ G iter
The use of both uniform crossover and differential combination operators enhances the search through proper exploration and exploitation of the restricted search space (according to the current wheel distribution). In fact, distributing the original features among a number of wheels can lead to a noticeable reduction in the search space. Considering the example discussed earlier, f1 is selected from the first wheel, which consists of 16 features, i.e., a search space that is remarkably smaller than the original feature space. This attribute becomes more useful as the size of the original feature set gets bigger. As mentioned earlier, the rationale behind the re-distribution of features among the wheels is to enhance the global exploration of the search space. It also enables the algorithm to deal with the case where more than one of the important features are stuck in the same wheel. For instance, consider that subset f4,11,14,39g is the best subset of four features out of the original 62 features in our example shown in Fig. 1. We can notice that features 4, 11 and 39 were originally located in wheel 4, hence they cannot be used together to form a subset with another feature (recall that for a given subset, only one feature is selected from each wheel). Let’s presume that when the algorithm reached step 6, the three best subsets (assuming that k ¼3 in this example) were: f12,56,9,11g, f12,56,14,60g and f61,56,37,60g. These features were not allowed to move to another wheel, however, the remaining features were randomly distributed among the four wheels. This process would lead to the formation of new wheels, which are shown in Fig. 1(b). Note that features that formed the k best subsets are distinguished from the rest using a bold font. The algorithm would then use the new wheels and perform the search again to explore new regions of the search space. In such case, it is possible now for features 4, 11 and 39 to be members of the same subset. Hence, the optimal subset can be found using the new wheels when step 6 is reached, as shown in Fig. 1(c). In other words, the redistribution of features helps in avoiding local minima, as it allows the exploration of new regions of the search space. In regards to the value of k, one may want to introduce upper and lower limits when dealing with small population sizes
20
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
(less than 15) or very larger population sizes. It is important to mention that we have not attempted to optimize the DE parameters, as this work aims at validating the concept of the proposed search strategy. Moreover, utilizing an adaptive version of DE may increase the computational cost or memory requirement of the algorithm. 3.2. Selection of subsets using an upper subset size limit For a number of problems, the user may not be able to specify the desired number of features to be selected; rather he (or she) may only be able to provide an ‘‘upper limit’’. For instance, if NF is the number of features in the original feature set, then the user may choose an upper limit U L rN F . For datasets containing small or moderate number of features, one can choose U L ¼ NF , however, for datasets containing thousands of features, then UL may be assigned a value that is a fraction of NF. We present here an extension to the DEFSW algorithm to deal with this case. The algorithm would search for subsets of size that is less than or equal to UL. For the example presented in the previous section, we set UL ¼6. The wheels are quite similar to those used in the previous section, however, the only difference is the use of an additional feature, which we called the ‘‘imaginary feature’’ or ‘‘feature number 0’’. Hence, the original 62 features are now distributed among six wheels. Each wheel would also include the ‘‘imaginary feature’’ as shown in Fig. 3(a). It can be seen that wheels 1 and 2 consist of 12 features, while the rest of the wheels consist of 11 features, where the imaginary feature is placed in the last slot of each wheel. If vector xi ¼ f5:37,11:91,2:33,7:18,11:27,1:84g, then according to Fig. 3(a) the produced feature subset is: f21,0,33,3,0,35g. We can notice that the imaginary feature is selected to represent wheels 2 and 5. This would reduce the subset to f21,33,3,35g, i.e., shrink its size to 4. The steps of the algorithm are quite similar to those described in the previous section. The main difference between the two versions is the inclusion of the imaginary feature and its impact on wheels and the selected subsets. Let’s presume that the three best subsets found by the algorithm when it reached step 6 were: f34,0,16,0,2,24g, f34,0,16,59,2,31g and f34,0,42,59,2,24g, with sizes of 4, 5 and 5, respectively. This suggests that subsets of different sizes may be found by the different members of the population. Step 6 is replaced by the following:
fix the imaginary feature and features of the k best subsets in their corresponding wheels. Randomly distribute the rest of the features among the D wheels This revised step guarantees that each wheel will always have one imaginary feature. The new wheels are shown in Fig. 3(b). Finally, in order to encourage the selection of smaller subsets, a penalty may be added to the fitness function that is proportional to the size of the selected feature subset. Fig. 3. Wheels constructed to select subsets of size 6 or less (UL ¼ 6).
4. Experimental results Various experiments, with different datasets were conducted to test the performance of the two versions of the proposed DEFSW, i.e., the constrained version (that searches for the best subset of a predefined size) and the non-constrained version (that automatically determines the optimal subset size, which will be smaller than the provided upper limit). The performance of the proposed DEFSW will be evaluated against the following feature selection methods:
Simple genetic algorithm (SGA) [9]. Two versions are used; constrained and non-constrained. For the constrained version,
a modified cross-over and mutation operators are used to make sure that the number of ‘1’s (selected features) matches the predefined number of desired features. Hybrid genetic search algorithm (HGA) [13]. This algorithm was proposed to search for subsets of fixed sizes. It adds a local sequential search to the original GA algorithm, by using the ripple_addðrÞ and ripple_remðrÞ (r is assigned a value of 2 for all experiments presented here). As mentioned earlier, this method is computationally very expensive for larger datasets, as the number of subsets to be formed and evaluated increases
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
with the number of features in the dataset. Hence, we are going to restrict the number of subsets to be evaluated to a certain multiple of those of SGA. Binary particle swarm optimization (BPSO) [22] algorithm. Both constrained and non-constrained versions of the algorithm were implemented. The constrained version was implemented according to the algorithm described in [32]. Improved binary particle swarm (IBPSO) [23]. Similar to SGA and BPSO, constrained and non-constrained versions of the algorithm were implemented. The constrained version was implemented according to the algorithm described in [32]. An earlier version of differential evolution based feature selection that was proposed by the authors (DEFSO) [24]. Unlike DEFSW, DEFSO does not use wheels, rather it places all features in a list, and hence, solutions may include duplicated features. To overcome this problem, the algorithm utilizes a roulette wheel weighting scheme to replace the duplicated features. Distribution factors that are calculated using certain measures, which include contribution of features to the formation of ‘‘good’’ subsets and the overall usage of each feature, are used to estimate the probability of selecting replacement features. The constrained ANT algorithm described in [18]. The local information is measured using the mutual information between pairs of features and the target classes, which was implemented using a histogram approach. As mentioned earlier, this property increases the computational cost and memory requirements of the algorithm. A non-constrained genetic algorithm based wheel implementation (GAFSW). This algorithm is quite similar to DEFSW, however, instead of using the DE operators of differential combination and uniform crossover, the scattered crossover and mutation operators of GA are utilized.
All methods used the same population size, which is set to 50 in all experiments presented here, and same number of iterations. In order to reduce the effect of the initial population, initial feature subsets of the DEFSW are saved, and the other methods are forced to start the search with these subsets. The averaged classification accuracy achieved by each method over 10 runs is used to evaluate its performance. In the first experiment, the Madelon dataset from the UCI repository was chosen to test the performance of the different methods on datasets with large degree of redundancy among the original features. It is a two-class classification problem with sparse binary input features. There are 500 features in this dataset from
which only 5 are useful and the rest are either redundant or irrelevant. The original contributors of this dataset subdivided the data into a training set that consists of 2000 patterns, and a validation set that contains the remaining 600 patterns (see [33] for more details). A kNN classifier with k ¼ 3 was utilized to evaluate the fitness of candidate feature subsets. In the first part of the experiment, ANT, HGA and DEFSO, which all do not have nonconstrained versions, as well as the constrained versions of DEFSW, SGA, BPSO and IBPSO are used to search for the best subset of five features. The maximum number of iterations was fixed to 560 for all methods, which represented the stopping criterion. For DEFSW, five wheels were used and features were re-distributed among the wheels eight times (G¼8), where in each time the number of iterations (iter) was set to 70. The averaged classification accuracies achieved by these methods are illustrated using a box plot as shown in Fig. 4(a). These results can be roughly categorized into two main groups. The first group includes DEFSW, DEFSO, ANT, HGA, and IBPSO with ANT showing better performance than all other methods. However, as mentioned earlier the ANT algorithm requires the estimation and storage of mutual information between each pair of features and the target classes, which increases the computational cost and memory requirement. HGA also achieved a relatively good performance with very low variance, which can be justified by the effect of the sequential local search, where for each run the algorithm usually converges to a similar solution. However, HGA is computationally far more expensive to run than all other methods, including ANT. This is due to its local search, which requires the evaluation of a new subset using the adopted classifier for each addition/removal of a feature to the current subset. Thus, we decided to restrict the number of subsets to be evaluated to 10 times the number of subsets evaluated using SGA, otherwise the execution time for this and other datasets can be remarkably higher than that of other methods. Accordingly, the performance of DEFSW, DEFSO and IBPSO can be considered very acceptable considering their limited computational and memory requirements. On the other hand, the second group included SGA and BPSO, were both methods exhibited on average lower classification results. The results also suggest that the performance of SGA and BPSO might be sensitive to the initial population when dealing with large feature sets with highly redundant and irrelevant features, thus both SGA and BPSO might easily get trapped into local minima. In the second part of the experiment, the non-constrained versions of SGA, BPSO, IBPSO, GAFSW and DEFSW were applied to the Madelon dataset to search for the best feature subset that can have any size less than 150 (upper limit). Fig. 4(b) shows the
94 Classification Accuracy (%)
92 Classification Accuracy (%)
21
91 90 89 88 87
93 92 91 90 89 88 87 86
DEFSW DEFS0
ANT
BPSO
IBPSO
SGA HGA(10X)
DEFSW
GAFSW
SGA
BPSO
IBPSO
Fig. 4. Performance of the constrained and non-constrained methods on the Madelon dataset. (a) Constrained (DNF ¼ 5). (b) Non-constrained (UL¼ 150).
22
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
Table 1 Description of the datasets employed. Dataset
No. of features
No. of classes
No. of samples
Lung Colon Lymphoma Leukemia1 9_Tumors Brain_Tumor1 Brain_Tumor2 Prostate_Tumor Hill-Valley Gas Sensor Array Drift Musk (Ver. 1) Musk (Ver. 2) Semeion Handwritten Digit Location of CT Slicesa ISOLET Multiple Features Gisette Arcene Amazon Commerce Reviews Dexter Dorothea
325 2000 4026 5327 5726 5920 10,367 10,509 100 128 166 166 256 385 617 649 5000 10,000 10,000 20,000 100,000
7 2 9 3 9 5 4 2 2 7 2 2 10 10 26 10 2 2 50 2 2
73 62 96 72 60 90 50 102 606 13,910 476 6598 1593 53,500 6238 2000 7000 200 1500 300 800
a This is originally a regression dataset. To use it here, 10 labels have been generated from the target variable.
averaged classification accuracies achieved by these methods. These results prove the effectiveness of the proposed DEFSW in comparison to the other methods when not specifying the desired number of features. GAFSW achieved the second best performance and is found to be noticeably better than SGA. The five methods selected subsets with different sizes, which are: 83, 10, 11, 61 and 62 for SGA, BPSO, IBPSO, GAFSW and DEFSW, respectively. This indicates that BPSO and IPSO selected remarkably smaller subsets than the other three algorithms with SGA selecting subsets with larger sizes. One can also notice that the non-constrained DEFSW achieved higher accuracy than its constrained counterpart. This explains the selection of larger subsets. In contrast, there is little difference in performance between the constrained and nonconstrained versions of SGA, BPSO and IBPSO. In the second experiment, datasets with different number of features, ranging from one hundred to hundred thousands are considered. The first three were obtained from http://research. janelia.org/peng/proj/mRMR/, the following five from http://www. gems-system.org and the rest from http://archive.ics.uci.edu/ml/ index.html. Details of these datasets are given in Table 1. For datasets with more than 1000 samples, the training and testing portions are formed by allocating 1000 samples that are randomly selected in each run for training and the rest for testing. A 10-fold cross validation technique was used in each run for datasets with less than 1000 samples. In order to have different divisions of training and testing, the 10-folds are randomized with the start of each run. A multi-class linear SVM obtained from http://www.csie.ntu.edu.tw/cjlin/liblinear was adopted as the classification algorithm. Since the appropriate size of the most predictive feature subset is unknown, the desired number of features was varied between 5 and 50 features with a five step increment. In the first three datasets, the performance of some of the methods was close to 100% using only 25 features, therefore there was no need to continue to 50 features. For the first eight datasets, all constrained methods were used to search for the most important feature subsets of predefined sizes. A predefined maximum number of iterations, which was set to 200, was used by all methods as the stopping criterion. In addition, all methods used the same fitness function, which was the classification accuracy of the validation set. The obtained results are shown in Fig. 5(a–g). It can be seen that for some datasets near perfect
performance was achieved by a number of methods. For the particular dataset of Leukemia1, it was indicated in [34] that it would be possible to achieve 100% accuracy on the leukemia1 dataset if we rank all the features according to their degree of correlation with the target classes and form a subset using the first 50 features. Feature selection on the other hand enable us to achieve the same performance using less than 10 features, as shown in Fig. 5(d). These results can be categorized, according to the performance of the different methods, into two main categories: the first is occupied by DEFSW and DEFSO, while the second is occupied by the SGA, BPSO, IBPSO. On the other hand, the performance of the ANT and HGA methods can be seen floating between these two categories. Despite the ‘‘on average’’ good performance of the ANT method in comparison to SGA, BPSO and IBPSO, the relatively high computational and memory requirements of the ANT method can represent a major drawback. Like the ANT method, HGA also showed a high degree of fluctuation in performance between the different datasets. As mentioned earlier, we restricted the number of subsets that were evaluated to 10 times the number of subsets evaluated by SGA (10X). This produced good results for some datasets. In fact, for the two datasets of lung and colon, which have smaller number of features than other datasets, the performance of HGA was slightly better than other methods. However, for other datasets restricting the number of evaluated subsets to (10X) produced results that were worse than other methods, and hence, we increased this number to (50X) but it was still not enough to achieve a comparable performance to that of DEFSW and DEFSO. Figures also show that for some datasets the performance of HGA degrades as DNF increases. This can be justified by the effect of local search, which becomes less reliable when searching for larger subsets. It is also very clear that for almost all datasets the performance of BPSO and IBPSO methods are quite similar. The performance of SGA on the other hand is close to that of BPSO1 and BPSO2 in some datasets and a bit worse in others. On average, DEFSW and DEFSO outperformed all other methods, with DEFSW showing more consistent performance across all datasets, where one can notice the difference in performance between the two methods when applied to the 9_Tumors dataset. It is important to mention that DEFSO builds an estimation vector of probabilities and updates it within each iteration to replace duplicated features, but it is still not as computationally demanding as the ANT algorithm. On the other hand, all of the DEFS, SGA, BPSO, and IBSPO do not require any re-estimation of information contents. This gives DEFSW an advantage over both DEFSO and ANT, where with reduced memory requirement, it almost always exhibited the best, or near the best performance. The non-constrained versions of SGA, BPSO, IBPSO, GAFSW and DEFSW employed in order to investigate their performance when not specifying the desired number of features. An upper limit subset size was provided to all methods, which was set to UL¼150 for all considered datasets, apart from the Hill-valley, which was set to UL ¼ NF ¼ 100. A minor penalty term was introduced into the fitness function to favor the selection of smaller subsets. The maximum number of iterations was fixed to 600 for all methods. Table 2 presents the average classification accuracy of the competing methods (together with the estimated standard error, which is calculated by dividing the standard deviation of accuracy by the square root of the number of runs). A two-tailed paired t-test is performed with significance level of a ¼ 0:05 between DEFSW and each of the other methods for each dataset. Thus, for a given dataset, if there is a significant difference between DEFSW and another method, then a bullet is displayed next to the accuracy of that method. This table indicates that DEFSW is significantly better than the other methods for most of the datasets. Out of the 22 considered
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
23
Fig. 5. Average classification accuracies for different subset sizes. (a) Colon dataset results. (b) Lung dataset results. (c) Lymphoma dataset results. (d) Leukemia1 dataset results. (e) Brain_Tumor1 dataset results. (f) Brain_Tumor2 dataset results. (g) Prostate_Tumor dataset results. (h) 9_Tumors dataset results.
datasets, DEFSW is significantly better than SGA, BPSO, IBPSO and GAFSW in 16, 18, 19 and 13 datasets, respectively. Table 3 presents other comparison measures. The first row of the table represents the mean accuracy across all the datasets. Table 3 also presents pair-wise comparisons between the five feature selection methods according to their geometric mean
accuracy ratio (r_ ), the win-tie-loss (s) and the p-value of the sign test for the win-tie-loss measure (p). Below is a brief description of these measures:
The geometric accuracy ratio. For two methods that have classification accuracies a1 ,a2 . . . ,an and b1 ,b2 . . . ,bn , respectively
24
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
Table 2 Classification accuracy and estimated standard error for the considered feature selection methods. Dataset
SGA
BPSO
IBPSO
GAFSW
DEFSW
Madelon Lung Colon Lymphoma Leukemia1 9_Tumors Brain_Tumor1 Brain_Tumor2 Prostate_Tumor Hill-Valley Gas Sensor Musk_ver1 Musk_ver2 Semeion CTSlices ISOLET Multiple Fea Gisette Arcene Amazon Comm Dexter Dorothea
88.42 7 0.41 99.86 7 0.14 100.00 7 0.00 99.72 7 0.14 100.00 7 0.00 97.07 7 0.39 97.37 7 0.33 99.67 7 0.33 99.21 7 0.20 81.37 7 0.80 94.77 7 0.21 90.58 7 0.18 94.68 7 0.10 93.98 7 0.39 89.99 7 0.08 93.56 7 0.07 98.70 7 0.08 95.97 7 0.14 98.81 7 0.22 61.16 7 1.03 97.80 7 0.38 95.90 7 0.17
90.63 70.14 98.58 70.36 99.86 70.14 99.71 70.15 100.00 70.00 90.56 70.97 97.07 70.20 98.87 70.41 99.51 70.16 78.39 70.52 94.53 70.24 86.96 70.20 94.32 70.09 91.42 70.20 88.46 70.11 91.70 70.14 98.33 70.14 95.19 70.11 96.42 70.21 56.02 70.42 94.40 70.38 95.97 70.09
90.85 7 0.19 98.99 7 0.17 99.55 7 0.31 99.56 7 0.15 100.00 7 0.00 91.81 7 1.06 97.19 7 0.27 98.72 7 0.41 99.42 7 0.21 78.27 7 0.64 94.47 7 0.23 87.01 7 0.18 94.33 7 0.08 90.99 7 0.21 88.55 7 0.10 91.72 7 0.11 98.47 7 0.12 94.92 7 0.20 96.67 7 0.42 57.42 7 0.57 94.67 7 0.52 96.36 7 0.12
92.50 70.22 100.00 70.00 100.00 70.00 100.00 70.00 100.00 70.00 98.12 70.45 98.22 70.22 99.80 70.20 99.62 70.16 84.08 70.61 94.92 70.18 91.82 70.17 94.97 70.07 95.82 70.21 89.79 70.07 94.12 70.06 98.00 70.09 96.72 70.09 99.60 70.10 67.86 70.72 98.30 70.26 96.88 70.10
93.82 7 0.13 100.00 7 0.00 100.00 7 0.00 100.00 7 0.00 100.00 7 0.00 100.00 7 0.00 99.067 0.24 100.00 7 0.00 99.907 0.10 84.26 7 0.72 94.95 7 0.16 92.057 0.13 95.077 0.09 96.36 7 0.26 90.087 0.06 94.75 7 0.08 99.13 7 0.08 98.107 0.07 100.00 7 0.00 86.54 7 0.48 99.87 7 0.07 98.76 7 0.10
Table 3 Comparison of averaged classification accuracy, geometric accuracy ratio, win-tieloss, and p-value of the sign test across all datasets.
Mean accuracy SGA r_ s p BPSO r_ s p IBPSO r_ s p GAFSW r_ s p
SGA
BPSO
IBPSO
GAFSW
DEFSW
Dataset
SGA
BPSO
IBPSO
GAFSW
DEFSW
94.03
92.59
92.72
95.05
96.49
1.0175 18-1-3 0.0015
1.0155 18-1-3 0.0015
0.9877 2-2-18 0.0004
0.9729 0-2-20 0.0000
0.9981 8-1-13 0.3833
0.9716 1-1-20 0.0000
0.9577 0-1-21 0.0000
0.9733 1-1-20 0.0000
0.9591 0-1-21 0.0000
Madelon Lung Colon Lymphoma Leukemia1 9_Tumors Brain_Tumor1 Brain_Tumor2 Prostate_Tumor Hill-Valley Gas Sensor Musk_ver1 Musk_ver2 Semeion CTSlices ISOLET Multiple Fea Gisette Arcene Amazon Comm Dexter Dorothea
83 18 11 20 11 44 18 14 13 65 82 94 117 115 144 140 68 108 41 129 52 20
10 39 35 66 107 106 37 48 47 53 82 94 99 122 146 145 57 119 104 143 117 41
11 42 36 63 98 118 50 42 48 56 85 80 107 120 144 143 42 123 104 142 123 45
61 18 11 20 9 36 21 13 10 59 73 77 97 97 129 127 66 96 41 119 46 24
62 17 10 20 8 45 22 16 10 55 74 72 92 96 130 126 61 116 53 132 38 92
64
83
83
57
61
0.9842 0-4-18 0.0000
(n represents the number of runs), the geometric accuracy is calculated according to the following equation: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u n Pn Y u n i ¼ 1 logðai=biÞ ¼t ð7Þ ai=bi: exp n i¼1
Table 4 Average subset sizes (rounded to the nearest integer) for the considered feature selection methods.
This measure reflects the relative performance of one method with respect to another. If the outcome is greater than 1, then it is an indication that the first classifier outperforms the second one in terms of accuracy. Win-Tie-Loss: This is an important measure, where the three values are the number of datasets for which classifier a obtained better, equal, or worse performance outcomes than classifier b. Sign test. The p-values of a two-tailed sign test based on the win-tie-loss record. if p is significantly low, then one can conclude that it is unlikely that the outcome was obtained by chance, i.e., the difference between the two methods is
Mean
significant. On the other hand, a higher p value indicates that the two methods are not significantly different.
The mean accuracy measure indicates that DEFSW is the best feature selection method, outperforming all remaining methods followed by GAFSW and then SGA. The three other measures confirm these findings. One can also notice from this table that BPSO and IBPSO are not significantly different. Another important aspect in evaluating the performance of non-constrained feature selection methods is the size of the selected feature subsets. Table 4 shows the average size of the best subset selected by each of the five methods. Obtained subset sizes indicate that there is a variation between the different
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
160
Table 5 Average rank of the four algorithms based on the results presented in Table 2.
140 Average rank
Subset size
120 100
60 Min size Max size Best size
40
0
100
200 300 400 No. of iterations
500
600
Fig. 6. Maximum subset size, minimum subset size and size of best subset for each iteration of the DEFSW algorithm when applied to the 9_Tumor dataset.
datasets as well as between the five methods. There is no single method that always tend to select subsets with small sizes, nor a one that always select big subsets. However, on average, GAFSW, DEFSW and SGA select subsets that are smaller than those selected by BPSO and IBPSO. In order to further elaborate on the issue of subsets’ sizes, Fig. 6 shows the maximum subset size, minimum subset size and size of the best subset in each iteration of the 9_tumors dataset. When the algorithm starts, subsets are formed using different sizes that range between few features and the upper limit (150 in all experiments presented here). The gap between the maximum and minimum would then start to decrease until the wheels are re-distributed. As mentioned in Section 3.2, the re-distribution of wheels is implemented by fixing the feature of the k elite subsets (k was assigned a value of 5, or 0:1 NP , in all experiments) and randomly distributing the rest of the features among the various wheels. Subsets of the remaining members of the population are then randomly generated using the features of the new wheels, where again each subset can be of any size that is smaller than the upper limit. This explains the spiky shape of the graph. This figure also shows that the size of the best subset gradually decreases with the increase of iterations. An additional significance test (nonparametric equivalent of ANOVA) known as the Friedman test was employed to further compare the aforementioned feature selection methods on multiple datasets. The Friedman test ranks the different algorithms on each dataset according to the average error rates, giving rank 1 to the one with the smallest error. If the algorithms have no difference between their expected errors (the null-hypothesis), then their average ranks should not be different either, which is what is checked for by the Friedman’s test. The Friedman statistic is calculated according to the following equations: 2 3 12N 4X 2 KðK þ1Þ2 5 2 : ð8Þ wF ¼ R KðK þ 1Þ j j 4 FF ¼
SGA
IBPSO
GAFSW
DEFSW
2.98
3.75
2.11
1.16
Table 6 Tabular representation of cost-conscious Friedman’s test, with its post-hoc Bergman’s test.
80
20
25
ðN1Þw2F , NðK1Þw2F
ð9Þ
where N is the number of datasets, K is the number of algorithms, and Rj is the rank of algorithm j. As BPSO and IBPSO showed no significance difference in performance and subset size, we opted not to include BPSO, as it achieved a slightly lower mean accuracy than IBPSO. Hence,
SGA BPSO GAFSW DEFSW Average-ranks Prior cost-rank Final-rank
SGA
IBPSO
GAFSW
DEFSW
0 1 1 1 3 (2.98) 3(64) 3
1 0 1 1 4 (3.75) 4 (83) 4
1 1 0 1 2 (2.11) 1 (57) 2
1 1 1 0 1 (1.16) 2 (61) 1
K¼ 4, N ¼22 and the ranking of the four algorithms based on the results presented in Table 2 are shown in Table 5. Based on these ranks, w2F ¼ 49:34 and F F ¼ 62:18. FF is distributed according to the F distribution with 41 ¼ 3 and ð41Þ ð221Þ ¼ 63 degrees of freedom. The critical value of Fð3,63Þ for a ¼0.05 is 2.75, which is obtained from the F distribution table. The null-hypothesis (the four methods are not different) is rejected as F F 4Fð3,63Þ. The implementation of the Friedman test was adopted from [50] with cost-conscious comparisons between different methods on multiple datasets. The mean subset size from Table 4 was included as the cost of each algorithm. This was done in order to check on the significance of the different methods while taking into considerations their average errors and the sizes of selected subsets. For further details on the computation of the Friedman test statistic, the reader can refer to [50–52]. Results of running the Friedman test on the error rates shown in Table 2 are given in Table 6 together with the average rank, prior rank (based on cost) and final rank of the different algorithms. As the null-hypothesis is rejected, we use a post-hoc test (Bergman’s test chosen in this paper, other tests gave same results) to check which pairs of algorithms have different ranks, with 1’s indicating significant differences and 0’s indicating no significant differences. According to the Friedman’s test results with its post-hoc analysis, one can clearly notice that there is a significant difference between our proposed DEFSW method and all of SGA, BPSO, and GAFSW methods. Additionally, despite the slightly smaller subset size achieved by GAFSW, the cost-conscious Friedman test result ranked the proposed DEFSW as the best method, where our goodness measure is composed of a prior cost term additional to the generalization error. Based on the obtained experimental results, the wheel structure proves to be very effective, as DEFSW showed that it consistently outperformed other algorithms. The strength of the DEFSW algorithm rests on modifying the DE algorithm to make it capable of selecting relevant feature subsets using the proposed wheel structure. This algorithm overcomes the main limitation of the DEFSO algorithm; namely the need for using a roulette wheel weighting scheme to replace the duplicated features, which leads to an increased computational cost. The wheel structure has also noticeably improved the performance of GA based feature selection (GAFSW versus GAB). In fact, the wheel structure enhances the ability of the two algorithms to explore and exploit the search space in a better way, which explains the consistently good performance of DEFSW and to a slightly less degree GAFSW compared to other methods.
26
A. Al-Ani et al. / Swarm and Evolutionary Computation 9 (2013) 15–26
5. Conclusion This paper presented a powerful differential evolution based feature selection method. The proposed method, which is based on a novel wheel-structure approach proved to be very successful in exploring and exploiting the feature space and avoiding local minima. Two versions of the algorithm were presented, one that required the desired feature subset size to be specified, while the other only requires an upper subset size limit. The proposed wheel structure also showed very good potential in improving the performance of GA-based feature selection. A number of problems were used to compare the performance of the proposed DEFSW algorithm with other population-based feature selection methods. Both versions of the algorithm achieved, on average, better results than all other feature selection methods, especially when dealing with large feature sets. Statistical significance tests results also indicated the significance of the achieved results by the proposed algorithm in comparison to all other algorithms on various datasets. References [1] H. Liu, H. Motoda, Less is more, in: H. Liu, H. Motoda (Eds.), Computational Methods of Feature Selection, Taylor and Francis Group, LLC, New York, USA, 2008, pp. 3–17. [2] H. Liu, E.R. Dougherty, J.G. Dy, K.A. Torkkola, E. Tuv, H.A. Peng, C.A. Ding, F.A. Long, M.A. Berens, L.A. Parsons, Z. Zhao, L.A. Yu, G.A. Forman, Evolving feature selection, IEEE Intelligent Systems 20 (6) (2005) 64–76. [3] I. Guyon, S. Gunn, M. Nikravesh, L.A. Zadeh, Feature Extraction: Foundations and Applications, Springer-Verlag, New York, 2006. [4] P.M. Narendra, K. Fukunaga, A branch and bound algorithm for feature subset selection, IEEE Transactions on Computers 26 (1977) 917–922. [5] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97 (1997) 273–324, special issue on relevance. [6] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, PrenticeHall, Englewood Cliffs, NJ, 1982. [7] P.J.M. Laarhoven, E.H.L. Aarts, Simulated Annealing: Theory and Applications, Kluwer Academic Publishers, 1988. [8] S.-W. Lin, T.-Y. Tseng, S.-Y. Chou, S.-C. Chen, A simulated-annealing-based approach for simultaneous parameter optimization and feature selection of back-propagation networks, Expert Systems with Applications 34 (2) (2008) 1491–1499. [9] R.L. Haupt, S.E. Haupt, Practical Genetic Algorithms, second ed., John Wiley & Sons, 2004. [10] P.L. Lanzi, Fast feature selection with genetic algorithms: a filter approach, in: Proceedings of the IEEE International Conference on Evolutionary Computation, 1997, pp. 537–540. [11] F.Z. Brill, D.E. Brown, W.N. Martin, Fast genetic selection of features for neural network classifiers, IEEE Transactions on Neural Networks 3 (2) (1992) 324–328. [12] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, Dimensionality reduction using genetic algorithms, IEEE Transactions on Evolutionary Computation 4 (2) (2000) 164–171. [13] I.S. Oh, J.S. Lee, B.R. Moon, Hybrid genetic algorithms for feature selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (11) (2004) 1424–1437. [14] H. Frohlich, O. Chapelle, B. Scholkopf, Feature selection for support vector machines by means of genetic algorithms, in: Proceedings of the IEEE International Conference Tools with Artificial Intelligence (ICTAI’03), 2003, pp. 142–148. [15] H.R. Kanan, K. Faez, GA-based optimal selection of PZMI features for face recognition, Applied Mathematics and Computation 205 (2) (2008) 706–715. [16] J. Lu, T. Zhao, Y. Zhang, Feature selection based-on genetic algorithm for image annotation, Knowledge-Based Systems 21 (8) (2008) 887–891. [17] M. Dorigo, T. Stutzle, Ant Colony Optimization, MIT Press, London, 2004. [18] A. Al-Ani, Feature subset selection using ant colony optimization, International Journal of Computational Intelligence 2 (2005) 53–58. [19] M.H. Aghdam, N.G. Aghaee, M.E. Basiri, Text feature selection using ant colony optimization, Expert Systems with Applications (2008). [20] H.R. Kanan, K. Faez, An improved feature selection method based on ant colony optimization (ACO) evaluated on face recognition system, Applied Mathematics and Computation 205 (2008) 716–725. [21] J. Kennedy, R.C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann Publishers, London, 2001.
[22] H.A. Firpi, E. Goodman, Swarmed feature selection, in: Proceedings of the Applied Imagery Pattern Recognition Workshop, 2004, pp. 112–118. [23] L.Y. Chuang, H.W. Chang, C.J. Tu, C.H. Yang, Improved binary PSO for feature selection using gene expression data, Computational Biology and Chemistry 32 (1) (2008) 29–38. [24] R. Khushaba, A. Al-Ani, A. Al-Jumaily, Differential evolution based feature subset selection, in: Proceedings of the International Conference Pattern Recognition (ICPR’08), 2008. [25] R. Panda, M.K. Naik, B.K. Panigrahi, Face recognition using bacterial foraging strategy, Swarm and Evolutionary Computation 1 (3) (2011) 138–146. [26] S. Das, S. Maity, B.-Y. Qu, P.N. Suganthan, Real-parameter evolutionary multimodal optimization – a survey of the state-of-the-art, Swarm and Evolutionary Computation 1 (2) (2011) 71–88. [27] T.T. Nguyen, S. Yang, J. Branke, Evolutionary dynamic optimization: a survey of the state of the art, Swarm and Evolutionary Computation 6 (2012) 1–24. [28] F.V.D. Bergh, An Analysis of Particle Swarm Optimizers, Ph.D. Thesis, University of Pretoria, Pretoria, South Africa, 2001. [29] K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A Practical Approach to Global Optimization, Springer, 2005. [30] R. Storn, Differential evolution research – trends and open questions, in: U.K. Chakraborty (Ed.), Advances in Differential Evolution SCI, vol. 143, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 1–31. [31] A.K. Palit, D. Popovic, Computational Intelligence in Time Series Forecasting: Theory and Engineering Applications, Springer, 2005. [32] R.N. Khushaba, A. Alsukker, A. Al-Ani, A. Al-Jumaily, A.Y. Zomaya, A novel swarm based feature selection algorithm in multifunction myoelectric control, Journal of Intelligent and Fuzzy Systems 20 (4–5) (2009) 175–185. [33] I. Guyon, S. Gunn, M. Nikravesh, L.A. Zadeh, Feature Extraction: Foundations and Applications, Springer-Verlag, Berlin Heidelberg, Netherlands, 2006. [34] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531–537. [35] V.P. Plagianakos, D.K. Tasoulis, M.N. Vrahatis, A review of major application areas of differential evolution, in: Advances in Differential Evolution, Springer-Verlag, Berlin, 2008, pp. 197–238. [36] J. Liu, J. Lampinen, A fuzzy adaptive differential evolution algorithm, Soft Computing 9 (6) (2005) 448–462. [37] J. Brest, S. Greiner, B. Boskovic, M. Mernik, V. Zumer, Self-adapting control parameters in differential evolution: a comparative study on numerical benchmark problems, IEEE Transactions on Evolutionary Computation 10 (6) (2006) 646–657. [38] J. Teo, Exploring dynamic self-adaptive populations in differential evolution, Soft Computing 10 (8) (2006) 673–686. [39] A.K. Qin, V.L. Huang, P.N. Suganthan, Differential evolution algorithm with strategy adaptation for global numerical optimization, IEEE Transactions on Evolutionary Computation 13 (2) (2009) 398–417. [40] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional external archive, IEEE Transactions on Evolutionary Computation 13 (5) (2009) 945–958. [41] A. Ghosh, S. Das, A. Chowdhury, R. Giri, An improved differential evolution algorithm with fitness-based adaptation of the control parameters, Information Sciences 181 (2011) 3749–3765. [42] M. Weber, V. Tirronen, F. Neri, Scale factor inheritance mechanism in distributed differential evolution, Soft Computing 14 (2010) 1187–1207. [43] F. Neri, E. Mininno, Memetic compact differential evolution for cartesian robot control, IEEE Computational Intelligence Magazine 5 (2) (2010) 54–65. [44] F. Neri, V. Tirronen, Recent advances in differential evolution: a survey and experimental analysis, Artificial Intelligence Review 33 (1–2) (2010) 61–106. [45] S. Das, P.N. Suganthan, Differential evolution: a survey of the state-of-the-art, IEEE Transactions on Evolutionary Computation 15 (1) (2011) 4–31. [46] C.S. Velayutham, S. Kumar, Differential evolution based on-line feature analysis in an asymmetric subsethood product fuzzy neural network, ICONIP, Lecture Notes in Computer Science, vol. 3316, 2004, pp. 959–964. [47] X. He, Q. Zhang, N. Sun, Y. Dong, Feature selection with discrete binary differential evolution, in: International Conference on Artificial Intelligence and Computational Intelligence, 2009, pp. 327–330. [48] J.G. Nieto, E. Alba, J. Apolloni, Hybrid DE-SVM approach for feature selection: application to gene expression datasets, logistics and industrial informatics, 2009, LINDI 2009, Second International, 2009, pp. 1–6. [49] R.N. Khushaba, A. Al-Ani, A. Al-Jumaily, Feature subset selection using differential evolution and a statistical repair mechanism, Expert Systems with Applications 38 (9) (2011) 11515–11526. [50] A. Ulas, O.T. Yildiz, E. Alpaydyn, Cost-conscious comparison of supervised learning algorithms over multiple data sets, Pattern Recognition 45 (2012) 1772–1781. [51] J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [52] J. Derrac, S. Garcia, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation 1 (2011) 3–18.