Correlation Based Feature Selection Using Quantum Bio Inspired Estimation of Distribution Algorithm Omar S. Soliman and Aliaa Rassem Faculty of Computers and Information, Cairo University, 5 Ahmed Zewal Street, Orman, Giza, Egypt
[email protected],
[email protected]
Abstract. Correlation based feature Selection (CFS) evaluates different subsets based on the pairwise features correlations and the features-class correlations. Machine learning techniques are applied to CFS to help in discovering the most possible differnt combinations of features especillay in large feature spaces. This paper introduces a quantum bio inspired estimation of distribution algorithm (EDA) for CFS. The proposed algorithm integrates the quantum computing concepts, vaccination process with the immune clonal selection (QVICA) and EDA. It is employed as a search technique for CFS to find the optimal feature subset from the features space. It is implemented and evaluated using benchmark dataset KDD-cup99 and compared with the GA algorithm. The obtained results showed the ability of QVICA-with EDA to obtain better feature subsets with fewer length, higher fitness values and in a reduced computation time. Keywords: Correlation Based Feature Selection (CFS), Network intrusion detection, Quantum Computing, Vaccine Principles, Immune Clonal Algorithm, EDA.
1
Introduction
Feature selection (FS) is an important preprocessing step before classification tasks. In large data sets with huge number of feature, some features may be irrelevant or redundant which may increase computation time or have an impact on the accuracy of the results. Efficient FS methods are those which can explore the whole features space and select the best subset of features that has the most relevant and non redundant features with high class prediction power. There are two main methods that deal with feature selection: filter methods and wrapper methods. Filter methods rely on the general characteristics of the training data to select features with independence of any predictor [11]. Machine learning (ML) algorithms have become an essential need to deal efficiently with these FS issues. They are often used with filters to search for good representative features in large features space [15]. One of these ML algorithms introduced in this paper, is quantum vaccined immune clonal algorithm with estimation of C. Sombattheera et al. (Eds.): MIWAI 2012, LNCS 7694, pp. 318–329, 2012. c Springer-Verlag Berlin Heidelberg 2012
Correlation Based Feature Selection
319
distribution algorithm (QVICA-with EDA). KDD-Cup99 data set is used as a case of intrusion detection for testing these ML algorithms as search techniques. A detailed description of filters, the algorithm and the dataset are in the following sub sections. Also, discretization techniques are described. The aim of this paper is to develop a quantum bio inspired estimation of distribution algorithm for correlation based feature selection to obtain optimal feature subsets. The algorithm is applied over benchmark dataset and compared with GA algorithm to evaluate its effectiveness. The rest of this paper is organized as follows: Section 2 introduces related work in the field of feature selection. Section 3 introduces the proposed algorithm introduced in this paper. Where experimental results and discussion are are presented in section 4. The last section is devoted to conclusions and further works.
2
Related Works and Background
Many research works had been done in developing new algorithms for feature selection, where different Evolutionary algorithms (EAs) were applied with either filter or wrapper methods. The EDA was introduced as a wrapper for feature subset selection for splice site prediction application [14]. EDA was introduced again in a study that was to determine whither EDAs present advantages over simple GAs in terms of accuracy or speed when applied to feature selection problems. The study presented experiments with four evolutionary algorithms, GA, compact GA, extended compact GA and Bayesian optimization algorithm, applied to the feature selection problem. The classification results using a Naive Bayes classfier on artificial data sets did not provide an evidence to measure the advantages of EDAs over the other EAs [2]. Another wrapper-filter feature selection algorithm (WFFSA) using a memetic framework, a combination of genetic algorithm (GA) and local search (LS), was developed as in [18]. A modified Kolmogorov-Smirnov Correlation Based Filter algorithm for Feature Selection was proposed where results were compared with CFS and simple Kolmogorov Smirnov-Correlation Based Filter (KS-CBF). The classification accuracy of the algorithm was the highest with the reduced feature set using the proposed approach [9]. The KDD dataset was used for experimental comparison in many FS studies, a sample of these studies are as follows. A rough set based feature selection method was developed to select the most relevant features which can represent the pattern of the network traffic in intrusion detection system [17]. A wrapper based feature selection approach using Bees algorithm (BA) as a search strategy for subset generation, and using Support Vector Machine (SVM) as the classifier was applied. The algorithm was tested on KDD-cup 99 data set and compared with other feature selection techniques such as Rough-DPSO, Rough, Linear Genetic Programming (LGP), Multivariate Regression Splines (MARS), and Support Vector Decision Function Ranking (SVDF). The BASVM has yielded better quality intrusion detection system (IDS) with higher classification accuracy, high detection rate and low false alarm rate [7].
320
2.1
O.S. Soliman and A. Rassem
Correlation Based Feature Selection (CFS)
This paper will focus on one of the filter methods which is correlation based feature selection method trying to optimize its performance. Correlation based Feature Selection (CFS), developed by Hall (1999), is a simple filter algorithm that ranks feature subsets according to a correlation based heuristic evaluation function. The bias of the evaluation function is toward subsets that contain features that are highly correlated with the class and uncorrelated with each other. Irrelevant features should be ignored because they will have low correlation with the class. Redundant features should be screened out as they will be highly correlated with one or more of the remaining features. CFS has two main phases, the first is calculating the matrix of feature-feature and feature-class correlations. The second phase is a search procedure that is applied to explore the features space and get the optimal subset. To examine all possible subsets and select the best is prohibitivedue to large features space. Various heuristic search strategies such as, best first, are often applied to search the features space in reasonable time. The CFS correlation based heuristic evaluation function is defined as in equation 1 [3], [5], [13]. (1) Ms = krcf / k + k(k − 1)rf f Equation 1 is Pearsons correlation where all variables have been standardized. Ms is the heuristic merit of a feature subset S containing k features, rcf is the mean feature-class correlation and rf f is the average feature−feature intercorrelation. The numerator of this equation can be thought of as providing an indication of how predictive of the class a set of features is; and the denominator of how much redundancy there is among the features [5], [11]. 2.2
Discretization
Discretization is important before feature selection as many FS methods work on discrete data so continuous attributes should be converted. Discretization involves the grouping of continuous values into a number of discrete intervals. Discretization methods first sort the continuous attribute/feature values then all possible cut points are calculated and the best point, due to some evaluation measure, is selected. The attribute values are splitted or merged at this cut point and the process is continued until a stoppong criteria is met. Equal frequency discretization(EFD) is one of the most suitable methods for large datasets [11], [16]. EFD divides the sorted values into k intervals so that each interval contains approximately the same number of training instances. Thus each interval contains n/k adjacent values; k is a user predefined parameter and usually is set as 10 [11]. 2.3
Quantum Inspired Immune Clonal Algorithm
Immune Clonal Algorithm (ICA) is inspired from the human immune systems clonal selection process over the B cells where the evolution process of the antibodies is a repeated cycle of matching, cloning, mutating and replacing. The best
Correlation Based Feature Selection
321
B cells are allowed through this process to survive which increases the attacking performance against the unknown antigens. Vaccination is another immunological concept that ICA applies through the vaccine operator. This operator is used to introduce some degree of diversity between solutions and increase their fitness values bu using artificial Vaccines. these vaccines go through a different evolution process where the genetic operators of cross over and mutation are used for their optimization through generations. EAs sometimes show bad performance in high dimensional problems due to the numerous evolutionary operations and fitness evaluations applied. Some approches have been used to overcome this limitation like the parallizing EAs or hybridizing them with other powerful algorithms.ICA,as an EA, suffers from such a limitation where it doesn’t perform effectively in complicated problems due to the large population of antibodies that has to be created. One of the hybridization approches that enhances the ICA perfromance is shown in this paper, which is called the Quantum inspired evolutionary algorithms (QIEA). QIEAs were introduced in the 1990s to get advantage of the quantum comupting (QC) in solving problems on classical computers. QIEAs merge QC concepts like qbits, quantum gates, superposition propoerty and quantum observation process, with classical EAs to improve their perfromance. quantum inspired ICA (QICA), is the hybridization between QC and classical ICA, enhanced the perfrmonace of the ICA and helped in solving the problem of its ineffective perfromance in high dimensional problems. Inspired quantum concepts used in QICA are shown below in details [8], 2.4
Estimation of Distribution Algorithm
The most of evolutionary algorithms (EA) use sampling during their evolution process for generating new solutions. Some of these algorithms use it implicitly, like Genetic Algorithm (GA), as new individuals are sampled through the genetic operators of the cross over and mutation of the parents. Other algorithms apply an explicit sampling procedure through using probabilistic models representing the solutions characteristics like the estimation of distribution algorithms (EDAs). EDAs are population based algorithms with a theoretical foundation on probability theory. They can extract the global statistical information about the search space from the search so far and builds a probability model of promising solutions. Unlike GAs, the new individuals in the next population are generated without crossover or mutation operators. They are randomly reproduced by a probability distribution estimated from the selected individuals in the previous generation. EDA has some advantages, over other traditional EAs, where it is able to capture the interrelations and inter dependencies between the problem variables through the estimation of their joint density function. EDA doesnt have the problem of finding the appropriate values of many parameters as it only relies on the probability estimation with no other additional parameters. EDA relies on the construction and maintenance of a probability model that generates satisfactory solutions for the problem solved. An estimated probabilistic model, to capture the joint probailties between variables, is constructed from selecting the
322
O.S. Soliman and A. Rassem
current best solutions and then it is simulated for producing samples to guide the search process and update the induced model [8].
3
Proposed Algorithm
The effectiveness of correlation based features selection method (CFS) depends on the effectiveness of the search algorithm and its ability to examine the most number of possible feature subsets to select the most representative one. As in equation 1, the fitness of any candidate subset is evaluated by dividing the subset class prediction power with the features interactions so only relevant and non redundant features are favored for selection. The proposed quantum bio inspired estimation of distribution algorithm algorithm for CFS, it is based on the quantum vaccined immune clonal algorithm, and EDA. The schema of the proposed algorithm is shown in figure 1. As shown in figure 1, the algorithm has two main stages. First Phases is about data preprocessing and the second is the CFS with its two phases. The details of proposed algorithm are described as follows: – Data Preprocessing Phase This pase is concerned with data preprocessing of the KDD dataset; it is composed of two man steps including symbolic data conversion and data discretization. In this phase all features are normalized to be handled equally as follows:
Fig. 1. The schema of proposed algorithm
Correlation Based Feature Selection
323
• Symbolic data conversion: The first step of data preprocessing is the convesrion of any symbolic data. some features of KDD dataset are symbolic, for example, protocol type, service and attack type. These features are converted into numeric values so all features can be treated in the same way. Each possible value of each symbolic feature is converted to a number from 1 to N where N is the total number of the feature’s values. • Data Discretization: As many filter methods work on discrete data and as KDD data set has some continuous features, a discretization step is the second step for the preprocessing of the data. Equal frequency discretization (EFD) is the used discretization method with k, number of intervals, equals 10. – CFS Phases after the dataset is ready for the feature selection process where all features have discrete numeric values, the two CFS phases are implemented as follows, • Phase 1: Correlations Matrix: a matrix of feature-feature correlations and feature-class correlations is computed from the features’ values. This matrix is the evlautaion criteria of any possible feature subset. It is the base of subsets selection in the second phase. • Phase 2: Search Technique: The QVICA-V with EDA is applied for this phase to search, rank and select better subsets. This technique integrates the quantum computing and immune clonal selection principles with the vaccination and EDA sampling mechansim to improve the solutions fitness and the degree of diversity. Two populations, one for solutions (feature subsets) and one for vaccines are initialized first in the quantum bit representation. After initilization, two parallel flows take place for the evolution process where one of them is over the vaccines population and the other over the subsets population. At the first flow, The vaccines population is divided into two sub populations. Genetic operators are used to evolve the first sub population by the crossover and mutataion operators where EDA is applied on the second subpopulation. It estimates a Probability model that represents the fittest vaccines then new vaccines are sampled from this model. Fittest vaccines are the farthest vaccines with higher distance values from subsets so as higher space exploration degree is ensured. In the second flow, quantum subsets, candidate problem solutions, are evolved by the clonal and quantum mutataion operators then observed into more subsets. The final subsets are vaccined by the new generated vaccines then the immune clonal selection took place to select best subsets as the next population where the phase is repeated until the number of iterations is reached [8].
324
4
O.S. Soliman and A. Rassem
Experiemntal Results
Most of the research done over the CFS method often used hill climbing and best first as heuristic search strategies to search the features space in reasonable time. With higher data dimensionality and more problems features, these techniques may take more processing time to get the best results. GA is a well known heuristic search, that is popular with its simple implementation, various operators of cross over and mutation and parallelizm and applied in many differnt applications. Therefore, GA search is more powerful than other heuristics due to its ability to examine more solutions at the same time and its flexibilty of change of its opertaors so as to control the search process. In this paper, our proposed algorithm is compared with the GA search through many experiments. All the experimenst are done on the benchmark dataset in the intrusion detection field is the KDD (Knowledge Discovery and Data Mining Tools Conference) Cup 99 dataset [4]. KDD contains 5 million input patterns and each record represents a TCP/IP connection that is composed of 41 features that are both qualitative and quantitative in nature . This database contains 39 types of distinct attacks, grouped into four classes of attack and one class of non attack. These characteristics convert this dataset in a challenge for the sake of classification. The data set used in this study is a smaller subset (10 of the original training set), that contains 494,021 instances. Table 1 shows the number of samples of each attack type and the number of its sub attacks [1], [6].The four major types of attacks in KDD are [11], – Denial of Service (DoS)attacks where an attacker makes some computing or memory resource too busy or too full to handle legitimate requests, thus denying legitimate users access to a machine. – Probe attacks where an attacker scans a network to gather information or find known vulnerabilities. – Remote-to-Local (R2L) attacks where an attacker sends packets to a machine over a network, then exploits machines vulnerability to illegally gain local access as a user – User-to-Root (U2R) attacks where an attacker starts out with access to a normal user account on the system and is able to exploit vulnerability to gain root access to the system. GA parameters, crossover and mutation probabilities are set in addition to the number of iterations which is set to 1000 and the population size (number of subsets) is set also to 1000. The initial quantum population size is set to 5 with clone scale of 4 and the number of observations is 40. Three evaluation measures are used the base of the comparison in this paper. Theses measures include the best subset fitness obtained, the average feature to feature (f-f) correlation and the average features-class (f-c) correlation of the best subset. Five experiments are executed, the first is done over the whole data records of the KDD then each of the other four experiments are done over only the records of each specific type of attack. Table 2 shows the first evaluation measure values for GA and
Correlation Based Feature Selection
325
Table 1. Distribution of attack types in KDD dataset
Attack Normal DoS Probe R2L
Samples No. 97280 391458 4107 1124
U2R
52
Sub Attacks Back,Land,neptune,Pod,Smurf,teardrop Satan,ipsweep,Nmap,portsweep guess-passwd,ftp-write, Imap,Phf, multihop,warezmaster,warezclient,spy Buffer-overflow,loadmodule,perl,rootkit
Table 2. Best selected feature subset and its fitness Value using GA and QVICAwith EDA Search (subset population size=1000, iterations=1000, crossover probability = 0.3 and mutataion probability = 0.5)
Exp. Best fit- Features Selected features ness No. 1 0.9746 10 4,6,8,12,13,23,25,29,30,35 2 0.9786 13 2,3,5,10,11,12,22,23,24,29,33,34,36 GA Search 3 0.9747 12 2,3,18,21,23,27,29,30,31,32,33,37 4 0.8969 8 9,12,21,28,29,31,37,39 5 0.8544 14 1,3,5,10,13,14,,17,18,20,24,26,32,35,38 1 0.9762 9 3,4,6,12,24,25,30,34,35 QVICA2 0.9998 7 1,2,7,29,33,34,36 3 0.9933 8 2,27,31,32,33,34,37,40 with EDA 4 0.9964 2 15,29 Search 5 0.939 4 10,14,24,32
proposed algorithm, where the best subset obtained is shown with the number of features it includes and its fitness value (CFS evaluation function score) at each experiment. As shown in table 2, the QVICA-with EDA search method outperformed the GA search in getting optimal subsets with higher fitness values of CFS merit function. Only in the firt experiment, tested on all data records, both algorithms have almost the same behavior as there is a slight difference between their fitness results. In other four experiments, the proposed algorithm behaves better than the GA where better fitness values are obtained. It can be seen that the proposed algorithm was more effective than GA in searching for the relevant and non redundant features which is clear from the number of features obtained from both of them at each experiment. QVICA-with EDA is able to get feature subset with fewer length, which means that it discovers more irrelevant and redundant features than GA which outputs larger subsets.
326
O.S. Soliman and A. Rassem
Table 3. Average (f-f) and (f-c) Correlations of the Best Subset of QVICA-with EDA and GA Search
Exp.
1 2 3 4 5
Average features Correlations GA QVICA-with EDA 0.645 0.6212 0.4245 0.6264 0.4604 0.3203 0.4808 0.3137 0.4521 0.7047
Average Class Correlations GA QVICA-with EDA 0.804 0.7951 0.67 0.8243 0.65 0.8302 0.6625 0.8075 0.5988 0.8285
The last two evaluation measures, the average feature-feature (f-f) correlation and average feature-class (f-c) correlation values of the best subset, are calculated for both algorithms at each experiment as in table 3. CFS aims to both lower the first measure, to ensure redundancy elimination and to maximize the second measure for higher class predictability power. Table 3 proves the ability of QVICA-with EDA search method to maintain the CFS first goal as in almost all cases, its selected feature subset has smaller features correlations/interactions than those of the GA selected one. The algorithm is able also to achieve the second goal and get features with higher class predicability than the GA’s selected features (as listed in the last two columns of the table). For more analysis regarding the performance of the proposed algorithm, the best feature subset fitness value obtained at each experiment’s iterations are visualized for both algorithms in the Fig. 2 and Fig. 3(a),(b),(c)&(d) to track the search strategy of both algorithms and their population dynamics through iterations. As shown in Fig. 2 and Fig. 3(a),(b),(c)&(d) of the five experiments, the proposed algorithm has an increasing curve of the best fitness value through all iterations where GA has a random behavior. The increasing curve shows how the whole subset population in QVICA-with EDA moves towards the best solution due to the EDA sampling that always sample the best solutions found so far at each iteration. A small degree of randomization also appears in the curve and is gained from both the vaccination and quantum observation process. This randomization means that no only the search direction is towards optimality but also there is a space at each iteration for searching more areas in the features space to explore new unknown subsets. On the other hand, the GA search strategy appears to be highly random according to the GA operators but it fails to reach or get closer to optimality regions which is clear from the maximum values obtained by both algorithms. The proposed algorithm is able to get the highest fitness values for the DoS attack, and that is because it has the largest number of records in the dataset. The large number of records provides sufficient information about the features
Correlation Based Feature Selection
327
Fig. 2. Best Subset fitness value found by QVICA-with EDA and GA searches over all data records
(a) Best fitness for DoS data records. (b) Best fitness for Probe data records.
(c) Best fitness for R2L data records. (d) Best fitness for U2R data records. Fig. 3. Best Subset fitness value found by QVICA-with EDA and GA Algorithm
328
O.S. Soliman and A. Rassem
dependency and the features’ class predictability that helps in building more accurate correlation matrix and reaching better fitness values. the Lowest fitness values are achieved by GA and by our algorithm for the U2R attack because it has the smallest number of records in the dataset so little amount of information about the features correlations is provided.
5
Conclusions
This paper introduced a quantum bio inspired estimation of distribution algorithm for correlation based feature selection, that combines quantum computing (QC) concepts with vaccination principles, the immune clonal selection and estimation of distribution algorithm (EDA) sampling. The quantum properties of q-bits representation, quantum mutataion and observation with the EDA sampling were utilized for improving the performance of the search and in reducing the computation time. The proposed algorithm employed as a search technique for CFS to find the optimal subsets of the features space. It was implemented and evaluated using benchmark dataset KDD-Cup99 dataset, and compared with GA search as one of the best heuristic search methods. Results showed that it was capable to obtain better feature subsets with fewer length, higher fitness values and reducing computation time by the quantum representation of solutions. For future work we intend to apply the proposed search algorithm to a real application; more experiments and more comparative study with the most recent Machine learning algorithms.
References 1. Olusola, A., Oladele, A., Abosede, D.: Analysis of KDD 99 Intrusion Detection Dataset for Selection of Relevance Features. In: Proceedings of the World Congress on Engineering and Computer Science, vol. I, pp. 20–22 (2010) 2. Cantu-Paz, E.: Feature Subset Selection by Estimation of Distribution Algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), pp. 303–310 (2002) 3. Hall, M.: Correlation-based feature selection for machine learning. PhD Thesis, Department of Computer Science, Waikato University, New Zealand (1999) 4. KDD 1999 archive: The Fifth International Conference on Knowledge Discovery and Data Mining, http://kdd.ics.uci.edu/databases/kddcup99/ kddcup99.html 5. Hall, M., Smith, L.: Feature Selection for Machine Learning: Comparing a Correlation based Filter Approach to the Wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235– 239 (1999) 6. Hoque, M., Mukit, M., Bikas, M.: An Implementation of Intrusion Detection System Using Genetic Algorithm. International Journal of Network Security and Its Applications (IJNSA) 4(2), 109–120 (2012) 7. Alomari, O., Othman, Z.: Bees Algorithm for feature selection in Network Anomaly detection. Journal of Applied Sciences Research 8(3), 1748–1756 (2012)
Correlation Based Feature Selection
329
8. Soliman, O.S., Rassem, A.: A bio inspired clonal algorithm with estimationof distribution algorithm for global optimization. Informatics and Systems (INFOS), 166–173 (2012) 9. Srinivasu, P., Avadhani, P.S., Satapathy, S.C., Pradeep, T.: A Modified Kolmogorov-Smirnov Correlation Based Filter Algorithm for Feature Selection. In: Satapathy, S.C., Avadhani, P.S., Abraham, A. (eds.) Proceedings of the InConINDIA 2012. AISC, vol. 132, pp. 819–826. Springer, Heidelberg (2012) 10. Niu, Q., Zhou, T., Ma, S.: A Quantum-Inspired Immune Algorithm for Hybrid Flow Shop with Make span Criterion. Journal of Universal Computer Science 15(4), 765–785 (2009) 11. Bol´ on-Canedo, V., S´ anchez-Maro˜ no, N., Alonso-Betanzos, A.: Feature selection and classification in multiple class datasets: An application to KDD Cup 99 dataset. Expert Systems with Applications 38(5), 5947–5957 (2011) 12. He, X., Zeng, J., Xue, S., Wang, L.: An New Estimation of Distribution Algorithm Based Edge Histogram Model for Flexible Job-Shop Problem. In: Yu, Y., Yu, Z., Zhao, J. (eds.) CSEEE 2011. CCIS, vol. 158, pp. 315–320. Springer, Heidelberg (2011) 13. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 14. Saeys, Y., Degroeve, S., Aeyels, D., Van de Peer, Y., Rouze, P.: Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction. Bioinformatics 19, 179–188 (2003) 15. Saeys, Y., Degroeve, S., Van de Peer, Y.: Feature Ranking Using an EDA-based Wrapper Approach. STUD FUZZ, vol. 192, pp. 243–257 (2006) 16. Yang, Y., Webb, G.: A Comparative Study of Discretization Methods for NaiveBayes Classifiers. In: Proceedings of Pacific Rim Knowledge Acquisition Workshop, 159–173 (2002) 17. Chunga, Y., Wahid, N.: A hybrid network intrusion detection system using simplified swarm optimization (SSO). Applied Soft Computing 12(9), 3014–3022 (2012) 18. Zhu, Z., Ong, Y., Dash, M.: Wrapper-Filter Feature Selection Algorithm Using A Memetic Framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 37(1), 70–76 (2007)