Decision Tree Techniques Applied on NSL-KDD Data ... - Springer Link

0 downloads 0 Views 324KB Size Report
niques to remove irrelevant features from NSL-KDD data set to develop a ro- ... existing feature selection techniques to be applied on NSL-KDD data set to ...
Decision Tree Techniques Applied on NSL-KDD Data and Its Comparison with Various Feature Selection Techniques H.S. Hota1 and Akhilesh Kumar Shrivas2 1

Guru Ghasidas Central University, Chhattisgarh, India 2 Dr. C.V. Raman University, Chhattisgarh, India {profhota,akhilesh.mca29}@gmail.com

Abstract. Intrusion detection system (IDS) is one of the important research area in field of information and network security to protect information or data from unauthorized access. IDS is a classifier that can classify the data as normal or attack. In this paper, we have focused on many existing feature selection techniques to remove irrelevant features from NSL-KDD data set to develop a robust classifier that will be computationally efficient and effective. Four different feature selection techniques :Info Gain, Correlation, Relief and Symmetrical Uncertainty are combined with C4.5 decision tree technique to develop IDS . Experimental works are carried out using WEKA open source data mining tooland obtained results show that C4.5 with Info Gain feature selection technique has produced highest accuracy of 99.68% with 17 features, however result obtain in case of Symmetrical Uncertainty with C4.5 is also promising with 99.64% accuracy in case of only 11 features . Results are better as compare to the work already done in this area. Keywords: Decision Tree (DT), Feature Selection(FS), Intrusion Detection System(IDS).

1

Introduction

The ever increasing size of data in computer has made information security more important. Information security means protecting information and information systems from unauthorized access. Information security becomes more important as data are being accessed in a network environment and transferred over an insecure medium. Many authors have worked on this issue and applied feature selection technique on NSL-KDD data set for multiclass problem. Mukharjee, S. et al.[9] have proposed new feature reduction method: Feature Validity Based Reduction Method (FVBRM) applied on one of the efficient classifier Naive Bayes and achieved 97.78%accuracy on reduced NSL-KDD data set with 24 features. This technique gives better performance as compare to Case Based Feature Selection (CFS), Gain Ratio (GR) and Info Gain Ratio (IGR) to design IDS. Panda, M.et al.[10] have suggested hybrid technique with combination of random forest, dichotomies and ensemble of balanced nested dichotomies (END) model and achieved detection rate 99.50% and low false alarm M.K. Kundu et al. (eds.), Advanced Computing, Networking and Informatics - Volume 1, Smart Innovation, Systems and Technologies 27, DOI: 10.1007/978-3-319-07353-8_24, © Springer International Publishing Switzerland 2014

205

206

H.S. Hota and A.K. Shrivas

rate 0.1% which are quite encouraging in comparison to all other models. Imran, H. M.et al.[11] have proposed hybrid technique of Linear Discriminant Analysis (LDA) algorithm and Genetic Algorithm (GA) for feature selection . They proposed features selection technique applied on radial basis function with NSL-KDD data set to develop a robust IDS. The different feature subsets are applied on RBF model which produces highest accuracy of 99.3% in case of 11 features. Bhavsar, Y. B.et al. [12] have discussed different support vector machine (SVM) kernel function as Gaussian radial basis function (RBF) kernel, polynomial kernel, sigmoid kernel to develop IDS. They have compared accuracy and computation time of different kernel function as classifier, authors suggested Gaussian RBF kernel function as the best kernel function ,which achieved highest accuracy of 98.57% with 10- fold cross validation. Amira, S.A.S.et al. [8] have proposed Minkowski distance technique based on genetic algorithm to develop IDS to detect anomalies. The proposed Minkowski distance techniques applied on NSL-KDD data produces higher detection rate of 82.13% in case of higher threshold value and smaller population. They have also compared their results with Euclidean distance. In a recent work by Hota, H. S. et al.[15] , a binary class based IDS using random forest technique combined with rank based feature selection was investigated. Accuracy achieved in case of above technique was 99.76% with 15 features. It is observed from the literature that feature selection is an important issue due to high dimensionality feature space of IDS data. This research work explores many existing feature selection techniques to be applied on NSL-KDD data set to reduce irrelevant features from data set, so that IDS will become computationally fast and efficient. Experimental work is carried out using WEKA (Waikato Environment for Knowledge Analysis)[13] and Tanagra [14] open source data mining tool and obtained results revealed that C4.5 with Info Gain feature selection technique and C4.5 with Symmetrical Uncertainty feature selection technique produces highest accuracy respectively with 17 and 11 features and is highest among the entire research outcome reviewed so far.

2

Methods and Materials

Various methods (Techniques) based on data mining techniques and benchmark data (material) used in this research work are explained in detail as below: 2.1

Decision Tree

Decision tree [2] is probably the most popular data mining technique commonly used for classification. The principle idea of a decision tree is to split our data recursively into subsets so that each subset contains more or less homogeneous states of our target variable. At each split in the tree, all input attributes are evaluated for their impact on the predictable attribute. When this recursive process is completed, a decision tree is formed which can be converted in simple If –Then rules. We have used various decision tree techniques like C4.5, ID3 (Iterative Dichotomizer 3), CART (Classification and Regression Tree), REP Tree and decision list.

Decision Tree Techniques Applied on NSL-KDD Data and Its Comparison

207

Among the above decision tree techniques, C4.5 is more powerful and produces better results for many classification problems. C4.5 [1] is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees and rule derivation. In building a decision tree, we can deal with training set that have records with unknown attributes values by evaluating the gain, or the gain ratio, for an attribute values are available. We can classify the records that have unknown attribute value by estimating the probability of the various possible results. Unlike CART, which generates a binary decision tree, C4.5 produces tree with variable branches per node. When a discrete variable is chosen as the splitting attribute in C4.5, there will be one branch for each value of the attribute. 2.2

Feature Selection

Feature selection [5]is an optimization process in which one tries to find the best feature subset from the fixed set of the original features, according to a given processing goal and feature selection criteria. A solution of an optimal feature selection does not need to be unique. Different subset of original features may guarantee accomplishing the same goal with the same performance measure. An optimal feature set will depend on data, processing goal, and the selection criteria being used. In this experiment, we have used following ranking based feature selection techniques: Information gain [4] is a feature selection technique which measure features of data set based on its ranking. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or information content of messages. Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or impurity in these partitions. Correlation based feature selection (CFS) [7] is used to determine the best feature subset and is usually combined with search strategies such as forward selection, backward elimination, bi-directional search, best-first search and genetic search. Among given features, it finds out an optimal subset which is best relevant to a class having no redundant feature. It evaluates merit of the feature subset on the basis of hypothesis: Good feature subsets contain features highly correlated with the class, yet uncorrelated to each other. This hypothesis gives rise to two definitions. One is feature class correlation and another is feature -feature correlation. Feature-class correlation indicates how much a feature is correlated to a specific class while feature-feature correlation is the correlation between two features. ReliefF [6] is an extension of the Relief algorithm that can handle noise and multiclass data sets. ReliefF uses a nearest neighbor implementation to maintain relevancy scores for each attribute. It defines a good discriminating attribute as the attribute that has the same value for other attributes in the same class and different from attribute values in different classes. ReliefF algorithm is stronger and can handle incomplete and noisy data as well as multiclass dataset. Symmetrical Uncertainty [7] is another feature selection method that was devised to compensate for information gain‘s bias towards features with more values. It capitalizes on the symmetrical property of information gain. The symmetrical uncertainty between features and the target concept can be used to evaluate the goodness of features for classification.

208

2.3

H.S. Hota and A.K. Shrivas

NSL-KDD Data

Many benchmark data sets related to IDS are available in repository sites .One of the data set publicly available for the development and evaluation of IDS is NSL-KDD data set [3]. This data set is inherent data set of KDD99. One of the problem with KDD99 data set is the huge number of redundant records, which causes the learning algorithms to be biased towards the frequent records, and thus prevent them from learning infrequent records which are usually more harmful to networks such as U2R and R2L attacks. In addition, the existence of these repeated records in the test set will cause the evaluation results to be biased by the methods which have better detection rates on the frequent records. Data set has total 25192 samples consisting 41 features with four different types of attack and normal data. Detail of samples according to attack class is shown in Table 1. Table 1. Different attacks and Normal data along with sample size

Class type Normal DoS R2L U2R Probe Total

Number of instance 13449 9234 209 11 2289 25192

From the above table it is clear that data set is highly unbalanced or there is no uniform distribution of samples. Number of samples of DoS type attack is high (9234 samples) while on the other hand U2R type of attack has only 11 samples. This unbalanced distribution of samples may create problem during training of any data mining based classification model. In order to verify efficiency and accuracy of IDS model, data set is divided into various partitions based on training and testing samples.

3

Experimental Work

The experimental work is carried out using WEKA [13] and TANAGRA [14] open source data mining software. This entire experimental work is divided into two parts: First building multiclass classifier and second applying feature selection technique. Multiclass classifier for IDS is developed based on various decision tree techniques and finally a best model (C4.5) with reduced feature subset is selected. As explained NSL-KDD data set is divided into three different partitions as 6040%, 80-20%, and 90-10% as training–testing samples. Decision tree based models are first trained and then tested on various partitions of data set. Accuracy of different models with different partitions is shown in Table 2.Accuracy of model is varying from one partition to other partition, highest accuracy of 99.56% was achieved in case of C4.5 with 90-10% partition with all available features.

Decision Tree Tech hniques Applied on NSL-KDD Data and Its Comparison

209

Table 2.. Accuracy of different partitions of data set Model C4.5 ID3 CART REP Tree Decision Table

60-4 40% partition 99.39 97.75 99.29 98.55 9 98.44

80-20%partition 99.52 97.92 99.33 98.86 98.70

90-10% partition 99.56 97.86 99.05 99.12 98.69

Table 3. Obtained rank in i case of various ranking based feature selection techniques Feature selection Technique Info Gain Correlation ReliefF Symmetrical Uncertainty

Features with

rank (In descending order)

5,3,6,4,30,29,33,34,35,38,12,39,25,23,26,37,32,36,31,224, 41,2,27,40,28,1,10,8,13,16,19,22,17,15,14,18,11,7,9,20,21 29,39,38,25,26,33,34,4,12,23,32,3,35,40,27,41,28,36,311,3 7,2,30,8,1,22,19,10,24,14,15,6,17,16,13,18,11,5,9,7,20,211 3,29,4,36,32,38,12,33,2,34,39,23,26,35,40,30,31,8,24,225, 37,27,41,28,10,22,1,6,14,11,13,15,19,18,16,5,17,9,7,20,21 4,30,5,25,26,39,38,6,29,12,3,35,34,33,23,37,36,32,31,440, 2,41,27,24,1,28,10,22,8,13,16,14,17,19,11,15,9,18,7,20,21

Feature selection is an optimization o process in which we can find the best featture subset from original data seet. Best model obtained in case of 90-10% partition ussing C4.5 technique is selected for applying feature selection. Various ranking based ffeature selection techniques liike Info Gain , Correlation, ReliefF and Symmetrical U Uncertainty are applied on C4.5 model. Table 3 shows rank obtained after applying vvarious feature selection techniiques in descending order (Left to right ) ,these features are then removed one by one an nd applied on C4.5 model. Rank obtained in case of diffferent feature selection techn niques are different, hence results obtained in case off all these techniques will be different d in terms of accuracy. Table 4 shows accurracy of model with different feaature selection techniques with reduced subset of featuures. C4.5 with Info Gain and C4 4.5 with Symmetrical Uncertainty are producing best acccuracy as 99.68% (With 17 7 features) and 99.64% (With 11 feature) respectiveely.

Fig. 1. Accuracy graph h in case of different feature selection techniques with C4.5

210

H.S. Hota and A.K. Shrivas

Both the models are competitive and can be accepted for the development of IDS. Fig. 1 shows pictorial view of accuracy obtained in case of different feature selection techniques combined with C4.5 model.

4

Conclusion

Information security is a crucial issue due to exchanging of huge amount of data and information in day to day life .Intrusion detection system (IDS) is a way to secure our host computer as well as network computer from intruder. Efficient and computationally fast IDS can be developed using decision tree based techniques. On the other hand irrelevant features available in the data set must be reduced. In this research work ,an attempt has been made to explore various decision tree techniques with many existing feature selection techniques. Four different feature selection techniques are tested on CART, ID3, REP, Decision table and C4.5 .Two rank based feature selection techniques: Info Gain and Symmetrical Uncertainty along with C4.5 produces 99.68% and 99.64% accuracy with 17 and 11 features respectively. Results obtained in these two cases are quite satisfactory as compare to other research work already done in this field. In future a feature selection technique based on statistical measures will be developed and will be tested on more than one benchmark data related to intrusion; also proposed feature selection technique will be compared with all other existing feature selection techniques. Table 4. Selected features in case of C4.5 model using various feature selection techniques Feature Selection Technique Info Gain-C4.5

No. of features 17

Accuracy

99.68

Correlation-C4.5

37

99.56

ReliefF-C4.5

37

99.56

Symmetrical Uncertainty-C4.5

11

99.64

Selected Features

{5,3,6,4,30,29,33,34,35,38,12,39,25,23, 26,37,32} {29,39,38,25,26,33,34,4,12,23,32,3,35,4 0,27,41,28,36,31,37,2,30,8,1,22,19,10,2 4,14,15,6,17,16,13,18,11,5} {3,29,4,36,32,38,12,33,2,34,39,23,26,35 ,40,30,31,8,24,25,37,27,41,28,10,22,1,6, 14,11,13,15,19,18,16,5,17} {4,30,5,25,26,39,38,6,29,12,3}

References 1. Pujari, A.K.: Data mining techniques, 4th edn. Universities Press (India), Private Limited (2001) 2. Tang, Z.H., MacLennan, J.: Data mining with SQL Server 2005. Willey Publishing, Inc., USA (2005) 3. Web sources, http://www.iscx.info/NSL-KDD/ (last accessed on October 2013) 4. Han, J., Kamber, M.: Data Mining Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)

Decision Tree Techniques Applied on NSL-KDD Data and Its Comparison

211

5. Krzysztopf, J.C., Pedrycz, W., Roman, W.S.: Data mining methods for knowledge discovery, 3rd edn. Kluwer Academic Publishers (2000) 6. Zewdie, M.: Optimal feature selection for Network Intrusion Detection System: A data mining approach. Thesis, Master of Science in Information Science, Addis Ababa University, Ethiopia (2011) 7. Parimala, R., Nallaswamy, R.: A study of a spam E-mail classification using feature selection package. Global Journals Inc. (USA) 11, 44–54 (2011) 8. Aziz, A.S.A., Salama, M.A., Hassanien, A., Hanafi, S.E.-O.: Artificial Immune System Inspired Intrusion Detection System Using Genetic Algorithm. Informatica 36, 347–357 (2012) 9. Mukherjee, S., Sharma, N.: Intrusion detection using Bayes classifier with feature reduction. Procedia Technology 4, 119–128 (2012) 10. Panda, M., Abrahamet, A., Patra, M.R.: A hybrid intelligent approach for network intrusion detection. Proceedia Engineering 30, 1–9 (2012) 11. Imran, H.M., Abdullah, A.B., Hussain, M., Palaniappan, S., Ahmad, I.: Intrusion Detection based on Optimum Features Subset and Efficient Dataset Selection. International Journal of Engineering and Innovative Technology (IJEIT) 2, 265–270 (2012) 12. Bhavsar, Y.B., Waghmare, K.C.: Intrusion Detection System Using Data Mining Technique: Support Vector Machine. International Journal of Emerging Technology and Advanced Engineering 3, 581–586 (2013) 13. Web sources, http://www.cs.waikato.ac.nz/~ml/weka/ (last accessed on October 2013) 14. Web sources, http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra (last accessed on October 2013) 15. Hota, H.S., Shrivas, A.K.: Data Mining Approach for Developing Various Models Based on Types of Attack and Feature Selection as Intrusion Detection Systems (IDS). In: Intelligent Computing, Networking, and Informatics. AISC, vol. 243, pp. 845–851. Springer, Heidelberg (2014)

Suggest Documents