Network Intrusion Detection using Feature Selection and Decision tree classifier Shina Sheen Dept of Mathematics and Computer Applications PSG College of Technology Coimbatore, India
R Rajesh, Member IEEE School of Computer Science and Engineering, Bharathiar University Coimbatore, India
[email protected]
[email protected]
Abstract— Security of computers and the networks that connect them is increasingly becoming of great significance. Machine learning techniques such as Decision trees have been applied to the field of intrusion detection. Machine learning techniques can learn normal and anomalous patterns from training data and generate classifiers that are used to detect attacks on computer system. In general the input to classifiers is in a high dimension feature space, but not all features are relevant to the classes to be classified. Feature selection is a very important step in classification since the inclusion of irrelevant and redundant features often degrade the performance of classification algorithms both in speed and accuracy. In this paper, we have considered three different approaches for feature selection, Chi square, Information Gain and ReliefF which is based on filter approach. A comparative study of the three approaches is done using decision tree as classifier. The KDDcup 99 data set is used to train and test the decision tree classifiers. Keywords— Decision trees, Feature selection, Filter method, Chi square, Information Gain, ReliefF
A
I.
INTRODUCTION
s the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to a wide variety of cyber threats. According to a recent survey by CERT/CC (Computer Emergency Response Team/Coordination Center) [12], the rate of cyber attacks has been more than doubling every year in recent times. It has become increasingly important to make our information systems, especially those used for critical functions in the military and commercial sectors, resistant to and tolerant of such attacks. Intrusion detection includes identifying a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources. Traditional methods for intrusion detection are based on extensive knowledge of signatures of known attacks. Monitored events are matched against the signatures to detect intrusions. These methods extract features from various audit streams, and detect intrusions by comparing the feature values to a set of attack signatures provided by human experts. The signature database has to be manually revised for each new type of intrusion that is discovered. A significant limitation of signature-based
methods is that they cannot detect emerging cyber threats, since by their very nature these threats are launched using previously unknown attacks. In addition, even if a new attack is discovered and its signature developed, often there is a substantial latency in its deployment across networks. These limitations have led to an increasing interest in intrusion detection techniques based upon data mining [7, 8, 16, 17, and 18]. We describe the use of machine learning techniques which provide decision aids for the analysts and which automatically generate rules to be used for computer network intrusion detection. We use Quinlan’s C4.5 algorithm [21] to construct decision trees from structured data. The C4.5 algorithm uses information theoretic precepts to create efficient decision trees. Given a structured data set, a list of attributes describing each data element, and a set of categories to partition the data into, the C4.5 algorithm determines which attribute most accurately categorizes the data. Feature selection, is a preprocessing step to machine learning of selecting a subset of relevant features for building robust learning models [3, 6, 11, 14]. It is the process of choosing a subset of original features so that the feature space is optimally reduced to evaluation criterion. The raw data collected is usually large, so it is desired to select a subset of data by creating feature vectors that represent most of the information from the data. Existing feature selection methods for machine learning typically fall into two broad categories; those which evaluate the worth of features using the learning algorithm that is to be ultimately applied to the data, and those which evaluate the worth of features by using heuristics based on general characteristics of the data. The former are referred to as wrappers and the latter filters [15, 27]. In this paper, we have considered three different approaches for feature selection: Chi square [10], Information Gain [25] and ReliefF [19] which is based on filter approach. A comparative study of the three approaches is done using decision tree as classifier. The KDDcup 99 data set is used to train and test the decision tree classifiers.
II.
CLASSIFICATION
A. Introduction The classification of large data sets is an important problem in data mining. For a database with a number of records and for a set of classes such that each record belongs to one of the given classes, the problem of classification is to decide the class to which a given record belongs. In supervised classification we have a training data set of records and for each record of this set, the respective class to which it belongs is also known. Using the training set, the classification process attempts to generate the descriptions of the classes, and these descriptions help to classify the unknown records. There are several approaches to supervised classifications. Decision trees are one of them. B. Decision Trees The decision tree classifier by Quinlan is one of the most well known machine learning techniques. A decision tree is made of decision nodes and leaf nodes. Each decision node corresponds to a test X over a single attribute of the input data and has a number of branches, each of which handles an outcome of the test X. Each leaf node represents a class that is the result of decision for a case. The process of constructing a decision tree is basically a divide and conquer process. A set T of training data consists of k classes (c1, c2,…..ck). If T only consists of cases of one single class, T will be a leaf. If T contains no case, T is a leaf and the associated class with this leaf will be assigned with the major class of its parent node. If T contains cases of mixed classes, a test based on some attribute ai of the training data will be carried and will be split into n subsets (T1, T2,…..,Tn), where n is the number of outcomes of the test over attribute ai. The same process of constructing decision tree is recursively performed over each Ti, where j ranges from 1 to n, until every subset belongs to a single class. III.
FEATURE SELECTION
Feature Selection, also known as “subset selection” or “variable selection”, is a common pre-processing step used in Machine Learning, where a subset of the features available form the original data are selected for subsequent application of a learning algorithm[1]. Feature Selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so called “curse of dimensionality”, which refers to the fact that the number of data samples required to estimate some arbitrary multivariate probability distribution increases exponentially as the number of dimensions in the data increases linearly. All feature selection methods need to use some sort of evaluation function together selection is a very important step in classification with a search procedure to find the optimal feature set. The evaluation functions can be divided into two main groups: Filters and Wrappers. Filters measure the relevance of feature
subsets independently of any classifier, whereas wrappers use the classifier’s performance as the evaluation measure. In this paper, we have considered three different approaches for feature selection, Chi square, Information Gain and ReliefF which is based on filter approach. IV.
EXPERIMENTS AND EVALUATIONS
In our preliminary work, we have selected KDD 1999 dataset to test the performance of our approach based on Decision trees with feature selection because the dataset is still a common benchmark to evaluate our techniques in IDS. We have used an open source machine learning framework – Weka [28] (the latest Windows version is Weka 3.4). It is a collection of machine learning algorithms for data mining tasks and it contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. For feature selection, we have selected a subset of 5000 records randomly from KDD 1999 dataset and divided into two classes, normal (no) or attack (yes). We performed ChiSquare, Information Gain and ReliefF approach on it to acquire the most relevant and necessary feature. To identify attacks, we adopted 10-fold cross validation to verify the feature selection results. Afterwards, we also have adopted the decision tree approach (C4.5 algorithm) to evaluate our work on intrusion detection. A. Results and Analysis The results of our work are given below. When applying the various algorithms specified above we have selected the top 20 features selected by the various methods. The most relevant features ranked according to the three algorithms are as shown in Table 1. These 20 features were then provided to decision tree for classification and it was found that both Chi square and Information Gain had similar performance while ReliefF was giving a lower performance with the KDD data set as shown in Fig 1.
Fig 1 Performance chart of chi square, reliefF and info gain
Table I Most relevant top 20 features of KDD data set
F.No 2 3 4 5 12
Feature protocol type service flag source bytes logged in
22
is guest login
23
Count
24
srv count
27
rerror rate.
28
srv rerror rate
30 31
diff srv rate srv diff host rate dst host count
32 33
dst host count
34
dst host same srv rate
35
dst host diff srv rate dst host srv diff host rate dst host serror rate. dst host rerror rate dst host srv rerror rate
37 38 40 41
srv
Description Connection protocol (e.g. tcp, udp) Destination service (e.g. telnet, ftp) Status flag of the connection Bytes sent from source to destination 1 if successfully logged in; 0 otherwise 1 if the login is a "guest'' login; 0 otherwise Number of connections to the same host as the current connection in the past two seconds Number of connections to the same service as the current connection in the past two seconds % of connections that have “REJ'' errors % of connections that have “REJ'' errors % of connections to different services % of connections to different hosts count of connections having the same destination host count of connections having the same destination host and using the same service % of connections having the same destination host and using the same service % of different services on the current host % of connections to the same service coming from different hosts % of connections to the current host that have an S0 error % of connections to the current host that have an RST error % of connections to the current host and specified service that have an RST error
It has also seen that the time taken to build the model has been reduced considerably when you do feature selection without compromising on the accuracy of classification as shown in Fig 2. We have taken the Top 5, Top 10, Top 15, Top 20 features and done a comparative study of the Classification Accuracy. It was found that when you selected the Top 15 and Top 20 features it gave almost the same performance for Chi square and Information gain as shown in Table 2. The Comparison of the Accuracy based on the features selected is shown in Fig 3. Table II Classification accuracy
No. of Features 5 10 15 20
Chi Square 95.0207 95.6432 95.8506 95.8506
Info Gain
ReliefF
95.0207 95.6432 95.8506 95.8506
92.6349 92.8769 92.8769 95.6432
Fig 2 Time taken to build the model
Fig 3 Performance of algorithms based on selected features
V.
CONCLUSION
Intrusion Detection with feature selection was able to outperform the decision tree algorithm without feature selection. We believe that this improvement is due to the fact that the first approach is able to focus on relevant features and eliminate unnecessary or distracting features. This initial filtering is able to improve the classification abilities of the decision tree in a shorter time. Of the three feature filter algorithms chosen it was found that Chi square and Information Gain was giving a better performance than ReliefF when KDD data set was taken. The work may be further extended by considering the four major attacks in the KDD data set. The data set can be divided in such a way that enough data is present for the four major types of attack. ACKNOWLEDGMENT The first author is thankful to all the staff members of PSG College of Technology, Coimbatore. The second author is thankful to UGC for the major research project grant. The
authors are thankful to all the staff members of Bharathiar University for their valuable support. The authors are also thankful to the reviewers for their valuable comments. REFERENCES [1]
[2] [3] [4] [5]
[6]
[7] [8]
[9]
[10]
[11]
[12] [13]
A Blum and P Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, 97(1-2):245–271, December 1997 B Caswell and M Roesch, “Snort: The open source network intrusion detection system.” , 2004 M Dash and H Liu, "Feature Selection for Classification," Intelligent Data Analysis: An Int'l J., vol. 1, no. 3, pp 131-156, 1997 D Denning (1987), “An Intrusion-Detection Model,” IEEE Transactions on Software Engineering, Vol. SE-13, No. 2, pp.222-232. O Depren, M Topallar, E Anarim, and M K Ciliz “An Intelligent Intrusion Detection System for Anomaly and Misuse Detection in Computer Networks” Expert systems with Applications, 29:713–722, 2005. J Doak, ”An evaluation of feature selection methods and their application to computer security”, Technical report, DavisCA: University of California, Department of Computer Science,1992 Eric Bloedorn et al, ”Data Mining for Network Intrusion Detection: How to get started” Technical paper, 2001 Gary Stein, Bing Chen, Annie S Wu, Kein A Hua, “Decision tree classifier for network intrusion detection with GA based feature selection”, Proceedings of the 43rd ACM Annual Southeast Conference, Kennesaw, Georgia, Vol 2, 2005 H Kayacik Güneş, A Nur Zincir-Heywood, Malcolm I. Heywood, “Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets” H.Liu, Setiono and R., “Chi2:feature selection and discretization of numeric attributes”, in Proc of the Seventh International Conference on Tools with Artificial Intelligence, pp.388 – 391, 1995. Isabelle Guyon, Andr´e Elisseeff, ”An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research 3 (2003) 1157-1182 A. K Jones and. R S Sielken, “Computer system intrusion detection: A survey” , Feb. 9, 2000. KDDCUP 1999.
[14] K Kira, and L. A. Rendell: 1992a, ‘The feature selection problem: traditional methods and new algorithm’. In: Proceedings of AAAI’92 [15] R Kohavi and G H John, "Wrapper for Feature Subset Selection”, Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997. [16] W Lee, S Stolfo, K Mok,,” A Data Mining Framework for building Intrusion detection models”, In Proceedings of the IEEE Symposium on Security and Privacy,1999 [17] W Lee, S. J. Stolfo, and K. W. Mok, ”Adaptive intrusion detection: A Data Mining approach”, Artificial Intelligence Review 14 (6), 533567, 2000 [18] Li, X., and Ye, N. , “Decision tree classifier for computer intrusion detection.”, Journal of Parallel and Distributed Computing Practices, 4(2), pp 179-190, 2001 [19] Marko Robnik-ˇ Sikonja, Igor Kononenko, “ Theoretical and Empirical Analysis of ReliefF and RReliefF”, Machine Learning Journal (2003) 53:23-69 [20] T Peng, C Leckie, K Ramamohana rao 2007, “Survey of network based defense mechanism countering the DOS and DDOS problem”, ACM Computing surveys, vol 39, no1, article 3 (Apr 2007) [21] J R Quinlan, “C4.5, Programs for Machine Learning”, Morgan Kaufmann San Mateo Ca, 1993 [22] A Sundaram, “An Introduction to Intrusion Detection”, ACM Cross Roads, Vol. 2, No. 4, April 1996 [23] R Verwoerd and R Hunt, “Intrusion Detection Techniques and Approaches” Computer Communications, 25(15):1356–1365, September 2002 [24] Ian H Witten, Eibe Frank, “Data Mining Practical Machine Learning Tools and Techniques”, Second Edition, Morgan Kaufmann, 2005 [25] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques” Harcourt India Pvt Ltd [26] Stephen Northcutt, Judy Novak, “ Network Intrusion Detection “, Pearson Education, 2003 [27] Włodzisław, W. Tomasz, B. Jacek and K. Adam, “Feature Selection and Ranking Filters”, 2003, [28] “Weka Machine Learning Project”.