Building Lightweight Intrusion Detection System

1 downloads 0 Views 172KB Size Report
parameters is necessary to guarantee high accuracy of classification, called high detection rates. We regulated optimal value of mtry using tuneRF( ) function in.
Building Lightweight Intrusion Detection System Based on Random Forest Dong Seong Kim, Sang Min Lee, and Jong Sou Park Network Security Lab., Computer Engineering Department, Hankuk Aviation University, 200-1, Hwajeon-dong, Deogyang-gu, Goyang-city, Gyeonggi-do, 412-791, Korea {dskim, minuri33, jspark}@hau.ac.kr

Abstract. This paper proposes a new approach to build lightweight Intrusion Detection System (IDS) based on Random Forest (RF). RF is a special kind of ensemble learning techniques and it turns out to perform very well compared to other classification algorithms such as Support Vector Machines (SVM) and Artificial Neural Networks (ANN). In addition, RF produces a measure of importance of feature variables. Our approach is able not only to show high detection rates but also to figure out stable output of important features simultaneously. The results of experiments on KDD 1999 intrusion detection dataset indicate the feasibility of our approach.

1 Introduction Intrusion Detection System (IDS) plays vital role of detecting various kinds of attacks. The main purpose of IDS is to find out intrusions among normal audit data and this can be considered as classification problem. As new attacks appear and amount of audit data increases, IDS should counteract them. IDS may utilize additional hardware such as network processor, System on Chip (SoC) and FieldProgrammable Gate Array (FPGA) [17]. Additional hardware can increase packet capture speed and decrease processing time but it needs more costs and may not enhance detection rates of IDS. In addition to this, as network speed becomes faster, there is an emerging need for security analysis techniques that will be able to keep up with the increased network throughput [10]. Therefore, IDS itself should be lightweight while guaranteeing high detection rates. This paper proposes a new approach to build lightweight IDS. Previous researches have focused on parameter optimization of intrusion detection model and feature selection of audit data. Firstly, the objective of parameter optimization is to regulate parameters of detection model based on various kinds of classification algorithms such as Support Vector Machines (SVM) [5, 6, 12], Hidden Markov Model (HMM) [13], several kinds of Artificial Neural Network (ANN) [4, 18] and so on. For example, Kim and Park [9, 14] and Mukkamala et al. [20] have tried to optimize parameters of kernel function in SVM. Secondly, the objective of feature selection of audit data is to figure out relevant features among whole features to decrease processing time of audit data and to improve detection rates. For feature selection, wrapper and filter method have been J. Wang et al. (Eds.): ISNN 2006, LNCS 3973, pp. 224 – 230, 2006. © Springer-Verlag Berlin Heidelberg 2006

Building Lightweight Intrusion Detection System Based on Random Forest

225

proposed; wrapper method adopts classification algorithms and performs crossvalidation to identify important features. Sung and Mukkamala [18] have tried to identify and categorize features according to their importance to detect a specific kind of attacks such as probe, DoS (Denial of Service), Remote to Local (R2L) and User to Root (U2R). They have used backward feature selection method and SVM and ANN as feature selection algorithms. On the other hand, filter method utilizes correlation based approach, which utilizes feature-class and inter-feature correlation. Filter method is more lightweight than wrapper method in terms of computation time and overheads since it is performed independent of classification algorithms. However, filter method has lower detection rates than wrapper method due to its low correlation with classification algorithms. In order to improve these problems, several studies have proposed hybrid approaches which combine wrapper and filter approach. Kim et al. [9] have proposed fusion approach to optimize both feature selection and parameter regulation. Park et al. [14] have proposed correlation based hybrid approach, which combines filter method with wrapper method through Genetic Algorithm (GA) operation. However, hybrid approaches may inherit both filter and wrapper approach’s drawbacks; they sometimes show a little degradation on detection rates with more computations rather than the naive filter method. To cope with them, this paper proposes a new approach to build lightweight IDS based on Random Forest (RF). RF is a special kind of ensemble learning techniques and turns out to perform very well compared to other classification algorithms, including Support Vector Machines (SVM), Artificial Neural Network (ANN) and so on. In addition, RF produces a measure of importance of the feature variables. Proposed approach based on RF is able not only to show high detection rates but also to figure out important feature simultaneously, without any further overheads compared to hybrid approaches [9, 14]. The results of experiments on KDD 1999 intrusion detection dataset indicate the feasibility of our approach.

2 Random Forest This paper proposes lightweight IDS based on Random Forest (RF). RF is random forests are comparable and sometimes better than state-of-the-art methods in classification and regression [11]. RF is a special kind of ensemble learning techniques [2]. RF has low classification (and regression) error comparable to SVM. Moreover, RF produces additional facilities, especially feature importance. RF is robust concerning the noise and the number of attributes. The learning instances are not selected with bootstrap replication is used for evaluation of the tree, called OOB (out-of-bag) evaluation, which is unbiased estimator of generalization error. RF builds an ensemble of CART tree classifications using bagging mechanism [3]. By using bagging, each node of trees only selects a small subset of features for the split, which enables the algorithm to create classifiers for high dimensional data very quickly. One has to specify the number of randomly selected features (mtry) at each split. The default value is sqrt(p) for classification where p represents number of features. The Gini index [1] is used as the splitting criterion. The largest possible tree is grown and is not pruned. One also should choose the big enough number of trees (ntree) to

226

D.S. Kim, S.M. Lee, and J.S. Park

ensure that every input feature gets predicted several times. The root node of each tree in the forest keeps a bootstrap sample from the original data as the training set. The OOB estimates are based on roughly one third of the original data set. By contrasting these OOB predictions with the training set outcomes, one can arrive at an estimation of the predicting error rate, which is referred to as the OOB error rate. In summary, classification accuracy of RF is high comparable to that of SVM, and it also generates individual features’ importance, these two properties help one to build lightweight IDS with small overheads compared to previous approaches. Our proposed approach will be presented in next section.

3 Proposed Approach The overall flow of proposed approach is depicted in figure 1. The network audit data is consisting of training set and testing set. Training set is separated into learning set, validation set. Testing set has additional attacks which are not included in training set. In general, even if RF is robust against over-fitting problem [2], we use n-fold cross validation method to minimize generalization errors [3]. Learning set is used to train classifiers based on RF and figure out importance of each feature of network audit data. These classifiers can be considered as detection models in IDS. Validation set is used to compute classification rates by means of estimating OOB errors in RF, which are detection rates in IDS. Feature importance ranking is performed according to the result of feature importance values in previous step. The irrelevant feature(s) are eliminated and only important features are survived. In next phase, only the important features are used to build detection models and evaluated by testing set in terms of detection rates. If the detection rates satisfy our design requirement, the overall procedure is over; otherwise, it iterates the procedures. We carry out several experiments on KDD 1999 intrusion detection dataset and experimental results and analysis are presented in next section.

Fig. 1. Overall flow of proposed approach

Building Lightweight Intrusion Detection System Based on Random Forest

227

4 Experiments and Analysis 4.1 Experimental Environments and Dataset All experiments were performed in a Windows machine having configurations Intel (R) Pentium( R) 4, 1.70 GHz (over 1.72 GHz), 512 MB RAM. We have used RF version (R 2.2.0) in open source R-project [19]. We have used the KDD 1999 CUP labeled dataset [6] so as to evaluate our approach. Stolfo et al. [8] defined higherlevel features that help in distinguishing normal connections from attacks. The dataset contains 24 different types of attacks that are broadly categorized in four groups such as probes, DoS (Denial of Service), U2R (User to Root) and R2L (remote to local). Each instance of data consists of 41 features which we have labeled as f1, f2, f3, f4 and so forth. We only used DoS type of attacks, since other attack types have very small number of instances so that they are not suitable for our experiments [15]. The dataset is consists of training set and testing set. We randomly split the set of labeled training set into two parts; learning set and validation set. Learning set is used to adjust the parameters in RF. Validation set is used to estimate the generalization error of detection model. The generalization errors are represented as OOB errors in RF. In order to achieve both low generation errors, in other words, high detection rates, we have adopted 10-fold cross validation with 2000 samples. Testing set is used to evaluate detection model built by learning set and validation set. 4.2 Experimental Results and Analysis RF has two parameters; the number of variables in the random subset at each node (mtry) and the number of trees in the forest (ntree). The optimization of two parameters is necessary to guarantee high accuracy of classification, called high detection rates. We regulated optimal value of mtry using tuneRF( ) function in Random Forest Package, and it turned out mtry = 6. In case of ntree, there is not a specific function; we regulated optimal ntree value by carrying out experiments with varying ntree values from 0 to 350. The optimal ntree value can be evaluated referring to the OOB errors (see figure 2). We assume that detection rates are determined by equation “1 – OOB errors”. The experimental results for determination of optimal ntree values are depicted in figure 2. According to figure 2, detection rate

100

Detection rates (%)

99.95 99.9 99.85

Upper Average Lower

99.8 99.75 99.7 99.65 99.6 100

130

140

210

340

ntree

Fig. 2. Detection rates vs. ntree values from 0 to 350

228

D.S. Kim, S.M. Lee, and J.S. Park

of RF turned out most high and stable when ntree = 130 and 340. It spends more training time in case of ntree = 340 than in that of ntree = 130, and we selected ntree = 130. As the result of experiments, we set two optimized parameter values; mtry = 6, ntree = 130. After determination of these two parameters, we have performed “feature selection” according to the result of feature importance. The each feature importance values vary a little bit in each experiment. We ranked individual features in descending order according to the average value of each features as the result of 30 times iteration with 2000 samples. The important features from upper 1st and 5th and their properties are summarized in Table 1. Table 1. The upper 5 important features and their properties Feature f23 f6 f13 f3 f12

Properties number of connections to the same host as the current connection in the past two seconds number of data bytes from destination to source number of “compromised”' conditions network service on the destination, e.g., http, telnet, etc. 1 if successfully logged in; 0 otherwise

Average importance values 0.3314 0.2961 0.2484 0.2153 0.2065

100

Detection rate (%)

99.95 99.9 99.85

Upper Average Lower

99.8 99.75 99.7 99.65 99.6 5

10

15

20

25

30

Total number of feature

Fig. 3. Detection rates vs. number of features

The order of feature importance ranking varies according to the dataset but these 5 features almost appear in these upper 5 features, it means that these features are intrinsic features. This is comparable results to Kim et al. [9] and Sung and Mukkamala’s approach [18]. Kim et. al.’s approach has showed important ‘feature set’ but didn’t show the individual features’ importance. Sung and Mukkamala’s approach [18] have ranked feature importance but importance values have a little difference so that it is not applicable to build real IDS. On the other hand, our approach showed not only definitely individual feature importance but also reasonable context information for each important feature. For example, in case of feature f23, it represents “number of connections to the same host as the current

Building Lightweight Intrusion Detection System Based on Random Forest

229

connection in the past two seconds” property. For DoS attacks detection, it’s one of important metric even applied in preprocessor plug-in in SNORT [16]. As the results of feature selection and elimination, there occurs a little degradation in detection rates but it is marginally small (see figure 3). Experimental results have showed higher detection rates comparable to Kim et al.’s fusion approach [9] and Park et. al ’s hybrid feature selection approach [14]. In summary, both optimal values for mtry and ntree parameters were determined through parameter optimization, and importance features were figured out through feature selection of audit data. This means that our approach performs both parameter optimization and feature selection while it has higher detection rates, with stable results of important features. These advantages in our approach enable one to model and implement lightweight IDS.

5 Conclusions Existing studies to build lightweight IDS have proposed two main approaches; parameters optimization of detection models (classification algorithms) and feature selection of audit data. A numbers of researches on parameters optimization have been proposed, which are based on a variety of classification algorithms, including SVM, ANN and so on [5, 6, 9, 12, 14, 18]. The feature selection of audit data has adopted two main methods; wrapper and filter method. The hybrid approaches have been proposed to improve both filter and wrapper method [9, 14]. However, there are still rooms for improvement in terms of detection rates and stable selection for important features. This paper has presented how to build lightweight IDS based on RF, since performance of RF turns out comparable to that of SVM and it also produces feature variables importance. Several experiments on KDD 1999 intrusion detection dataset have been carried out; experimental results indicate that our approach is able not only to guarantee high detection rates but also to figure out important features of audit data.

Acknowledgements This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment).

References 1. Breiman, L., Friedman, J. H., Olshen, R. A., Stone, C. J.: Classification and Regression Trees. Chapman and Hall, New York (1984) 2. Breiman, L.: Random forest. Machine Learning 45(1) (2001) 5–32 3. Duda, R. O., Hart, P. E., Stork, D. G.: Pattern Classification. 2nd edn. John Wiley & Sons, Inc. (2001)

230

D.S. Kim, S.M. Lee, and J.S. Park

4. Fox, K. L., Henning, R. R., Reed, J. H., Simonian, R. P.: A Neural Network Approach Towards Intrusion Detection. In Proc. of the 13th National Computer Security Conf., Washington, DC (1990) 5. Fugate, M., Gattiker, J. R.: Anomaly Detection Enhanced Classification in Computer Intrusion Detection. In.: Lee, S.-W., Verri, A. (eds.): Pattern Recognition with Support Vector Machines. Lecture Notes in Computer Science, Vol. 2388. Springer-Verlag, Berlin Heidelberg New York (2002) 186–197 6. Hu, W., Liao, Y., Vemuri, V. R.: Robust Support Vector Machines for Anomaly Detection in Computer Security. In Proc. of Int. Conf. on Machine Learning and Applications 2003, CSREA Press (2003) 168–174 7. KDD Cup 1999 Data.: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 8. KDD-CUP-99 Task Description: http://kdd.ics.uci.edu/databases/kddcup99/task.html 9. Kim, D., Nguyen, H.-N., Ohn, S.-Y., Park, J.: Fusions of GA and SVM for Anomaly Detection in Intrusion Detection System. In.: Wang J., Liao, X., Yi, Z. (eds.): Advances in Neural Networks. Lecture Notes in Computer Science, Vol. 3498. Springer-Verlag, Berlin Heidelberg New York (2005) 415–420 10. Kruegel, C., Valeur, F.: Stateful Intrusion Detection for High-Speed Networks. In Proc. of the IEEE Symposium on Research on Security and Privacy (2002) 285–293 11. Meyer, D., Leisch, F., Hornik, K.: The Support Vector Machine under Test. Neurocomputing. 55 (2003) 169–186 12. Nguyen, B. V.: An Application of Support Vector Machines to Anomaly Detection. Research in Computer Science-Support Vector Machine, report (2002) 13. Ourston, D., Matzner, S., Stump, W., Hopkins, B.: Applications of Hidden Markov Models to Detecting Multi-Stage Network Attacks. In Proc. of the 36th Hawaii Int. Conf. on System Sciences, IEEE Computer Society Press (2002) 334–343 14. Park, J., Shazzad, Sazzad, K. M., Kim, D.: Toward Modeling Lightweight Intrusion Detection System through Correlation-Based Hybrid Feature Selection. In.: Feng, D., Lin, D., Yung, M. (eds.): Information Security and Cryptology. Lecture Notes in Computer Science, Vol. 3822. Springer-Verlag, Berlin Heidelberg New York (2005) 279–289 15. Sabhnani, M., Serpen, G.: On Failure of Machine Learning Algorithms for Detecting Misuse in KDD Intrusion Detection Data Set. Intelligent Analysis (2004) 16. SNORT, http://www.snort.org 17. Song, H., Lockwood, J. W.: Efficient Packet Classification for Network Intrusion Detection using FPGA. In.: Schmit, H., Wilton, S. J. E. (eds.): In Proc. of the ACM/SIGDA 13th Int. Symposium on Field-Programmable Gate Arrays. FPGA (2005) 238–245 18. Sung, A. H., Mukkamala, S.: Identifying Important Features for Intrusion Detection Using Support Vector Machines and Neural Networks. In Proc. of the 2003 Int. Symposium on Applications and the Internet Technology, IEEE Computer Society Press (2003) 209–216 19. The R Project for Statistical Computing, http://www.r-project.org/ 20. Mukkamala, S., Sung, A. H., Ribeiro, B. M.: Model Selection for Kernel Based Intrusion Detection Systems, In Proc. of Int. Conf. on Adaptive and Natural Computing Algorithms, Springer-Verlag (2005) 458–461