WK,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHUDQG.QRZOHGJH(QJLQHHULQJ,&&.(
A Novel Framework, Based on Fuzzy Ensemble of Classifiers for Intrusion Detection Systems Saman Masarat
Hassan Taheri, Saeed Sharifian
Switching and Network Lab. Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran
[email protected]
Department of Electrical and Electronic. Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran {htaheri, sharifian_s}@aut.ac.ir analysis methods. In signature-based methods IDS uses predefined information such as rules and signatures for training. So, signature-based IDSs can detect any type of attacks which exist in training set. Signature-based IDSs have a high detection rate in detecting attacks existed in training set. Therefore, the rate of false alarms in this type of IDSs is very low. Despite of this benefit, a major defect of signature-based IDSs is their inability to identifying novel attacks didn’t exist in training set. Snort [4] is a familiar open source signature-based IDS. Although performance of signature-based IDSs can be improved by continuous updating rules and signatures, but inability to find novel attacks, increased tendency to behavior-based IDSs. Anomaly-based IDSs try to create a profile of normal behavior. So, any activity conflicting with normal profile can be an attack or abnormal action. Although these systems can find novel attacks but they have high false alarms rate. Importing data mining and artificial intelligence techniques to anomaly-based IDSs, can reduce the rate of false alarms. After setting up an IDS in network, IDS analyzes traffic and returns the alarms for each of the network interactions and generate large volume of alarms. These alarms might be related to previous or future alarms. Therefore, investigating and finding these relations can help to detect attack scenarios. On the other hand, because the rate of these alarms is very large, alert correlation have important role in IDSs and can be used to reduce volume of alarms.
Abstract—By developing technology and speed of communications, providing security of networks becomes a significant topic in network interactions. Intrusion Detection Systems (IDS) play important role in providing general security in the networks. The major challenges with IDSs are detection rate and cost of misclassified samples. In this paper we introduce a novel multistep framework based on machine learning techniques to create an efficient classifier. In first step, the feature selection method will implement based on gain ratio of features. Using this method can improve the performance of classifiers which are created based on this features. In classifiers combination step, we will present a novel fuzzy ensemble method. So, classifiers with more performance and lower cost have more effect to create the final classifier. Keywords-component: IDS; Classifier; Fuzzy Ensemble.
I.
Feature
Selection;
Tree
INTRODUCTION
There are three main aspects in providing security for systems: Prevention, Detection and Response. Prevention, as the first important section in security, is protecting system against abnormal behaviors. Detection is checking system permanently to detect if abnormal activity has entered the system and how we can detect it and in final step, how we can response to it. According to security reports published in recent years, number of attacks to networks has increased significantly [1]. Firewalls have introduced as one of the security ways to contrast attacks. But they capture only the input traffic to the network and they are not sufficient for complete prevention of a system. As a result, a mechanism which has integrated control on whole of the network is needed. This is the main reason that intrusion detection systems are created and play an important role in providing security of networks.
In this paper we focus on classification problem of IDSs. When events come to IDS and IDS tries to separate normal traffic from abnormal one, it needs to solve a classification problem. So, one of the important aspects of improving the security of an IDS, is developing its classification module. The reminder of the paper is organized as follows: The next section provides a brief review of related works in data mining and artificial intelligence techniques in intrusion detection systems. In section III, we introduce our proposed method in four part: 1. Workload, 2. Feature reduction, 3. Training classifiers and 4. Fuzzy ensemble of classifiers. Results of each part will be discussed in details. Finally section IV contain a conclusion of the paper.
In concept, intrusion detection systems are hardware or software systems, which should identify and detect any unauthorized use of the system or harming to system from any of the internal or external users. So, three important functions of an IDS are (1) Observation, (2) Detection and (3) Reaction which are generally compatible with three parameters of network security factors, mentioned before.
II.
Machine Learning techniques are used widely in many aspects of science. Wide variety of them including Neural Networks, Fuzzy Logic, Support Vector Machines (SVM), Evolutionary Algorithms, Pattern Recognition methods, etc. are used to improve the performance of Intrusion Detection Systems. Table 1, briefly shows related works, advantages and limitations of each method. Related references of each method are mentioned in the ‘Method’ column.
In recent years, many researches have focused on IDSs which are divided into two general topics: Event Analysis [2] and Alert Correlation [3]. The former is a significant topic in IDS. In event analysis, IDS can detect the type of anomaly activity. In this paper we will focus on event analysis. Event analysis can be categorized into two main topics: Signature-based methods and anomaly behavior
,(((
RELATED WORKS
Method Neural Network based [5]
Fuzzy Logic based [6]
Association Rules based [7] SVM ( Support Vector Machine) based [8] Evolutionary methods (GA, SI, etc.) [9][10]
x
x x x x x x x x x x x
Hybrid methods [11][12]
x
Table 1. Summary of Artificial Intelligence and Data Mining techniques used in IDSs. Advantages Challenges Approximately, have good detection rate. x Training phase of neural networks is a time consuming process. x Needs large amount of features for effective training. x Adding new samples to training set, needs retraining IDS for applying new samples. x Cannot recognize patterns did not exist in training set. Can be used for reducing time of neural networks training. x Have lower detection rather than neural networks. Can be used to get more flexibility to other methods. Can help to make a better decision about uncertain patterns. Increase robustness and adaption ability of IDSs. Can detect novel attacks by combining known attacks. x Have time-consuming process. x Have a high computational cost. Can use appropriate kernel related to goal (i.e. Gaussian kernel for putting x In large datasets it would be unable or stress on similar attacks). taking too long time for finishing the training time. Have a unique solution (in comparison to neural networks may return different results in different runs). Can be used for determining efficient parameters for other techniques. x Many methods have no predefined value for number of iterations or thresholds of Can be used for discovering classification rules or clusters for misuse and fitness. anomaly detection. x It is possible that some algorithms in Can be used for tracking the intruder trail. some problems do not converge. Aim to solving complex problems by employment of multiple but simple agents without need of any form of supervision to exist [10]. Can use pleasure advantages of each method. x Combining different methods need expert knowledge.
A. Workload Unfortunately, there are few numbers of reliable datasets in IDS field. Many of datasets are not free and many of them are not publicly available [13].
III. PROPOSED APPROACH AND SIMULATION RESULTS Previous scholars have applied feature selection and random trees to several datasets and evaluated some benefits of the techniques. In this paper we propose a framework such as shown in Fig. 1. Briefly, proposed method has three main phases. Feature selection, training tree classifiers and fuzzy ensemble of classifiers. Although our proposed method is general and can be applied to any dataset, but we choose IDS field as a case study and try to verify our proposed method on KDDCup99 dataset.
Knowledge Discovery Dataset (KDD) [14] is one the famous datasets in IDS’s field. Although KDD introduced in 1999, it is still the main dataset in IDS analysis. Some limitation of this dataset is published in [15]; but being free, having a labeled traffic for both train and test sets, and the possibility of comparing results with other methods are some advantages of this dataset. This benefits caused security designers to present their results on this dataset along with other workloads. KDD includes seven weeks traffic in TCP format, contains about five billion communication records and each record is approximately 100 bytes. Test set contains two billion connections. KDDCup99 is a branch of KDD that is used in annual KDD Competitions [16]. KDDCup99 contains 41 features and includes four types of main attacks with their subsets. DoS, U2R, R2L and Probe are the main attacks and 41 features include duration, protocol type, service etc. DoS attack takes the system to out of service state. R2L defines when intruder gets unauthorized access from a remote machine. U2R means unauthorized access to local supervisor privileges and Probe attack means probing the victim system by port scanning or similar methods. Due to the large volume of dataset, in many researches 10 percent of the dataset is used as a benchmark for evaluation [17]. We also use 10 percent for evaluating our method. More details about KDDCup99 and distribution of features and attacks are shown in tables 2 and 3.
Figure 1. Proposed Approach (In short).
We will reduce number of features in the first step. In the second step, we select random features and train J48 trees. Finally, we will implement fuzzy combination between classifiers to get a final classifier. Now, we introduce details of each step and explain its results.
Table 2. Attacks and sub-attacks in KDDCup99 Training set. Type Sub-Attack (num. of repetition) Total Dist. in num. dataset (%)
Probe DoS
U2R
R2L
Normal
ipsweep (1247), nmap (231), portsweep (1040), satan (1589), mscan (0), saint (0) back (2203), land (21), Neptune (107201), pod (264), smurf (280790), teardrop (979), apache2 (0), mailbomb (0), processtable (0), udpstorm (0) buffer_overflow (30), loadmodule (9), perl (3), rootkit (10), httptunnel (0), ps (0), worm (0), xterm (0) FTP_Write (8), guess_passwd (53), Imap (12), multihop (7), phf (4), spy (2), warezclient (1020), warezmaster (20), named (0), sendmail (0), snmpgetattack (0), snmpguess (0), sqlattack(0), xlock (0), xsnoop(0) -
4107 391458
79.23
52
0.01
1126
0.22
97278
Table 4. Confusion matrix in IDS [18]. Normal Probe DoS U2R 0 1 2 2 Normal 1 0 2 2 Probe 2 1 0 2 DoS 3 2 2 0 U2R 4 2 2 2 R2L
0.83
R2L 2 2 2 2 0
B. Feature Reduciton In first step we use feature selection methods to reduce number of features in dataset. KDDCup99 has 41 features and not all of them are important. So we use Gain Ratio Evaluation for evaluating which features have more importance versus others. Entropy is the average amount of information needed to specify the state of a random variable. Entropy is the indicator of impurity of the data. Entropy measures the level of impurity in the dataset and can be defined by:
19.69
Table 3. Attacks and sub-attacks in KDDCup99 Testing set. Sub-Attack (num. of repetition) Total Dist. in num. dataset (%) Probe ipsweep (306), nmap (84), 4166 1.33 portsweep (354), satan (1633), mscan (1053), saint (736) DoS back (1098), land (9), Neptune 229853 73.91 (58001), pod (87), smurf (164091), teardrop (12), apache2 (794), mailbomb (5000), processtable (759), udpstorm(2) U2R buffer_overflow(22), 228 0.07 loadmodule(2), perl (2), rootkit(13), httptunnel (158), ps (16), worm (2), xterm (13) R2L FTP_Write(3), guess_passwd 16189 5.2 (4367), imap (1), multihop (18), phf (2), spy (0), warezmaster (1602), warezclient (0), named (17), sendmail (17), snmpgetattack (7741), snmpguess (2406), sqlattack(2), xlock (9), xsnoop (4) Normal 60593 19.48
H ( x)
Type
(2)
¦ pi log 2 pi i 1
Where pi is the probability of class i; Entropy is proportional to the information content; i.e. in dataset with one class entropy is zero. It means we have minimum impurity and we can’t extract any discriminating information from this data. Another example, assume multiclass dataset with equal number of samples in each class. In this dataset entropy is unit and it means this data is good for training. Information Gain is a method for determining which features in a training dataset are most useful for discrimination between the classes to be learned. Expected information gain to be extracted from one feature in a dataset, can be calculated by (3). (3) InformationGain H (T ) H (T | A) Where H (T ) is entropy of dataset and H (T | A) is the average of the dataset entropy after splitting dataset with attribute A. (3) can be rewritten as
There are some samples in test set that do not exist in train set. For example in test set and in Probe attack we have sub-attacks like “mscan” and “saint” which do not exist in training set at all. There are similar examples about DoS, U2R and R2L too.
(4)
InformationGain( Ex, a) H ( Ex)
¦
vvalues ( a )
In addition to performance, for evaluating results of classification algorithm in IDSs, there is another important parameter named cost. As it is shown in table 4 if a DoS attack classifies incorrectly to R2L attack, IDS undertakes 2 point of penalty for classification and total cost of the 1 system can be calculated by (1). cos t ¦ M ij u Cij N i, j (1)
(
^x Ex | value( x, a) v` Ex
.H (^ x Ex | value( x, a) v`))
Where Ex is the set of all training records, value(x, a ) with x Ex defines a value of a specific example x for attribute a Attributes. Information Gain is used to order features in nodes of decision tree. But, Information gain is biased toward choosing attributes with large number of values and it may result over fitting. So, Gain Ratio and Intrinsic Information have introduced. Intrinsic information shows entropy of distribution of instances into branches by (5) and Gain Ratio normalizes information gain by (6) [19].
Where M ij shows number of samples with j class that classified incorrectly to i class and Cij are constant variables that can be extracted from confusion cost matrix shown in table 4.
IntrinsicValue( Ex, a)
¦
(
vvalues ( a )
Cost and performance are two important parameters in IDSs and security agents prefer to use IDS with lower cost and higher performance in their system.
| {x Ex | value( x, a) v} | log 2 ( )) Ex
^ x Ex | value( x, a) Ex
v`
.
(5)
GainRatio(S , A)
InformationGain( Ex, a) IntrinsicValue( Ex, a)
connection is from/to the same host or not. On the other hand, land attack can be detected by Host IDS (HIDS). After down sampling the DoS attack, the land attacks are removed from the balanced dataset to make a desirable dataset. So, it has no important information to improve Network IDS (NIDS) detection rate. Feature 20 represents number of outbound commands in an ftp session and feature 21 represents hot login to indicate if it is hot login. There is no change for this features in dataset and so they have no positive impact for classification problem.
(6)
We implement Gain Ratio on KDDCup99 dataset and the results are shown in Fig. 2. As it represents, feature numbers 7, 20 and 21 have the Gain Ratio equals to zero or very low values and it means removing them from dataset not only reduces the computation volume, but also can improve performance of classifiers.
After Removing low importance features, we used random forests for evaluating effect of this type of feature selection. Although our mechanism for classification, which will be discussed in next step, is different from random forests, but selecting random features is common process in both of them. Random forests have a parameter for evaluating the performance of the classifier named Out-OfBag (OOB) error. OOB is as accurate as using a test set for evaluating the classifier. In random forests each tree is constructed using a different bootstrap sample from the original data. After creating the trees with different training subsets, about 33 percent of samples are left out of the bootstrap sample and are not used in the construction of the tree. These samples are Out-Of-Bagging samples. After training tree, OOB samples are used for testing the tree and finally by combining results of trees final OOB error can be calculated. So in our work, OOB errors comparisons can be represent the effect of feature reduction process. Table 5. Evaluationg feature selection method using 10-fold crossvalidation. OOB Error Performance Error Rate Before feature selection 99.9779 0.0221 2.73 u104 After feature selection
2.28 u104
99.9785
0.0215
As we have shown in table 5, the OOB parameter after feature selection has improved in comparison to before feature selection. In addition to OOB, results of 10 fold cross-validation using random forest classifier is shown. It means by selecting 10 section with random selection, nine sections are used for training and the remaining one used for testing. Results are shown in table 5. C. Trainig Classifiers Fig. 3 represents the proposed method in details. Now, we should select some random features and train a tree with each selection. In previous section, we calculate gain ratio for each feature and results are shown in Fig. 2. Now, we use Roulette Wheel algorithm based on the gain ratios for selecting features. The basic strategy of using Roulette Wheel follows the rule: probability of selection is proportional to the fitness of an individual as shown in Fig. 4. So with this mechanism, features with more gain have more probability to be selected. Roulette Wheel selection is defined with (7) [20].
Figure 2. Gain Ratio of Features.
Feature 7 (land), indicates Local Area Network Denial attack. Land is a DoS attack that consists of sending a special poison spoofed packet to a computer, causing it to lock up. In other words, land attack sends a TCP SYN packet to the victim IP and causes it to reply itself. This procedure wastes system resources. This means, this attack is not over the network. Because only one packet is sent over the network. So, land feature in KDDCup99 indicates if a
Figure 3. Training phase of proposed approach.
Effect of each classifier in intrusion detection system can be evaluated by these two parameters. Performance is a unique value for each classifier. Using confusion matrix of each trained classifier and (1) returns the cost of each classifier. We used OOB samples to calculate the weight of each classifier. In the other word, we used one third random records of train data for optimizing combination weights in classifiers. As it is shown in Fig. 5, we have tree membership function for performance and each classifier gets a weight in Good, Medium or Bad. The same is true about Cost. So we can define 9 rules based on the value of performance and cost. In our work, an effect of each tree in final classifier means the weight of tree over summation of all weights of trees. As before, we used 15 features for each tree. We used 10 trees for combining; because of the nature of random selection, more tree numbers return repetitive results and they have not positive effect on final result. Producing 10 trees and combining them has optimum result for both computational cost and performance. Finally, we combine classifiers which are weighted based on performance and cost. The framework such as shown in Fig. 7 is used for testing. Final classifier showed detection rate of 93.00 and cost of 0.2179 on KDDCup99 test set. All of the sections of the proposed approach, are implemented in Java API platform with WEKA core and required modifications are applied on it. Comparing our results with other methods, are shown in table 7. All of the mentioned sections, such as feature reduction, feature selection based on gain ratio and combining methods are applied to the case study dataset. Final results are shown in table 7. Some of the methods available in table 7, are the approaches mentioned in related works. All chains of the proposed framework, including feature reduction, novel feature selection method based on gain ratio, training classifiers with optimum number of features and fuzzy combination of them are presented as a ‘proposed approach’ in the table.
Figure 4. Roulette Wheel diagram.
pi
¦
wi N j 1
wj
(7)
Where wi is the fitness of individual i in population and N is the number of individuals in the population There are 41 features in KDDCup99. After removing features 7, 20 and 21, which was discussed in previous section, there are 38 features remaining. Using Gain Ratio Values, showed in Fig. 2, the roulette wheel can be designed. It can be seen that features with more Gain Ratio have more probability to be selected. To find optimum number of features, we used random forest evaluation. We implement a random forest classifier with 30 tree and different number of features. We used 10 trees as default number of trees as in WEKA [21]. Results of 10-fold cross validation model are shown in table 6; so, we select 15 random feature from dataset and then we train a J48 classifiers with this features. In the next section we will implement fuzzy combination algorithm on this classifiers to get a better results. Table 6. Evaluation of random forest with 10 tree and different number of features. Feat. 5 10 15 20 25 30 Num. 99.9812 99.9716 99.9879 99.9765 99.9753 99.9709 DR
D. Fuzzy Ensemble of classifiers Now, we want to implement a fuzzy ensemble algorithm to combine classifiers and representing final classifier with high detection rate and lower cost. In this step, we used fuzzy method to import effect of each classifier on final classifier. Tress with higher detection rate and lower cost must have more effect on final classifier. So, we used fuzzy membership functions such as Fig. 5 for performance and Fig. 6 for cost. As mentioned in previous sections, there are two important parameter in IDS: Performance and cost.
Figure 5. Membership Functions for Performance.
[4] [5]
[6]
Figure 6. Membership Functions for Cost.
[7]
[8] Figure 7. Testing phase of proposed approach. [9]
Table 7. Performance and Cost of other methods. Model Accuracy Cost Proposed Approach 0.9300 .2179 Random Forest [17] 0.9293 .2282 Best KDD Result [17] 0.9271 .2331 JRip [22] 0.9230 N.A* NBTree [22] 0.9228 N.A LBk [22] 0.9222 N.A SVM [23] 0.9218 N.A J48 [22] 0.9206 N.A MLP [22] 0.9203 N.A Decision Table [22] 0.9166 N.A SMO [22] 0.9165 N.A BayesNet [22] 0.9062 N.A OneR [22] 0.8931 N.A Naïve Bayes [22] 0.7832 N.A
[10] [11]
[12]
[13] [14]
*N.A = Not Available.
[15]
IV. Conclusion In this paper we try to present a new multi-step framework for intrusion detection systems. In the first step, we mentioned that KDD has many features and not all of them are important for classification problem. Therefore, we present a novel feature selection method based on Gain Ratio. Using Roulette Wheel based on Gain Ratio of features, gets more probability to features with higher gain ratio in random feature selection; so we train J48 trees with optimum features. In the final phase, we introduce a novel fuzzy weighting method to ensemble classifiers. Adding the fuzzy weighted combiner, can tag weights to classifiers related to their cost and performance. Finally we have shown that our proposed approach returns better results than other similar methods.
[16] [17]
[18] [19] [20] [21]
REFERENCES [1]
[2]
[3]
US-State of Cyber-Crime. Available: https:// http://www.pwc.com/en_US/us/increasing-iteffectiveness/publications/assets/2014-us-state-ofcybercrime.pdf , 2014. P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maciá-Fernández, and E. Vázquez, "Anomaly-based network intrusion detection: Techniques, systems and challenges," computers & security, vol. 28, pp. 18-28, 2009. C. V. Zhou, C. Leckie, and S. Karunasekera, "A survey of coordinated attacks and collaborative intrusion detection," Computers & Security, vol. 29, pp. 124-140, 2010.
[22]
[23]
SNORT. Available: www.snort.com, 2014. J. Jiang, C. Zhang, and M. Kamel, "RBF-based real-time hierarchical intrusion detection systems," Neural Networks. Proceedings of the International Joint Conference, pp. 15121516, 2003. S. Chavan, K. Shah, N. Dave, S. Mukherjee, A. Abraham, and S. Sanyal, "Adaptive neuro-fuzzy intrusion detection systems," in Information Technology: Coding and Computing. Proceedings. ITCC. International Conference, pp. 70-74, 2004. L. Li, D.-Z. Yang, and F.-C. Shen, "A novel rule-based Intrusion Detection System using data mining," Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference, pp. 169-172, 2010. S.-J. Horng, M.-Y. Su, Y.-H. Chen, T.-W. Kao, R.-J. Chen, J.L. Lai, et al., "A novel intrusion detection system based on hierarchical clustering and support vector machines," Expert systems with Applications, vol. 38, pp. 306-313, 2011. D. S. Kim, H.-N. Nguyen, and J. S. Park, "Genetic algorithm to improve SVM based network intrusion detection system," in Advanced Information Networking and Applications. AINA. 19th International Conference, pp. 155-158, 2005. C. Kolias, G. Kambourakis, and M. Maragoudakis, "Swarm intelligence in intrusion detection: A survey," computers & security, vol. 30, pp. 625-642, 2011. S. Srinoy, "Intrusion detection model based on particle swarm optimization and support vector machine," in Computational Intelligence in Security and Defense Applications. CISDA. IEEE Symposium, pp. 186-192, 2007. S. Peddabachigari, A. Abraham, C. Grosan, and J. Thomas, "Modeling intrusion detection system using hybrid intelligent systems," Journal of network and computer applications, vol. 30, pp. 114-132, 2007. UNB-ISCX Available: http://www.iscx.ca/datasets , 2013. Knowledge Discovery Dataset. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html C. Thomas, V. Sharma, and N. Balakrishnan, "Usefulness of DARPA dataset for intrusion detection system evaluation," in SPIE Defense and Security Symposium, pp. 69730G-69730G8, 2008. KDD 2014 competition. Available: http://www.kdd.org/kdd2014, 2014. J. Zhang, M. Zulkernine, and A. Haque, "Random-forestsbased network intrusion detection systems," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions, vol. 38, pp. 649-659, 2008. C. Elkan, "Results of the KDD'99 classifier learning," ACM SIGKDD Explorations Newsletter, vol. 1, pp. 63-64, 2000. S.-Y. Wu and E. Yen, "Data mining-based intrusion detectors," Expert Systems with Applications, vol. 36, pp. 5605-5612, 2009. A. Lipowski and D. Lipowska, "Roulette-wheel selection via stochastic acceptance," Physica A: Statistical Mechanics and its Applications, vol. 391, pp. 2193-2196, 2012. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, pp. 10-18, 2009. H. A. Nguyen and D. Choi, "Application of data mining to network intrusion detection: classifier selection model," in Challenges for Next Generation Network Operations and Service Management, ed: Springer, 2008, pp. 399-408. T. Ambwani, "Multi class support vector machine implementation to intrusion detection", Proceedings of the International Joint Conference on Neural Networks, 2003, pp. 2300-2305.