In view of enormous computing power available in present day processors, the trend to deploy ...... Florida Institute of Technology Technical Report CS-2001-2.
Improvement in minority attack detection with skewness in network traffic Ciza Thomas
a
and N. Balakrishnan
b
a PhD
b Professor
student, SERC, IISc, Bangalore, India; and Associate Director, IISc, Bangalore, India ABSTRACT
The acceptability and usability of Intrusion Detection Systems get seriously affected with the data skewness in network traffic. A large number of false alarms mean a lot in terms of the acceptability of Intrusion Detection Systems. The reason for the increase in false alerts is that the normal traffic abound. Even with highly accurate Intrusion Detection Systems, the effective detection rate of the minority attack types will be unacceptably low and those attack types are often the most serious ones. Thus high accuracy is not necessarily an indicator of high model quality, and therein lies the accuracy paradox of predictive analytics. The cost of missing an attack is higher than the cost of false alarms. The data-dependent sensor fusion architecture presented in this paper learns from the data and then appropriately gives weighting to the decisions of various Intrusion Detection Systems. The fusion enriches these weighted decisions to provide a single decision, which is better than those of the existing Intrusion Detection Systems. This method reduces the false positive rate and improves the overall detection rate and also the detection rate of minority class types in particular. Keywords: Intrusion Detection Systems (IDS), Anomaly-based IDS, Data-Dependent Fusion (DD Fusion), False Positive (FP), False Negative (FN), Precision, Recall, F-score, Detection Performance, Sensor Fusion, Neural Network
1. INTRODUCTION The threat of attacks on the Internet are quite real and frequent and this has led to an increased need for securing any network on the Internet. An Intrusion Detection System (IDS) provides an additional layer of security to network’s perimeter defense, which is usually implemented using a firewall. The goal of an IDS is to collect information from a variety of systems and network sources, and then analyze the information for signs of intrusion and misuse. IDSs are implemented in hardware, software, or a combination of both. The network traffic is made up of attack or anomalous traffic, and the normal traffic. The real-world traffic is predominantly made up of normal traffic rather than attack traffic. Even with the attack traffic, some attacks are rarer. Rarer attacks may also cause significant damage. The IDSs are normally characterized by the overall accuracy. Though an IDS can give very high overall accuracy, its performance for the class of rarer attacks has been found to be less than acceptable. The problem of designing IDSs to work effectively and yield higher accuracies for minor attacks even in the mix of data skewness has been receiving serious attention in recent times. The imbalance in data degrades the prediction accuracy. In most of the available literature this is overcome by resampling the training distribution. The resampling is done either by oversampling of the minority class or by undersampling of the majority class.1–3 The other commonly used approach for overcoming data imbalance is through cost-sensitive learning,4, 5 the twophase rule induction method,6 and rule based classification algorithms like RIPPER7 and C4.5 rules.8 In view of enormous computing power available in present day processors, the trend to deploy multiple IDSs in the same network to obtain best-of-breed solutions has been attempted for enhancing the performance of attack detection. In spite of all such attempts, the performance of the IDSs in detecting minority and rarer attacks leaves scope for improvement. In this paper, a method of combining the decisions of multiple IDSs using Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2008, edited by Belur V. Dasarathy, Proc. of SPIE Vol. 6973, 69730N, (2008) · 0277-786X/08/$18 · doi: 10.1117/12.785623 Proc. of SPIE Vol. 6973 69730N-1 2008 SPIE Digital Library -- Subscriber Archive Copy
data-dependent sensor fusion technique is presented. Sensor fusion is the process of combining information from various suboptimal sources in order to obtain a more accurate and optimal result. For illustrative purposes the DARPA 1999 dataset has been used. The performance of the proposed IDS has been shown to be better than those reported so far for the minority attacks along with the improved performance for the majority attacks. The paper is organized as follows. In section 2 the classification of attacks with reference to the DARPA dataset is discussed. In section 3, the data skewness problem is exemplified. This is done with reference to the DARPA dataset as well as the observations of a typical University traffic. Section 4 reviews the existing approaches. In section 5 the motivation behind the present study and the details of the proposed approach are discussed. In section 6, the proposed architecture has been discussed. Section 7 covers the implementation and section 8 discusses the results. In section 9, the conclusion of the paper is drawn.
2. CLASSIFICATION OF ATTACKS The classification of the various attacks found in the network traffic is explained in detail in the thesis work of Kendall9 with respect to DARPA Intrusion Detection Evaluation dataset10 and is discussed here in brief. The attacks are mainly classified into five main categories, namely, Probes, Denial of Service attacks (DoS), Remote to Local attacks (R2L), User to Root attacks (U2R), and the Data attacks. The probes or scan attacks automatically scan a network of computers or a DNS server to find whether a particular IP exists. If the IP exists, then it tries to find out additional information about the machine, like the operating system running and the services running. The DoS attacks are designed to disrupt a host or network service. Hence DoS causes service denial and probes are for reconnaissance/surveillance. The R2L attack on the other hand gains account on a remote machine, exfiltrates files, modifies data, installs trojans for back door entry etc. The U2R attack uses buffer overflow to acquire root shell in order to get the full control of the system. In Data attacks an attacker gets privilege to access the special files. An attack could be labeled as both U2R and Data if an U2R attack was used to obtain access to the special files.
3. DATA-SKEWNESS IN NETWORK TRAFFIC The attack traffic in a real-world traffic is mostly rare. In addition, the attack types within the attack class itself is skewed with the probes and DoS attacks abound whereas the R2L and U2R attacks being rare. The effect of this data skewness poses some very serious issues in intrusion detection performance mainly in two ways: 1) Applying the conditional probability using Bayes’ theorem, the detection of an attack can be shown to be difficult unless both the percentage of attacks in the entire traffic and the accuracy rate of their identification are far higher than they are at present. The Bayesian rate of attack detection P r(I|A) is given by: P r(I|A) =
P r(A|I)∗P r(I) P r(A|I)∗P r(I)+P r(A|N I)∗P r(N I)
(1)
where I denotes the intrusion, N I denotes no-intrusion, and A denotes the alert. The false alarm rate is the limiting factor for the performance of most of the IDSs. This is due to the base-rate fallacy phenomenon, which says that in order to achieve substantial value for the Bayesian attack detection rate, it is necessary to achieve an unattainably low false positive rate11 . In order to apply this reasoning in the evaluation of the IDSs, the data set commonly used is the DARPA 1999 evaluation data set where the ratio of number of attacks to the number of normal traffic is roughly of the order of 1:26,000. The DARPA data is supposedly modeling a realistic situation, having been synthesized based on the traffic observed on a large US Air Force base. With an IDS which is 99% accurate, and a false positive rate of 0.01, the Bayesian rate of attack detection P r(I|A) is obtained as 0.00379. Hence the false positives rise to roughly 262 for detecting almost one real attack. This clearly shows the inability of the Intrusion Detection System for its proposed task of attack detection, where the
Proc. of SPIE Vol. 6973 69730N-2
actual attacks get embedded in the large volume of false positives. Even though the detection is 99% certain, the chance of detecting an attack in the total alerts is only 1/263 due to the fact that normal traffic is much larger than the attack traffic. Thus it is difficult to interpret what a small false alarm rate is, when the base rate is also small. Considering the large proportion of the normal traffic in the network traffic and also a high T Prate with the IDS, eqn.1 can be approximated as: P r(I|A) ≈
P r(I) P r(A|N I)
(2)
Eqn.2 shows that for the Bayesian attack detection rate to improve to a value of almost 1, the false positive rate must be of the same order as the prior probability of attack in the network traffic. Thus data skewness demands for an extremely low false positive rate in order to achieve appreciably high value of Bayesian attack detection rate. 2) The standard base-rate says that it is safe to assume all the traffic as normal and still get an accuracy of 99.99%. The performance of IDS is evaluated using the accuracy measure, but the measure fails in the case of imbalance in data and also when the cost of different errors vary markedly. Most of the IDSs generate a trivial model by almost predicting the majority class, since predicting minority classes has a much higher error rate and hence it degrades the IDS performance. This is accuracy paradox,12 which says that in predictive analytics, high accuracy is not necessarily an indicator of high model quality. This is explained with the DARPA test data set with 5 million test records consisting of 190 attacks. Consider an IDS which detects 100 attacks at a false positive rate of 0.01%. The accuracy of this detector is 99.994%. However the accuracy paradox lies in the fact that the accuracy can be easily made 99.996% by always predicting normal. The second model, even though has a higher accuracy, is useless since it does not detect attacks. Hence most of the IDSs do not detect minority class types sufficiently well since they aim to minimize the overall error rate, rather than paying attention to minority class, which is obviously not the desired detection result. Thus the present day stand-alone Intrusion Detection Systems are not effective in detecting the attacks, especially the rare class of attack types. In a real-world environment, the minority attacks namely, R2L and U2R/Data attacks are more dangerous than the majority attacks like probe and DoS. Hence it is essential to improve the detection performance for the minority intrusions, while maintaining a reasonable overall detection rate.
4. EXISTING APPROACHES In most of the available literature the imbalance in the data is overcome by resampling the training distribution. The resampling is done either by oversampling of the minority class or by undersampling of the majority class. But the understood disadvantages are the increased training time due to the increase in training set size due to oversampling of the minority class and also the chances of overfitting with replication of minority samples. The disadvantage with undersampling is the possibility of elimination of some useful data which may affect the IDS performance. Oversampling and undersampling of the data for overcoming the data imbalance can be seen in the work of Brieman.1 Oversampling and undersampling of the data is done intelligently by Kubat and Matwin2 by removing the redundant data of the majority classes or the border minority class examples which cause error. Chawla3 does oversampling by interpolating between several minimum class examples, thus avoiding overfitting and causing decision boundaries for the minimum class to spread further into the majority class space. The other commonly used approach for overcoming data imbalance is through cost-sensitive learning.4, 5 The cost-sensitive learning has not shown any significant advantage in the effort to overcome the imbalance in the data. In the work by Joshi et al.,6, 13, 14 the problem of detecting rare classes is done by the two-phase rule induction method in the context of learning complete and precise signatures of rare classes. The performance evaluation is done on specially designed synthetic dataset.
Proc. of SPIE Vol. 6973 69730N-3
Other rule based classification algorithms like RIPPER7 and C4.5 rules8 face the problems of splintered false positives and small disjuncts. Other predictive classifier models for rare events are given in15 and16 . But, none of these attempts have shown any significant contribution in overcoming the data skewness problems in comparison to the initial resampling attempt. Hence in spite of all the earlier attempts, there is still scope for a significant improvement in the detection of rare attacks.
5. MOTIVATION FOR THE PRESENT STUDY AND THE PROPOSED APPROACH It is important to understand that the cost of misclassifying an attack as a normal (type I errors or FN) is often more than the cost of misclassifying a normal as an attack (type II errors or FP). Lee et al.17 have come up with an attack taxonomy, which categorize intrusions that occur in the DARPA Intrusion Detection Evaluation dataset. An attempt to look at the skewness within this minority class is made and it is observed that there is still a higher misclassification cost for the minority attack types. This has also been highlighted in the cost matrix published for the DARPA evaluation.18 Hence it is important to have IDSs that minimize the overall misclassification cost by performing better on minority classes and again on minority attack types. Thus by thinking in line with the KDD evaluations done on IDSs, the goal is not mere accuracy but also misclassification cost. The misclassification cost penalty was the highest for one of the most infrequent attack type and that too for the type I error.18 The total misclassification cost can be reduced if the type I errors and the type II errors can be reduced. The resampling is commented in general, and in particular with the experiments conducted in this work in the following manner: There is no point in reducing the normal data in the training data set since this data set is an expected replication of the real-time data and the data set available has a distribution more or less like the naturally occurring class distribution. Additionally, changing the data distribution will affect the accuracy of the IDS in predicting the class of test samples that belong to each of these classes. Also in the case of anomaly detectors, which learn from the normal samples, the more the normal traffic in the training data, the better it performs. Also, it has been realized that base-rate fallacy is not a factor which can be avoided and the only measure is to set the acceptable false alarm rate to be extremely low, almost as low as the prior probability. It is seen that several of the detection algorithms present results with high detection rate and low false positive rate without considering the real impact of the false alerts generated. Since one of the main goals of this work is to prevent the misinterpretation of the metrics used, a good estimate for a low false alarm rate is to set it approximately as the prior probability of attack. Various IDSs reported in the literature have shown distinct preferences for detecting a certain class of attacks with improved accuracy, while performing moderately on the other classes. With advances in sensor fusion it has become possible to obtain a more reliable and accurate decision for a wider class of attacks by combining the decisions of multiple IDSs. This paper presents a probable solution, by making use of fusion of the IDSs, to the major problems highlighted in the initial part of the paper. This is expected to provide an optimal overall performance with potentially suboptimal individual solutions. The decision fusion has been shown to have a great potential in overall performance beyond the level reached by the individual IDSs in the work of Thomas and Balakrishnan.19 Thomas and Balakrishnan have proposed the data-dependent decision fusion architecture to enhance the performance of IDSs as an earlier work.20 The present work tries to improve the detection rate of the minority class types without increasing the false positive rate using the same architecture. The other somewhat related works albeit distantly are the alarm clustering method by Perdisci et al.,21 aggregation of alerts by Valdes et al.,22 and alert correlation by Cuppens et al.23 These works address the issue of efficiently managing the large number of alerts by providing an unified description of the alerts from individual IDSs.
Proc. of SPIE Vol. 6973 69730N-4
The use of data fusion in the field of DoS anomaly detection is presented by Siaterlis and Maglaris.24 In this work, the Dempster-Shafer theory of evidence is used as the mathematical foundation for the development of a novel DoS detection engine. The detection engine is evaluated using the real network traffic. Tim Bass,25 has presented a framework to improve the performance of intrusion detection systems based on data fusion. A few first steps towards developing the engineering requirements using the art and science of multi-sensor data fusion as an underlying model is provided in this work.25 Giacinto et al.,26 proposed an approach to intrusion detection based on fusion of multiple classifiers. With each member of the classifier ensemble trained on a distinct feature representation of patterns, the individual results are combined using a number of fixed and trainable fusion rules. The formulation of the intrusion detection problem as a pattern recognition task using data fusion approach based on multiple classifiers is attempted by Didaci et al.27 The work confirms that the combination reduces the overall error rate, but may also reduce the generalization capabilities. The superiority of data fusion technology applied to intrusion detection systems is presented in the work of Wang et al.28 The method used is information collection from the network and host agents and application of Dempster-Shafer theory of evidence. Another work incorporating the Dempster-Shafer theory of evidence is by Hu et al.29 The Dempster-Shafer theory of evidence in data fusion is observed to solve the problem of how to analyze the uncertainty in a quantitative way. In the evaluation, the ingoing and outgoing traffic ratio and service rate are selected as the detection metrics, and the prior knowledge in the DDoS domain is proposed to assign probability to evidence. Siraj et al.30 discuss a Decision Engine for an Intelligent Intrusion Detection System (IIDS) that fuses information from different intrusion detection sensors using an artificial intelligence technique. The Decision Engine uses Fuzzy Cognitive Maps (FCMs) and fuzzy rule-bases for causal knowledge acquisition and to support the causal knowledge reasoning process.
6. PROPOSED DATA-DEPENDENT DECISION FUSION SCHEME An architecture, which introduces the data dependence in the fusion technique was proposed and implemented in the work of Thomas and Balakrishnan.20 The idea in the proposed architecture is to properly analyze the data and understand when the individual IDSs fail. The fusion unit should incorporate this learning from input as well as from the output of detectors to make an appropriate decision. The architecture which makes use of the data-dependent fusion method is shown in figure 1. The proposed architecture is a three-stage architecture,
INPUT (x)
IDS1
IDS2
S1
IDSn
S2
Sn
S1 S2 Sn
S1
x
S2
Sn
OUTPUT (y) FUSION UNIT
w1 w2
NEURAL NETWORK LEARNER wn
Figure 1. Data-dependent Fusion method
with optimizing the individual IDSs as the first stage, the Neural Network learner determining the weights of the individual IDSs as the second stage, and then the fusion unit doing the weighted aggregation as the final stage. The Neural Network learner can be considered as a pre-processing stage to the fusion unit. The neural network is most appropriate for weight determination as it is difficult to define the rules clearly, mainly when more number of IDSs are added to the fusion unit. When a record is correctly classified by one or more detectors, the neural network will accumulate this knowledge as a weight and with more number of iterations, the weight gets stabilized. The architecture is independent of the dataset and the structures employed, and can be used with
Proc. of SPIE Vol. 6973 69730N-5
any real valued data set. The neural network processes the entire available feature set to extract more effective signatures. Thus it is reasonable to make use of Neural Network learner unit to understand the performance and assign weights to various individual IDSs for a large data set. The weight assigned to any IDS, not only depends on the output of that IDS, but also on the input traffic which causes this output. A neural network unit is fed with the output of the IDSs along with the respective input for an in depth understanding of the reliability estimation of the IDSs. The alarms produced by the different IDSs when they are presented with a certain attack clearly tells which sensor generated the more precise result and what attacks are actually occurring on the network traffic. The output of the neural network unit corresponds to the weights which are assigned to each one of the individual IDSs. With the improved weight factor, the IDSs can be fused to produce an improved resultant output. Thus the proposed architecture refers to a collection of diverse IDSs that respond to an input traffic and the weighted combination of their predictions. The weights are learned by looking at the response of the individual sensors for every input traffic. The fusion output can be represented as: y = Fj (wji (xj , Sji ), Sji )
(3)
where the weights wji are dependent on both the input xj as well as individual IDS’s output Sji , the suffix j refers to the class label and the prefix i refers to the IDS index.
7. IMPLEMENTATION 7.1 Choice and modification of the individual IDSs Taking into account the fact that an acceptable false alarm rate is extremely low, almost as low as the prior probability, two IDSs PHAD31 and ALAD,32 which give extremely low false alarm rate of the order of 0.00002 and a third IDS which is the popularly used open source IDS, the Snort,33 are considered. With the first two IDSs, the Bayesian attack detection rate is of the order of 35% and 38% respectively. The other reason was that most of the existing IDS algorithms neglect the minority attack types, R2L and U2R in comparison to the majority attack types, Probes and DoS. ALAD is highly successful in detecting these rare attack types. Also Snort detects the U2R/Data attacks exceptionally well. The detection performance of the anomaly detectors PHAD and ALAD can be improved further by training them on additional normal traffic other than the traffic of weeks 1 and 3 of the DARPA 1999 data set. Also the focus is on improving the misuse-based IDS, Snort by modifying the snort rules.
7.2 Test setup The test setup for the experimental evaluation consisted of three Pentium machines with Linux Operating System. The experiments were conducted with the simulated IDSs PHAD, ALAD, and Snort, distributed across the single subnet observing the same domain. The weight analysis of the IDS data coming from PHAD, ALAD, and Snort was carried out by the Neural Network learner before it was fed to the fusion element. The detectors PHAD and ALAD produces the IP address along with the anomaly score whereas the Snort produces the IP address along with severity score of the alert. The alerts produced by these IDSs are converted to a standard binary form. The Neural Network learner inputs these decisions along with the particular traffic input which was monitored by the IDSs. The Internet Engineering Task Force Intrusion Detection working group’s Intrusion Detection Message Exchange Format (IDMEF), which enable different types of IDSs to generate the events by using unified language can be used instead.
Proc. of SPIE Vol. 6973 69730N-6
The Neural Network learner was designed as a feed forward back propagation algorithm with a single hidden layer and 25 sigmoidal hidden units in the hidden layer. Experimental proof is available for the best performance of the Neural Network with the number of hidden units being log(T), where T is the number of training samples in the data set.34 In order to train the neural network, it is necessary to expose them to both normal and anomalous data. Hence, during the training, the network was exposed to weeks 1,2,3 of the training data and the weights were adjusted using the back propagation algorithm. An epoch of training consisted of one pass over the training data. The training proceeded until the total error made during each epoch stopped decreasing or 1000 epochs had been reached. The fusion unit performed the weighted aggregation of the IDS outputs and using an optimized threshold for identifying the attacks in the test data set. It used binary fusion by giving an output value of one or zero depending the value of the weighted aggregation of the various IDS decisions. The packets were identified by their timestamp on aggregation. A value of one at the output of the fusion unit indicated the record to be under attack and a zero indicated the absence of an attack.
7.3 Dataset The DARPA 1999 evaluation data set10 was used for the purpose of training the detectors as well as testing the individual detectors and the overall fusion system. The data consists of weeks one and three of training data and week two of labeled attacks and weeks four and five of test data. Evaluating the proposed IDS with DARPA data set may not be representative of the performance with more recent attacks or with other attacks against different types of machines, routers, firewalls or other network infrastructure. Even with its serious drawbacks, as observed by McHugh35 and Mahoney and Chan36 and the potential questions about the adequacy of the data for its intended purpose, still there is no good data set other than DARPA data set for IDS evaluation. The DARPA data has certainly been useful in the development of the proposed system. It is important to mention at this point that the proposed architecture can be generalized beyond the dataset or the IDSs that get used. The proposed method is independent of the input traffic or the individual IDSs that take part in fusion. Since none of the IDSs perform exceptionally well on the DARPA dataset, the aim is to show that the performance improves with the proposed method. If a system is evaluated on the DARPA dataset, then it cannot claim anything more in terms of its performance on the real network traffic. Hence this dataset can be considered as the base line of any research.37 Also, Thomas et al.37 illustrate that even after eight years of its generation, there are lot of attacks in the dataset for which signatures are not available in the database of the frequently updated signature based IDSs. The real data traffic is difficult to work with; the main reason being the lack of the information regarding the status of the traffic. Even with intense analysis, the prediction can never be 100 percent accurate because of the stealthiness and sophistication of the attacks and the unpredictability of the non-malicious user as well as the intricacies of the users in general. The normal data generated from an internal University network is collected and this has been randomly divided into two parts. PHAD is trained on week three of the DARPA data set and one portion of the internal network traffic data, and ALAD is trained on week one of the DARPA data set and the other portion of the internal network traffic data. The test has been conducted on the entire test set of the weeks four and five.
7.4 Metrics for performance evaluation Let T P be the number of attacks that are correctly detected, F N be the number of attacks that are not detected, T N be the number of normal traffic packet/connections that are correctly classified, and F P be the number of normal traffic packet/connections that are incorrectly detected as attack.
Proc. of SPIE Vol. 6973 69730N-7
Table 1. Attacks of each type detected by PHAD at a false positive rate of 0.00002
Attack Type
Total attacks
Attacks detected
% detection
Probe
37
26
70%
DoS
63
27
43%
R2L
53
6
11%
U2R/Data
37
4
11%
Total
190
63
33%
Table 2. Attacks of each type detected by ALAD at a false positive of 0.002%
Attack Type
Total attacks
Attacks detected
% detection
Probe
37
9
24%
DoS
63
23
37%
R2L
53
31
59%
U2R/Data
37
15
31%
Total
190
78
41%
7.4.1 ROC and AUC: ROC curves are used to evaluate classifier performance over a range of trade-offs between T Prate and F Prate . 7.4.2 Precision, Recall and F-score: Precision is a measure of what fraction of test data detected as attacks are actually from the attack classes. Recall is a measure of what fraction of attack class gets correctly detected. P recision(P ) =
TP T P +F P
Recall(R) =
TP T P +F N
F-score is roughly the harmonic mean of recall and precision and is considered as the overall accuracy score of an IDS. F -score =
2P R P +R
The primary goal is to achieve improvement in recall as well as precision for the rare classes. The hypothesis that the proposed model is suitable for the rare classes is empirically evaluated in the next section using the DARPA’99 dataset.
8. RESULTS AND DISCUSSION All the Intrusion Detection Systems that form part of the fusion IDS were separately evaluated with the same data set. It can be observed from the Tables 1, 2, and 3 that the attacks detected by different IDS were not necessarily the same and also that no individual IDS was able to provide acceptable values of all performance measures. Then the empirical evaluation of the proposed data-dependent decision fusion method was also performed and observations given in Table 4. When a discrete IDS was applied to a test set, it yields a single confusion matrix, which in turn corresponds to one ROC point. Thus, a discrete IDS produced only a single point in the ROC space, whereas scoring IDSs can be used with a threshold to produce different points in the ROC space. The improved overall performance of the IDS on fusion can be observed from the Figure 2.
Proc. of SPIE Vol. 6973 69730N-8
Table 3. Attacks of each type detected by Snort at a false positive of 0.02%
Attack Type Probe DoS R2L U2R/Data Total
Total attacks 37 63 53 37 190
Attacks detected 15 35 30 34 115
% detection 41% 56% 57% 92% 61%
Table 4. Attacks of each type detected by data-dependent architecture at a false positive of 0.004%
Attack Type Probe DoS R2L U2R/Data Total
Total attacks 37 63 53 37 190
Attacks detected 31 44 34 34 143
% detection 84% 70% 64% 92% 75%
ROC SEMILOG CURVE 1 0.9
TRUE POSITIVE RATE
0.8 0.7 PHAD
0.6
ALAD 0.5
Snort
0.4
DD Fusion
0.3 0.2 0.1 0 −6 10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
FALSE POSITIVE RATE (LOG SCALE)
Figure 2. ROC curve resulting from a single evaluation of individual IDSs and the Fusion IDS
Table 5. Comparison of the evaluated IDSs with various evaluation metrics
Detector PHAD ALAD Snort Data-dependent fusion
P 0.39 0.44 0.10 0.42
R 0.33 0.41 0.61 0.75
F-score 0.36 0.42 0.17 0.54
Proc. of SPIE Vol. 6973 69730N-9
Accuracy 0.99 0.99 0.99 0.99
AUC 0.66 0.71 0.8 0.88
Table 6. Detection of different attack types by individual IDSs and the DD Fusion method
Fusion /
PHAD
ALAD
Snort
Attack Type
Data-Dependent Fusion
Detection Probe
70%
24%
41%
84%
DoS
43%
37%
56%
70%
R2L
11%
59%
57%
64%
U2R/Data
11%
31%
92%
92%
False Positive%
0.002%
0.002%
0.02%
0.004%
The results in Table 5 show that the accuracy and AUC are not good metrics with the imbalanced data. Accuracy is heavily biased to favor majority class. When accuracy is used as a performance measure, it is necessary to assume target class distribution to be known and unchanging and cost of FP and FN to be equal. These assumptions are unrealistic. If metrics like accuracy and AUC are to be used, then the data has to be more balanced in terms of the various classes. If AUC is to be used as an evaluation metric a possible solution is to consider only the area under the ROC curve until the F Prate reaches the prior probability. Recall and Precision are good measures for IDS evaluation and because of the trade-off between the two, F-score can be used to score the balance between the two. F-score is a good metric for IDS evaluation in case of imbalance in the data.
Figure 3. Comparison of evaluated systems illustrating the higher performance of rare attack types by Data-Dependent fusion method
In real world network environment, the rare attacks like U2R and R2L are more dangerous than Probe and DoS attacks. Hence it is essential to improve the detection performance of these rare classes of attacks while maintaining a reasonable overall detection rate. The results presented in Table 6 and Figure 3 indicate that the proposed method performs significantly better for rare attack types with a high recall as well as a high precision as against achieving the high accuracy alone. The claim that the proposed method performs better is supported
Proc. of SPIE Vol. 6973 69730N-10
by a statement from Kubat et al.,38 which states that “ a classifier that labels all regions as majority class will achieve an accuracy of 96%...a system achieving 94% on the minority class and 94% on the majority class will have worse accuracy yet be deemed highly successful” The results of the data-dependent fusion method is better to what has been predicted by the Lincoln Laboratory after the DARPA IDS evaluation. With the proposed method, an intrusion detection of 75% at a false positive of as low as 0.004% has been achieved. The F-score has been improved to 0.54.
9. CONCLUSION The research and development efforts in the field of IDS, and the state-of-the-art IDSs all are still with marginal detection rates and high false alarm rates, especially in the case of stealthy, novel and R2L attacks. In the environment in which an IDS is expected to operate, the attacks are the minority requiring very low false positive rates for acceptable detection. A basic domain knowledge about network intrusions makes us understand that the U2R and R2L attacks are intrinsically rare. The poor performance of the detectors has been improved by discriminative training of anomaly detectors and incorporating additional rules into the misuse detector. This paper proposes a new approach of machine learning method where corresponding learning problem is characterized by a number of features, skewness in data and the class of interest being the minority class and the minority attack type, and the non uniform misclassification cost. The proposed method successfully demonstrates that the neural network learner encapsulates expert knowledge for the weighted fusion of individual detector decisions. This creates an adaptable algorithm that can substantially outperform state-of-the art methods for minority class type detection in both coverage and precision. The evaluations show the strength and ability of the proposed approach to perform very well for those rare classes. The experimental comparison of this method has confirmed its usefulness and significance.
REFERENCES [1] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and regression trees, Belmount, CA: Wadsworth, 1984 [2] M. Kubat, R. C. Holte, S. Matwin, Learning when negative examples abound: One-sided selection, Proceedings of the ninth European Conference on machine learning, pp. 146-153, 1997 [3] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, Journal of Artificial Intelligence research, vol.16, pp. 321-357, 2002 [4] K. McCarthy, B. Zabar, G. Weiss, Does cost-sensitive learning beat sampling for classifying rare classes?, Proceedings of the 1st International workshop on utility-based data mining, pp. 69-77, 2005 [5] P. K. Chan, S. J. Stolfo, Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection, KDD 1998, pp. 164-168, 1998 [6] M. V. Joshi, R. C. Agarwal, V. Kumar, Mining needles in a haystack: classifying rare classes via two-phase rule induction, ACM SIGMOD May 2001 [7] W. W. Cohen, Fast effective rule induction, Proceedings of 12th International conference on machine learning, California, 1995 [8] J. R. Quinlan, C4.5: Programs for machine learning, Morgan Kaufmann, 1993 [9] K. Kendall, A database of computer attacks for the evaluation of intrusion detection systems, Thesis, MIT, 1999 [10] DARPA Intrusion Detection Evaluation Data Set, http://www.ll.mit.edu/IST/ideval/data/data index.html [11] S. Axelsson, The base-rate fallacy and the difficulty of Intrusion Detection, ACM Transactions on Information and System Security, vol.3, No.3, Aug 2000 [12] Accuracy Paradox, http://en.wikipedia.org/wiki/Accuracy paradox [13] M. V. Joshi, On evaluating performance of classifiers for rare classes, Proceedings of the 2002 IEEE International Conference on data mining, pp. 641-644, 2002
Proc. of SPIE Vol. 6973 69730N-11
[14] R. Agarwal and M. V. Joshi, PN rule: A new framework for learning classifier models in data mining (a case-study in network intrusion detection), Technical Report RC 21719, IBM Research report, Computer Science/Mathematics, April 2000 [15] P. Chan, S. Stolfo, Towards scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection, Proceedings of 4rth International Conference on knowledge discovery and data mining (KDD-98), pp 164-168, 1998 [16] R. C. Holte, N. Japkowicz, C. X. Ling, Learning from imbalanced data sets, Technical Report WS-00-05, AAAI Press, Menlo Park, CA, 2000 [17] W. Lee, W. Fan, M. Miller, S. Stolfo, E. Zadok, Toward cost-sensitive modeling for intrusion detection and response, Technical report CUCS-002-00, Computer Science, Columbia University, 2000 [18] C. Elkan, Results of the KDD’99 classifier learning, SIGKDD Explorations, Vol. 1, Issue 2, pp. 63-64, Jan 2000 [19] C. Thomas, N. Balakrishnan, Selection of Intrusion Detection Threshold Bounds for effective Sensor Fusion, Proceedings of SPIE Defense and Security Conference, 6570-5, Orlando, Florida, April 2007 [20] C. Thomas, N. Balakrishnan, Performance Enhancement of Intrusion Detection Systems using DataDependent Decision Fusion (submitted to Journal of Computer Security) [21] R. Perdisci, G.Giacinto, F. Roli, Alarm clustering for intrusion detection systems in computer networks, Engg. applications of Artificial Intelligence, Elsevier publications, March 2006 [22] A. Valdes, K. Skinner, Probabilistic alert correlation, Springer Verlag notes in Computer Science, 2001 [23] F. Cuppens, A. Miege, Alert correlation in a cooperative intrusion detection framework, Proceedings of the 2002 IEEE symposium on security and privacy, 2002 [24] C. Siaterlis, B. Maglaris, Towards Multisensor Data Fusion for DoS detection, ACM Symposium on Applied Computing, 2004 [25] T. Bass, Multisensor Data Fusion for Next Generation Distributed Intrusion Detection Systems, IRIS National Symposium, 1999 [26] G. Giacinto, F. Roli, L. Didaci, Fusion of multiple Classifiers for Intrusion Detection in Computer Networks, Pattern Recognition Letters, 24, pp. 1795-1803, 2003 [27] L. Didaci, G. Giacinto, F. Roli, Intrusion detection in computer networks by multiple classifiers systems, International Conference on Pattern Recognition, 2002 [28] Y. Wang, H. Yang, X. Wang, R. Zhang, Distributed intrusion detection system based on data fusion method, Intelligent control and automation, WCICA 2004 [29] W. Hu, J. Li, Q. Gao, Intrusion Detection Engine on Dempster-Shafer’s Theory of Evidence, Proceedings of International Conference on Communications, Circuits and Systems, vol.3, pp. 1627-1631, Jun 2006 [30] A. Siraj, R. B. Vaughn, S. M. Bridges, Intrusion Sensor Data Fusion in an Intelligent Intrusion Detection System Architecture, Proceedings of the 37th Hawaii international Conference on System Sciences, 2004 [31] M. V. Mahoney, P. K. Chan, Detecting Novel attacks by identifying anomalous Network Packet Headers, Florida Institute of Technology Technical Report CS-2001-2 [32] M. V. Mahoney and P. K. Chan, Learning non stationary models of normal network traffic for detecting novel attacks, SIGKDD, 2002 [33] www.snort.org/docs/snort htmanuals/htmanual 260 [34] R. P. Lippmann, R. K. Cunningham, Improving Intrusion Detection Performance using keyword selection and Neural Networks, Web proceedings of the 2nd International Workshop on Recent Advances in Intrusion Detection, 1999 [35] J. McHugh, Testing Intrusion Detection Systems: A critique of the 1998 and 1999 DARPA IDS evaluations as performed by Lincoln Laboratory, ACM Transactions on information and system security, vol 3, No.4, Nov 2000 [36] M. V. Mahoney, P. K. Chan, An analysis of the 1999 DARPA/ Lincoln Laboratory evaluation data for network anomaly detection, Technical Report CS-2003-02 [37] C. Thomas, V. Sharma, N. Balakrishnan, Usefulness of DARPA data set for Intrusion Detection Systems Evaluation, Proc. of the SPIE, Defense and Security Symposium, Mar. 2008 (accepted for publication) [38] M. Kubat, R. C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Machine Learning, 30: 195-215, 1998
Proc. of SPIE Vol. 6973 69730N-12