Intelligent Data Analysis 9 (2005) 309–326 IOS Press
309
Evaluating indirect and direct classification techniques for network intrusion detection Taghi M. Khoshgoftaara,∗, Kehan Gaob and Nawal H. Ibrahima a Department
of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA and Computer Science Department, Eastern Connecticut State University, CT 06226, USA E-mail:
[email protected] b Math
Received 2 June 2004 Revised 24 July 2004 Accepted 4 August 2004 Abstract. The network intrusion detection domain has seen increased research that exploit data mining and machine learning techniques and principles. Typically, multi-category classification models are built to classify network traffic instances either as normal or belonging to a specific attack category. While many existing works on data mining in intrusion detection have focused on applying direct classification methods, to our knowledge indirect classification techniques have not been investigated for intrusion detection. In contrast to indirect classification techniques, direct classification techniques generally extend associated binary classifiers to handle multi-category classification problems. An indirect classification technique decomposes (binarization) the original multi-category problem into multiple binary classification problems. The classification technique used to train the set of binary classification problems is called the base classifier. Subsequently, a combining strategy is used to merge the results of the binary classifiers. We investigate two binarization techniques and three combining strategies, yielding six indirect classification methods. This study presents a comprehensive comparative study of five direct classification methods with the thirty indirect classification models (six indirect classification models for each of the five base classifiers). To our knowledge, there are no existing works that evaluate as many indirect classification techniques and compare them with direct classification methods, particularly for network intrusion detection. A case study of the DARPA KDD-1999 offline intrusion detection project is used to evaluate the different techniques. It is empirically shown that certain indirect classification techniques yield better network intrusion detection models. Keywords: Network intrusion detection, multi-category classifications, binary classifiers, combining strategies, base learners
1. Introduction With the explosive growth of the Internet and the increased availability of tools for attacking networks, network intrusion detection has become a critical component of successful network administration. Intrusion detection systems (IDSs) provide an additional level of protection in detecting the presence of an intruder, and help provide accountability for the attacker’s actions. Since most of the intrusions can be located by examining patterns of user activities and audit records, many IDSs have been built by utilizing the recognized attack patterns. IDSs are generally classified as misuse detectors [31] or ∗
Corresponding author: Taghi M. Khoshgoftaar, Empirical Software Engineering Laboratory, Dept. of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431 USA. Tel.: +1 561 297 3994; Fax: +1 561 297 2800; E-mail:
[email protected]. 1088-467X/05/$17.00 © 2005 – IOS Press and the authors. All rights reserved
310
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
anomaly detectors [7,11]. Misuse detectors rely on signatures of past and known attacks. In contrast, anomaly detectors rely on organization-specific behavioral profiles for detecting abnormal behaviors. The focus of our study is on misuse detection classification models in the domain of network security. Identifying intrusions, misuses, and attacks in general, requires that systems be monitored for anomalous behavior. This includes the analysis of network traffic flow, system log files, system activity statistics, and user connection activity. Such audit data is subsequently mined and compared with a set of recognized misuse and attack patterns to identify possible security violations. Various data mining techniques, such as neural networks, statistical techniques, pattern recognition and machine learning, have been utilized to automate the intrusion detection process to reduce human interventions. In related literature, classification models are generally built to predict network traffic instances as either normal or belonging to a specific attack category. For example, the commonly used DARPA KDD-1999 offline intrusion detection project dataset categorizes network traffic data according to a five-class taxonomy – Normal, Probe (reconnaissance), DOS (denial of service), U2R (user to root) and R2L (remote to local) [24,32]. These categories are based on the inherent hierarchial level of attacks and their varying level of severity in a given network environment. The classification problem for network intrusion detection is therefore a multi-category classification problem. A variety of modeling approaches have been used for the network intrusion detection classification problem, including fuzzy classifiers [17,18], neural networks [33], support vector machines [35], clustering [36,41], and rule-based techniques [2]. However, intrusion detection are often viewed as a two-group classification problem integrating all attack types into a single attack category [1,23]. Two-group classification models in intrusion detection have proven their value, especially in anomaly detection techniques that aim to characterize traffic as either suspicious or not [11]. However, not all data mining techniques that build binary classification models can facilitate multi-category classification. In our study, we focus on the multi-category classification problem in network intrusion detection [30]. Research works that investigate multi-category classification models for intrusion detection have focused on using a Direct classification approach, i.e., a classification technique that facilitates multicategory classification is used, such as the C4.5 decision tree learner. Another approach to addressing multi-category classification is employing an Indirect classification approach [4,8,9]. The multi-category classification problem is decomposed (binarization) into a set of binary classification problems, the predictions of which are then merged using a specific combining strategy. The classification technique used to solve the set of binary classification problems is called the base learner or base classifier. An Indirect approach is therefore defined by its binarization method and combining strategy [4,8]. This study presents a comprehensive comparative evaluation of some commonly used Direct multicategory classification techniques with some Indirect classification approaches. The Direct classification techniques investigated include C4.5 (decision tree learner), JRip (rule-based learner), PART (combinatory rule- and tree-based learner), IBk (instance-based learner), and LWL (memory-based learner). All of these techniques are implemented in the freeware Weka, a data mining and machine learning tool [42]. The selection of these techniques were based on their ability to handle nominal and numeric features, and support multi-category classification. The Indirect classification approaches were obtained by pairing each of two binarization techniques with each of three combining strategies. The binarization techniques investigated include One vs One and One vs Rest, and the combining strategies investigated include Hamming Decoding, Loss-Based Decoding, and Soft-Max Function. The base learners for building the set of binary classifiers are the five Direct classification techniques. Consequently, we obtain thirty Indirect classification approaches for addressing the multi-category intrusion detection problem. We empirically compare the five Direct classification techniques with the thirty Indirect classification techniques. A case study of the benchmark KDD-1999 offline intrusion detection project dataset is used
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
311
in our comparative study. The dataset contains network connection records obtained from live (simulated) traffic consisting of 42 network traffic features. We also evaluate the five Direct classification techniques with respect to each other, and the six Indirect classification techniques with respect to each other for each base learner. This study is unique in evaluating six Indirect classification techniques and comparing them with Direct classification methods, particularly for network intrusion detection. Such a comprehensive comparative analysis would be very useful to fellow researchers and practitioners. Our study indicates that certain Indirect classification methods performed better than the Direct classification methods. Considering primary performance factors like detection accuracy, stability and modeling time, the C4.5 and PART classification techniques performed well. The remainder of the paper is organized as follows: Section 2 describes in the methodology used in this study. It provides a detailed explanation on the Indirect approach of multi-category classification and describes the class binarization techniques and the combining strategies used therein. Section 3 explains the modeling techniques involved to obtain sets of intrusion detection models. Sections 4 and 5 include the description of experiments and the analysis of the experimental results obtained. Finally, we draw conclusions from our findings in this study and suggest future work in Section 6. 2. Methodology for multi-group classification Although real-world classification problems often have multiple classes [12,26,28,40], many learning algorithms are inherently binary, more specifically, they are only able to discriminate between two classes. We identify two general approaches for building a multi-group classification model: Direct and Indirect. The Direct approach generalizes the original binary classification method to tackle the multi-category problems as with C4.5 [38]. The Indirect approach employs class binarization techniques which reduce the multi-class problem into a series of simple binary-class problems which are solved individually and the resultant predictions are combined to obtain a final solution. 2.1. Class binarization techniques Class binarization, the first phase in the Indirect approach for multi-category classification, is a mapping of a multi-class learning problem to several two-class learning problems in a way that allows a sensible decoding of the prediction, i.e., it allows the derivation of a prediction for the multi-class problem from the predictions of the set of two-class classifiers. The learning algorithm used for solving the twoclass problems is called the base learner or base classifier. There have been various class binarization approaches proposed in the literature [8,14,20]. Our research, however, focussed on the two popular class binarization techniques, namely One vs Rest and/or One vs One. 2.1.1. One versus rest The most popular class binarization technique is the One vs Rest class binarization, where one takes each class in turn and learns binary concepts that discriminate that class from all other classes. The One vs Rest class binarization transforms an n-class problem into n two-class problems. These two-class problems are constructed by using the examples of class i as the positive examples and the examples of rest classes as the negative examples. The binary classifier is learned by training on such a dataset, one corresponding to each class. Hence, for our intrusion detection dataset we learn five such One vs Rest binary classifiers namely: Normal vs Rest, Probe vs Rest, DOS vs Rest, U2R vs Rest and R2L vs Rest.
312
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
2.1.2. One versus one The basic idea of the One vs One technique is quite simple, namely to learn one classifier for each pair of classes. The One vs One class binarization transforms an n-class problem into n(n−1) two-class 2 problems < i, j >, one for each set of classes {i, j}, where i, j = 1, . . . , n, and i < j . The binary classifier for a problem < i, j > is trained with examples of classes i and j , whereas examples of classes k = i, j are ignored for this problem [15]. Hence, for the intrusion detection dataset the One vs One procedure learns ten binary classifiers namely: Normal vs Probe, Normal vs DOS, Normal vs U2R, Normal vs R2L, Probe vs DOS, Probe vs U2R, Probe vs R2L, DOS vs U2R, DOS vs R2L and U2R vs R2L. 2.2. Combining strategies In our research area, we implemented certain combiner strategies, the second phase in the Indirect approach for multi-category classification, to decode the predictions obtained from each binary classifier from the respective binarization process and obtain the final solution. In the following sections, we discuss three most popularly used decoding techniques that we implemented in this research. 2.2.1. Output coding for multi-class problems Output coding for the multi-class problems is composed of two stages. As described earlier, in the training stage we need to construct multiple (supposedly) independent binary classifiers, i.e., employ class binarization, each of which is based on a different partition of the set of the labels into two disjoint sets. In the second stage, the predictions of the binary classifiers are combined to extend a prediction on the original label of a test instance. Error-correcting output codes, which are a unified procedure to get the final prediction for the multiclass problem, are a popular and powerful method for class combination techniques [8,15]. Detailed description of this unified framework is as follows: We first build l binary training datasets by re-labelling the instances in the original dataset. Consequently, for each s = 1, . . . , l, the binary learner is provided with the labelled data of the form (xi , M(yi , s)) for all instances xi in the training set, where M is the coding matrix which depends on the type of class binarization. For the One vs Rest approach, M is an n × n matrix in which all diagonal elements are +1 and all other elements are −1, and n is the number of the classes (n = 5 in our intrusion detection dataset). For One vs One approach, M is an n × ( n2 ) matrix in which each column corresponds to a distinct pair (i, j ) of classes, where i, j = 1, 2, . . . , 5 and i < j . For each column M has +1 in row i, −1 in row j , and zeros in all other rows. The binary learning algorithm (base learner) is used to construct classifiers, one for each column of M, i.e., the set of instances induced by each column. This set is fed as a training data to the learning algorithm that finds a hypothesis h s . This reduction yields l different binary classifiers h 1 , . . . , hl . We denote the vector of predictions of these classifiers on an instance x as h(x) = (h 1 (x), . . . , hl (x)). We denote the r th row of M by Mr or M(r). Given a new instance x we predict the label y for which the row My is the “closest” to h(x). In other words, we predict the label y that minimizes the d(M(y), h(x)) for some distance d. Several methods of combining the h s ’s have been devised in the past to calculate the minimal distance d thus increasing the confidence that y is the correct label of x according to classifiers h. The following sections describe the two methods implemented in our research to calculate this distance.
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
313
2.2.2. Hamming decoding This technique counts up the number of positions s in which the sign of the prediction h s (x) differs from the matrix entry M(r, s). The distance measure is dH (M(r), h(x)) =
l 1 − sign(M(r, s)hs (x)) 2
s=1
(1)
where sign(z) is +1 if z > 0, −1 if z < 0 and 0 if z = 0. This is essentially like computing Hamming distance between row M r and the signs of hs (x). However, note that if either M(r, s) or hs (x) is zero then that component contributes 12 to the sum. For an instance x and matrix M, the predicted label y ∈ {1, . . . , n} is therefore y = arg min(dH (M(r), h(x)) r
(2)
2.2.3. Loss-based decoding As we see, with Hamming Decoding, potentially useful information specific to the binary classifiers is ignored in the calculation of distance. To overcome this shortcoming, a method was proposed that would take into account the magnitude of the predictions which can often be an indication of a level of confidence as well as the relevant loss function L. This method was termed as Loss-Based Decoding [4, 8]. The idea is to choose the label r that is most consistent with the predictions h s (x) in the sense that, if instance x were labelled r , the total loss on instance (x, r) would be minimized over choices of r ∈ {1, . . . , n}. This means that our distance measure is the total loss on a proposed instance (x, r) dL (M(r), h(x)) =
l
L(M(r, s)hs (x))
(3)
s=1
where L(z) is the corresponding loss function [4]. Analogous to Hamming Decoding, the predicted label y ∈ {1, . . . , n} is y = arg min(dL (M(r), h(x)) r
(4)
2.2.4. Soft-max function method The Soft-Max Function is designed depending on the type of class binarization technique used. The set of parameters for this function is determined by solving a particular optimization problem. The idea of the method is to model the relationship between each instance and its posterior probability as accurately as possible so as to reliably predict the class membership of that instance. This relationship will be modelled by the soft-max function [9]. The Soft-Max combination method can be used with any class binarization technique. Soft-max combination for one versus rest classifiers Suppose there are n classes and m labelled training data (x 1 , y1 ), . . . , (xm , ym ), where yi ∈ {1, . . . , n} is the class label of the ith training instance xi . For this n-class classification problem, there exists a posteriori probability Pki for each instance xi and each class k, where k ∈ {1, . . . , n}. P ki therefore n i must satisfy k=1 Pk = 1. For xi , we denote the output (decision value) of the k th binary classifier wk (class k versus the rest) as r ki , which is expected to be large if x i is in class k and small otherwise. After
314
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
constructing n One vs Rest classifiers, posterior probabilities are obtained from the following soft-max function [9]: i
eωk rk +ωk0 = P rob(wk | xi ) = (5) zi n ωk rki +ωk0 where z i = is a normalization term. The parameters of the soft-max function, k=1 e (ω1 , ω10 ), . . . , (ωn , ωn0 ) can be determined by minimizing the negative log-likelihood (NLL) function E: m E=− log Pyii (6) Pki
i=1
Soft-max combination for one versus one classifiers Following the same lines, posteriori probabilities can also be obtained by Soft-Max combination of One vs One binary classifiers. For an instance x i , we denote the output of binary classifier w kt (class i . Obviously we have r i = −r i . After constructing n(n−1) binary classifiers, k versus class t) as r kt tk kt 2 posterior probabilities are obtained from the following soft-max function:
Pki
= P rob(wkt | xi ) =
e
t=k
i +ω ωkt rkt k0
zi
(7)
ω r i +ω where z i = nk=1 e t=k kt kt k0 is a normalization term. The soft-max function parameters can be determined by solving the following optimization problem shown in Eq. (6). Different weights assigned to each instance in the training dataset yield optimal parameters by minimizing the NLL function on the training data, thereby estimating posterior probabilities (Eqs (5) and (7)) based on predictions of the classifier on each weighted instance. Note that the classifier that performed best will be assigned the highest weight.
3. The base learners This section presents a brief description of the five base classifiers used in our study. The aim is to provide an overview of the techniques, instead of an extensive algorithmic details. These base classifiers were part of the Weka data mining tool, which is a freeware application [27,42]. 3.1. C4.5, The decision tree learner The C4.5 algorithm is an inductive supervised learning system which employs decision trees to represent a classification model [25]. C4.5 consists of four principal programs: decision tree generator, production rule generator, decision tree interpreter, and production rule interpreter [37]. The algorithm uses these four programs when constructing and evaluating classification tree models. Different tree models were built by varying parameters: minimum number of objects per leaf node before splitting and confidence factor for pruning [42]. The C4.5 algorithm commands certain pre-processing of data in order for it to build decision tree models. Some of these include attribute value description type, predefined discrete classes, and sufficient number of observations for supervised learning. We used the Weka implementation of C4.5, named J48 [42]. The tree is initially empty, and the algorithm starts building it from the root adding decision nodes or leaf nodes as it grows. The over-fitting problem is solved at the pruning stage.
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
315
3.2. Ripper, The rule learner A viable competitor to decision trees is the rule-learning algorithm Ripper abbreviated for Repeated Incremental Pruning to Produce Error Reduction [6,16]. Ripper learns from a fixed set of attributes and outputs the model in the form of rules. It is commonly used for reasons that it can efficiently process large noisy datasets and are competitive in generalization performance with many learning methods. The algorithm starts with an empty rule set. Rules are created for every class in the training set and are then pruned [42]. The rule learner, Ripper, implemented in the Weka tool is known as JRip. We will hence refer to this algorithm as JRip throughout this paper. 3.3. PART, The combinatory tree and rule learner Both the leading machine learning techniques, C4.5 and Ripper, explained previously use global optimization. In the sense they induce initial ruleset and then refine it using an optimization stage that discards (for C4.5) or adjusts (for Ripper) individual rules. PART (Polynomial And Regression Trees) has an advantage in that it is a rule generator that performs no global optimization [13]. It learns one rule at a time by repeatedly generating pruned decision trees, thus combining the two major paradigms for rule generation – creating rules from decision trees and the separate-and-conquer rule-learning technique. Thus, we call it the Combinatory Tree and Rule Learner. To make a single rule, PART builds a pruned decision tree for the current set of instances, the leaf with the largest coverage is made into a rule, and the tree is discarded. Building a “partial” tree instead of a full decision tree to extract a single rule avoids the over-pruning problem and accelerates the process significantly. A partial decision tree is an ordinary decision tree that contains branches to undefined subtrees. Construction and pruning operations are applied in order to find a “stable” subtree that can be simplified no further. Once this stage is reached, tree-building ceases and a single rule is extracted. 3.4. IBk, The instance-based learner Instance-based learning is a popular classification technique that is implemented in numerous statistical and pattern recognition software packages. Based on the concept of how people use memory to perform tasks, i.e., we often recall past experiences to guide us to the solution to new problems, instance-based learners simulate this by determining which case in memory or the training library is the most similar to the new instance by using a metric to measure similarity [3]. Instance-based learners are “lazy” in the sense that they perform little work when learning from the dataset, but expend more effort classifying new examples. These systems learn new concepts by storing past cases in such a way that new examples can be directly compared with them. On the basis of this comparison the system decides the class of the new example. Weka implements two Instance-Based Learners – IB1 and IBk. IBk is an instance-based learning technique with k nearest neighbor. It expands the functionality introduced in IB1. Since selecting only one nearest neighbor can cause predictions and classifications to be unduly influenced by noisy exemplars. 3.5. LWL, The memory-based learner LWL (Locally Weighted Learning) is a non-adaptive lazy learning regression technique intended primarily for prediction. It defers processing of training data until a query is answered. This involves
316
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
storing the training data in memory and finding relevant data in the database to answer a particular query [5]. Hence, this learning is also referred to as Memory-Based Learning. LWL uses locally-weighted training to combine training data, using a distance function to fit a surface to nearby points. It is used in conjunction with another classifier to perform classification rather than prediction. The four components that define LWL are: a distance metric, near neighbors, weighting function, and fitting the local model [39]. 4. System description The Information Systems Technology (IST) Group of MIT Lincoln Laboratory (AFRL/SNHS) sponsorship, has collected and distributed the first standard framework for the evaluation of computer network Intrusion Detection System. The efforts have involved the execution of a variety of simulated attacks over a number of platforms using scripts and programs in a closed test environment in order to collect the audit data useful for the evaluation of Intrusion Detection Systems [19,32,34]. The datasets produced as a result of the effort were characterized by three types of attributes (independent variables): basic, content and traffic. Basic Attributes contain information about network packet level data, also referred to as the intrinsic attribute in a network packet, and are the basic features that can be extracted from tcp network stream. They are the quickest to extract, and need no more processing than simple packet parsing. Content Attributes summarize the behavior of the network based on the payload or the content of a packet. For example, the content attribute that measures the number of failed logins that result from entering a wrong password would increment the num failed logins attribute. This would record malicious activity based on content carried over the network aimed at causing an intrusion. Traffic Attributes measure rates or occurrence of network events over a number of ports. These are the most expensive metrics to be derived from a stream load of traffic. They are often significant in identifying some important attacks like denial of service (DOS). Our experiment used the dataset including all three types of attributes. There were a total of 41 attributes, of which 3 were categorical variables (protocol, service and connecting flags) and the remaining were numerical (source and destination bytes, number of failed logins, etc.). The dependent variable (class) was the attack category, which belonged to one of the five: Normal, Probe (surveillance), DOS (denial-of-service), U2R (unauthorized access to local superuser (root) privileges) and R2L (unauthorized access from a remote machine). 4.1. Data preprocessing The original dataset that was composed of approximately five million records included attacks that appeared in extremely different proportions, with DOS containing the overwhelming majority. The U2R attack group formed a negligible proportion relative to the massive size of the dataset. Another problem in the dataset was the presence of noise. By that we mean that there were some instances in the dataset that had identical attribute values for the given attributes but were labelled with a different class. This obviously destroys the purpose of coherent learning by any data mining algorithm. Therefore, a considerable amount of data-preprocessing had to be undertaken before performing any modeling experiments. In addition, even after attempting to modify the dataset’s attack proportions and removing noise, the large size of the dataset still made it difficult for any modeling technique to be applied in a reasonable execution time. Data sampling was necessary for a more feasible modeling task. It was necessary to ensure though, that the reduced dataset was as representative of the original dataset as possible.
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
Category Normal Probe DOS U2R R2L Total records
Table 1 Distribution of Test Dataset Training data Test data – T1 88.44% 89.69% 3.73% 4.10% 6.68% 5.19% 0.12% 0.36% 1.02% 0.66% 43,994 20,012
317
Test data – TT1 90.16% 3.76% 6.68% 0.18% 1.06% 21,580
The entire dataset was originally composed of 4,940,208 records. Out of that total, almost 80% was made up of the DOS attack. More specifically, the Smurf and Neptune attacks, that belonged to the DOS category, made up the greatest majority of that attack group. One of the main reasons behind this is inherent in the nature of denial-of-service attacks. Such attacks are often characterized by a large number of network connections that are usually recorded during flooding attacks that intend to compromise a networked machine or cause it to overflow its applications’ buffers. Like prior research [32], we decided to reduce the quantities of redundant DOS class instances, especially from the two largest attacks in that category, namely, Smurf and Neptune. We decided to pick a factor of 1/100 of each of the total occurrence of Smurf and Neptune. This has also been motivated by prior work that did the same, but chose a different reduction factor (1/20) [29]. All other attack instance proportions were remained unaltered. It is important to point out that these modifications were carried out on a reduced dataset of the five million record set, i.e., 494,023 records. This reduction was already prepared and made available at the UCI (University of California, Irvine) data repository. Note that after removing many redundant instances of Smurf and Neptune from the reduced dataset, the remaining training dataset has 109,910 records. Further reduced training dataset were formed by simply sampling from this dataset with the sampling factor 2/5. A given attack category consisted of multiple attack types that were similar in nature. The test dataset available from the UCI repository included instances with attack types not seen in the training dataset. However, these new attack types belonged to one of the five attack categories. We made two versions of the test dataset. The first containing test records that belonged to attack types also available in the training dataset. We denote this dataset as TT1. The second test dataset included traffic instances that belong to attack types not found in the training set. We denote this dataset as T1. Evaluation with this test dataset helped prove the ability of our selected model in generalizing for “unseen” instances. By “seen” we refer to attack types that the model has been trained on during model building as they were present in the training dataset. “Unseen” instances belong to attack types that are new. A further detailed discussion on our data processing is presented in [1]. The distribution of attack categories of our final training dataset, test dataset T1 and test dataset TT1 are listed in Table 1. 5. Experiments The cost-sensitive C4.5 and JRip classifiers were built using optimal parameters as well as default parameters2 for both Direct and Indirect methods. The PART, IBk and LWL classifiers were built using both default and optimal parameters for the Direct method, and with default parameters for the Indirect 2 Optimal parameters refer to those parameters for which the model can have the minimal Expected Cost of Misclassification [22]. Default parameters refer to those parameters that the Weka Data Mining tool [42] adopts without specification.
318
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection Table 2 Cost Matrix classified as → Normal Prob DOS U2R R2L
Normal 0 1 2 3 4
Prob 1 0 1 2 2
DOS 2 2 0 2 2
U2R 2 2 2 0 2
R2L 2 2 2 2 0
method. In addition, we illustrate the use of Expected Cost of Misclassification (ECM) [22] for selecting the intrusion detection models, for analyzing results from the Direct and Indirect methods, and for the comparison of our chosen learners based on these results. 5.1. Expected cost of misclassification In network intrusion detection, the cost of acting on each type of erroneous prediction is not the same. The cost matrix used for scoring entries provided by the KDD Cup’99 classifier learning contests [10] is shown in Table 2. The rows correspond to predicted categories, while columns correspond to actual categories. From the cost matrix, we observe that each different intrusion attack category has different severity. For example, R2L is the most serious attack category. If it is misclassified as Normal, the cost is 4. In addition, the diagonal elements of the cost matrix are all zeros, which implies the cost is zero when an instance is correctly classified. A preferred classification rule attempts to minimize the Expected Cost of Misclassification (ECM), which is defined as follows [22]: ⎛ ⎞ n n 1 ⎝ ECM = Cij × Nij ⎠ (8) N i=1 j=1
where Cij is the misclassification cost incurred when class i is classified as class j . N ij is the number of misclassified instances predicted as j whose actual class is i. In our case, we have five different types of classes, so n = 5. N is the total number of instances in the dataset. 5.2. Results and analysis We have implemented the multi-category classification by combining binary classifiers. The base classifiers were used for building the binary models using appropriate datasets depending on the binarization technique used. After the binary classification problems have been solved, the optimal parameters from each binary classifier for which the best model, i.e., the model with the minimal ECM, was selected are fed into each of the combiners: Hamming Decoding, Loss-Based Decoding and Soft-Max Function to obtain the final solution. We analyze these results to find which combiner strategy and which binarization technique performed best. This analysis is further compared with the results obtained from the Direct multi-classification approach implemented using the same base classification algorithms from the Weka Data Mining tool. For each base classifier, we began with the One vs Rest technique of binarization. We built five binary classifiers for this case. Then, we conducted experiments with One vs One technique of binarization. In this case, we built ten binary classifiers. We started out the Indirect method with the C4.5 as base learner, followed by choosing JRip as base learner. Experiments were performed for optimal parameters
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
319
Table 3 C4.5 Training Result with M = 2 and C = 0.75 for Normal vs Rest Classifier cr 0.01 0.1 1 2 3 4 5 6 7 8 9 10
aa 38749 38867 38884 38891 38894 38893 38891 38889 38892 38895 38891 38891
ab 161 43 26 19 16 17 19 21 18 15 19 19
ba 17 32 39 51 61 65 64 69 66 84 84 80
bb 5067 5052 5045 5033 5023 5019 5020 5015 5018 5000 5000 5004
ECM 0.0074 0.0035 0.0033 0.0037 0.0041 0.0044 0.0044 0.0048 0.0045 0.0054 0.0055 0.0053
as well as default parameters. Whereas for PART, IBk and LWL, we used only the default parameters for Indirect method due to the undesirable performance of optimal parameters on the test datasets when using C4.5 and JRip. This is explained in detail in further sections. 5.2.1. Results of C4.5 and JRip We conducted experiments for the base learner, C4.5, with optimal parameters as well as default parameters. In the Indirect classification approach, optimal parameters indicate those parameters with values tuned in the binary classifiers that can minimize the ECM value of the binary model; while the default parameters refer to those parameters with the values that are adopted by the Weka tool without specification. The C4.5 algorithm that comes with Weka has two modes of pruning operations: Reduced Error Pruning (REP) and Subtree Raising. We used the default pruning option, Subtree Raising, to conduct our experiments. A brief description is shown as follows: For the experiments with optimal parameters, we chose to vary the parameter M , minimum number of objects per leaf node, through 2, 5, 10, 20, 50 and 100, and the parameter C , confidence factor for pruning, through 0.25, 0.50 and 0.75. Table 3 shows an example of the selected optimal model from the ten-fold cross validation 3 results of C4.5 for the One vs Rest binarization technique that we generated. The table summarizes the C4.5 binary models that have been built with the parameter value M = 2 and C = 0.75. The models vary depending on modeling cost ratio parameter (cr ) that we varied through various values for each fixed value of M and C . The ‘aa’ (‘bb’) column lists the number of instances belonging to class ‘a’ (‘b’) that were correctly predicted. The column ‘ab’ (‘ba’) represents the number of instances that actually belong to class ‘a’ (‘b’), but are predicted to be of class ‘b’ (‘a’). The last column lists the ECM values. The models in the table are sorted by the modeling cost ratios. ‘a’ and ‘b’ are denoted as the first and second respective labels of the pair for which binary model is built, e.g., for Normal vs Rest model, ‘a’ is ‘Normal’ and ‘b’ is ‘Rest’. It should be noted that we tuned the three parameters: M , C and cr , and found the optimal model occurred at M = 2, C = 0.75 and cr = 1 (marked in bold in the table) for the binary classifier Normal vs Rest. The optimal parameters selected for each binary classifier from the One vs Rest technique were applied to the respective combiners: Hamming Decoding, Loss-Based Decoding, and Soft-Max Function, to carry 3
A ten-fold cross validation is used throughout our study for selecting the respective classification models.
320
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection Table 4 C4.5: Misclassification Costs for the Training, T1 and TT1 Datasets DIRECT
INDIRECT One vs One Hamming Loss-Based Soft-Max One vs Rest Hamming Loss-Based Soft-Max
C.V. Optimal Default 177 182
T1 Optimal Default 1602 1977
TT1 Optimal Default 1715 2001
115 109 109
195 172 183
2800 2800 2868
1488 1516 3642
3235 3233 3281
1654 1693 3904
154 140 125
252 181 211
2758 2742 3535
3024 3043 3498
3001 2978 3592
3246 3290 3973
out the combination task for the final model. Along the same line, the optimal parameters selected for the ten binary classifiers from the One vs One technique were applied to the three respective combiners to obtain the final predictions. In addition, we conducted the similar experiments with the default parameters provided by the Weka tool, we obtained the predictions for the six Indirect classification approaches. Since we would like to compare the Indirect and Direct classification approaches, we also carried out the experiments using the Direct classification method with the optimal and default parameters. Table 4 summarizes the misclassification costs for the C4.5 models built and tested with the training dataset, T1 and TT1 test datasets respectively. We used the misclassification cost, MC 4 , to evaluate the performance of the given model instead of using ECM because of the tiny number of ECM that leads to difficulty of comparisons between the different classification techniques. The cross-validation (C.V.) results on the training dataset are better than the prediction results for the T1 and TT1 test datasets, as shown in Table 4. This is a typical overfitting problem 5 . It should be noted that it is not impartial to compare the performance of the model on two datasets in terms of the total misclassification cost unless both datasets have the same number of observations. Therefore, the average of misclassification cost or ECM could be the substitution when the number of observations in the datasets are different. For the Indirect method with Hamming and Loss-Based Decoding, models based on default parameters yielded better test dataset results for One vs One than those based on optimal parameters. In addition, those models had similar prediction results for One vs Rest as compared to models based on optimal parameters. This could also be due to overfitting of the model on fit dataset caused by the selected ‘optimal’ parameters. A further discussion about the overfitting problem is beyond the scope of this paper, but will be part of our future work. Moreover, performance of the One vs One model is better than or similar (within 10% of) to the One vs Rest model for the default parameters. The Loss-Based and Hamming Decoding techniques seem to outperform the Soft-Max Function. For the Direct method, experiments conducted with optimal parameters have shown better prediction results for cross-validation as well as for test datasets, T1 and TT1, as compared to those with default parameters. Overall, the Loss-Based method for combining the One vs One classifiers using default parameters performed better than the Direct classification method for both cross-validation and test datasets. 4
n n
MC = i=1 j=1 (Cij × Nij ). 5 Overfitting occurs when the performance of the model on the test dataset is not as good as the performance on the fit dataset
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
321
Table 5 JRip: Misclassification Costs for the Training, T1 and TT1 Dataset DIRECT
INDIRECT One vs One Hamming Loss-Based Soft-Max One vs Rest Hamming Loss-Based Soft-Max
C.V. Optimal Default 117 143
T1 Optimal Default 29011 29284
TT1 Optimal Default 29074 29327
155 155 433
121 119 114
1522 1522 29252
1480 1482 24813
1630 1630 29409
1583 1583 25097
183 180 133
164 164 126
1847 1839 29595
1681 1680 29288
1819 1819 29845
1779 1781 29482
All experiments above were carried out for the base learner C4.5. These experiments were then repeated in the similar manner with JRip as the base learner for our purpose of comparing the performance of different learners for the Indirect and Direct methods. For the experiments with optimal parameters for Jrip, we chose to vary the parameter F , the number of folds for reduced error pruning, through 3 and 5, the parameter W , the minimal weights of instances within a split, through 2, 3 and 5 and the parameter O, the number of runs of optimizations, through 2, 5 and 10. Increase in runs of optimization improves the performance of the classifier, however with speed being the penalty. Table 5 summarizes the misclassification costs for the JRip models built and tested with the training dataset, T1 and TT1 test datasets, respectively. From Table 5, note that cross-validation results are better than the prediction results from applying the preferred model to T1 and TT1. Moreover, performance of the One vs One model is better than or similar (within 10% of) to the One vs Rest model. The performance of the Loss-Based technique seems to be similar to that of the Hamming Decoding technique. The SoftMax Function, however, performed very poorly for the test datasets. A large proportion of the Normal instances were detected as U2R. This is because the Soft-Max Function gives more weight to those binary classifiers that build more accurate models. Further analysis discovered that the U2R binary models had the highest detection rate (i.e., from a set of 43,994 records only one or two were misclassified). Hence, majority of Normal instances were predicted as U2R. Similar results were observed in the experiments for the Direct classification method. In addition, the rule-based nature of the algorithm could be a reason behind this behavior. Overall, JRip shows weak robustness and high sensitiveness to the distribution of the dataset. For the Indirect method, experiments conducted with default parameters have shown better or similar results as compared to those with optimal parameters on cross-validation as well as on test datasets T1 and TT1. Experiments with default parameters for the Indirect method generally outperformed experiments with optimal parameters on test datasets and as seen in the case of JRip for cross-validation too. Hence, we carried out experiments with the remaining three algorithms (PART, IBk and LWL) using default parameters only. However, for the Direct classification method we conducted experiments for both default and optimal parameters since the optimal model performed better than (for C4.5) or similar to (for JRip) the default model. 5.2.2. Results of PART, IBk and LWL We carried out further experiments for Indirect method of classification using PART, IBk and LWL as the base learners, respectively. The results of these experiments were then compared with those of
322
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection Table 6 PART: Misclassification Costs for the Training, T1 and TT1 Datasets DIRECT C.V. T1 TT1 Optimal 132 1659 1844 Default 178 2444 2563 INDIRECT One vs One Hamming 150 1480 1577 Loss-Based 134 1236 1265 Soft-Max 145 5939 8133 One vs Rest Hamming 186 5956 8756 Loss-Based 157 5474 7814 Soft-Max 171 21380 21630 Table 7 IBk: Misclassification Costs for the Training, T1 and TT1 Datasets DIRECT C.V. T1 TT1 Optimal & Default 156 4718 4783 INDIRECT One vs One Hamming 148 4713 4775 Loss-Based 148 4713 4775 Soft-Max 148 4721 4779 One vs Rest Hamming 156 4718 4783 Loss-Based 156 4718 4783 Soft-Max 156 4718 4784 Table 8 LWL: Misclassification Costs for the Training, T1 and TT1 Datasets DIRECT C.V. T1 TT1 Optimal & Default 305 4660 4816 INDIRECT One vs One Hamming 198 2008 2262 Loss-Based 162 1783 2056 Soft-Max 208 4886 5047 One vs Rest Hamming 258 2133 2347 Loss-Based 210 1885 2117 Soft-Max 265 4978 5150
Direct method carried out for the same base learners. A brief description of the experiments using the three base classifiers is illustrated as follows: For PART, we conducted experiments with default parameters M = 2, minimum number of objects per leaf node, C = 0.25, confidence factor for pruning, and F = 3, number of folds for reduced error pruning. For IBk, we conducted experiments with default parameter k = 1, number of nearest neighbor. LWL performs classification when employed in conjunction with another classifier. For this purpose, we adopted the default classifier, “Decision Stump”. We conducted experiments with default parameter k = −1, number of nearest neighbor (– implies all nearest neighbors) and using the default weighting function, i.e., W = 0 (the linear weighting function). Tables 6, 7 and 8 summarize the misclassification costs for the PART, IBk and LWL models built and tested with the training dataset, T1 and TT1 test datasets, respectively. The cross-validation results are better than the prediction results when the PART models are applied to the test datasets. The performance
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
323
of the One vs One model is significantly better than the One vs Rest model. For the Direct method, optimal model outperformed the default model. On the other hand, the Indirect method using Loss-Based technique to combine One vs One classifiers seems to outperform the other two combiner techniques as well as the Direct classification method. This is evident from the cross-validation results but more so from the results for the test datasets T1 and TT1. Similar to the PART models, the cross-validation results of the IBk models are better than the results on T1 and TT1. Moreover, performance of the One vs One model is similar to the Direct approach and the One vs Rest model. The results for the One vs Rest and the Direct method on the fit and test datasets are consistent. The three combiner techniques show very consistent results for the One vs One and the One vs Rest models, respectively. However, if we observe all the misclassification costs closely, we can still conclude that the performance of the Loss-Based technique for the One vs One approach is better than the Direct classification approach for cross-validation and the test datasets T1 and TT1. It is observed from Table 8 that cross-validation results of the LWL models are better than the results on the test datasets. Moreover, performance of the One vs One model is better than the One vs Rest model. The Direct method of multi-group classification yielded poor results. We can conclude that the performance of the Loss-Based technique for combining the One vs One classifiers outperforms the other two combiner techniques and the Direct classification approach for cross-validation as well as for test datasets T1 and TT1. As discussed earlier, T1 has both “seen” and “unseen” instances and TT1 has only “seen” instances from the training dataset. Hence, one would assume that the misclassification cost of the model applied to T1 would be higher as compared to TT1 since there would be more misclassifications in the case of T1. However, our observations from experiments performed on all learners were the opposite. This could be due to the fact that the distribution for Normal category in T1 is closer to the training dataset as compared to the TT1, and this category makes the overwhelming majority of the fit and test datasets. Hence, as observed from the confusion matrix [21], the varied number of misclassifications in this category (Normal) for T1 and TT1 respectively, caused a much larger value of misclassification cost for TT1. 5.2.3. Performance comparison of modeling techniques From the experiments conducted using the five base classifiers (Tables 4 to 8), we selected the result that corresponds to the lowest value (or close to the lowest value 6) of misclassification cost and/or ECM for each of the learners. The selected results show that the Loss-Based technique for combining One vs One classifiers, using default parameters performed the best for all the five learners. We tabulate these misclassification costs and corresponding ECM values in Table 9 for all five learners and hence evaluate their performances. Table 9 shows that PART, C4.5 and JRip have performed reasonably well in terms of misclassification costs and ECM. Also, the results for the fitted model applied to test datasets T1 and TT1 show that PART has outperformed every other learner yielding the lowest values for ECM and misclassification costs. 6. Conclusions This paper evaluated six different Indirect classification techniques (each was run with 5 different base classifiers) with five different Direct classification techniques. The Indirect classification method 6
Some base learners such as C4.5 and JRip, which do not exhibit the consistently best result for both cross-validation and prediction, e.g., Loss-Based combiner had better quality-of-fit, while Hamming combiner had the better predictive output (see Tables 4 and 5); but overall both combiners had similar performance.
324
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection Table 9 The Lowest Misclassification Cost and ECM Values for the Five Learners Learners C4.5 JRip PART IBk LWL
MC 172 119 134 148 162
C.V. ECM 0.0039 0.0027 0.0030 0.0034 0.0037
T1 MC 1516 1482 1236 4713 1783
ECM 0.0757 0.0740 0.0617 0.2355 0.0891
MC 1693 1583 1265 4775 2056
TT1 ECM 0.0785 0.0733 0.0586 0.2213 0.0953
breaks down the multi-category problem into multiple binary problems that are solved individually and then combined to yield a final solution. Model selection and evaluation was performed based on misclassification cost and/or the Expected Cost of Misclassification. In fact, it is difficult to conclude that of the different Direct and Indirect classification methods which one consistently performs better than the other. Many factors may influence the conclusions, such as the chosen base learner, the technique of class binarization, and the combining strategy adopted. However, we observed that for the Indirect classification methods, a general conclusion could be drawn: One vs One binarization technique performed generally better than the One vs Rest technique, and the Loss-Based combining strategy performed generally better than the Hamming Decoding and Soft-Max Function combining strategies. Hence, for all learners the Loss-Based technique for combining One vs One classifiers generally yielded the lowest misclassification costs and ECM. We compared these values for all learners, and came to the conclusion that C4.5, JRip and PART performed well, with PART yielding the lowest ECM for the test datasets. Considering the run times for the different Direct classification techniques, we observed that the JRip, C4.5 and PART techniques build models in a reasonable amount of time. On the other hand, modeling with IBk and LWL was very time-consuming. IBk took a couple of weeks whereas LWL took months to complete all required experiments, even with attempts of parallel processing. An effective Intrusion Detection System must detect malicious activity in real time. Thus, a fast and accurate response is imperative in a real time processing environment and marks the efficiency of the learner. It is difficult to judge any particular algorithm, because there are various performance measures for which the learner may behave in a different manner. From experience with this study, we note that in terms of factors such as stability, detection accuracy and modeling time, which are critical to real time IDSs, PART outperformed all other learners we considered. The Indirect five-group classification approach implemented in this paper has used a single base classifier (one of the five Direct classification techniques) for all experiments. In order to improve detection rate, different base learners can be used with our combining techniques, e.g., C4.5 can be used for the One vs Rest Normal classifier, JRip can be used for the corresponding One vs Rest DOS classifier and so on. These classifiers could then be combined to provide likely improved results. This idea is prompted due to the intrinsic nature of certain binary class problems that they may be more suitable for specific machine learning techniques. In addition, future work could extend allowing dynamic model building and evaluation by using real-time network traffic flow. Acknowledgments We thank anonymous reviewers for their comments and suggestions. We also thank Naeem Seliya for his assistance with modifications, suggestions, and patient reviews of this paper.
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
325
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]
M. Abushadi, Resource-sensitive intrusion detection models for network traffic, Master’s thesis, Florida Atlantic University, Boca Raton, FL, USA, December 2003. Advised by Taghi M. Khoshgoftaar. R. Agarwal and M.V. Joshi, PNrule: A new framework for learning classifier models in data mining, Technical TR 00-015, Department of Computer Science, University of Minnesota, 2000. D. Aha and D. Kibler, Instance-based learning algorithms. Machine Learning 6 (1991), 37–66. E.L. Allwein, R.E. Schapire and Y. Singer, Reducing multiclass to binary: A unifying approach for margin classifiers, in Proceedings 17th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 2000, 9–16. C. Atkeson, A. Moore, S. Schaal and D. Kibler, Locally weighted learning, Machine Learning 11 (1997), 11–73. W.W. Cohen, Fast effective rule induction, in: Proceedings of the 12th International Conference on Machine Learning (ML-95), A. Prieditis and S. Russell, eds, Lake Tahoe, California, 1995, Morgan Kaufmann, pp. 226–123. D.E. Denning, An intrusion-detection model, IEEE Transactions on Software Engineering 13(2) (1987), 222–232. T.G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output codes, Journal of Artificial Intelligence Research 2 (1995), 263–286. K. Duan, S.S. Keerthi and W. Chu, Multi-category classification by soft-max combination of binary classifiers, in: Multiple Classifier Systems, of Lecture Notes in Computer Science, (Vol. 2709), Guilford, UK, June 11–13 2003, Springer, pp. 125–134. C. Elkan, Results of the KDD’99 classifier learning contest, in Proceedings: International Conference on Software Maintenance, Orlando, Florida USA, November 1999, IEEE Computer Society, pp. 328–336. E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo, A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data, Applications of Data Mining in Computer Security (2002). S. Fine, G. Saon and R.A. Gopinath, Digit recognition in noisy environments via a sequential GMM/SVM system. In IEEE International Conference On Acoustics, Speech, And Signal Processing 1 (May 13–17 2002), I–49–I–52. E. Frank and I.H. Witten, Generating accurate rule sets without global optimization, in Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, 1998, Morgan Kaufmann, 144–151. J. F¨urnkranz, Pairwise classification as an ensemble technique, in Proceedings of the 13th European Conference on Machine Learning (ECML’02), Helsinki, Finland, 2002. J. F¨urnkranz, Round robin classification, Machine Learning Research 2 (2002), 721–747. J. F¨urnkranz and G. Widmer, Incremental reduced error pruning, in Proceedings the Eleventh International Conference on Machine Learning, New Brunswick, NJ, 1994, 70–77. J. Gomez and D. Dasgupta, Evolving fuzzy classifiers for intrusion detection, in Proceedings of the 2002 IEEE, Workshop on Information Assurance, 2002. J. Gomez, D. Dasgupta, O. Nasaroui and F. Gonzalez, Complete expressions trees for evolving fuzzy classifier systems with genetic algorithms and application to network intrusion detection, in: proceedings of the North American Fuzzy Information Processing Society, June 2002, pp. 63–78. J.W. Haines, L.M. Rossey, R.P. Lippmann and R.K. Cunningham, Extending the darpa off-line intrusion detection evaluations, in Proceedings of 2nd DARPA Information Survivability Conference and Exposition, (vol. 1), Anaheim, California, June 12–14 2001, 35–45. T. Hastie and R. Tibshirani, Classification by pairwise coupling, in: Advances in Neural Information Processing Systems, (vol. 10), M.I. Jordan, M.J. Kearns and S.A. Solla, eds, The MIT Press, 1998. N.H. Ibrahim, Evaluating indirect and direct classification techniques for network intrusion detection, Master’s thesis, Florida Atlantic University, Boca Raton, FL, USA, April 2004, Advised by Taghi M. Khoshgoftaar. R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis, 2nd edition, Prentice Hall, Englewood Cliffs, NJ, USA, 1992. T.M. Khoshgoftaar and M. Abushadi, Resource-sensitive intrusion detection models for network traffic, in: Proceedings of the 8th International Symposium on High Assurance Systems Engineering, Tampa, Florida, USA, March 2004. IEEE Computer Society, pp. 249–258. T.M. Khoshgoftaar, K. Gao and Y. Wang, A comparative study of classification algorithms for network intrusion detection, in: Proceedings: 10th International Conference on Reliability and Quality in Design, H. Pham and M.-W. Lu, eds, Las Vegas, Nevada, USA, August 2004. International Society of Science and Applied Technologies, pp. 168–172. T.M. Khoshgoftaar and N. Seliya, Comparative assessment of software quality classification techniques: An empirical case study, Empirical Software Engineering Journal 9(3) (2004), 229–257. T.M. Khoshgoftaar, N. Seliya and K. Gao, Assessment of a new three-group software quality classification technique: An empirical case study, Empirical Software Engineering Journal 10 (2005), 183–218. T.M. Khoshgoftaar, S. Zhong and V. Joshi, Noise elimination with ensemble-classifier filtering for software quality estimation, Intelligent Data Analysis: An International Journal 9(1) (2005), 1–25.
326
T.M. Khoshgoftaar et al. / Evaluating indirect and direct classification techniques for network intrusion detection
[28]
A. Klautau, N. Jevtic and A. Orlitsky, Combined binary classifiers with applications to speech recognition, in Seventh International Conference on Spoken Language Processing, Denver, Colorado, USA, September 16–20 2002, 2469–2472. W. Lee, S. Stolfo and K. Mok, Mining in a data-flow environment: Experience in network intrusion detection, in: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99), S. Chaudhuri and D. Madigan, eds, 1999, pp. 114–124. W. Lee, S.J. Stolfo and K.W. Mok, A data mining framework for building intrusion detection models, in: IEEE Symposium on Security and Privacy (1999), 120–132. U. Lindqvist and P. Porras, Detecting computer and network misuse through the production-based expert system toolset (p-best), in Proceedings of the 1999 IEEE Symposium on Security and Privacy, Oakland, California, May 1999, IEEE Computer Society Press, Los Alamitos, California, 146–161. R. Lippmann, J.W. Haines, D.J. Fried, J. Korba and K. Das, The 1999 DARPA off-line intrusion detection evaluation, Computer Networks 34 (2000), 579–595. R.P. Lippmann and R.K. Cunningham, Using key-string selection and neural networks to reduce false alarms and detect new attacks with sniffer-based intrusion detection systems, in RAID 1999 Conference, West Lafayette, Indiana, September 7–9, 1999. J. McHugh, The Lincoln Laboratory intrusion detection evaluation: A critique, in Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, 2000. S. Mukkamala, G. Janoski and A. Sung, Intrusion detection: Support vector machines and neural networks. IEEE IJCN (May, 2002). L. Portnoy, E. Eskin and S.J. Stolfo, Intrusion detection with unlabeled data using clustering, in: Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001), Philadelphia, PA, November 5–8 2001, pp. 51–62. J.R. Quinlan, C4.5: Programs For Machine Learning, Machine Learning. Morgan Kaufmann, San Mateo, California, 1993. R.J. Quinlan, Simplifying decision trees, International Journal of Man-Machine Studies 27(3) (December, 1987), 221– 234, Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5. J. Schneider and A.W. Moore, A locally weighted learning tutorial using vizier 1.0, Technical report, Carnegie Mellon University, 1997. N. Smith and M. Gales, Speech recognition using SVMs, in: Advances in Neural Information Processing Systems 14, MIT Press, 2002. J. Tollel and O. Niggemann, Supporting intrusion detection by graph clustering and graph drawing, in: RAID 2000 Third International Workshop on the Recent Advances in Intrusion Detection, Toulouse, France, October 2–4, 2000, pp. 51–62. I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, October 1999.
[29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42]