Network Intrusion Detection by a Multi-stage Classification System* Luigi Pietro Cordella1, Alessandro Limongiello2, and Carlo Sansone1 1 Dipartimento
di Informatica e Sistemistica, Università degli Studi di Napoli “Federico II” Via Claudio 21, I-80125 Napoli, Italy {cordel,carlosan}@unina.it 2 Dip. di Ingegn. dell’Informazione ed Ingegn. Elettrica, Università degli Studi di Salerno Via Ponte Don Melillo, I, I-84084, Fisciano (SA), Italy
[email protected]
Abstract. A serial multi-stage classification system for facing the problem of intrusion detection in computer networks is proposed. The whole decision process is organized into successive stages, each one using a set of features tailored for recognizing a specific attack category. All the stages employ suitable criteria for estimating the reliability of the performed classification, so that, in case of uncertainty, information related to a possible attack are only logged for further processing, without raising an alert for the system manager. This permits to reduce the number of false alarms. The proposed multi-stage intrusion detection system has been tested on two different services (http and ftp) of a standard database used for benchmarking intrusion detection systems. The experimental analysis highlights the effectiveness of the approach: the proposed system behaves significantly better than other multi-expert systems performing classification in a singlestage
1 Introduction The increasing number of different services nowadays offered through the Internet determines a strong request for exploiting computer network security techniques that permit to protect Internet providers and/or commercial sites from malicious attacks (intrusions). A variety of approaches for facing the network intrusion detection problem has been proposed until now [1]. This notwithstanding, an Intrusion Detection Systems (IDS) can be basically ascribed to two different categories. The first one, that exploits signatures of known attacks for detecting when an attack occurs, is known as misuse, or signature, detection. IDSs that fall in this category are based on a model of all the possible misuses of the network resources. The completeness request is actually their major limit [2]. *
This work has been partially supported by the Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR) in the framework of the FIRB Project “Middleware for advanced services over large-scale, wired-wireless distributed systems (WEB-MINDS)”.
F. Roli, J. Kittler, and T. Windeatt (Eds.): MCS 2004, LNCS 3077, pp. 324–333, 2004. Springer-Verlag Berlin Heidelberg 2004
Network Intrusion Detection by a Multi-stage Classification System
325
A dual approach tries to characterize the normal usage of the resources under monitoring. An intrusion is then suspected when a significant difference from the resource’s normal usage is revealed. IDSs following this approach, known as anomaly detection, seem to be more promising because of their potential ability to detect unknown intrusions. However, in this case the major challenge is the need of acquiring a model of the normal use general enough to allow authorized users to work without raising false alarms, but specific enough to recognize unauthorized usages [3,4]. Different attack types can occur in a real network. Kendall [5] proposed a taxonomy of attacks, grouping them into four major categories: Probes, Denial of service (DoS), Remote to local (R2L) and User to root (U2R). The first category is made up of attacks that test a potential target to collect information about a possible intrusion. Therefore, they are usually harmless, unless a vulnerability is discovered and later exploited. DoS attacks prevent normal operation, causing the target host or a server to crash, or blocking network traffic; they, however, do not violate the target host. On the contrary, the last two categories group together attacks that permit the attacker to compromise the target host. In particular, in R2L attacks, an unauthorized user is able to bypass normal authentication and to execute commands on the target host, while in U2R attacks, a user with login access is able to bypass normal authentication to gain the privileges of another user, typically the root user. With this taxonomy in mind, the network intrusion detection problem can be easily formulated as a typical pattern recognition problem [6]: given information about network connections between pairs of hosts, the task is to assign each connection to one out of five classes, that represent normal traffic conditions or one of the four different attack categories described above. Here the term “connection” refers to a sequence of data packets related to a particular service, as a file transfer via the ftp protocol. Since an IDS must detect connections related to malicious activities, each network connection can be viewed as a “pattern” to be classified. This formulation implies the use of an IDS based on a misuse detection approach. The main advantage of the pattern recognition approach is the generalization capability exhibited by pattern recognition systems. They are able to detect some novel attacks, without the need of a complete description of all the possible attacks’ signatures, so overcoming one of the main drawbacks of the misuse detection approach. In [7] the feasibility of the pattern recognition approach for the intrusion detection problem is addressed. Different pattern recognition systems have been proposed in the recent past for realizing an IDS, mainly based on neural network architectures [3,8,9]. In order to improve the performance, approaches based on multi-expert architectures have been also proposed [6,10]. However, it should be worth noticing that one of the main drawbacks occurring when using pattern recognition techniques in real environments is the high false alarm rate they often produce [6] (this drawback, indeed, is shared by a large number of commercial IDS). Moreover, the classification is usually performed in a feature space made up of all the features needed to detect the considered attack classes. Since the distributions of the classes is very unbalanced, it is necessary to employ a quite large set of features having different semantic meaning to reliably distinguish be-
326
Luigi Pietro Cordella, Alessandro Limongiello, and Carlo Sansone
tween patterns coming from distinct classes. This is not advisable, because the excessive size of the feature vectors could make more difficult the construction of reliable classifiers. Starting from these considerations, in this paper we propose a multi-stage classification system for network intrusion detection. Each stage performs a binary classification, distinguishing between normal connections and a single type of attack, and utilizes features especially tailored for performing classification between these two classes. In order to decrement the number of missed detection, the proposed multistage system declares a connection as normal traffic only if all the stages do not detect an attack. Furthermore, in order to keep low the number of false alarms, some criteria have been established for evaluating the reliability of each classification act directly from the output of the classifiers. The reliability value is used to reject patterns that are unreliably attributed to an attack class. Reject, in this case, implies that the data about a ‘rejected’ connection are only logged for further processing, without raising an alert for the system manager. This should be done by using, for example, the proposed multi-stage system as an engine detection module of a public-domain IDS like Snort [11]. The organization of the paper is as follows: in Section 2 the proposed architecture is described, while in Section 3 several tests on a standard database used for benchmarking IDS are reported, together with a comparison of the proposed system with some parallel multi-expert systems. Finally, some conclusions are drawn.
2 The Proposed Approach As anticipated in the introduction, if the detection of an intrusion is performed in a feature space made up of all the features needed to detect all the considered attack classes, the excessive size of the feature vectors could make difficult the construction of a reliable classifier. Moreover, the false alarm rate should be maintained low in order to make pattern recognition techniques appealing also for a system manager. In order to address the first problem, the proposed architecture is made up of a cascade of stages; each stage considers a different set of features and is tailored for distinguishing only between normal traffic and one attack class. Therefore each stage behaves as a binary classifier on a (possibly) reduced set of features, so augmenting the probability of increasing the overall system performance. As regards the flow of the decision process, if a stage recognizes a pattern as normal traffic, the pattern is forwarded to the successive stage for further classification. In such a way, a pattern is recognized as normal traffic only if all the stages attribute it to the normal traffic class. This choice should permit to keep low the number of undetected attacks. As regards the dual need of decreasing the false alarm rate, it must be noted that each stage is made up of an expert, devoted to the classification of an input pattern, and of a decider. This latter, on the basis of the output vector provided by the corresponding expert, estimates the reliability of the classification decision, so isolating all the patterns that can be reliably considered as attacks. If a stage was not able to reliably assigning the sample to an attack, the pattern is rejected. On the contrary, if an
Network Intrusion Detection by a Multi-stage Classification System
327
expert recognizes a pattern as normal traffic, it is always forwarded to the successive stage for further classification, independently of reliability. An overview of the decisional flow of the proposed system is given in Fig. 1. Note that the reliability parameters, whose values range from 0 to 1, are indicated with ψ, and the reliability thresholds (formally defined hereafter) with σ. These symbols have a subscript denoting the stage they refer to. I STAGE CLASSIFICATION
INPUT CONNECTION
Guess Class C1 ∈ {normal, attack} Classification Reliability ψI
Yes
C1 = attack
Reject (connection is logged)
C2 = R2L
Yes
C2’ = DoS
Yes No
III STAGE CLASSIFICATION
ψII’ > σII’,2 Yes
No
ψIII > σIII Yes
Reject (connection is logged)
ψII’ > σII’,1 Yes Guess Class = DoS Attack Alert!
No
Yes
Guess Class = U2R Attack Alert!
Guess Class C2’ ∈ {DoS, Probe} Classification Reliability ψII’
No
Yes
Guess Class C3 ∈ {normal, U2R} Classification Reliability ψIII
Reject (connection is logged)
II’ STAGE CLASSIFICATION
No
ψII > σII Guess Class = R2L Attack Alert!
II STAGE CLASSIFICATION
Reject (connection is logged)
Yes
No Guess Class C2 ∈ {normal, R2L} Classification Reliability ψII
No
ψ I > σI
No
Reject (connection is logged) Guess Class = Probe Attack
Alert!
C3 = U2R
No Guess Class = normal traffic (no action)
Fig. 1. The implemented decisional flow as a function of the classification decisions at the different stages and of the corresponding reliability evaluations.
The classification process starts by presenting the input pattern to the first stage. It is devoted to discriminate between normal traffic and attacks belonging to DoS or Probe classes. These two attack classes are considered together since their features should be quite similar. If the expert of this stage reliably recognizes the pattern as an attack, a second stage is activated for discriminating between DoS and Probe classes. However, if the classification of one of these two stages is under a suitable threshold, the input pattern is rejected, i.e. the data related to the connection are logged for fur-
328
Luigi Pietro Cordella, Alessandro Limongiello, and Carlo Sansone
ther processing and no alert is raised. The method used for fixing the thresholds for each stage will be illustrated in the next subsection. On the other hand, if the first stage does not attribute the input connection to the attack class, the stage devoted to discriminate between normal connections and R2L attacks is activated. If the expert of this stage attributes the input pattern to the R2L class with high reliability, an alert is generated and the classification process stops; on the contrary, if the reliability is under a suitably fixed threshold the pattern is rejected and the data relative to the input connection are logged. Once again, the process continues, instead, if the second stage expert classifies the input pattern as belonging to normal traffic, no matter for the associated reliability. In this case, a third (and last) stage is activated. It must discriminate between normal connection and U2R classes, adopting the same decision rule of the previous stage. In summary, it is worth noticing that while the attribution to the attack class can be done by a single expert, a pattern can be assigned to the normal traffic class only after taking into account the results of three decision stages. The choice of considering DoS and Probe attack together in the first stage is justified because connections related to these types of attacks are characterized by a behavior that is typically quite different from the one exhibited by normal connections. Thus, it is simpler to detect them at the first step. On the contrary, U2R and R2L attacks are more difficult to recognize and are even more dangerous, since their aim is to violate the target host. Obviously, the sequence the different stages are activated influences the overall system performance. In the next Section, tests will be made to confirm that the chosen activation sequence guarantees the best system performance. 2.1 Reliability Thresholds The low reliability of a classification can be traced back to one of the following situations: a) the considered sample is significantly different from those present in the training set; b) the point which represents the sample considered in the feature space lies where the regions pertaining to different classes overlap. To distinguish between classifications which are unreliable because a sample is of type a or b, we define two reliability parameters, ψ a and ψ b whose values vary between 0 (completely unreliable) and 1 (very reliable). Each parameter is a function of the classifier output vector; in [12] the definition of the reliability parameters in case of some popular neural architectures are given. A parameter ψ providing an inclusive measure of the reliability of a classification can be computed by combining the values of ψ a and ψ b. The form chosen for ψ is: ψ = min{ψ a,ψ b}. This is certainly a conservative choice because it implies that a low value for just one of the parameters is sufficient to consider unreliable the whole classification. However, it is consistent with the kind of classification system considered, which is aimed at achieving the highest reliability. Once the reliability of each classification has been evaluated, the optimal values of the thresholds can be determined by using the method described in [13]. It is assumed that an effectiveness function P is defined which, taking into account the require-
Network Intrusion Detection by a Multi-stage Classification System
329
ments of the particular application, evaluates the quality of the classification in terms of correct recognition, misclassification and rejection rates. Under this assumption the optimal reject threshold value σ, determining the best trade-off between reject rate and misclassification rate, is the one for which the function P reaches its absolute maximum. The requirements of the particular application domain are specified by attributing costs to misclassifications, rejects and correct classifications. As in [13], in our case the cost of an error is different as a function of the actual class. So a cost matrix Cij has to be defined, whose generic element indicates the cost of misclassifying a pattern belonging to the i-th class by attributing it to the j-th class. We can assume that the gain for correct classification and the cost of a reject are not dependent on the class of the pattern. On the contrary, we can expect that the costs of a misclassification are quite different depending on the actual and the guess class. Finally, it is worth noticing that misclassification costs are typically higher than the reject costs. This is true also in our domain, where a reject implies that the data relative to the ‘rejected’ connection are logged by the IDS for an off-line processing and so there is still the possibility of detecting an intrusion.
3 Experimental Results The proposed system has been tested on a subset of the database created by DARPA in the framework of the 1998 Intrusion Detection Evaluation Program. It is made up of a large number of network connections related to normal and malicious traffic. This database was pre-processed by the Columbia University giving rise to a feature vector of 41 elements for each connection, according to the set of features defined in [14] and tailored for the intrusion detection problem. In the database each connection is labelled as belonging to one out of the five classes described in the previous Sections: normal traffic, Probe, DoS, U2R and R2L attacks. It is worth noticing that each attack class is made up of different attack variants, each one exploiting different vulnerabilities of a computer network. Our results have been systematically compared with those obtained from some single classifiers operating in a single stage and some parallel Multi-Expert Systems (MES). As single experts we considered two different neural architectures: a Learning Vector Quantization (LVQ) and a three layered Multi-Layer Perceptron (MLP). In order to train them, the training data was split into two disjoint sets: a training set (in the following TRS) and a training-test set (in the following TTS), used to stop the learning process so as to avoid a possible overtraining. Different classifiers have been experimented, by varying the number of hidden nodes for the MLP nets and by using different numbers of prototypes for the LVQ nets. Also different feature sets have been used for training. Different combining rules have been taken into account in order to build several MES, each one composed by two or three of the previously described neural architec-
330
Luigi Pietro Cordella, Alessandro Limongiello, and Carlo Sansone
tures. In particular, both “fixed” (Majority Vote, Min, Max, Average) and “trainable” (Naïve Bayes, Dempster-Shafer, BKS) combining rules have been considered. The results obtained by all the considered classification systems are reported in terms of i) the overall classification error, ii) the sum of the false alarm and the missed detection rates, without taking into account class confusion among attack classes, and iii) the average classification cost calculated according to the cost matrix shown in Table 1. This type of evaluation has been proposed in [7] so as to have a more significant parameter for evaluating the effectiveness of an IDS. The average cost is computed by multiplying each entry of the confusion matrix with the corresponding entry in the cost matrix and dividing the result by the total number of test samples. The cost matrix of Table 1 is also used for evaluating the reliability thresholds by using the method illustrated in Section 2.1. In this case the cost of a reject is assumed to be 0.5. Table 1. Cost matrix used to weight the confusion matrix related to each classifier and to calculate the reliability thresholds. The cost of a reject is assumed to be 0.5. Actual Class Normal DoS U2R R2L Probe
Normal 0 2 3 4 1
DoS 2 0 2 2 2
Guess Class U2R 2 2 0 2 2
R2L 2 2 2 0 2
Probe 1 1 2 2 0
In the following we present the results obtained by our classification method applied to two different network services (ftp and http) among those present in the DARPA database. Other services have been also experimented, but the obtained results are not reported here for the sake of brevity. The choice of designing a different multi-stage classifier system for each service follows the so-called modular approach presented in [10], where the authors experimentally demonstrate the advantage, in terms of recognition performance, of an IDS that develops a different classification module for each one of the network services to be protected. 3.1 FTP Service In this case the training data was made up of 798 patterns related to different attacks and to the normal class. These samples are not well balanced among the different classes: in particular, there are very few samples of U2R and Probe attacks. For this reason such samples have been duplicated before training, giving rise to 841 training data. This procedure was motivated by the fact that it is possible to easily retrieve on the Web the code for launching some type of attacks, and then it is reasonable to assume the possibility of having several identical connections in case of attacks. We use 65% of training data as TRS (549) and the remaining 35% for TTS (292). The Test Set (TS) for this service is made up of 825 patterns.
Network Intrusion Detection by a Multi-stage Classification System
331
Table 2 reports the results of the best single classifiers and of the best parallel MES made up of three experts. As it is evident, the results in terms of overall error are very poor, while the percentage of false alarms and missed detections is quite acceptable. In this case the best MES does not significantly improve the performance obtained by the best single classifier. Table 3 shows the characteristics of each stage of the proposed architecture. All the stages employ an LVQ net as expert, trained with the same modalities of the single classifiers described above. Note that different feature sets have been selected at different stages. Table 2. Results obtained by the best single experts and by the best Parallel MES on the TS. Classifier Single LVQ (all features) LVQ (six features)
MES
Average Majority Vote
Overall Error 13.82 % 14.18 % 13.70 % 13.70 %
Cost 0.3358 0.2861 0.2564 0.2564
False+Missed alarm rates 3.76 % 1.09 % 1.05 % 1.05 %
Table 3. Details about each stage of the proposed system. Stage
Neural Architecture
I
LVQ
II III II’
Features
Feature Normalization
Threshold
Yes
σI = 0.406
LVQ LVQ
duration, src_bytes, dst_bytes, same_srv_rate, dst_host_srv_diff_host_rate All All
Yes No
LVQ
All
Yes
σII = 0.210 σIII = 0.100 σII’,1 = 0.001 σII’,2 = 0.003
The results obtained on the TS by the proposed system are reported in Table 4. For the sake of comparison, both the systems with and without the use of the reject option have been considered. The overall error is significantly lower with respect to the previous case. Moreover, the false alarm rate further decreases with the introduction of the reject option, as it should be. It is worth noticing that the chosen activation sequence for the stages of the proposed architecture is really the most effective one. In fact, the overall error, without the reject option, raises to the 9.45% if the II Stage is chosen as the first one in our architecture, and to the 8.36% if the III Stage is the first to be activated. Table 4. Results obtained on the TS by the proposed multi-stage classification system, with and without the reject option. Without reject option With reject option
Overall error 2.79 % 0.72% (with 3.76 % of reject)
Cost 0.0509
False+Missed alarm rates 0.85 %
0.0309
0.24 %
332
Luigi Pietro Cordella, Alessandro Limongiello, and Carlo Sansone
3.2 Http Service The training data for this service in the DARPA database are made up of 64292 patterns. However, in [9] it has been demonstrated that a dataset of about 15% of the whole http data is sufficient to training neural classifiers. Therefore, only 8866 samples have been considered as training data. Differently from the previous case, there are no attacks belonging to the U2R class. This implies that the proposed multi-stage classification system for this service do not present the III Stage. Also in this case samples of the R2L and Probe classes have been duplicated before training. The 70% of the whole training data has been used as TRS and the remaining 30% for the TTS. The TS for this service is made up of 40442 patterns. Table 5 reports the results of the best single classifiers and of the best parallel MES made up of three experts. In this case the performance are quite good, especially as regards the number of false alarms. The best MES exhibits exactly the same performance of the best single expert. Table 5. Results obtained by the best single experts and by the best Parallel MES on the TS. Classifier Single LVQ MLP
Overall error 0.22 % 0.27 % 0.22 % 0.22 %
MES
Average Majority Vote
Cost 0.0027 0.0033 0.0027 0.0027
False +Missed alarm rates 0.09 % 0.14 % 0.09 % 0.09 %
Table 6 shows the characteristics of each stage of the proposed architecture; all of them employ an LVQ net as expert. Once again, these experts were trained with the same modalities of the single classifiers reported in Table 5. In this case, the feature set is the same for all the three stages, even if in one case no normalization was performed. Since the expert present at each stage is very reliable, no thresholds were fixed by the proposed method in this case. Table 6. Details about each stage of the proposed system. Stage I II II’
Neural Architecture LVQ LVQ LVQ
Features All All All
Feature Normalization Yes Yes No
Threshold 0.00 0.00 0.00
Finally, table 7 shows the results of the proposed system on the TS. Also in this case the overall error decreases with respect to the previous case, keeping low the false alarm rate. This confirms the effectiveness of the proposed approach. Table 7. Results obtained by the proposed multi-stage classification system on the TS. Overall error 0.11 %
Cost 0.0014
False +Missed alarm rates 0.09 %
Network Intrusion Detection by a Multi-stage Classification System
333
4 Conclusions In this paper we have presented a multi-stage classification system for network intrusion detection. For this architecture, we have used some criteria for evaluating the reliability of the response and a method for the determination of an optimal reject option, in order to reduce the false alarm rate of the system. The effectiveness of our proposal has been experimentally evaluated on a standard database, where a significant improvement in the reliability of the system has been demonstrated.
References 1. S. Axelsson, Research in Intrusion Detection Systems: A Survey, TR 98-17, Chalmers University of Technology, 1999. 2. R. Kumar, E.H. Spafford, “A Software Architecture to Support Misuse Intrusion Detection”, in Proceedings of the 18th National Information Security Conference, pp. 194-204, 1995. 3. A.K. Ghosh, A. Schwartzbard, “A Study in Using Neural Networks for Anomaly and Misuse Detection”, Proc. 8'th USENIX Security Symposium, Aug. 26-29 1999, Washington DC. 4. T. Lane, C.E. Brodley, “Temporal Sequence learning and data reduction for anomaly detection”, ACM Trans. on Inform. and System Security, vol. 2, no. 3, pp. 295-261, 1999. 5. K. Kendall, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems, Master's Thesis, Massachusetts Institute of Technology, 1998. 6. G. Giacinto, F. Roli, L. Didaci, “Fusion of multiple classifiers for intrusion detection in computer networks”, Pattern Recognition Letters, vol. 24, pp. 1795-1803, 2003. 7. C. Elkan, “Results of the KDD99 classifier learning”, ACM SIGKDD Explorations 1, 63– 64, 2000. 8. S. C. Lee, D.V. Heinbuch, “Training a neural Network based intrusion detector to recognize novel attack”, IEEE Trans. Syst, Man., and Cybernetic, Part-A, vol. 31, pp. 294-299, 2001. 9. M. Fugate, J.R. Gattiker, “Computer Intrusion Detection with Classification and Anomaly Detection, using SVMs”, International Journal of Pattern Recognition and artificial Intelligence, vol. 17, no. 3, pp. 441-458, 2003. 10. G. Giacinto, F. Roli, L. Didaci, “A Modular Multiple Classifier System for the Detection of Intrusions”, Lecture Notes in Computer Science vol. 2709, pp. 346-355, 2003. 11. J. Beale, J.C. Foster, Snort 2.0 Intrusion Detection, Syngress Publishing, Inc., Rockland, MA, 2003. 12. L.P. Cordella, C. Sansone , F. Tortorella, M. Vento, C. De Stefano, “Neural Networks Classification Reliability”, in C.T. Leondes (ed.). Academic Press theme volumes on Neural Network Systems, Techniques and Applications. Academic Press, 5, pp. 161-199, 1998. 13. C. Sansone, F. Tortorella, M.Vento, “A Classification Reliability Driven Reject Rule for Multi-Expert Systems”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 15, no. 6, pp. 885-904, 2001. 14. W. Lee, S.J. Stolfo, “A framework for constructing features and models for intrusion detection systems”, ACM Transactions on Inform. System Security, vol. 3, no. 4, pp. 227-261, 2000.