The proposed system has been tested on a wide database made up of real ..... S. Axelsson, âThe Base-Rate Fallacy and the Difficulty of Intrusion Detectionâ, ...
Using Behavior Knowledge Space and Temporal Information for Detecting Intrusions in Computer Networks* L.P. Cordella, I. Finizio, C. Mazzariello, and C. Sansone Dipartimento di Informatica e Sistemistica, Università di Napoli “Federico II”, Via Claudio, 21 I-80125 Napoli, Italy {cordel, ifinizio, cmazzari, carlosan}@unina.it
Abstract. Pattern Recognition (PR) techniques have proven their ability for detecting malicious activities within network traffic. Systems based on multiple classifiers can further enforce detection capabilities by combining and correlating the results obtained by different sources. An aspect often disregarded in PR approaches dealing with the intrusion detection problem is the use of temporal information. Indeed, an attack is typically carried out along a set of consecutive network packets; therefore, a PR system could improve its reliability by examining sequences of network connections before expressing a decision. In this paper we present a system that uses a multiple classifier approach together with temporal information about the network packets to be classified. In order to improve classification reliability, we introduce the concept of rejection: instead of emitting an unreliable verdict, an ambiguously classified packet can be logged for further analysis. The proposed system has been tested on a wide database made up of real network traffic traces.
1 Introduction The most common and best known tools used to ensure security of companies, campuses and, more in general, of any network, are Firewalls and Antiviruses. Though famous and well known, such tools alone are not enough to protect a system from malicious activities. Based on such assumption, many researchers started to develop systems able to successfully detect intrusions and, in some cases, trace the path leading to the attack source. On the basis of the information sources analyzed to detect an intrusive activity, the Intrusion Detection Systems (IDS) can be grouped into different categories. In the following, we will concentrate our attention on Network-based IDS (N-IDS) [1]. On the other hand, depending on the detection technique employed, they can be roughly classified as belonging to two main groups as well [2]. The first one, that exploits *
This work has been partially supported by the Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR) in the framework of the FIRB Project “Middleware for advanced services over large-scale, wired-wireless distributed systems (WEB-MINDS)”.
S. Singh et al. (Eds.): ICAPR 2005, LNCS 3687, pp. 94 – 102, 2005. © Springer-Verlag Berlin Heidelberg 2005
Using Behavior Knowledge Space and Temporal Information
95
signatures of known attacks for detecting when an attack occurs, is known as misuse (or signature) detection based. IDS’s that fall in this category are based on a model of all the possible misuses of the network resources. The completeness request is actually their major limit [3]. A dual approach tries to characterize the normal usage of the resources under monitoring. An intrusion is then suspected when a significant difference from the resource’s normal usage is revealed. IDS’s following this approach, known as anomaly detection based, seem to be more promising because of their potential ability to detect unknown intrusions (the so-called zero-day attacks). However, there is also a major challenge, because of the need to acquire a model of the normal resources usage which is general enough to allow authorized users to work without raising false alarms, but specific enough to recognize unauthorized usages [4,5]. The network intrusion detection problem can also be formulated as a binary classification problem: once the information about network connections between pairs of hosts is given, the task is to assign each connection to one out of two classes, which represent normal traffic conditions or an attack. Here the term connection refers to a sequence of data packets sharing some properties. In this framework, several proposals have been made in order to extract high-level features from data packets [6,7]. Each network connection can be then described by a “pattern” to be classified, and a pattern recognition (PR) approach can be followed. PR systems typically follow the misuse detection approach. Their main advantage is the ability to generalize. They are able to detect some novel attacks, since different variants of the same attack will be typically described by very similar patterns. Moreover, the high-level features extracted from connections relative to a totally new attack should exhibit a behavior quite different from those extracted from normal connections. Summarizing, these PR systems don’t need a complete description of all the possible attack signatures. This overcomes one of the main drawbacks of the misuse detection approach. Signature based systems, in fact, may fail in detecting attacks undergone to even slight modifications from a known pattern. Different misuse-based PR systems have been reported in the recent past for realizing an IDS, mainly based on neural network architectures [8,9]. In order to improve the detection performance, approaches based on multi-expert architectures have been also proposed [10,11,12]. Indeed, also anomaly-based systems have been considered in the PR field. Here, they can be ascribed to the more general category of approaches based on the novelty detection, i.e. the identification of new or unknown data or signal that a system is not aware of during training [13]. Examples of neural-based systems that follow the anomaly detection approach can be found in [14,15]. However, one of the main drawbacks occurring when using PR techniques in real environments is the high false alarm rate they often produce [10]. This is a very critical point, as pointed out in [16]. An information source that is commonly disregarded in PR systems is the temporal sequence of the traffic network patterns. According to our opinion, however, this kind of information can be profitably used for augmenting the reliability of the attack detection. An attack is, in fact, typically spread along several network packets close to each other. Even though high level features are extracted from the traffic, it is quite
96
L.P. Cordella et al.
unusual for an attack to spread over a single connection pattern isolated within a sequence of normal connections. A PR system could then improve its performance by examining sequences of network connections. In order to realize an IDS that is capable of detecting intrusion by keeping the number of false alarms as low as possible, in this paper we propose a multiple classifier system that combines the behavior knowledge space with temporal information coming from a real-time analysis of the network traffic. In particular, starting from the proposal made in [6], a framework for extracting features from real traffic is adopted [7]. Then the collected data are fed to a multiple classifier system that employs the Behavior Knowledge Space rule for combining the output of the composing classifiers. The standard BKS rule is here generalized for coping with the temporal information of a connection pattern sequence. In order to maximize the complementariness of the decisions to be combined, a rule-based classifier [17] and a neural network are employed as base classifiers. The organization of the paper is as follows: in Section 2 the proposed approach is presented, while in Section 3 the database obtained form real network traffic is described. Tests of the proposed IDS are reported in Section 4; finally, some conclusions are drawn in Section 5.
2 A Behavior-Knowledge Space Combining Rule Using Temporal Information Algorithm In [18], Huang and Suen proposed a combining rule that does not require the independence assumption of the base classifiers. It derives the information needed to combine a set of classifiers from a knowledge space, which can concurrently record the decision of all the classifiers on a suitable set of samples. This means that such a space records the behavior of all the classifiers on this set, and thus it is called the Behavior Knowledge Space. The combining rule that uses it is called the BehaviorKnowledge Space (BKS) rule. More in details, a Behavior-Knowledge Space is a K-dimensional space where each dimension corresponds to the decision of a classifier. Given a pattern x to be assigned to one out of M possible classes, the ensemble of classifiers can in theory provide MK different decisions. Each one of these decisions (D1(x), D2(x), … , DK(x)) - where Dj(x) represent the guess class supplied by the j-th classifier - constitutes one unit of the BKS. In our case M is equal to 2, so the number of units is 2K. The BKS combining rule operates in two phases: a learning phase for knowledge modeling and an operating phase for decision-making. In the learning phase the BKS look-up table is built-up: each BKS unit U can record M different values ei, one for each class. Given a suitably chosen training set, each pattern xtr of this set is classified by all the classifiers and the unit (called focal unit) that corresponds to the particular decision of the ensemble of classifiers (D1(xtr), D2(xtr),…, DK(xtr)) is activated. Let us denote this unit with FU(xtr). It records the actual class C(xtr) of xtr, say j, by adding one to the value of ej. At the end of this phase, each unit can calculate the best representative class associated to it, say C(U), defined as the class that exhibits the highest value of ei, i.e.:
Using Behavior Knowledge Space and Temporal Information
C(U) = j
where j = argmax ei
97
(1)
i
In other words, this class corresponds to the most likely class, given a classifiers' decision that activates that unit. In the operating mode, for each pattern xtest to be classified, the decisions (D1(xtest), D2(xtest), … , DK(xtest)) of the classifiers are collected and the corresponding focal unit FU(xtest) is selected. Then the class attributed to xtest is the best representative class associated to that focal unit, i.e.: C(xtest) = C(U)
where U = FU(xtest)
(2)
Since in our case the temporal sequence of the patterns to be classified assumes a particular significance, the BKS can be augmented with a temporal dimension. In this case, the number of units becomes 2K⋅t, where t is the size of the considered temporal window. Each unit, in fact, has to record a sequence of t values for each of the K classifiers, so the new behavior knowledge space assumes a dimensionality equal to K⋅t. In operating mode, t successive decisions for each classifier (relative to a sequence of t consecutive patterns) need to be collected. Then, these K⋅t values will select a focal unit whose best representative class will be associated to the last pattern of the sequence. The next pattern will be classified by shifting the temporal window one pattern forward, so individuating a (possibly) different focal unit. The sequence of decisions relative to a temporal window can also be used for evaluating the reliability of each classification act. A reliability R(U) can be in fact associated to each unit, as specified in the following. In operating mode, all the times a focal unit is selected, its reliability will be the reliability of the performed classification. We have chosen to evaluate R(U) in the following way:
⎧e j if e j > 0 ⎪ R( U ) = ⎨ ek if e j = 0 ⎪⎩ 0
where e j = max ei and ek = max ei i
i≠ j
(3)
In other words, R(U) is the ratio between the values associated to the first and the second most representative class of this unit. If the value associated to the most representative class of a unit is zero (i.e., the considered unit was never activated by the patterns belonging to the training set), the reliability of this unit is set to zero. The value of R(U) can be profitably used for choosing to reject a pattern instead of running the risk of misclassifying it. Rejection, in this context, implies that the data about a ‘rejected’ connection are only logged for further processing, without raising an alert for the system manager [12]. In order to make a rejection, a suitably chosen threshold has to be fixed. This could be done, by using the method proposed in [19], in an adaptive way with respect to the requirements of the application at hand. This notwithstanding, in Sect. 4 results with a reliability threshold value fixed to 0.6 will be reported. Finally, in order to choose the optimal value of the temporal window, an analysis of the performance of the proposed approach on a suitable set of data can be performed as a function of the value of t. Then, the value of t that allows us to obtain
98
L.P. Cordella et al.
the best trade-off between reject and error rate can be selected. If the chosen set is sufficiently representative of the target domain this should guarantee the best performance also in the operating mode.
3 A Real Network Traffic Database One of the main issues related to PR in intrusion detection is the use of a proper database. Two main approaches are possible: the former relies on simulating a real-world network scenario; the latter builds the data set using actual network traffic. The first approach has been usually adopted. The most well-known dataset is the so-called KDD Cup 1999 Data1, which was created for the Third International Knowledge Discovery and Data Mining Tools Competition, held within KDD-99, The Fifth International Conference on Knowledge Discovery and Data Mining. It was created by the Lincoln Laboratory at MIT in order to conduct a comparative evaluation of intrusion detection systems, developed under DARPA (Defense Advanced Research Projects Agency) and AFRL (Air Force Research Laboratory) sponsorship2. This set was created in order to evaluate the ability of data mining and PR algorithms to build predictive models able to distinguish between a normal and a malicious behaviour. The KDD Cup 1999 Data contain a set of connection records coming out from a pre-processing of raw TCPdump data. Each connection is labelled as either normal or attack. The connection records are built from a set of higher-level connection features, defined by Lee and Stolfo [6], that are able to tell apart normal activities from illegal network activities. Although it is widely employed [9,10,12,20], some criticisms have been raised against such database [21]. Indeed, numerous research works analyze the difficulties arising when trying to reproduce actual network traffic patterns by means of simulation [22]. Actually, the major issue resides in the effectiveness of reproducing the behaviour of network traffic sources. On the basis of the above considerations, we have concluded that the KDD Cup 1999 Data can just be used to make a first evaluation of the effectiveness of the PR algorithms under study, rather than providing useful indications for a real application of intrusion detection systems. On the other hand, collecting real traffic can be considered as a viable alternative approach for the construction of a traffic data set [23]. Although it can prove effective in real-time intrusion detection, it still presents some concerns. In particular, the collection of a real traffic data set needs a data pre-classification process for packet labelling. Indeed, no information is available in the real traffic to distinguish the normal activities from the malicious ones in order to label the data set. Last but not least, the issue of privacy of the information contained in the real network data has to be considered: payload anonymizers and IP address spoofing tools are needed in order to preserve sensitive information. This notwithstanding, we decided to collect real traffic traces. We deem that such an approach represents an enforced solution in case the computed patterns have to be 1 2
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html http://www.ll.mit.edu/IST/ideval
Using Behavior Knowledge Space and Temporal Information
99
applied in a system that must exploit temporal information. Our data set has been built by collecting real traffic on the local network at Genova National Research Council (CNR). The raw traffic data set contains about one million packets, equivalent to 1GByte of data. The network traffic has been captured by means of the TCPdump tool and logged to a file. In order to solve the pre-classification problem (which, as already stated, requires labelling the items in the data set), we have used a previous work of Genova’s research team. By using different intrusion detection systems, researchers in Genova have analyzed the generated alert files and manually identified, in the logged traffic, a set of known intrusions. We have leveraged the results of this research in order to extract the connection features record and properly label it with either a normal or an attack tag. The number of attack packets in the whole data set is about 3,500; both Denial of Service and Probing attacks have been found in the traffic data. As regards the considered connection features, starting form the Lee and Stolfo work [6], we extracted 26 features for each network connection. More details about the feature extraction process – that can be carried out in real time - can be found in [7].
4 Experimental Results As stated before, we defined 26 features starting from the 41 proposed by Lee and Stolfo in [6]. Such a high number of features, indeed, may result in redundancy in the information provided about each traffic pattern to be analyzed; furthermore, not all the features are necessary to detect the presence of a particular attack type. Regarding the particular attack distribution and normal traffic characteristics of the analysed network scenario, it is desirable to reduce the feature space dimensionality by preserving most of the information. Thus, we applied a feature selection process to the above described database, by adopting a Sequential Forward Selection strategy, with the Minimum Estimated Probability classification criterion. Though Best Feature Selection would probably lead to slightly better results, its heavy computation load and the huge amount of data to be analyzed led us to choose the quoted technique. At the end of the feature selection process, each network connection was represented by a feature vector of 8 components. The whole database was then split into three disjoint sets: a training set (in the following TRS) used for training the base classifiers and for calculating the BKS look-up tables, a validation set (in the following VS) used for stopping the learning process so as to avoid a possible overtraining and for choosing the optimal value of t, and a test set (TS). In particular 30% of the data (about 300,000 patterns) was used as TRS, 30% as VS (about 300,000 patterns) and the remaining 40% as TS (about 400,000 patterns). As base classifiers, we employed a neural network, namely a LVQ classifier - with 10 prototypes for the attack class and 50 prototypes for the normal class - and a rule-based learning system –SLIPPER - that creates a rule set by iteratively boosting a greedy rule-builder [17]. In Table 1 the results of these classifiers, in terms of the overall error rate on TRS, VS and TS, are shown.
100
L.P. Cordella et al. Table 1. Results obtained by the base classifiers on the three considered data sets
Classifier
LVQ
SLIPPER
Data Set TRS VS TS TRS VS TS
Error rate 0.265 % 0.456 % 1.004 % 0.204 % 0.223 % 0.261 %
As it is evident from Table 1, the performance of the base classifiers is certainly good. Nevertheless, the best result on the TS, obtained by SLIPPER, indicates that there is still a thousand of connection records that are misclassified. In order to choose a value for t by following the approach described in Sect. 2, Table 2 reports the results obtained by the proposed system on the VS, for different values of the temporal window t. From this table it is evident that the optimal value of t can be fixed to 4, even if also values of t equal to 3 and 5 give rise to good results. Table 2. Results obtained by the proposed system on the VS as the value of t varies. The reliability threshold was fixed to 0.6. The optimal value of t is reported in bold.
t 1 2 3 4 5 6 7
Error rate 0.198 % 0.191 % 0.189 % 0.187 % 0.187 % 0.294 % 0.289 %
Reject rate 0.216 % 0.153 % 0.165 % 0.179 % 0.202 % 0.118 % 0.153 %
In order to verify the exactness of this choice, Table 3 reports the results obtained on the TS as a function of t. In this case, indeed, the best results were obtained for a slightly different value of t (i.e., 3 instead of 4). However, also the results obtained for the selected value of t are very significant. In particular, the proposed system is able to reduce the number of errors, which are about halved with respect to the best base classifier. The use of the temporal window allows us to have a slight improvement in terms of error rate with respect to the case t = 1 (i.e., when the standard BKS rule is used). But the temporal information improves the reliability of the system: the adoption of a value of t equal to 4 instead of using the standard BKS rule implies that the reject rate decreases from 0.922% to 0.735%. Since the error rate remains practically the same, this means that about eight hundred patterns are now correctly classified and no more rejected.
Using Behavior Knowledge Space and Temporal Information
101
Table 3. Results obtained by the proposed system on the TS as the value of t varies. The reliability threshold was fixed to 0.6.
t 1 2 3 4 5 6 7
Error rate 0.163 % 0.162 % 0.162 % 0.162 % 0.162 % 0.536 % 0.522 %
Reject rate 0.922 % 0.666 % 0.656 % 0.735 % 0.861 % 0.522 % 0.688 %
5 Conclusions In this paper we proposed a multiple classifier approach to the problem of detecting intrusions in computer networks. It makes an explicit use of temporal information for improving the reliability of the detection. The approach has been tested on a wide database of patterns extracted from real traffic network traces. It demonstrated to be able to improve the classification capability of the base classifiers, as well as the reliability of the performed detection, by suitably exploiting the temporal information. As a future development of the proposed multi classifier approach, we have planned to address the problem of automatically selecting the optimal reject threshold value. Moreover, we will work on the analysis of the rejected packets with slower but more accurate algorithms, in order to further improve the detection capability of the proposed approach.
References 1. G. Vigna, R. Kemmerer, “Netstat: a network based intrusion detection system”, Journal of Computer Security, vol. 7, no. 1, 1999. 2. S. Axelsson, Research in Intrusion Detection Systems: A Survey, TR 98-17, Chalmers University of Technology, 1999. 3. R. Kumar, E.H. Spafford, “A Software Architecture to Support Misuse Intrusion Detection”, in Proceedings of the 18th National Information Security Conference, pp. 194204, 1995. 4. A.K. Ghosh, A. Schwartzbard, “A Study in Using Neural Networks for Anomaly and Misuse Detection”, Proc. 8'th USENIX Security Symposium, Aug. 26-29 1999, Washington DC. 5. T. Lane, C.E. Brodley, “Temporal Sequence learning and data reduction for anomaly detection”, ACM Trans. on Inform. and System Security, vol. 2, no. 3, pp. 295-261, 1999. 6. W. Lee, S.J. Stolfo, “A framework for constructing features and models for intrusion detection systems”, ACM Transactions on Inform. System Security, vol. 3, no. 4, pp. 227261, 2000.
102
L.P. Cordella et al.
7. M. Esposito, C. Mazzariello, F. Oliviero, S.P. Romano, C. Sansone, “Real Time Detection of Novel Attacks by Means of Data Mining Techniques”, Proceedings of the 7th International Conference on Enterprise Information Systems, Miami (USA), May 24-28, pp. 120-127, 2005. 8. S. C. Lee, D.V. Heinbuch, “Training a neural Network based intrusion detector to recognize novel attack”, IEEE Trans. Syst, Man., and Cybernetic, Part-A, vol. 31, pp. 294299, 2001. 9. M. Fugate, J.R. Gattiker, “Computer Intrusion Detection with Classification and Anomaly Detection, using SVMs”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 17, no. 3, pp. 441-458, 2003. 10. G. Giacinto, F. Roli, L. Didaci, “Fusion of multiple classifiers for intrusion detection in computer networks”, Pattern Recognition Letters, vol. 24, pp. 1795-1803, 2003. 11. G. Giacinto, F. Roli, L. Didaci, “A Modular Multiple Classifier System for the Detection of Intrusions”, Lecture Notes in Computer Science vol. 2709, pp. 346-355, 2003. 12. L. P. Cordella, A. Limongiello, C. Sansone, “Network Intrusion Detection by a Multi Stage Classification System”, Lecture Notes in Computer Science vol. 3077, Springer, Berlin, pp. 324-333, 2004. 13. S. Singh, M. Markou, Novelty detection: a review - part 2: neural network based approaches, Signal Processing, vol. 83, no. 12, pp. 2499-2521, 2003. 14. J. Ryan, M.J. Lin, R. Miikkulainen, Intrusion detection with neural networks, in Advances in Neural Information Processing Systems 10, M. Jordan et al., Eds., Cambridge, MA: MIT Press, pp. 943-949, 1998. 15. K. Labib and R. Vemuri. NSOM: A real-time network-based intrusion detection system using self-organizing maps. Technical report, Dept. of Applied Science, University of California, Davis, 2002. 16. S. Axelsson, “The Base-Rate Fallacy and the Difficulty of Intrusion Detection”, ACM Trans. on Information and System Security, vol. 3, no.3, pp. 186-205, 2000. 17. W.W. Cohen, Y. Singer, “Simple, Fast, and Effective Rule Learner” in Proc. of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, July 18-22, Orlando, Florida, USA, pp. 335-342, 1999. 18. Y. S. Huang, C. Y. Suen, “A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 1, pp. 90-94, 1995. 19. L.P. Cordella, C. Sansone, F. Tortorella, M. Vento, C. De Stefano, “Neural Network Classification Reliability: Problems and Applications”, in Image Processing and Pattern Recognition, vol. 5 of Neural Network Systems Techniques and Applications, Academic Press, San Diego, CA, pp. 161-200, 1998. 20. Y. Liu, K. Chen, X. Liao, W. Zhang, “A genetic clustering method for intrusion detection“, Pattern Recognition vol. 37, 2004. 21. J. McHugh, “Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory”, ACM Transactions on Information and System Security, vol. 3, no. 4, pp. 262-294, 2000. 22. V. Paxson, S. Floyd, “Difficulties in simulating the internet”, IEEE/ACM Transactions on Networking, vol. 9, no. 4, pp. 392–403, 2001. 23. M. Mahoney, A Machine Learning Approach to Detecting Attacks by Identifying Anomalies in Network Traffic, PhD thesis, Florida Institute of Technology, 2003.