Distributed Investigations of Intrusion Detection Data on the Grid Marius Joldos, Ioan Lucian Muntean Computer Science Dept. Technical University of Cluj-Napoca Cluj-Napoca, Romania Email: Marius.Joldos,
[email protected]
Abstract—In this paper we demonstrate the use of grid-based computing in revealing aspects of intrusion detection data which were difficult to reveal some years ago due to constrained computing resources. Exploiting the grids for their computing resources can rapidly turn into a tedious task, entailing solid technical knowledge about the grid. Grid application (and) frameworks are the category of tools that make the use of the grid much simplified for the scientist with interest solely on the computing resources. Mining security data is a task, we believe, that can be better served using the grid. Here, we attempt to reveal new insights in intrusion detection data, such as the well known KDDCUP’99 data set.
Keywords: Grid computing, Intrusion detection, Data mining, KDD data set.
I. M OTIVATION Data mining has been one of the approaches used as a base when building intrusion detection systems (IDS). This is because the amount of information which an IDS has to process is vast, and grows as the speed of computer networks grows. Unfortunately, the performance of the most successful IDS based on this approach does not meet the performance constraints imposed by production systems: a very low rate of false positive and false negative alerts. Our intention is to further investigate the available data sets in order to discover what would be a better aggregate of the available data mining methods and what features would result in more accurate detection given the highly increased computing power of grid computing. Why this topic? We believe that available data is still not entirely exploited and understood. This is due to the selection of rather small subsets of existing data when applying various approaches, which resulted in rather poor representation of many of the attack patterns. Why Grid? Nowadays a common way to get access to large computing resources. Currently, grid application frameworks exist that allow the user/researcher to focus on the computation task rather then to spend time for learning how to exploit the resources. Here - we make use of grids for finding correlations between the degree of success of various machine learning algorithms and the characteristics of a well known intrusion detection data set - the KDD Cup 1999 data [1].
II. R ELATED W ORK John McHugh [2] was the first to criticize the KDDCUP data sets, pointing out that the synthetic data generated were quite far from what one collects from real networks. But, due to the lack of publicly available data sets, they are still used. Sabhani and Serpen [3] showed that no pattern classification or machine learning algorithm can be trained successfully with the KDD data set to perform misuse detection for two of the four attack categories included in the KDD data set: user-to-root or remote-to-local attack categories. Tavallaee [4] conducted a statistical analysis of the KDDCUP’99 data set and found two important issues which affect the performance of machine learners: the huge number of redundant records in both train and test sets. To correct this, they proposed a new data set – which they called NSL-KDD – consisting of selected records from the original set. A noticeable fact is that, when machine learning is used for mining intrusion detection data, only a small part of the available data set of about 5% is used – usually due to limited computing resources and the huge amount of time needed for some approaches (e.g. Support Vector Machines). Exploiting the grids for their computing resources can rapidly turn into a tedious task, entailing solid technical knowledge about the grid. Grid application (and) frameworks are the category of tools that make the use of the grid much simplified for the scientist with interest solely on the computing resources. Relevant examples of such frameworks with applicability to a large group of end-user applications are Gridbus [5], g-Eclipse [6], GridSFEA [7] and its derivatives GridSFEA Trotter [8], FSIonGrid [9]. All these applications and frameworks allow the scientist to pack computing tasks into grid jobs, to dispatch them to the computing resources, and to fetch the results. Both Gridbus and GridSFEA can access most of the grid middleware currently available on production grids. g-Eclipse can, in exchange, operate with clouds as well. Whereas Gridbus plays the roles of both middleware and end-user applications, GridSFEA and g-Eclipse run on the end-user machines only. The software realization of GridSFEA (and derivatives) is lightweight compared to g-Eclipse and has better support for scenarios similar to the ones needed in this research. Data mining applications with support for distributed execution on the grid, such as Grid WEKA [10]
and Weka4WS [11], are generally hard to maintain esp. due to the rapid change of the grid software. For this reason, we chose to distribute computing tasks using GridSFEA Trotter. III. M INING S ECURITY DATA There are two categories of intrusion detection systems: misuse detection and anomaly detection. Systems in the first category find intrusions by monitoring network traffic in search of direct matches to known patterns of attack (called signatures or rules). A disadvantage of this approach is that it can only detect intrusions that match a pre-defined rule. One advantage of these systems is that they have low false alarm rates. When using the anomaly detection approach, the system defines the expected behavior of the network in advance. The profile of normal behavior is built using techniques that include statistical methods, association rules and neural networks. Any significant deviations from this expected behavior are reported as possible attacks. In principle, the primary advantage of anomaly based detection is the ability to detect novel attacks for which signatures have not been defined yet. However, in practice, this is difficult to achieve because it is hard to obtain accurate and comprehensive profiles of normal behavior. This makes an anomaly detection system generate too many false alarms and it can be very time consuming and labor intensive to sift through this data. As stated by Singhla and Jajodia in [12], the problem of intrusion detection can be reduced to a Data Mining task of classifying data. Briefly, one is given a set of data points belonging to different classes (normal activity, different attacks) and aims to separate them as accurately as possible by means of a model. Unfortunately, as it is known in the research community, the public data sets for security purposes are scarce. The sets one can access easily are quite few, e.g: ∙ The original KDDCUP data set is provided as a CSV file with features given in a separate file, suitable for many classifiers. ∙ Dae-Ki Kang [13] provides an intrusion detection benchmark data sets in bag of system calls representation. His sets are available upon request and include the Sequence TIme-Delay Embedding (STIDE) benchmark data sets form University of New Mexico intrusion detection and the evaluation set from MIT. ∙ Song et al. [14] proposed new data set retrieved from real environment. By utilizing IDS and honeypots deployed on the real network environment, they gathered a fresh data set which includes a large number of false positive alerts. The tools involved in processing the available data are vary from specially crafted tools to general machine learning tools. Yale [15] offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool. It started as open source, and the community edition is still free, although with less features. Maybe the most used is the Waikato Environment for Knowledge Analysis (WEKA) [16]. The workbench includes
algorithms for regression, classification, clustering, association rule mining and attribute selection. It includes data visualization facilities and many preprocessing tools for preliminary exploration of data. The KDD Cup 1999 Intrusion detection contest data (KDD cup 99 Intrusion detection data set) was prepared by the 1998 DARPA Intrusion Detection Evaluation program by MIT Lincoln Labs (MIT Lincoln Laboratory). Lincoln labs acquired nine weeks of raw TCP dump data. The raw data was processed into connection records, which consist of about 5 million connection records. The data set contains 24 attack types. These attacks fall into four main categories: 1) Denial of service (DOS): In this type of attack an attacker makes some computing or memory resources too busy or too full to handle legitimate requests, or denies legitimate users access to a machine. Examples are Apache2, Back, Land, Mailbomb, SYN Flood, Ping of death, Process table, Smurf, Teardrop. 2) Remote to user (R2L): In this type of attack an attacker who does not have an account on a remote machine sends packets to that machine over a network and exploits some vulnerability to gain local access as a user of that machine. Examples are Dictionary, Ftp write, Guest, Imap, Named, Phf, Sendmail, Xlock. 3) User to root (U2R): In this type of attacks an attacker starts out with access to a normal user account on the system and is able to exploit system vulnerabilities to gain root access to the system. Examples are Eject, Loadmodule, Ps, Xterm, Perl, Fdformat. 4) Probe: In this type of attacks an attacker scans a network of computers to gather information or find known vulnerabilities. An attacker with a map of machines and services that are available on a network can use this information to look for exploits. Examples are Ipsweep, Mscan, Saint, Satan, Nmap. The data set has 41 attributes for each connection record plus one class label. R2L and U2R attacks dont have any sequential patterns like DOS and Probe because the former attacks have the attacks embedded in the data packets whereas the later attacks have many connections in a short amount of time. Therefore, some features that look for suspicious behavior in the data packets like number of failed logins are constructed and these are called content features. IV. O UR A PPROACH : G RID -BASED I NVESTIGATIONS In our approach to the mining of security data, we partition the benchmark data in large data sets. We compare the distances between the feature vectors inside this sets, in an attempt to discover e.g. why clustering methods fail. We also compare this with what statistical methods yield on the same data. Then we employ WEKA to train different classifiers, using the larger sets and different feature vectors and aggregate them using voting. The goal of our distributed approach is to harness the computing resources available on grid. The analysis of each partition is handled as an independent grid job. The initial data
resides locally. The formulation of the grid task is carried out by GridSFEA, which runs on the end-user’s machine. Using the X.509 certificate of the user, the application framework uploads the data partitions to the grid and initiates the remote computations. Upon completion, the results of the processing of the first phase are fed to the grid job corresponding to the second phase. All jobs can be carried our on grid deployments based on the Globus Toolkit (4 or 5) or on the g-Lite middleware. The computation tasks are described using the Job Submission Description Language. GridSFEA translates from this generic language into the ones specific to the ones employed by each grid middleware. A. GridSFEA Trotter - An Application for Grids In our approach, the grid user is assisted by the application GridSFEA Trotter in the task of formulating and executing jobs on the grid. The Trotter is based on the GridSFEA framework and has a plugin-based software organization. As depicted in Fig. 1, the Trotter loads at runtime plugins capable of submitting jobs to GT4-, GT5-, or gLite-based grids. The middleware details are handled by means of third-party
1) Step 1: The original category labels were changed into labels for the attack groups mentioned in Section III. 2) Step 2: The redundant records were eliminated from the sets. 3) Step 3: Four training sets were created, using a random choice of records from each group. Each set contains about one quarter of the total records as detailed in Table I. Note that the rare attacks are all included in each mix. 4) Step 4: Batches of jobs for Weka are created and run on the grid. 5) Step 5: Results are downloaded to a local machine for analysis and interpretation. Steps 1 to 3 are carried out outside the grid, because they are not resource intensive, and need intervention of the experimenter in order to correct possible errors. Reduced sets of features were proposed by different authors such as Zhu [17] and Srinavasulu [18]. We are currently investigating the applicability of this feature sets to a larger sunset of data. The authors worked with a small subset (about 5% of the original) in a random selection. An importatnt argument against using such small subsets is that some attacks, such as U2R have very few records in the set and, if the proportions of the original file are used, they will be underrepresented. This is why we decided to use 25% of the original set, and oversample the types of attacks which were very scarce in the original data set. Category Normal DoS Probe R2L U2R
Number of records per training set 203204 61817 3463 999 52 TABLE I
C ATEGORIES OF UNIQUE RECORDS FROM KDD C UP 99 DATA SET.
Fig. 1. User jobs defined in JSDL format are transformed into middlewarespecific outgoing job requests on-the-fly by GridSFEA Trotter.
software such as CoG Kit, GAT etc. By writing the jobs in JSDL format, the user can express typical requirements for his computations, without having to learn specification languages native to the different grid middleware (Resource Specification Language for Globus or Job Definition Language for gLite, e.g.). This application runs on the end-user machine and handles all the typical steps that occur in the interaction with the grid: it delegates user’s credentials for the execution of the job; it stages local data to the grid such that everything needed for the processing resides remotely; it registers computation results with the GridSFEA framework; it fetches remote results back the the end-user machine. Submitted jobs can be managed by the user, stopped or canceled e.g. In our investigations the data used was derived from the original KDD Cup 99 data set according to the following procedure:
V. R ESULTS The experiments we carried out so far used the following machine learning approaches: J48 decision tree learning [19], Naive Bayes [20], and NBTree [21], and JRip cite:Cohen1995. Figure 2 shows the accuracies for the four methods involved. No feature extraction method was used; thus all original features were considered in learning. This lead to high computational demands. One can notice that the best accuracy is obtained by J48 (a C4.5 implementation), whereas the probabilistic methods (Bayes) perform quite poorly. The amount of time needed to complete learning for the methods used are given inf Figure 3. Table II shows the confusion matrix for the best performing method, on test data. It is easy to notice that the worst case is for R2L and U2R attacks. VI. C ONCLUSIONS Exploiting the grids for their computing resources can rapidly turn into a tedious task, entailing solid technical knowledge about the grid. Grid application (and) frameworks
R EFERENCES
Fig. 2. Accuracies of learning for four methods (J48, Naive Bayes, NBTree and JRip)
Fig. 3. Amounts of time taken to build a model for four machine learning methods (J48, Naive Bayes, NBTree and JRip).
are the category of tools that make the use of the grid much simplified for the scientist with interest solely on the computing resources. In this paper we show a possibility of employing grid-based computing in performing learning tasks for intrusion detection that were impossible to carry out on constrained computing resources. We intend to continue our investigations on the available sets using combined methods, and provide a semi-automated method for carrying out the experiments on the grid machines. We also plan to record a new data set for intrusion detection in our University (with a focus on insider activity), and make that set publicly available by the end of this year.
none 40283 6 372 190 11
Classified as probe dos r2l 3 37 0 264 2 0 3 2415 0 155 0 1 0 0 6
Attack type u2r 0 0 0 0 0
none probe dos r2l u2r
TABLE II C ONFUSION MATRIX FOR THE BEST PERFORMING APPROACH (J48)
[1] (1999) KDD Cup 1999 Data. online. [Online]. Available: http: //kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [2] J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory,” ACM Trans. Inf. Syst. Secur., vol. 3, pp. 262–294, November 2000. [Online]. Available: http: //doi.acm.org/10.1145/382912.382923 [3] M. Sabhnani and G. Serpen, “Why machine learning algorithms fail in misuse detection on KDD intrusion detection data set,” Intelligent Data Analysis, vol. 8, no. 4, pp. 403–415, 2004. [4] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set,” in Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Security and Defense Applications (CISDA 2009), 2009. [5] R. Buyya and S. Venugopal, “The gridbus toolkit for service oriented grid and utility computing: An overview and status report.” Press, 2004. [6] H. Kornmayer, M. St¨umpert, H. Gjermundrød, and P. Wolniewicz, “gEclipse – a contextualised framework for grid users, grid resource providers and grid application developers,” in Computational Science – ICCS 2008, ser. LNCS, M. Bubak et al., Eds., vol. 5103. Springer, 2008, pp. 399–408. [7] I. Muntean, “Efficient distributed numerical simulation on the grid,” Ph.D. dissertation, Institut f¨ur Informatik, Technische Universit¨at M¨unchen, 2008. [8] I. Muntean and A. Badiu, “Application plugins for distributed simulations on the grid,” INT J COMPUT COMMUN CONTROL, 2011, accepted. [9] I. Muntean, “Plugins for numerical simulation with GridSFEA on computing grid,” in Procs. of the 8th Intl. RoEduNet Conference Galati, 2009, pp. 51–56. [10] R. Khoussainov, X. Zuo, and N. Kushmerick, “Grid-enabled weka: a toolkit for machine learning on the grid,” http://userweb.port.ac.uk/ ∼khusainr/weka/ercim04 rk final.pdf, 2004. [11] M. Lackovic, D. Talia, and P. Trunfio, “Service oriented kdd: A framework for grid data mining workflows,” in Proceedings of the 2008 IEEE International Conference on Data Mining Workshops. Washington, DC, USA: IEEE Computer Society, 2008, pp. 496– 505. [Online]. Available: http://portal.acm.org/citation.cfm?id=1490299. 1490731 [12] A. Singhal and S. Jajodia, Data Mining and Knowledge Discovery Handbook. Second Edition. Springer verlag, 2010, ch. Data Mining for Intrusion Detection, pp. 1171–1180. [13] D.-K. Kang. (2005) Benchmark Data sets for Intrusion Detection System in Bag of System Calls representation. online. [Online]. Available: http://www.cs.iastate.edu/∼dkkang/IDS Bag/ [14] J. Song, H. Takakura, and Y. Okabe. (2007, January) A Proposal of New Benchmark Data to Evaluate Mining Algorithms for Intrusion Detection. http://www.apan.net/meetings/manila2007/presentations/security/algo.ppt. [15] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “Yale: Rapid prototyping for complex data mining tasks,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, L. Ungar, M. Craven, D. Gunopulos, and T. Eliassi-Rad, Eds. New York, NY, USA: ACM, August 2006, pp. 935–940. [Online]. Available: http://rapid-i.com/ component/option,com docman/task,doc download/gid,25/Itemid,62/ [16] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” SIGKDD Explorations, vol. 11, pp. 10–19, 2009. [17] X. Zhu, “Anomaly detection through statistics-based machine learning for computer networks,” Ph.D. dissertation, Graduate College, THE UNIVERSITY OF ARIZONA, 2006. [18] P. Srinivasulu, R. S. Prasad, and I. R. Babu, “Intelligent Network Intrusion Detection Using DT and BN Classification Techniques,” Int. J. Advance. Soft Comput. Appl., vol. 2, no. 1, pp. 124–141, March 2010. [19] R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993. [20] G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Eleventh Conference on Uncertainty in Artificial Intelligence. San Mateo: Morgan Kaufmann, 1995, pp. 338–345. [21] R. Kohavi, “Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid,” in Second International Conference on Knoledge Discovery and Data Mining, 1996, pp. 202–207.