Intrusion Detection Using Graph Support: A Hybrid Approach of Supervised and Unsupervised Techniques Om Pal, Peeyush Jain, Sudhansu Goyal, Zia Saquib, Bernard L. Menezes
Intrusion Detection Using Graph Support: A Hybrid Approach of Supervised and Unsupervised Techniques Om Pal *1, Peeyush Jain *1, Sudhansu Goyal *1, Zia Saquib *1, Bernard L. Menezes *2 *1 Centre for Development of Advanced Computing, Mumbai, India *2 Indian Institute of Technology Bombay, Mumbai, India {ompal, peeyushj, sudhansu, saquib}@cdacmumbai.in,
[email protected] doi: 10.4156/ijact.vol2.issue3.12
Abstract At present it is almost impossible to detect zero day attack with help of supervised anomaly detection methods. Unsupervised techniques also have the drawback of low detection rate in spite of detection of zero day attacks. Using combination of both unsupervised and supervised methods, promising detection results can be produced. In this paper we present a new sequence based graph support technique which is suitable for both supervised and unsupervised anomaly detection. Based on the system call interception technique, we present a intrusion detection system using Graph Support. This system intercepts every system call invoked by application programs and design graph to calculate the support. Once there are evidences showing certain deviation (graph support is less than threshold) is happening, the system can terminate the malicious process before it hurts the system. With help of graph support we could detect the malicious attack in the system.
Keywords: Intrusion detection, Unsupervised anomaly detection, Graph support, System calls. 1. Introduction The number of attacks on computer systems is on rise. Often, network intruders have easily overcome the passwords authentication mechanism designed to protect the system. With an increased understanding of how systems work, intruders have become skilled at determining their meeknesses and exploiting them to obtain unauthorized privileges. Intruders also use patterns of intrusion that are often difficult to trace and identify. So tools are necessary to monitor systems, to detect break-ins and to respond actively to the attacks in real time. Intrusion Detection System (IDS) is the technology which tells about security compromise with unwanted party. Intrusion detection is defined as “identifying unauthorized use, misuse, and abuse of computer systems by both inside and outside intruders” [3]. IDS simply monitor activities on host system (HIDS) and on network (NIDS) without providing any resistance to the attackers. IDS can be divided in to two categories (a) Misuse based IDS and (b) Anomaly based IDS. Anomaly based IDS is further divided in two categories (b1) Supervised anomaly detection and (b2) Unsupervised anomaly detection [2]. Misuse Detection maintains database of signatures and attacked history for detecting the attacks. Misuse detection gives high detection rate with low false alarm rate if signature database is up-to-date. Draw back of misuse based techniques is that these techniques can not detect novel attacks due to non availability of signatures of novel attacks. On the other side, supervised anomaly based techniques detect the novel attacks by learning the behaviour and flow of data. However anomaly based detection gives high false alarm rate due to low thickness of deviation layer between normal flow and abnormal flow. Another drawback with supervised anomaly based approach is that system can not produce true results if training database is malicious due to attack during training phase and as a result it gives the high false negative rate. On the other hand, unsupervised based algorithms learn the internal tightness and association of training data with one another. In case of attack, intrusiveness of one dimension depends on other dimensions as well as on its sub dimension, so unsupervised techniques produce lesser number of false alarms than supervised techniques because it does not depend on thickness of deviation layer between normal flow and abnormal flow. Drawback of this technique is that it does not detect DoS attack because it does not check the threshold level of data flow in detection phase.
114
International journal of Advancements in Computing Technology Volume 2, Number 3, August, 2010 Data mining techniques are heavily used in areas of intrusion detection [2,4]. Using data mining techniques like order of system call sequences, a lots of research has been done in area of unsupervised anomaly detection [4,5]. In sequence based approach, if system call sequence a, b, c is a normal pattern in training phase then only a, b, c is considered normal pattern in detection phase and other pattern like c, b, a is treated as abnormal. Some researchers focussed on system call sequence but length of system call sequence is still an issue [1]. They are not sure about the idle length of system call sequence. Our approach to solve these problems is based on graph support and system call sequence. First we create the graphs of system call sequences in training phase and in detection phase for calculating support of detection phase graph. With considering minimum support of detection phase graph in reference of training phase graph, we are able to improve detection phase accuracy.
2. Related work and related stuffs Kingsly Leung [2] introduced grid based and density based cluster approach for separating frequent item sets from non frequent item sets and they used their density based support mechanism for identifying unseen attack in anomaly detection. [1,8] introduced sequence based anomaly detection approach. Jiawei Han [6] Developed frequent pattern tree as a data structure for storing data in compact form and quick access of frequent item. Nong Ye [7] Presented Chi-Square method based anomaly detection technique for detecting intrusions on the basis of threshold value of profile phase. [9] Introduced similarity cluster based unsupervised technique for detecting DoS attacks. A group from Iowa State [12] introduced a system calls based state machine model for detecting intrusions. However they are not able to detect novel attacks. In this paper, proposed scheme uses combination of both unsupervised and supervised methods, and produces promising detection result for both seen and unseen attacks.
2.1 Support, Confidence and Frequent mining Support, confidence and frequent mining are general terms in field of data mining. [10] Presented mining algorithms for frequent set generation by using frequent pattern tree. If support of Breadmilk is 60% for 100 transactions then it means set {bread, milk} appears in 60 transactions. If confidence of breadmilk is 70% then it means 70% times milk occurred with bread and 30% times bread occurs alone. 2.2 Graph Support Given a collection of graphs G, the support of a sub graph g is defined as the fraction of all graphs that contain g as its sub graph. n | ∑ {Gi | g ≤ Gi, Gi ε G | } | i=0 Support(g) = |G| 2.3 Frequent Sub Graph Mining Given a set of graphs G and minimum support threshold then goal of frequent sub graph mining is to find out all sub graph of g such that support(g) >= minimum support.
3. Methodology Aim of proposed scheme is to discover the closeness and associability among the data items in training phase and detection phase. If the closeness and associability of data in detection phase is match with closeness and associability of data of training phase then flow of data in detection phase is normal and in case of mismatch scheme assumes that flow of data in detection phase is abnormal. For measuring closeness and associability scheme uses concept of graph support.
115
Intrusion Detection Using Graph Support: A Hybrid Approach of Supervised and Unsupervised Techniques Om Pal, Peeyush Jain, Sudhansu Goyal, Zia Saquib, Bernard L. Menezes
3.1 Generation of Training Phase Graph Start Eab, Fab=7 b
a
Ebd, Fbd=20
Eac, Fac=2 Edc, Fdc=32
d
c Ecd, Fcd=4 e Figure 1. Graph G
Scheme collects the data of system calls in training phase and generates the graph of these system calls. Generated training graph is stored in database permanently and it will be used in detection phase. Let a, b, c, d are generated system calls in training phase, G is the generated graph from these system calls, Fab is the frequency of event Eab if system call b follows system call a and Fba is the frequency of event Eba if system call a follows system call b.
3.2 Setting of support Threshold For setting the minimum support threshold for normal flow, graphs of normal flows as well as abnormal flows are generated in training phase. By using the training phase graph, generated normal flow graph and abnormal flow graphs, scheme calculate false positive and false negative by varying threshold value from 0 to 1. After doing above experiment, threshold value of support is set according the desired false positive and false negative. This support setting phase is similar to detection phase in some manner. The only difference is that in detection phase scheme compare the calculated value of support with stored threshold value while in support setting phase, scheme sets the minimum support by doing experiment on normal flow and abnormal flow.
3.3 Generation of Detection Phase Graph In detection phase graph g of system calls for duration of t is created. After creating graph g, support of each edge of graph g is calculated from training phase graph G. Let ei,j is the directed edge of graph g and fi,j is the frequency of edge ei,j. Start Eab, Fab=7 b
a
Ebd, Fbd=20
Eac, Fac=2 Edc, Fdc=32
d
c Ecd, Fcd=4 e Figure 2. Graph g Calculation of support flag for each edge
116
International journal of Advancements in Computing Technology Volume 2, Number 3, August, 2010 With the help of support flag, scheme determines the existence of directed edge ei,j of graph g in training phase graph G. If edge ei,j of graph g also exists in G then flag return true value otherwise it returns false value. Let Sf(ei,j) is the support flag for edge ei,j. For every edge ei,j ε g do If ei,j ε G then Sf(ei,j) = 1 Else Sf(ei,j) = 0 Calculation of graph support of g Let S(g) is the graph support of g, e is the set of directed edges of graph g and fk is the frequency of edge ek. e = {e1, e2, e3, …….. en} n
∑ fk*Sf(ek) k=1
S(g) = n
∑ fk k=1
3.4 Detection of Abnormal Behavior For abnormal flow detection, scheme compares detection phase support S(g) with support threshold. If value S(g) is greater that threshold value then flow is normal otherwise flow is abnormal.
4. Complexity Analysis At run time there is a need of training graph (G) to detect the anomalies. Detection graph (g) is generated by run time data to calculate the graph support with the training graph (G). Let |E| | is the cardinality of the set of edges (the number of edges), and | V | is the cardinality of the set of vertices of graph G and |e| | is the cardinality of the set of edges, and | v | is the cardinality of the set of vertices of graph g. Then space complexity can also be expressed as O ( | E | + | V | + |e| +|v| ). Since in the worst case, search (assume breadth-first search) has to consider all paths to all possible which is O(bd) for single edge nodes. The time complexity of search is d and |e|*O(b ) for whole graph g, where b is branching factor and d is depth of the tree. The time complexity can also be expressed as O( | E | + | V | ) for single edge and |e|* O( | E | + | V | ) for whole graph g since every vertex and every edge will be explored in the worst case.
5. Experimental Result For experimental purpose, the data (system calls) of the different process like ps, ssh etc is traced by doing experiments. Strace command is used to trace the data (system calls) of corresponding process. Initially normal data is taken by Strace and then system is attacked and gets the abnormal data. Experiments are done on multiple sets of above retrieved data to calculate the graph support as explained in section 3. On the basis of experimental results, the comparison of different threshold values is shown in table 1. It is concluded that the optimal threshold value of graph support for balancing false positive and false negative is 0.555. If the graph support of real data comes below 0.555, then it is an attack otherwise the incoming data is normal.
117
Intrusion Detection Using Graph Support: A Hybrid Approach of Supervised and Unsupervised Techniques Om Pal, Peeyush Jain, Sudhansu Goyal, Zia Saquib, Bernard L. Menezes
Support Threshold 0.554 0.555 0.556 0.557
Table 1. Optimal support False Positive False Negative (%) (%) 17.62 19.00 18.01 15.08 20.67 14.97 24.11 14.94
6. Conclusion Lots of research has been done in field of Intrusion Detection using supervised techniques but these techniques are not able to detect unseen attacks. So, researchers move to unsupervised techniques to detect these attacks. In this paper first we discussed the work done in field of supervised and unsupervised techniques and discussed about data mining techniques and graph support. Finally we come up with solution based on graph support technique for anomaly detection for detecting seen and unseen attacks. After doing experiments scheme found that there is a close relationship between false positive and false negative so according the criticality of application false positive and false negative could be balanced by choosing appropriate threshold value of graph support.
7. References [1] Min-Feng Wang, Yen-Ching Wu, and Meng-Feng Tsai: Exploiting Frequent Episodes in Weighted Suffix Tree to Improve Intrusion Detection System. In: 22nd International Conference on Advanced Information Networking and Applications [2] Kingsly Leung, Christopher Leckie: Unsupervised Anomaly Detection in Network Intrusion Detection Using Clusters. In: 28th Australasian Computer Science Conference, The University of Newcastle, Australia, January 2005 [3] S. Hossain, S. M. Bridges, and R. B. Vaughn: Adaptive Intrusion Detection with Data Mining. In Proc. of IEEE Int’l Conf. on Systems, Man and Cybernetics, pp. 3097-3103, 2003 [4] R. Li, and W.-M. Pan: Intrusion Detection System Based on New Association Rule Mining Model. In Proc. Of IEEE Int’l Conf. on Granular Computing, pp. 512-515, Beijing, China, 2005. [5] M.-J. Xu, and G. Gu: Method and Usage of Mining Association Rules in System Call Serial. In Computer Systems & Applications, no. 4, pp. 26-28, 2004 [6] Jiawei Han, Jian Pei, and Yiwen Yin: Mining Frequent Patterns without Candidate Generation [7] Nong Ye , Q. Chen: An Anomaly Detection Technique Based On A Chi-Square Statistic For Detecting Intrusions Into Information Systems, 2001 [8] M. M. Yasin, and A. A. Awan: In A Study of Host-based IDS Using System Calls. In Proc. of the IEEE Int’l Conf. on Networking and Communication, pp. 36-41, 2004 [9] Richard Rouil Nicolas Chevrollier Nada Golmie: Unsupervised Anomaly Detection System using Next-generation Router Architectecture [10] Kun-Ming Yu; Bin-Chang Wu: A High Performance Frequent Itemset Mining Algorithm Using Confidence Frequent Pattern Tree. In: published at Innovative Computing Information and Control, 2008. ICICIC apos;08. 3rd International Conference on Volume , Issue , 18-20 June 2008 Page(s):329 – 329 [11] Marcos M. Campos and Boriana L. Milenova: Creation and Deployment of Data Mining-Based Intrusion Detection Systems in Oracle Database 10g [12] A.P. Kosoresow and S.A. Hofmeyr: Intrusion Detection via System Calls traces. In: Software, 14(5):35-42, September-October 1997, IEEE Computer Society
118