Exploiting Frequent Episodes in Weighted Suffix Tree to Improve ...

2 downloads 0 Views 1MB Size Report
Exploiting Frequent Episodes in Weighted Suffix Tree to Improve Intrusion .... list of system calls issued by a single process from the ..... Vegas, Nevada, 2005.
Exploiting Frequent Episodes in Weighted Suffix Tree to Improve Intrusion Detection System Min-Feng Wang, Yen-Ching Wu, and Meng-Feng Tsai

Department of Computer Science and Information Engineering National Central University, Jhongli, Taiwan 32001 [email protected] Abstract In this paper we proposed a weighted suffix tree and find out it can improve the Intrusion Detection System (IDS). We firstly focus on the analysis of computer kernel system call, and discover some meaningful information from the unorganized system call sequences. We design a weighted suffix tree algorithm which derives from the concept of suffix tree algorithm for string matching, which then allows to mine the frequent episodes in order to get ordered frequent patterns. We therefore apply these rules to detect malicious attacks, and it shows our IDS still has a good ability to detect intrusion when we use fewer rules.

1. Introduction Intrusion Detection System (IDS) is one of key technologies to ensure dynamic system security. Intrusion detection is defined as “identifying unauthorized use, misuse, and abuse of computer systems by both inside and outside intruders” [5]. There are two basic approaches of intrusion detection: misuse intrusion detection [7, 10], and anomaly intrusion detection [18]. Generally speaking, an IDS does not include preventing intrusions from occurring. All IDSs must have three important functions: data collection, data classification, and data reporting. In general, there are several levels of data to be collected for IDS. Network data has a lot of features which can be provided to analyze intrusion detections. But the transmission of network data is always being encrypted for security; In addition, when source code of the application does not publish, the method of embedding monitoring application code become even more difficult to implement. Unlike network data, system calls of operating system do not have the problem of encryption or decryption, they operate more directly than network packages and are much

easier to analyze. We can collect the system calls simply by monitoring process. In addition, since the system calls will not be affected by the language of process being written, we can successfully avoid the publication problem of the source code. These advantages interest us to analyze system calls of operating system. Data mining generally refers to the process of extracting models from large stores of data [4]. However, researches of IDS which apply data mining techniques [13, 25] do not focus on time order between system calls. And the extracted rule set only contains frequent patterns of disorder system calls. For example, [25] used Apriori algorithm [1] to extract the rules. If it has a rule {01, 02, 03}, then the system calls 01, 02, and 03 will be bundled together all the time without time order identification. In this case, if system calls 01 followed by 02, followed by 03 is the normal order of system call of the process, but system calls 03 followed by 02, followed by 01 is an intrusive behavior, then, the method will not be able to detect the intrusion which applies the same system calls with different order. In order to improve the accuracy of detection, we will keep the order between system calls by using our proposed weighted suffix tree algorithm, and exploit frequent episode mining algorithm to extract significant rules from ordered sequences.

2. Related work Lee et al. introduced data mining techniques into IDS [11, 14, 15, 16, 17, 21]. They hope to exploit the ability of handling huge amount of data to improve the performance of IDS. [13] used the mining algorithm based on FP-tree to extract rule set of normal behavior. [8] proposed a “bag of system calls” to represent IDS. [12] introduced several data mining techniques to enhance IDS. [26] analyzed several anomaly IDSs

which used system calls. [2] involved with several main aspects and technologies of IDS.

2.1. Frequent episodes mining The frequent episodes algorithm described in [20] is an extension of association rules. A frequent episode is a set of items that occur frequently within a time window of a specified length. They emphasize that the items in a serial episode must occur in partial order in time. An example is home, research→theory [sup = 5%, conf = 20%, time = 30s], which means that 5% of the total time windows (30s) of all transactions under analysis, the visiting order starts with home page, and followed by the research guide and the theory page; and 20% of the users who visit the home page and the research guide in time order will visit the theory page subsequently.

2.2. Suffix tree The concept of suffix tree was first introduced as a position tree by Weiner [23]. The construction was greatly simplified by McCreight [19], and also by Ukkonen [22]. Ukkonen provided the first linear-time online-construction of suffix trees. A suffix tree is a tree-like data structure representing all suffixes of a string. The characteristics of suffix tree can support to keep the order between system calls that are recorded in the sequence, and to find all substrings of the sequences.

3. Problem definition A set of system calls which are arranged in time order is called system call sequence. Each trace is the list of system calls issued by a single process from the beginning to the end of execution. A daemon process is used to denote a background system process on UNIX which runs independently of any user who is logged-in. Trace lengths vary widely because of differences in program complexity because some traces are collected daemon processes and others are not. The University of New Mexico (UNM) provides a number of system call data sets. In UNM system call traces, each trace is an output of one program. Most traces involve only one process and usually one sequence is created for each trace. However, sometimes one trace has multiple processes. In such cases, we have extracted one sequence per process in

the original trace. Thus, each system call trace can yield multiple sequences of system calls if the trace has multiple processes. Data sets have been collected from those programs that run with privilege. Monitoring privileged programs has several advantages over monitoring user profiles [9] because the misuse of those programs has the greatest potential for harm to the system. Root processes are more dangerous than user processes because they have access to more parts of the computer system. They have a limited range of behavior, and their behavior is relatively stable over time [3]. Some of the normal traces are “synthetic” and some are “live”. Synthetic traces are collected in production environments by running a prepared script and the program options are chosen solely for the purpose of exercising the program without any real user’s requests, whereas live traces are collected from the program during normal usage of a production computer system. Without loss of generality, we assume that system call sequence is mapped to a set of contiguous integers. Each system call number is an instruction and is recorded into a mapping file. Therefore, we can understand the original instruction by finding the integer from the mapping file. Here we should emphasize that the same operating system with different version will have different mapping file of system calls. It should be noted that the experiments must work under the same version of operating system.

4. Methodology 4.1. Architecture Before introducing our methodology, we’d like to bring out the architecture of our IDS. Because some traces might have multiple processes. The traces must be sent to preprocess so as to gain the system call sequences which contain one process per sequence. For the trace preprocessor in Figure 1, inputs of the program are source data sets and output system call sequences. After preprocessing the traces, the sequences with different length of normal usage are sent to frequent pattern generator. Frequent pattern generator constructs weighted suffix tree for further computation. Rule extractor will extract all rules that satisfy the conditions set by users from frequent pattern generator and collect the rules into rule set pool for further using.

Table 1. The algorithm for weighted suffix tree

Fig. 1. Architecture of proposed system When we collect enough information into rule set pool, our decision engine will exploit rules that are provided from rule set pool to analyze system calls of the monitoring program. The decision engine will report to user interface if an intrusion is detected. Hence these reports will help human experts to solve the problems easier than before.

4.2. Weighted suffix tree A suffix tree can keep the order between system calls that are recorded in the sequence. We can also extract all substrings of the sequence from its suffix tree. Weighting the suffix tree will help us discover the occurring frequency of the patterns. These statistics can then be used to calculate frequent episodes later. There is only one sequence presented in the original suffix tree. Now we change the presentation of suffix tree and our weighted suffix tree is consisted of all sequences. Table 1 is the algorithm pseudo-code for weighted suffix tree. We assume that there are two sequences S1 = {03, 01, 03, 01, 15} and S2 = {01, 01, 03, 01} in the data set D = {S1, S2}. The weighted suffix tree after processing sequence S1 and S2 is shown in Figure 2. We record the traversed number of each node when we built the tree. For example, the rightmost node of weighted suffix tree in Figure 2 is noted as 15(1). 15 means the system number of that node, whereas 1 means the traversed frequency of that node. We extend the representation of suffix tree to deal with multiple input sequences. We note that two or more input sequences may have common suffixes, so the nodes of common suffixes will increase the count number of weight. Further, we can exploit the information of nodes to help us improving intrusion detection technique.

Fig. 2. The weighted suffix tree

4.3. Frequent episodes mining Since the processes we monitored are sometimes used intermittently, we change the setting of fixed time window in [20] into fixed length of system call sequence to suit the feature of our experiments. As we mentioned before, the weighted suffix tree algorithm has processed over data set D which contains two sequences S1, S2. Without loss of generality, we assume that we want to collect the rule set of a fixed sequence of length 3. Hence we can discover a rule set of normal usage which has the fixed sequence length of 3 (see Table 2). There are four different sequences of length 3 collected from the weighted suffix tree. The first column in Table 2 shows all of the sequences which are obtained from the weighted suffix tree, whereas the second column shows the number of the sequences which stands for the number of appearances of the sequence in the whole data set. We can easily obtain the number of each sequence by reading the

Table 2. The sequence of length 3

Table 4. Two parts of IDS

Table 3. Frequent episodes mining algorithm

Fig. 3. Number of rules with different length

weighted counts of the last node of the sequence. Then we can calculate the frequent episodes of each rule further. Here we have a frequent episode rule example is 03, 01 → 03 [sup = 20%, conf = 50%, length = 3], which means 20% of total sequences (of length 3) under analysis system call 03, 01, 03 are used in time order, and 50% of the cases of which system calls 03, 01 have been used in order, system call 03 will be used subsequently. Below we show the definition of support and confidence used in our frequent episodes mining: sup =

the weighted count of the last node of the sequence the sum of the weighted count of the last node of all sequences

conf =

the weighted count of the last node of the sequence the sum of the weighted count of the last node and its all siblings

The value of confidence is a conditional probability. It means the probability that the consequent will occur, given that the antecedents have occurred. We found that we can summarize the weighted counts of all the rules with the same antecedent to be denominator and the weighted count of the sequence be the numerator. We discovered that the rules with the same antecedent have the same prefix in the weighted suffix tree (see Figure 2). Hence, the last nodes of these rules are siblings to each other. We can easily calculate the sum of the weighted count of the last node and its all siblings to gain the value of denominator of confidence. We can exploit the mining algorithm described in Table 3 to repeatedly extract the rule set of different length when the weighted suffix tree has been constructed. In the past, those who focused on analysis of intrusion detection using system call sequences had

Fig. 4. Number of rules with different support threshold a lot trouble with how to choose the most appropriate sequence length. They chose the length from experiment results empirically. When they need to collect another rule set with another sequence length, they must read whole traces again [6, 14]. Once the weighted suffix tree has been constructed, our mining algorithm can overcome the drawback. For example, suppose it is the first time we collect the rule set of length 7. After the operating, nodes of level 7 in the whole tree are labeled as traversed. If we want to collect the rule set of length 11 later, we can use the same tree mining again.

4.4. Intrusion detection model We decide to use t-stide method that is proposed by C. Warrender et al. [24] verifies whether intrusive events occur or not. The “t” in “t-stide” represents the addition of this threshold on sequence frequencies. They set a threshold to prune rare sequences which are regarded as abnormal. We choose support and

Table 5. The rules of length 5

confidence values of frequent episodes to be the threshold condition in t-stide method, instead of the occurring frequencies of the sequences that are previously used in t-stide method. We collect sequences of fixed length k that exceed the threshold and store those sequences together for further using. The traces are compared with the rule set by the decision engine. The decision engine will report to the user interface an alarm when mismatch rate exceeds the threshold.

5. Experiments In this chapter, we will discuss the performance of execution time between the original t-stide method and our method first. Then we will use data sets of synthetic sendmail [3] in our experiments to show the results of the experiments. The experimental results obtained from [14] let us know that a set of specific rules for normal sequences can be used to detect any known and unknown intrusions. Whereas having only the rules for abnormal sequences only provides the capability to detect known intrusions we used 80% of normal traces to construct weighted suffix tree and generated rule sets of different lengths from the weighted suffix tree. The rest 20% of normal traces were used for testing. Both normal and abnormal traces are used for testing. There are three traces of the sscp (sunsendmail) attacks, two traces of the decode attacks, and five traces of the error condition- forwarding loops attacks used for testing too.

5.1. Comparison of execution during rules checking Our IDS can be divided into two parts from data collection, construction of a weighted suffix tree, generation of rule sets, and operation of the decision engine. As we can observe from Table 4, the data collection, construction of the weighted suffix tree, and the generation of rule sets are completed off-line.

Table 6. The rules of length 7

Table 7. The rules of length 9

Whereas the decision engine must response as soon as possible if it detects any deviation of the programs so as to improve the whole security of computer systems. Generally speaking, an IDS wishes to have all possible normal usages of a program to improve the ability of identifying normal and abnormal behaviors. Unfortunately, the increment of rules will affect the performance of the decision engine. We assume that the length of the trace is L, the size of sliding window is w, and there are n rules, L>>w. The window sliding needs to check rule set again in the t-stide method each time. The time complexity of t-stide in worst is O((L-w+1)*n) = O(L*n). Figure 3 shows the original number of rules in the rule set with the length 5 to 11. Figure 4 shows the number of rules under different support threshold. We compare the rule sets of length 5, 7, and 9. We discover that the support threshold becomes higher and the number of rules becomes fewer. Hence the execution time of IDS will be improved by exploiting a smaller rule set. Our IDS can response much more quickly than before when an intrusion has been detected.

5.2. Experimental results Now, we are interested in the accuracy of intrusion detection. Table 5, 6, 7 show that the experimental results using the rules of length 5, 7, and 9. The first row is the experimental result running with the method proposed in [24]. The column of support means the support threshold of frequent episodes. The column of

confidence means the confidence threshold of frequent episodes. The column of normal abn% means the mismatch rate of normal traces. We also record the mismatch rate of the sscp (sunsendmail) attack traces in the column of sscp abn%, the mismatch rate of the decode attack traces in the column of decode abn%, and the mismatch rate of the error conditionforwarding loops attack traces in the column of fwd abn%. We compare the results of the different support and confidence thresholds. The column of gap records the minimal abn% of attack traces minus abn% of normal traces. We can see from Table 5. There are 660 rules when we use the method of [24]. The mismatch rate of normal traces is 3.3%. The mismatch rate of sscp attack is 16.5%. The mismatch rate of decode attack is 5.9%. The mismatch rate of error conditionforwarding loops attack is 13.9%. Hence the mismatch rate gap between normal traces and abnormal traces is 5.9% - 3.3% = 2.6%. After we use different support and confidence thresholds, we can observe that the number of rules is between 546 ~ 256. Hence the operation of decision engine will be faster than before. Because we use fewer rules, some sequences in the trace will be treated as mismatch. Hence the mismatch rate of normal traces will be higher than the method of [24]. On the other hand, the mismatch rate of abnormal traces will also be higher than the method of [24]. The gaps between normal traces and abnormal traces are 10.5%, 12.7%, 17.6%, 17.2%, 18.7%, and 16.2%, respectively. We can discover that the gap is larger than original method, which help us to prove that the decision engine can explicitly recognize which trace is normal or abnormal. We can see from Table 6 which uses the rules of length 7, and Table 7 using the rules of length 9. The gap between normal traces and abnormal traces are still large enough after pruning some rules.

6. Conclusions An IDS needs to process a lot of data sets to carry on analysis. This characteristic allows the application of data mining to become an important component in an IDS. The algorithms which were proposed before can simply find out disorder frequent patterns. It means that we cannot know the order of the system calls. This characteristic is very fallible. For example, an intrusion can use the same frequent patterns but different order to escape the check of IDS. Weighted suffix tree proposed by us can solve this problem easily. Because the system calls between the rules extracted from weighted suffix tree conserve the order

of them, therefore we can fix the fallible drawback of the existing algorithms and improve the performance of the IDS. For different programs, the fittest length of the rules will also be different, which forces an IDS to repeatedly execute experiments with different lengths of rule sets to find out which rule set is the most suitable one. According to the method proposed by [3, 6, 24], constructing a new rule set of another length needs to load the whole traces again. However, with weighted suffix tree, our method needs to load whole trace only one time for constructing, and this tree can quickly extract lots of rule sets. These advantages of weighted suffix tree help to enhance the timeefficiency of the IDS. Frequent episodes mining can identify the rules whose order between system calls is conserved. After setting different support and confidence thresholds, we can prune those rare rules. And we can use fewer rules to speed up the operation of the decision engine. The accuracy is the most important issue of an IDS. The experimental results also show that the gaps between the mismatch rate of normal traces and the mismatch rate of abnormal traces are large after pruning rare rules by Frequent Episodes Mining. It means that IDS can report or alarm to the users or experts accurately.

7. Acknowledgment This work was sponsored by National Central University Project of Promoting Academic Excellence and Developing World Class Research Centers – Applied Informatics and Creative Contents: ServiceOriented Information Platform.

8. References [1] R. Agrawal, and A. Swami, “Fast Algorithms for Mining Association Rules,” Proc. of the 20th Int’l Conf. on Very Large Data Bases, pp. 487-499, Santiago, Chile, 1994. [2] Y. Bai, and H. Kobayashi, “Intrusion Detection Systems: Technology and Development,” Proc. of the 17th IEEE Int’l Conf. on Advanced Information Networking and Applications, pp. 710-715, 2003. [3] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, “A Sense of Self for Unix Processes,” Proc. of the 1996 IEEE Symp. on Security and Privacy, pp. 120-128, Los Alamitos, 1996. [4] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The KDD Process of Eextracting Uuseful Knowledge from Volumes of Data,” Communication of the ACM, vol. 39, no. 11, pp. 27-34, 1996.

[5] S. Hossain, S. M. Bridges, and R. B. Vaughn, “Adaptive Intrusion Detection with Data Mining,” Proc. of IEEE Int’l Conf. on Systems, Man and Cybernetics, pp. 3097-3103, 2003. [6] S. A. Hofmeyr, S. Forrest, and A. Somayaji, “Intrusion Detection Using Sequences of System Calls,” J. Computer Security, vol. 6, pp. 151-180, 1998. [7] K. Ilgun, R. A. Kemmerer, and P. A. Porras, “State Transition Analysis: A Rule-Based Intrusion Detection Approach,” IEEE Trans. on Software Engineering, vol. 21, no. 3, pp. 181-199, 1995. [8] D-K Kang, D. Fuller, and V. Honavar, ”Learning Classifiers for Misuse Detection Using a Bag of System Calls Representation,” Proc. of IEEE Int’l Conf. on Intelligence and Security Informatics, pp. 511-516, Atlanta, GA, USA, 2005. [9] C. Ko, G. Fink, and K. Levitt, “Automated Detection of Vulnerabilities in Privileged Programs by Execution Monitoring,” Proc. of the 10th Annual Computer Security Applications Conference, pp. 134-144, Dec. 1994. [10] S. Kumar, and E. H. Spafford, “A Software Architecture to Support Misuse Intrusion Detection,” Proc. of the 18th National Information Security Conference, pp. 194-204, 1995. [11] W. Lee, “A Data Mining Framework for Constructing Features and Model for Intrusion Detection System,” Ph.D Dissertations, Columbia University, New York, USA 1999. [12] C.-T. Lu, A. P. Boedihardjo, and P. Manalwar, “Exploiting Efficient Data Mining Techniques to Enhance Intrusion Detection Systems,” Proc. of the IEEE Int’l Conf. on Information Reuse and Integration, pp. 512-517, Las Vegas, Nevada, 2005. [13] T.-R. Li, and W.-M. Pan, “Intrusion Detection System Based on New Association Rule Mining Model,” Proc. of IEEE Int’l Conf. on Granular Computing, pp. 512-515, Beijing, China, 2005. [14] W. Lee, and S. J. Stolfo, “Data Mining Approaches for Intrusion Detection,” Proc. of the 7th USENIX Security Symposium, San Antonio, Texas, 1998. [15] W. Lee, and S. J. Stolfo, “A Framework for Constructing Features and Models for Intrusion Detection Systems,” ACM Trans. on Information and System Security, vol. 3, no. 4, pp. 227-261, 2000. [16] W. Lee, S. J. Stolfo, and P. K. Chan, “Learning Patterns from Unix Process Execution Traces for Intrusion Detection,” Proc. of the AAAI97 Workshop on AI Approaches to Fraud Detection and Risk Management, pp. 50-56, 1997.

[17] W. Lee, S. J. Stolfo, and K. W. Mok, “A Data Mining Framework for Building Intrusion Detection Model,” Proc. of the IEEE Symp. on Security and Privacy, pp. 120-132, Oakland, CA, May 1999. [18] T. Lunt, A. Tamaru, F. Gilham, R. Jagannathan, P. Neumann, H. Javitz, A. Valdes, and T. Garvey, ”A RealTime Intrusion Detection Expert System (IDES) –Final Technical Report,” Technical Report, Computer Science Laboratory, SRI International, Menlo Park, California, 1992. [19] E. M. McCreight, “A Space-Economical Suffix Tree Construction Algorithm,” J. ACM, vol. 23, no. 2, pp. 262272, 1976. [20] H. Mannila, H. Toivonen, and A. I. Verkamo, “Discovering Frequent Episodes in Sequences,” Proc. of the 1st Int’l Conf. on Knowledge Discovery in Databases and Data Mining, pp. 210-215, Montreal, Canada, 1995. [21] S. J. Stolfo, W. Lee, P. K. Chan, W. Fan, and E. Eskin, “Special Section on Data Mining for Intrusion Detection and Threat Analysis: Data Mining-Based Intrusion Detectors: An Overview of the Columbia IDS Project,” ACM SIGMOD Record, vol. 30, no. 4, pp. 5-14, Dec. 2001. [22] E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995. [23] P. Weiner, “Linear Pattern Matching Algorithm,” Proc. of the 14th IEEE Symp. on Switching and Automata Theory, pp 1-11, 1973. [24] C. Warrender, S. Forrest, and B. Pearlmutter, “Detecting Intrusions Using System Calls: Alternative Data Models,” Proc. of the IEEE Symp. on Security and Privacy, pp. 133145, 1999. [25] M.-J. Xu, and G. Gu, “Method and Usage of Mining Association Rules in System Call Serial,” Computer Systems & Applications, no. 4, pp. 26-28, 2004. [26] M. M. Yasin, and A. A. Awan, ”A Study of Host-based IDS Using System Calls,” Proc. of the IEEE Int’l Conf. on Networking and Communication, pp. 36-41, 2004.