Predicting the intrusion intentions by observing system call sequences

6 downloads 748 Views 574KB Size Report
identify the intrusion behaviors from the observed system call sequences with good accuracy. ...... Proceedings of the Sixth International Conference on User.
Computers & Security (2004) 23, 241e252

www.elsevier.com/locate/cose

Predicting the intrusion intentions by observing system call sequences* Li Fenga,), Xiaohong Guana,b, Sangang Guoa, Yan Gaoa, Peini Liua a

Xi’an Jiaotong University, State Key Laboratory for Manufacturing Systems (SKLMS), Center for Networked Systems and Information Security (CNSIS), 1190 Mailbox, Xian 710049, China b Center for Intelligent and Networked Systems, Tsinghua University, Beijing 100084, China Received 11 June 2003; revised 23 January 2004; accepted 26 January 2004

KEYWORDS Intrusion intention prediction; Intrusion detection; Plan recognition; Dynamic Bayesian network; Parameter compensation

Abstract Identifying the intentions or attempts of the monitored agents through observations is very vital in computer network security. In this paper, a plan recognition method for predicting the anomaly events and the intentions of possible intruders to a computer system is developed based on the observation of system call sequences. The probability of the goal state for a system call sequence is defined as the prediction index to determine if the intention is normal. An efficient algorithm based on the dynamic Bayesian network theory with parameter compensation is derived and then applied to update the index recursively. Extensive empirical testing is performed on the data sets published in the literature and those collected in an actual computer system at our lab. The testing results showed that this method can identify the intrusion behaviors from the observed system call sequences with good accuracy. ª 2004 Elsevier Ltd. All rights reserved.

Introduction Computer network security is extremely important in modern information society. Among the many technologies deployed for network security, intrusion detection is a method for detecting hostile attacks against computer network systems, both * The research presented in this paper was supported in part by the National Outstanding Young Investigator Grant (6970025), National Natural Science Foundation (60243001) and 863 High Tech Plan (2001AA140213) of China. ) Corresponding author. E-mail addresses: [email protected] (L. Feng), xhguan@ sei.xjtu.edu.cn (X. Guan), [email protected] (S. Guo), [email protected] (Y. Gao), [email protected] (P. Liu).

from outside and inside. Intrusion detection together with firewall, authentication and other technologies, constitutes the state of the art defense-in-depth framework for securing a computer network. Intrusion detection has been a very active research area and widely studied for more than a decade. Anomaly detection based on system modeling or machine learning is one of the systematic approaches for intrusion detection. The normal system behavior is modeled by a systematic method and the intrusion detection is performed by tracing significant deviations between monitored system activities and the model. Anomaly detection has the advantage of being able to detect unknown

0167-4048/$ - see front matter ª 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.cose.2004.01.016

242 intrusions. Denning (1987) proposed a model for building a real time intrusion-detection expert system by analyzing the profiles representing the system behaviors from audit records. Forrest et al. (1996) developed an intrusion-detection system (IDS) by natural immunization and provided a simple and practical way for detecting anomalous behavior by analyzing short sequences of system calls which may generate a stable signature for some normal system behaviors. In other words, an abnormal activity can be detected from the system call sequences when an attack occurs. This approach incorporates two phases. The first phase involves collecting traces of normal behaviors and building a database to characterize normal patterns from the observed system calls. In the second phase, newly observed system call sequences are matched against the normal pattern of the system behaviors. Lane and Brodley (1998) developed an approach to distinguish the behavior of normal users and masquerading or illegal users by comparing a current behavioral sequence to historical users’ profiles. Lee and Xiang (2001) introduced information-theoretic measures such as entropy, conditional entropy, relative conditional entropy, information gain, and information cost for intrusion detection. Through these measures, Unix system call data, BSM data, and network tcpdump data are processed to determine the attacks. The above methods are only able to detect intrusions after the attacks have occurred, either partially or fully, which makes it difficult to contain or stop the attack in real time. Therefore, it is desirable to incorporate a prediction function into future IDS (Geib and Goldman, 2001). Intention prediction or goal recognition is a theory and method to predict the goals or intentions of the monitored objects based on the observed data. This method is successful in predicting the security related behaviors in many applications. Geib and Goldman (2001) first introduced plan recognition into intrusion detection and presented a probabilistic model of plan recognition in IDS. It aims at recognizing and predicting the goals or intentions of the monitored agents based on the observed data. Plan recognition can be considered as an inference problem under uncertain conditions (Charniak and Goldman, 1993; Goldman et al., 1999). Current research focuses on (1) predicting plans or goals during cooperative interactions; (2) understanding stories (natural language processing); and (3) recognizing the plans of an agent that is unaware of the plans being monitored, known as keyhole plan recognition (Albrecht et al., 1997; Wærn and Stenborg, 1995). There are two main features of the keyhole

L. Feng et al. plan recognition: (1) the monitored agent is not aware of the fact that its behavior is monitored and analyzed; and (2) the observed data are incomplete. In the traditional plan recognition methods, the plan library is built manually, which greatly hinders the wide application of the plan recognition method. To overcome this obstacle, machine learning approaches are applied to collect information about the plans and to make decisions. Being capable of modeling a time-varying system, dynamic Bayesian network is one of the few methods that enables us to develop effective methods for recognizing and monitoring the timevarying plans (Albrecht et al., 1997; Nicholson and Brady, 1994). If an IDS can not only detect the actions that have happened already but also predict future actions, one can implement passive detection and proactive defense. It is then possible to block intrusions or respond to hacking in real time. It is also possible to predict the intruder’s intentions in a large-scale distributed IDS and to give early warning (Huang and Wicks, 1999). In this paper, our aim is to predict the occurrence of abnormal events by using the plan recognition approach. An efficient method is developed based on dynamic Bayesian network theory with the sequences of system calls used as an information source. First, the goals of different sequences of system calls are classified into two statesd normal and anomalous. With the clearly defined normal (e.g. issuing normal commands) or abnormal goals (e.g. launching local or remote buffer overflow to get higher privileges), we build a dynamic Bayesian network whose structure changes with time as new nodes are added with new incoming calls. From the statistics of observed system call data, the prior conditional probability distributions (CPDs) are obtained. Extensive empirical testing is performed. Firstly, the Unix system call sequences collected by the University of New Mexico group (Forrest et al., 1996) are analyzed and the normal and abnormal goals are effectively predicted. Secondly, intrusion prediction experiments are conducted in our lab and the system call sequences of the normal and anomalous Http, Ftp and Samba are collected from a Redhat Linux system. The comprehensive testing results reveal that this method has good prediction accuracy in determining the intrusion behaviors from the observed system call sequences. The remainder of the paper is organized as follows. The following sections present a dynamic Bayesian network based plan recognition method for predicting the goal of sequences of system calls and show the experimental results discovered using the new method. The final section draws

Predicting the intrusion intentions conclusions regarding this method and outlines the need for future research.

Dynamic Bayesian network based intrusion prediction Bayesian network is a graphical relational probability model and an efficient representation for the joint probability distribution of a set of random variables based on a set of conditional independence assumptions (Friedman et al., 1997). A Bayesian network consists of a set of nodes with each node associated with a random variable. The conditional independencies among the variables are represented by a directed acyclic graph (DAG) with its arcs representing dependencies. Most Bayesian network based methods are for static systems, where the network structures do not change with time. The applications of these methods need to specify network structure, provide conditional probability, add and delete evidences and apply the reasoning algorithm for modifying the evidences repetitively (Dean et al., 1990; Sterritt et al., 2000; Tawfik and Neufeld, 1994). This would cause the network to become very complicated in most cases. In recent years, dynamic Bayesian network (also called temporal probabilistic network, or dynamic causal probabilistic network) has gotten a lot of attentions of many researchers. Albrecht et al. (1997) first used the dynamic Bayesian network to predict the goals of multiple players in the multi-user dungeon (MUD) game with good experimental results. The advantage of the dynamic Bayesian network is that the network expands with its nodes representing the states of time-varying variables. This may reduce the network scale and simplify the structure. In this section, a method based on the dynamic Bayesian network is developed to predict the intentions or attempts of potential network intruders. We will focus on: (1) how to determine the network structure, nodes and associated states; (2) how to obtain the conditional probability distributions (CPDs) among various state variables from the observed data; and (3) how to develop an algorithm with parameter compensation to predict the goal of a system call sequence.

243 (1) System call (S): representing all possible calls in a system call sequence. The size of state space of system calls is KSK, which is the total number of possible calls in an operating system. (2) Goal (Q): being a set of state variables describing the goal of a sequence as normal or abnormal. Goal Q consists of two classes of states: normal and abnormal. The normal goal denotes that normal users or processes want to accomplish normal tasks. The abnormal goal is what intruders or hackers intend to reach by exploiting the vulnerabilities of the system. A simple dynamic Bayesian network for predicting the goal of a system call sequence based on the above definition is shown in Fig. 1 where Q 0 is the initial goal state and Sk is the kth system call in a system call sequence. This model stipulates that the system call Sk at time step k depends on the current goal Q and the previous system call Sk1 at time step k  1. These dependencies are based on the following assumptions: (1) The goal Q of the current system call sequence will not change in the ongoing process but only depend on the initial value Q0. (2) The sequence has a Markovian nature, that is, the current call only depends on the previous call and current Q but is independent of history.

Recursive prediction Based on the collected system call sequences, the CPDs can be calculated. Actually, given the CPD of the previous node, the CPD of the current node is calculated as shown in Table 1. Unless a node has no parent node, the priori probability is regarded as the CPD. To give a detailed procedure, the following definitions are required. Definition 1. The goal state variable Q ¼ fq1 ¼ normal; q2 ¼ abnormalg and the following notations are introduced.

S0

S1

Q0

Q

S2

S3

Network structure To discover the irregularity of system call sequences and detect intrusions, the following state variables are defined.

Figure 1 A dynamic Bayesian network for predicting the goal of a system call sequence.

244

L. Feng et al.

Table 1

Conditional probability distributions (CPDs)

Previous call

Current call

Transition probability

Transition times

Goal

S0 S1 . Sn1

S1 S2 . Sn

Pi01 Pi12 . Piðn1Þn

Mi01 Mi12 . Miðn1Þn

qi qi . qi

q: qi :

Sk ðk ¼ 0; 1; 2; .; n  1Þ: Bk ¼ fq; s0 ; .; sk1 ; sk g:

Piðk1Þk ðk ¼ 1; 2; .; nÞ:

Miðk1Þk ðk ¼ 1; 2; .; nÞ:

bki :

initial goal state; the ith goal state, also denotes state space including all the sub-sequences of qi goal state; the kth call in the system call sequence of length n; a segment of system calls from s0 to sk and the initial state is q; the transition probability from the (k  1)th call Sk1 to the kth call Sk in the system call sequence of length n for the ith goal state; the total number of state transitions for goal state i between the (k  1)th call Sk1 and the kth call Sk in the system call sequence of length n; compensation coefficient (explained later).

Then Bk Xqi ¼ ðqi KBk1 ÞXðsk Kqi ; sk1 ÞXBk1 ;

where ðqi KBk1 Þ is the goal state qi for given Bk1 , and ðsk Kqi ; sk1 Þ is the next call sk for given sk1 and qi . The following probability is obtained by total probability theorem PðBk Þ ¼

2 X

Pðqi KBk1 ÞPðsk Kqi ; sk1 ÞPðBk1 Þ;

and based on the conditional probability theorem, we have Pðqi KBk Þ ¼

Pðqi KBk1 ÞPðsk Kqi ; sk1 Þ : 2 X Pðqi KBk1 ÞPðsk Kqi ; sk1 Þ

The recursive Eq. (6) is the Bayesian posteriori equation for updating the probability of the goal state. As presented in Albrecht et al. (1997), the Bayesian network based algorithm for goal prediction is the following. Initial step:

2 X

PðS ¼ s1 KB0 Þ ¼

q2 ¼ F:

ð1Þ

ð6Þ

i¼1

1 PðQ ¼ qi KB0 Þ ¼ PðQ ¼ qi KqÞ ¼ ; 2

qi ¼ U and q1

ð5Þ

i¼1

Definition 2. Assume that fq1 ; q2 g is the perfect partition of the event space U containing all the possible sub-sequences. Such that \

ð4Þ

2 X

PðS ¼ s1 Ks0 ; qi ÞPðQ ¼ qi KB0 Þ:

ð7Þ ð8Þ

i¼1

i¼1

Based on the above definition, a Bayesian model can be derived for predicting the upcoming system call with the given current system call sequence. For a specific sequence Bk ¼ fq; s0 ; .; sk1 ; sk g with initial state q Bk ¼ Bk XU ¼

2 X

Bk Xqi :

ð2Þ

i¼1

The kth step: PðQ ¼ qi KBk Þ ¼ ak PðS ¼ sk Ksk1; qi ÞPðQ ¼ qi KBk1 Þ;

ð9Þ

which is the probability of the goal state based on the system call sequence monitored so far and used as prediction index, where ak is the normalizing factor. If Q ¼ qi is given, the probability of sk1 /sk is Miðk1Þk ; Ti

According to the Markovian assumption in section ‘Network structure’

Pðsk Ksk1 ; qi Þ ¼ Piðk1Þk ¼

ðsk Kqi ; q; s0 ; .; sk1 Þ ¼ ðsk Kqi ; sk1 Þ:

where Ti is the total number of all state transition times from sk1 for the ith goal state.

ð3Þ

ð10Þ

Predicting the intrusion intentions

245

Corrections for the overlapping partition

and modify Eq. (9) as

Bayesian formula is based on the total probability theorem and the exclusive partition of the event space. However, in reality it is difficult to have ideal partition. There exist sequences with different goals but the same initial sub-sequences. It may also be true that certain parts of an anomaly system call sequence could be the same as that of a normal sequence, and therefore q1 Xq2 sF;

PðQ ¼ q#i KBk1 ÞPðsk Ksk1 ; q#i Þ : 2 X PðQ ¼ q#i KBk1 ÞPðsk Ksk1 ; q#i Þ

ð12Þ

i¼1

In comparison with Eq. (6), only Pðqi KBk1 Þ Pðsk Kqi ; sk1 Þ are affected by the overlapping partition of fq1 ; q2 g since PðBk Þ is independent of the partition. Therefore, only Pðsk Ksk1 ; qi ÞPðqi KBk1 Þ in Eq. (9) needs to be corrected. Define compensation coefficient as bki ¼

ð14Þ This will make the prediction index more accurate. In actual application, the coefficient bki is usually determined by the experience in experiments since the accurate partition is generally not available and bki cannot be obtained by Eq. (13) theoretically.

ð11Þ

as shown in Fig. 2. In other words, it is not certain which goal set the sub-sequence Bn (a normal subsequence) or Ba (an anomaly sub-sequence) falls into. In Fig. 2, the intersected part of event space belongs either to the event space q1 or q2. In this case, direct application of Bayesian Eq. (6) would lead to incorrect prediction results. To reduce the prediction errors caused by overlapping partition, we propose to compensate the conditional probability of system call sequence. Therefore, the results obtained by Eq. (6) need to be corrected. Assume that the actual exclusive partition is fq#1 ; q#2 g shown in Fig. 2 with PðQ ¼ q#i KBk Þ ¼

PðQ ¼ qi KBk Þ ¼ ak bki PðS ¼ sk Ksk1; qi ÞPðQ ¼ qi KBk1 Þ:

PðQ ¼ q#i KBk Þ PðQ ¼ qi KBk Þ

PðQ ¼ q#i KBk1 ÞPðsk Ksk1 ; q#i Þ ¼ ; PðQ ¼ qi KBk1 ÞPðsk Ksk1 ; qi Þ

q1′

Event space of goal state q1

q ′2

Event space of goal state q 2

The experiments are designed to test the capability of our method to recognize intrusion intentions by predicting the goals of the system call sequences, once a hacker exploiting system vulnerabilities breaks into the computer network system. The testing is performed based on two different data sets: the system call data from University of New Mexico (UNM, see Forrest et al., 1996) and our own data collected from the computer network system in our lab. The approach to collect and process the data is given in detail in section ‘Testing and analysis based on the system calls data from CNSIS lab’. In our experiments, we tried different coefficients bki for compensating the conditional probability of system call sequence in order to optimize the prediction performance. Furthermore, an average prediction index Pv , a stability index d and a reliable time index g are proposed to evaluate prediction results. The testing results and evaluations are reported in the following two sub-sections. To measure the overall prediction performance, the average prediction index across the temporal horizon is defined below: T X

ð13Þ Pv ¼

q1

q2

Empirical results

q1 ∪ q 2

=



Figure 2

Description of event spaces.

T

;

ð15Þ

where T is the total number of system calls in one sequence. In addition, we define the prediction time index g indicating when the prediction is confirmed. This is the measure of how quickly the goal is detected. A smaller g ð0 ! g ! 1Þ means that we obtain a desirable result more quickly. The formal definition of average prediction time index is g¼

Intersection of events space q1 and q2

PðQ ¼ qi KBt Þ

t¼1

TM ; T

where TM is the number of temporal steps at which prediction index reaches its maximum. To

246

L. Feng et al.

determine TM , the fluctuation addition based on the sliding window is defined as !   kCWL  X K100%  PðQ ¼ qi KBk ÞK   ! 1 ; ð16Þ TM ¼ Tk iff   100% n¼k where WL is the sliding window length usually specified as 1/5e1/8 of the sequence length. Tables 2e5 list the compensating coefficient bki , and minimum and maximum value of Pv and g. Stability index d is also defined to describe the degree of fluctuation after the prediction index has reached the maximum value d¼

T X K100%  PðQ ¼ qi KBk ÞK 100% k¼T

ð0%PðQ ¼ qi KBk Þ%1Þ:

M

ð17Þ A large d means the prediction index would fluctuate a lot. The value of bki is not 1, which means compensation is active and validates our analysis in section ‘Corrections for the overlapping partition’. Since more than one system call sequences are processed, indices defined above are in some ranges, that is Pv ˛½Min Pv ; Max Pv ; g˛½gmin ; gmax ; d˛½dmin ; dmax : In following tests, it is assumed that the goal is identified when the prediction index reaches 100%.

Testing and analysis based on the UNM system calls data The testing is first performed based on the system calls data from University of New Mexico (Forrest et al., 1996). The processes involved are Login, Ps, Xlock, Inetd, Stide, Named and Ftp and intrusions types include buffer overflows, symbolic link attacks, Trojan agents, etc. First, the event occurrence probabilities Pðsk Ksk1; qi Þ used in Eq. (9) are calculated off-line based on the training data sets. The threshold of the prediction index is set

Table 2

to determine the goal state. The results of the temporal index of system call sequences versus the probabilities of normal or anomalous goal states as the prediction index are shown from Figs. 3e9. In Fig. 3, an anomalous and a normal sequence of Login process are tested. It is seen in Fig. 3(a) that the prediction index for an anomalous goal state reaches 100% in about 300 steps. Actually, the real goal of the normal and the anomalous sequences can be differentiated at about 70% of the sequence length. The low index value of the anomalous sequence at the beginning is due to the fact that the anomalous sequence and normal sequence are similar initially. The prediction index value for a normal systems call sequence in Fig. 3(b) is high even at the beginning with slight fluctuations in the middle of the sequence. Similarly, testing is performed for the processes of Stide, Inetd, Ftp, Named, Xlock and Ps and the results are shown in Figs. 4e9. The prediction results are all similar. That is, the goal states are predicted with the preset accuracy after certain percentages of the monitored system call sequences are analyzed. Therefore the algorithm developed in section ‘Dynamic Bayesian network based intrusion prediction’ is an effective method for plan recognition or goal prediction. In Table 2, the goals of five sequences of anomalous Ftp processes are predicted and the prediction time index is between gmin ¼ 0:91% and gmax ¼ 100% (goal is identified at the very end of the system call sequence). This means that the prediction time varies most among the processes tested. The prediction time index for Xlock varies only 2.2%, indicating that the prediction times for different sequences are about the same. The testing results show that the goal of Ftp process is identified earliest while the goal of the Login process is predicted last. The minimum and maximum average prediction indices Min Pv and Max Pv among all the sequences tested can reflect the overall prediction performance in general. The average prediction index and prediction times for

Experimental results for predicting the goal of anomalous system call sequences from UNM dataset

System calls

Number of sequences

bki

gmin (%)

gmax (%)

Min Pv

Max Pv

dmin

dmax

Login Ps Xlock Inetd Stide Named Ftp

5 12 2 30 8 5 5

1.05 1.02 1.9 0.9 1.02 1.3 1.3

71.6 5.4 34.6 24.9 5.1 0.82 0.91

78.7 38.3 36.8 47.4 22.7 52.7 100

0.4740 0.6818 0.6367 0.5315 0.8511 0.5674 0.5018

0.4914 0.9524 0.6583 0.9223 0.9662 0.9988 0.9960

0 0 0 0 0 0 0

0 0 0 0 0 0 0

Predicting the intrusion intentions Table 3

247

Experimental results for predicting the goal of normal system call sequences from UNM dataset

System calls

Number of sequences

bki

gmin (%)

gmax (%)

Min Pv

Max Pv

dmin

dmax

Login Ps Xlock Inetd Stide Named Ftp

12 5 39 2 10 13 7

1.05 1.02 1.9 0.9 1.02 1.3 1.3

0.27 7.4 0.18 14.6 10.6 0.0029 2.3

0.67 94 17.3 30.3 100 51.9 30.5

0.9993 0.5049 0.9838 0.9205 0.8231 0.5199 0.8226

1 0.9941 1 0.9954 0.9752 0.9999 0.9970

0 0 0 0 0 0 0

0.61 0 0 0 0 0 0

normal Login processes are quite stable with KMax Pv  Min Pv K ¼ 0:0007 and Kgmax  gmin K ¼ 0:4%, the most stable among other processes. This is due to the fact that Login is a small program with a very simple function. Any deviation from Login function can be easily identified as abnormal behavior. The stability indices are perfect (d ¼ 0) for almost all the processes except that the normal Login processes have a little fluctuation with dmax ¼ 0:61. Based on the testing results, it is evident that the method developed in this paper can indeed discover intrusion attempts and lay a basis for effective fending off attacks in advance.

Testing and analysis based on the system calls data from CNSIS lab To validate our method, extensive testing was also performed based on the data sets collected from the computer network system of our own lab. Several normal and anomalous Http, Ftp and Samba system call sequences are collected in RedHat Linux OS system with kernel 2.4.7e10. The data collector is implemented by tapping into the kernel. Each sequence is a series of system calls invoked by a process from the beginning to the end. Sequence lengths vary widely because of the differences in program complexity and users’ goals. The data collection procedure is presented as follows. (1) Collection of normal system call sequences: According to the Forrest et al.’s (1996) collection mechanism, the data for normal system calls are also divided into two types: synthetic and live data. Synthetic data are traces

Table 4

obtained by running prepared scripts or simulators to impersonate a real user’s behavior. Live normal data are traces of the processes collected during normal usage of a real computer system used by real normal users. We obtained the synthetic Http system call traces by simulating tool Webstress, which impersonates users to access and test the payload of WWW servers. For Ftp and Samba, we collected the live data, spanning a time period of about one month. The users were graduate students of our lab with a rich computer experience. (2) Collection of anomalous system call sequences: Intrusion traces involved some attacking events associated with Http, Ftp and Samba. Apache is a widely used WWW service software with an important component called Apache-SSL implementing of SSL service. It has a remote buffer overflow vulnerability that allows remote hackers to execute arbitrary commands with root privilege. Anomalous Http system calls traces are collected from such Apache-SSL exploitation. The widely known Wu-Ftpd has a remote file globbing heap corruption vulnerability, which allows remote attackers to execute arbitrary commands at the victim host. The anomalous traces of Ftp consist of the exploitations against the above vulnerability. Similarly, Samba 2.2.8 and later versions have the vulnerability leading to remote buffer overflow when illegal users send packages with the excessive length to samba server. The anomalous Samba system calls mainly comprise hosts by exploiting this vulnerability.

Experimental results for predicting the goal of anomalous system call sequences from CNSIS dataset

System calls

Number of sequences

bki

gmin (%)

gmax (%)

Min Pv

Max Pv

dmin

dmax

Http Ftp Samba

33 7 20

1.03 1.4 1.0

2.94 54.4 34.5

52.9 74.2 35

0.5330 0.26 0.94

1 0.457 0.941

0 0 0.3

0 0 0.3

248 Table 5

L. Feng et al. Experimental results for predicting the goal of normal system call sequences from CNSIS dataset

System calls

Number of sequences

bki

gmin (%)

gmax (%)

Min Pv

Max Pv

dmin

dmax

Http Ftp Samba

18 10 22

1.03 1.4 1.0

0.07 0.93 1.6

10.64 1.4 45.3

0.9320 0.9995 0.5890

1 0.9996 0.9965

0 0 0

0 0 0

The testing results of normal and anomalous goals of these traces are shown in Figs. 10e12. The prediction performances as the indices defined at the beginning of this section based on the obtained results are illustrated in Tables 4 and 5. It is observed that prediction results are very consistent with those based on the UNM data. The prediction index gradually reaches the stable maximum value with little fluctuation after time TM . The testing results for a normal and an anomalous sequence of Http process are shown in Fig. 10. It is seen in Fig. 10(a) that the prediction index for the anomalous goal reaches 100% in about 40 steps. Actually, the goal of the normal and the anomalous sequences can be differentiated at about 50% of the sequence length. From the illustration given in Figs. 10e12, Http, Ftp and Samba have similar results, though they have different TM . Among the three types of processes, the goal of anomalous Ftp process is the last one being identified while that of normal Ftp process is identified first. Except for a little fluctuation of the prediction index for anomalous Samba, those for all other processes are quite stable as seen in the figures and also demonstrated in Tables 4 and 5. Tables 4 and 5 illustrate the prediction indices for all the processes tested in our lab. From Table 4, it is seen that value of gmin of Http is the minimum among tested anomalous data, which indicates that Http is discovered earliest among

Figure 3

those with anomalous goals. In contrast, anomalous Ftp is discovered last. By observing d, the prediction indices are very stable except these for Samba with some slight fluctuation.

Conclusions and future work Intention prediction clears the way for possible real time response to hostile intrusions to computer network systems. A plan recognition based method is developed in this paper to predict the anomaly events and intentions of potential intruders using the system call sequences as observation data. An algorithm based on dynamic Bayesian network with parameter compensation accomplishes the prediction progressively and the hostile goal or the intrusion intentions is identified in the recursive process. Our extensive experimental results showed that this method is effective in predicting the goals of many normal or anomalous system call sequences of Unix operating systems. This method provides an effective way for analyzing and predicting attack attempts, to issue a short term early warning with direct evidences, and also forms the basis for developing defense and control strategies to respond to coordinated attacks with near real time. The techniques may be applicable to other information

Results for Login system call sequence.

Predicting the intrusion intentions

249

Figure 4

Results for Stide system call sequence.

Figure 5

Results for Inetd system call sequence.

Figure 6

Results for Ftp system call sequence.

250

L. Feng et al.

Figure 7

Figure 8

Figure 9

Results for Named system call sequence.

Results for Xlock system call sequence.

Results for Ps system call sequence.

Predicting the intrusion intentions

251

Figure 10

Results for Http system call sequence.

Figure 11

Results for Ftp system call sequence.

Figure 12

Results for Samba system call sequence.

252

L. Feng et al.

security problems to discover intrusion attempts in real time. There is still much remained to explore. Our future work focuses on four aspects. (1) How to reduce the intersection of normal and abnormal events space in order to reduce false alarm. Currently we use a sliding window with length of two calls to segment the sequences. According to the method that Kymie proposed in Tan and Maxion (2002), the sliding windows with length of six calls or above can distinguish the anomalous and normal sequences more effectively. Also, Lee and Xiang (2001) explained why the relational conditional entropy of anomalous and normal sequences is minimum when the length of sliding window is six or above. Therefore we will test the effect of the sliding windows with lengths other than two calls. (2) How to choose bki by a systematic approach. Currently bki is selected based on heuristics and experience. We are developing a systematic approach by machine learning. (3) How to reduce overhead caused by the goal prediction in implementation. (4) Validate our prediction method with more extensive data and in other practical systems.

Conference on Computer & Communication Security; 1998. p. 295e331. Lee Wenke, Xiang Dong. Information-theoretic measures for anomaly detection. In: Proceedings of the 2001 IEEE Symposium on Security and Privacy; 2001. p. 130e43. Nicholson AE, Brady JM. Dynamic belief networks for discrete monitoring. IEEE Trans Syst Man Cybern 1994;24(11): 1593e610. Sterritt R, Marshall AH, Shapcott CM, McClean SI. Exploring dynamic Bayesian belief networks for intelligent fault management systems. Proc. IEEE Int. Conf. Systems, Man And Cybernetics; 2000. p. 3646e52. Tan K, Maxion R. ‘‘Why 6’’? Defining the operational limits of stide. IEEE Symposium on Security & Privacy; 2002. p. 188e201. Tawfik Ahmed Y, Neufeld Eric. Temporal Bayesian networks. In: Proceedings of First International Workshop on Temporal Representation and Reasoning (TIME); 1994. Wærn A, Stenborg O. Recognizing the plans of a replanning user. In: Proceedings of the IJCAI95 Workshop on the Next Generation of Plan Recognition Systems: Challenges for and Insight from Related Areas of AI, Montreal, Canada; 1995. p. 113e8.

References

Xiaohong Guan received his B.S. and M.S. degrees in Control Engineering from Tsinghua University, Beijing, China, in 1982 and 1985, respectively, and his PhD degree in Electrical Engineering from the University of Connecticut in 1993. He was a consulting engineer with PG&E from 1993 to 1995. From 1985 to 1988 and since 1995 he has been with the Systems Engineering Institute at Xi’an Jiaotong University, Xi’an, China, and currently he is the Cheung Kong Professor of Systems Engineering and Director of the National Lab for Manufacturing Systems. He is also the Chair of Department of Automation and Director of the Center for Intelligent and Networked Systems, Tsinghua University, China. He visited the Division of Engineering and Applied Science, Harvard University from Jan. 1999 to Feb. 2000. His research interests include computer network security, and economics and security of networked systems.

Albrecht D, Zukerman I, Nicholson A, Bud A. Towards a Bayesian model for keyhole plan recognition in large domains. In: Proceedings of the Sixth International Conference on User Modeling, Sardinia, Italy; 1997. p. 365e76. Charniak E, Goldman R. A Bayesian model of plan recognition. Artif Intell 1993;64(1):53e79. Dean Thomas, Kanatawa Keiji, Shewchuk John. Prediction, observation and estimation in planning and control. In: Proceedings of Fifth IEEE International Intelligent Control Symposium, vol. 2. 1990. p. 645e6. Denning Dorothy E. An intrusion-detection model. IEEE Trans Software Eng February 1987;SE-13(2):222e32. Forrest S, Hofmeyr SA, Somayaji A, Longstaff TA. A sense of self for Unix processes. In: Proceedings of the 1996 IEEE Symposium on Security and Privacy, Los Alamitos, CA. IEEE Computer Society Press; 1996. p. 120e8. Friedman Nir, Geiger Dan, Goldszmidt Moises. Bayesian network classifiers. Machine Learn 1997;29(2e3):131e63. Geib Christopher W, Goldman Robert P. Plan recognition in intrusion detection systems. In: DARPA Information Survivability Conference and Exposition (DISCEX), vol. 1. 2001. p. 46e55. Goldman Robert P, Geib Christopher W, Miller Christopher A. A new model of plan recognition. In: Proceedings of the 1999 Conference on Uncertainty in Artificial Intelligence; 1999. p. 245e54. Huang Ming-Yuh, Wicks Thomas M. A large-scale distribution intrusion detection framework based on attack strategy analysis computer networks 1999;31(23e24): 2465e75. Lane T, Brodley CE. Temporal sequence learning and data reduction for anomaly detection. Proceedings of Fifth ACM

Li Feng received his B.S. and M.S. degrees in electrical engineering from Xi’an University of Science and Technology, Xi’an, China, in 1997 and 2001, respectively. He is currently pursuing his PhD degree at Center for Networked System and Information Security of Xi’an Jiaotong University, Xi’an, China. His research interests currently focus on intrusion detection and network security.

Sangang Guo received his B.S. degree in theoretical mathematics from Northwest University and M.S. degree in applied mathematics from Xi’an Jiaotong University, Xi’an, China in 1989 and 1996, respectively. His research interests include optimization of large-scale systems such as power generation scheduling methods with transmission constraint and algorithms for solving routing and wavelengths assignment problems in wavelength division multiplexed optical networks. He is also interested in computer network security. Yan Gao received her B.S. and M.S. degrees in electrical engineering from Xi’an Jiaotong University, Xi’an, China, in 2000 and 2003, respectively. Her research interests cover network security and system simulation. Peini Liu received her B.S. degree in electrical engineering from Xi’an Jiaotong University, Shaanxi, China, in 2001 and her research interests include intrusion detection and network security.

Suggest Documents