Frequent Episode Rules for Intrusive Anomaly Detection with Internet Datamining* Min Qin and Kai Hwang Internet and Grid Computing Laboratory University of Southern California, Los Angeles, CA 90089 Emails:
[email protected] and
[email protected]
Abstract: We present a new datamining scheme for building anomaly-based intrusion detection systems (IDS) in a network environment. Frequent episode rules are generated for anomaly detection. Several rulepruning laws are introduced to reduce the search space by up to 80% in anomaly detection. The new method demonstrates its effectiveness in detecting unknown network attacks embedded in traffic connections often requested in many Internet services such as telnet, http, ftp, smtp, Email, authentication, and authorization. We test the new episode rules and pruning techniques over the 1999 DARPA Lincoln Lab IDS evaluation datasets. The rule pruning process results in an average of 82% successful detections and 13% reduction in false positive alarms against more than 50 unknown network attacks from port scanning to R2L and DoS attack types. Our new scheme detects many attacks that cannot be detected by Snort, including the smurf, Apache2, Guesstelnet, etc. Our scheme is applicable to detect network anomalies in all TCP, UDP, and ICMP connections.
Index Terms: Network security, intrusion detection, datamining, anomaly detection, connection episodes, false alarms, Internet traffic, distributed systems, and Grid computing
1. Introduction Network security has become a major threat to Internet computing and web services. In August 2003, the outbreak of the MS Blast worm has caused millions of machine hosts to become defenseless with interrupted Internet services. An effective intrusion detection system (IDS) should be able to detect such attack profiles at the early stage. The purpose is to raise
the alarms timely to prevent major damages on network or client resources. Extensive research has been reported on the design and evaluation of IDS in the past few years. Gaffney et al [11] have proposed a decision theoretic approach to evaluate IDS. Method for reducing false alarm rate of IDS was introduced by Axelsson [3], who identified base-rate fallacy and implementation barriers. Other recent studies on IDS can be found in Burroughs et al [7], Ranum [27], and Sekar et al [30]. According to the detection methods used, the IDSs are classified into two major categories: signature-based versus anomaly-based. The signature-based IDS applies a misuse-detection model, by which the attacks are checked against saved signatures from known attacks previously reported. The Snort [29] is a good example of this kind of IDS. The misuse model is based on pattern matching, which is only good in detecting known attacks. Unless the signatures are updated frequently, the misuse model will fail in detecting new attacks. Anomaly-based IDSs are based on a normaluse detection model. The normal-use model checks the attack patterns against normal network behavior. The incoming traffic is compared with normal profiles to reveal any significant deviations. To distinguish between intrusive and normal behavior, algorithms are needed to generate the frequent episode rules (FER) [21] from audit traffic data. The concept of generating FERs using minimal occurrences was proposed by Mannila and Toivonen [22]. The advantage of using anomaly detection lies in its ability to cope with unknown attack patterns. The major drawback of anomaly detection lies in higher false alarms than using the misuse model [2].
_____________________________________ * Manuscript submitted on Jan. 27, 2004 to the 13th USENIX Security Symposium, to be held in San Diego, CA., August 913, 2004. This work was supported by NSF/ITR Grant ACI-0325409 to the Internet and Grid Computing Laboratory at the University of Southern California. All rights are reserved by the coauthors. Corresponding author is Kai Hwang via Email:
[email protected]. Min Qin can be reached at Email:
[email protected]
January 29, 2004
Page 1 of 15
Associations are used to capture the intrarecord patterns, while FERs are used to detect interrecord patterns. Datamining often generates many long FERs with a high degree of redundancy or repetitions. In this paper, we aim to remove the ineffective episode rules in the anomaly detection process. Statistically generated sequential rules for detecting anomalies were introduced in [32]. Instead of using datamining, a time-based inductive learning machine was introduced to adapt to the changes in normal user behavior. Anomaly is detected when a sequence of events deviates significantly from the normal sequential rules. Hofmeyr, et al. [13] use a similar approach by analyzing sequence of system calls to detect intrusions. In [18], Lane, et al. transform a discrete temporal sequence into a metric space and use a clustering technique to reduce the size of the user model. Datamining has been suggested for IDS construction in [5, 8, 10, 19, 24]. In the JAM project [19], Lee, et al suggested the use of axis and reference attributes to constrain the number of rules generated. Their method reduces the number of rules used to some extent. They generate FERs to reveal useful temporal traffic features. JAM uses RIPPER [8] in building classifiers that can detect signature of attacks. Fan et al [9] extended Lee’s work by introducing artificial anomalies to discover accurate boundaries between known classes and Anomalies. Bridge et al [6] applies fuzzy frequent episode and fuzzy association rules to the problem of intrusion detection. The ADAM project [4] offered a datamining framework for detecting network intrusions. Unlike JAM, ADAM is an anomaly based detection system. ADAM uses a sliding window algorithm to find frequent associations in TCP connection data. These associations are then compared with normal profiles that have already been constructed. In this paper, we make further reduction in the applicable FER rule space. Our method differs from both Lee’s scheme and ADAM by using a new FERmatching scheme. At present, universally accepted IDS evaluation benchmark is rather difficult to find. In 1999, DARPA Lincoln Laboratory [12,17] has conducted an evaluation of many IDSs under DARPA sponsorship.
January 29, 2004
Page 2 of 15
The DARPA datasets have been widely used as an evaluation of some IDS systems. Our IDS scheme was initially tested over these datasets. This paper reports the testing results on the LL traffic datasets. Subsequently, we will test the new detection scheme on the TCPdump data collected from several USC campuses [14]. This two-stage of experimentation is intended to be fair and objective, independent of specific training data used. Several research groups have identified the problems associated with using the DARPA datasets for detection benchmarking. McHugh [23] criticized DARPA data for its superficial background traffic and unreliable accuracy yielded by tuning an IDS system towards the target attacks. According to a recent analysis by Mahoney and Chan [20], the attack-free training data of DARPA evaluation lacks some attributes covering TCP SYN regularity, source address spectrum, checksum and packet header information, etc. However, these attributes do exist in the attack dataset. Anomaly IDS experimental results over the MIT/LL datasets may lead to over-claimed accuracy. For example, TTL values used to detect attacks in DARPA 1999 data are not available in the training dataset. In order to correctly reflect the new IDS performance, we did not include these attributes when we generate features from the available traffic data. We use the MIT/LL dataset to prove the viability and effectiveness of the new IDS scheme. The accuracy problem will be further addressed with real-life benchmarks in follow-up work [26]. The rest of the paper is organized as follows: Section 2 introduces the basic techniques for mining audit traffic of connection data. Section 3 presents an anomaly-based IDS architecture. Section 4 specifies a base-support algorithm for generating useful FERs to detect intrusions. In Section 5, three pruning techniques are introduced to reduce the FER search space. In Section 6, we outline the pruning procedure and FER matching issues. Section 7 reports experimental results in terms of intrusion detection rate and false alarm rate. Finally, we summarize the lessons learned and suggest further research directions.
s = Support ( X U Y )
2. Mining of Audit Data on Internet Traffic In order to build effective network IDS against intrusions, we use datamining to find the patterns of both normal and intrusion behaviors from system audit data. We adopted the idea of axis and reference attributes introduced by Lee et al, since it includes domain-specific knowledge and is able to describe relationships among traffic records. The tasks of datamining are described by either association rules or frequent episode rules (FERs). An association rule is aimed at finding interesting intrarelationship inside a connection record. The FER describes the inter-relationship among multiple connection records.
2.1 Basic Mining Terminologies Let T be a set of traffic connection records. Consider the set of attributes defined over the traffic records such as : A={timestamp, duration, service, srchost, desthost} for TCP connections. Let I be a set of values for the attributes in A, such as I = { timestamp = 10, duration = 1, service = http, srchost = 128.125.1.1, desthost = 128.125.1.10 }. Any subset of I is called an Itemset representing certain traffic characteristics. Let X be a traffic itemset under evaluation. The support value for X, denoted Support (X), is defined by the percentage of connection records in T that satisfies X. For example, X = {timestamp=10, duration=1} is an itemset. Y = {service = http} is another itemset. In this example, X I Y = φ . The union of the two itemsets is X U Y = {timestamp =10, duration=1, service=http} represents the three traffic attributes as listed. Association Rules: An association rule is defined between two traffic itemsets, X and Y. These two itemsets are disjoint with X I Y = φ . An association rule is denoted by: X → Y, ( c, s )
Support ( X U Y ) Support ( X )
(1.b)
Both s and c are fractional numbers calculated directly from the Support functions on the itemsets X and on the joint itemset X U Y as exemplified above. Frequent Episode Rules: In general, a FER is expressed by the expression: L1, L2, …, Ln → R1, … , Rm, (c, s, window)
(2.a)
where Li (1 ≤ i ≤ n) and Rj (1 ≤ j ≤ m) are ordered itemsets in a traffic record set T. We call L1, L2, …Ln the LHS (left hand side) episode and R1,….Rm the RHS (right hand side) episode of the rule. Note that all itemsets are sequentially ordered, that is L1, L2, …Ln, R1,…., Rm must occur in the ordering as listed. However, other itemsets could be embedded within our episode sequence. We define the support and confidence of rule (2.a) by the following two expressions:
s = Support ( L1 , L2 ..., R1 ..., R m ) ≥ s 0 c =
Support ( L1 , L 2 ,..., R 1 ,..., R m ) ≥ c0 Support ( L1 , L 2 ..., L n )
(2.b) (2.c)
We will consider only the minimal appearance of the episode sequence in the entire traffic sequence. The support value s is defined by the percentage of occurrences of the episode within the parentheses out of the total number of traffic records audited. The confidence level c is the joint probability of the minimal occurrence of the joint episodes out of the support for the LHS episode. Both parameters are lower bounded by so and co, the minimum support value and the minimum confidence level, respectively. The window size is an upper bound on the time duration of the episode sequence.
Example 1: Support and confidence in frequency episode rules Consider the association rule :
(1.a)
The association rule is characterized by a support value s and a confidence level c. These are probabilities of the corresponding traffic events, defined by:
January 29, 2004
c=
Page 3 of 15
(service = http) → (duration = 1) (0.8, 0.1), which means that 80% of all the http connections have duration less than one second. There are 10% of all network connections initiated from the http requests with a duration less than one second. The itemset on the
LHS differs in value from that on the RHS in the association rule. The episode rule:
(service = http, flag = S0) (service = http, flag = S0) → (service = http, flag = S0)
(service = authentication) → (service = smtp) (service = smtp) (0.6, 0.1, 2 sec)
where the event (service= http, flag= S0) is an association. The combination of associations and FERs reveals valuable information on both normal and intrusive behaviors. These can be applied to build an IDS to against both known and unknown attacks.
is use to specify an authentication event. The physical meaning is that if the authentication service is requested at time t1, there is a confidence level of c = 60% that two smtp services will follow before the time t1 + w, where the event window w = 2 sec. The support of the three traffic events (service = authentication), (service = smtp), (service = smtp) accounts for 10% of all network connections. Note that the events on both sides of the FER need not be disjoint in an episode of traffic connection events.
2.2 Axis Attribute and Reference Attribute The basic rule generation algorithm does not take any domain specific knowledge into consideration. Often, too many ineffective rules are generated to be useful. For example, the following association rule: srcbytes = 200 → destbytes = 300 is of little interest to the intrusion detection process, since the size information on service bytes and destination bytes are normally irrelevant to the traffic and threat condition. In order to address this issue, Lee et al [19] has introduced the concepts of axis attributes and reference attributes to constrain the generation of mining rules. For each association rules, it must contain some values of axis attributes. Those association rules that do not contain any axis attributes are considered irrelevant to the context. Axis attributes are selected from srchost (source host), desthost (destination host), srcport (source port), and services (destination port). Different combinations of the essential attributes form the axis attributes. The Itemsets in a FER must contain some axis attributes. The reference attributes are those that demand itemsets used in a FER must have the same reference value as exemplified below.
Example 2: FER for detecting a SYN flood attack The SYN flood attack is specified by the following episode rule:
January 29, 2004
Page 4 of 15
We identify below several open problems, which need to be solved in order to apply datamining in building an anomaly-based IDS, effectively. These problems are solved in subsequent sections. • There are a large number of rules in the normal profile, which may enlarge the search space beyond the possibility to perform real-time association or episode analysis. • Datamining often overlooks rare patterns. Infrequent network services such as an authentication process might be overlooked in an intrusion detection process. • There are lots of FERs to be generated from daily network traffic. Many of which are not in the normal profile, thus raising high false alarms to make the IDS effective in real practices.
3. Datamining Architecture for Anomaly Intrusion Detection Our long-term goal is to build an intelligent intrusion detection system that can help secure any distributed computing infrastructure such as a Grid computing system. The system can detect not only known attacks, but also novel unknown intrusions. Three major components of our IDS are the datamining engine, the intrusion detection engine, and the alarm generation engine as shown in Fig. 1. In this paper, we generate the normal profile database and the construction of the anomaly detection engine. In order to correctly detect intrusion patterns, we extract two levels of information from the raw audit data of the network traffic. Using connection information can help detect only a small portion of the attacks, although it is very effective against flood and scan attacks [22]. After adding some additional features extracted from the packet level data, new attacks can be detected. In our work, we extract the traffic features from both connection and packet information in the raw
audit data. In addition to generate new features for connection records, we use packet-level information to detect protocol anomalies. For each packet, we generate the following records from its header:
(connection.id, timestamp, srchost, srcport, desthost, destport, flags) The connection.id uniquely identifies for which connection the packet is generated. Timestamp is the sending time of the packet. The desthost and destport represent the IP address and port number of the destination, while srchost and srcport represent the IP address and port number of the source. Flags are used to indicate the connection status and some special attributes of a packet, such as whether the srchost is identical to the desthost. For packets with the same connection.id, we check whether they violate any TCP protocols. For example, TCP three-way handshake protocol can be easily verified by looking at the packets for establishing the connection. Also during the preprocessing stage, packets with infrequent properties are identified for the purpose of anomaly detection. We keep a strong interest in those infrequent attribute values since attackers often utilize them. For example, packets with same destination and source address will normally indicate some potential attacks. For each connection, we generate a summary record by aggregating all its packet information. Audit data Data preprocessor
Rules from realtime traffic
Feature extraction Intrusion Detection Engine
Signature Database
Alarm generation Engine
Attack-free episode rules
Data mining Engine
Anomaly Detection Engine
Alarm generation
Normal profile database
Security policy
Figure 1 The datamining architecture for anomaly-based intrusion detection We consider the following traffic connection records, abbreviated by: timestamp, duration, srchost, srcport, desthost, service (destport), srcbyte, destbyte, and flags,
January 29, 2004
Page 5 of 15
where the Srcbyte and destbyte are the number of bytes sent at each direction. This record is fed into our datamining engine for training and detection purpose. The anomaly detection engine will compare episode rules in the current traffic data with normal profiles to detect anomalous behaviors. Our scheme detects two kinds of anomalies: one for detecting sequential anomaly and the other for detecting single packet or single connection anomaly. Instead of comparing frequent episodes, we use FERs as an indicator to detect anomalies since it describes the relationship among a series of connections. If the FERs generated by the datamining engine deviates significantly from all normal FER rules, an alarm is raised. We calculate some temporal statistics from current traffic data to analyze the connection data. To cope with both known and unknown attacks, we have to check the signature database as well to classify known intrusions.
4. A New Datamining Algorithm for Base Support Most mining techniques exclude infrequent traffic patterns. This may cause the IDS to be ineffective in detecting rare network events. For example, the authentication service is infrequent in ordinary network traffic except in E-commerce and digital government applications. If we lower the support threshold, a large number of uninteresting patterns associated with the frequent services are discovered. Lee et al used a levelwise mining to iteratively lower the minimum support value. Initially, they use a high minimum support value si to find the episodes related to high frequency attribute values. The procedure iteratively lowers the support threshold by half so that each new candidate itemset must contain at least one “new” axis value. The procedure terminates, when a very small threshold so is reached. We have tested Lee’s level-wise algorithm using the 1999 DARPA intrusion data set. The result shows that some FERs contain unrelated associations.
Example 3: FER rules for http and telnet operations
Consider the following rule of 3 events on the lefthand and one consequence event on the righthand, where service is the axis attribute and the source host is the reference attribute: (service = http, flag = SF) (service = http, flag = SF, srcbyte = 5000) (service = telnet, flag = SF) → (service = http, flag = SF, srcbyte = 5000) (0.7, 0.0025) Since telnet has no relationship with the http operation, this episode rule is not useful to describe normal traffic pattern. Although telnet is a frequent service, the episode rules related to telnet are rare. The above episode rule has the highest support value among all rules related to telnet. Thus it is probable for a common service to appear in an episode rule with extremely low support value, if the individual connections are independent of one another. We introduce a base-support mining algorithm to address this problem. The base-support of an episode is defined as the ratio between the number of minimal occurrences of the episode and the number of records in T that contains the uncommon axis attributes of this episode. Define 1-itemsets as those containing only one attribute, saxis as the support value of the axis attributes, and min{saxis} as the minimum support of all axis attributes. Our base-support mining process is specified in Algorithm 1. Algorithm 1: Base-Support Mining Procedure Input: Base-support threshold s and its axis attribute(s) Output: Frequent episode rules Begin (1) For all different axis attribute(s) values, calculate their support in the database (2). Scan database to form L = { Large 1-itemsets that meet s × saxis }. (3). While there are new rules generated (4) Find serial episode from L: The episode must have support value larger than s × min{saxis} (5) Append the generated episode rules to the output rule set end while end
January 29, 2004
Page 6 of 15
We have experimented our base-support mining algorithm with the 1999 DARPA IDS Evaluation dataset [17]. To construct normal network patterns, the attack-free training data of the first and the third week are fed into our base-support mining engine. We use a simplified approach in merging frequent episode rules from multiple days. After finding FERs from each day’s audit record, we simply merge them into a large rule set by removing redundant rules. The minimum confidence value was set as 0.6 and the window size was chosen as 30 second. We choose the source host as the reference attribute and the service as the axis attribute. For uncommon services, rules related to them are hard to generate if we do not aggregate the connection records. For example, rules related to rare services such as printer may only have one occurrence in one day’s audit data. Such one-time pattern will often be ignored by most datamining engine. In order to capture rules related to these services, we aggregate the connections of rare services into the database Figure 2 shows the result of our base-support and Lee’s level-wise mining algorithms for all the TCP connections in two weeks of attack-free data from Lincoln Laboratory [17]. When more data are accumulated, we use our base-support algorithm to mine the frequent rules.An interesting feature of Lee’s level-wise mining algorithm is that lowering the initial support value will not generate more frequent episode rules. When dealing with 10 days of training data, the level-wise algorithm with initial support value 0.3 will generate more rules than that from initial value 0.1. Because the initial support value is reduced by half at each iteration, it has less impact after more iterations. It is rather hard to generate rules related to infrequent axis attributes. On the other hand, our basesupport mining is fair to different axis attribute values, since the same percentage of records related to different attribute values is required in any candidate rule. High minimum base-support value will result in fewer frequent episode rules. Under the context of different axis attribute values, the base-support mining algorithm provides the same minimum support value as a normal datamining algorithm does. If daily network traffic does not change often, our base-support mining algorithm stabilizes very fast and needs only a few days of training data in practical detection applications.
Number of episode rules generated
400 350 300
Base-support algorithm, minimum base-support=0.1
250
Base-support algorithm with minimum base-support=0.3
200
level-wise algorithm, initial support value=0.1
150
level-wise algorithm, initial support value=0.3
100 50 0 1
4
7
10
Training sets in days
Figure 2. Experiment on attack-free TCP connections of 1999 DARPA intrusion detection evaluation dataset using our base-support mining algorithm with minimum confidence level of 0.6 and a window size of 30 sec
5. Pruning of Ineffective Episode Rules However, if new axis attribute values for new services show up very often in network traffic, our algorithm cannot stabilize very quickly since each training set will introduce new frequent episode rules. Because of the large number of records in the TCPdump data, there are still a large number of uninteresting rules generated. In order to reduce the number of rules generated and to provide a simplified view of data patterns, we propose the following pruning techniques to reduce the rule space. We consider an FER effective, if it more applicable and more frequently used. An episode rule is said to be ineffective, if it is rarely used in detecting anomalies in network traffic. The following traffic laws apply to today’s open networks
5.1 Transposition Law Comparing the following two FER rules : L1, L2, … Ln → R1, … Rm (c1, s1)
(3.a)
L1, L2, … Ln-1 → Ln, R1, …., Rm (c2, s2)
(3.b)
The second rule in Eq.(4.b) is transposed by moving the event Ln from the LHS to the RHS. We consider the second rule more effective than the first one. These two rules have the same support value s1 = s2 = Support(L1, L2, …Ln, R1,…., Rm) as defined in (2.b). However their confidence levels are different:
January 29, 2004
Page 7 of 15
c1 =
Support ( L1 , ...Ln , R1 ....Rm ) Support ( L1 , ...Ln )
Support ( L1 , ...Ln , R1 ....Rm ) ≥ = c2 Support ( L1 , ...Ln − 1 )
(3.c)
The transposed rule in Eq.(4.b) has a smaller confidence value than the original rule in Eq.(4.a). If the confidence level c2 is above the minimum confidence value c, c1 must also be larger than c. Thus the first rule is always implied by the second one. So we can prune the first rule from the rule set.
Example 4: Application of Transposition Law Comparing the following two rules. The first one is more effective than the second one because it satisfies the transposition law. (service = http, flag = SF) → (service = http, flag = SF), (service = http, flag = SF) Both frequent episode rules describes the normal http behavior. We need only to include the first one in our normal FER rule set and ignore the second rule: (service = http, flag = SF), (service = http, flag = SF)
→ (service = http, flag = SF) During detection phase, if the second rule is generated after applying the transposition law, we regard it as a normal rule since it is implied by the first rule in our normal rule set.
We only generate one rule from a frequent episode, a large number of redundant comparisons are not necessary. However, datamining may generate some longer rules for describing similar normal behavior. A good example is given below: (service = http, flag= SF), (service = http, flag = SF) → (service = http, flag=SF),(service =http, flag = SF)
L1 does not exclude R1. A similar conclusion can be drawn between L1, L2 → R1 and L1 → R1 (c3, s3).
Example 5: Application of Elimination Law Based on the above law, the following rule is considered ineffective (service = http) (service = authentication) → (service = smtp) (0.6, 0.1)
Compared with the above shorter rules, this rule has the same power in describing normal http behavior. We need to remove as much redundancy from these rules as possible.
because of the existence of the rule:
5.2 Rule Elimination Law
is related only to the smtp operation, the appearance of http does not affect the other two itemsets. We keep only the following rule:
Shorter rules or rules with shorter LHS (left hand side) are considered more effective than longer rules or rules with longer LHS. This is because shorter rules are often easier to apply or to compare. Clustering of shorter rules is also much easier. The following FER rule (4.a) L1, L2 → R1 (c1, s1) becomes ineffective, if one of the following two rule conditions is met. (i). Rule L1 → R1 does not exist and rule L2 → R1 (c2, s2) exist in the rule set, where
c1 − c 2