Immunizing Computer Networks: Getting All the ... - Semantic Scholar

0 downloads 0 Views 165KB Size Report
Nov 2, 1998 - Network to Fight the Hacker Disease. £. Steven A. Hofmeyr. Stephanie Forrest. Dept. of Computer Science. University of New Mexico.
Immunizing Computer Networks: Getting All the Machines in Your Network to Fight the Hacker Disease Steven A. Hofmeyr Stephanie Forrest Dept. of Computer Science University of New Mexico Albuquerque, NM 87131-1386 fsteveah,[email protected] November 2, 1998

Abstract This paper introduces a method of distributed network intrusion detection which scales with the number of computers on a network and is tunable (the probability of detection can be traded off against overhead). Experiments with real network traffic show that the system has high detection rates for several common routes of intrusion, and low false-positive rates under normal behavior. The method can easily be extended to accommodate dynamically changing definitions of normal behavior (for example, adding a host to the network) and to remember known patterns of intrusion.

1 Introduction The problem of protecting networks of computers from harm is of increasing importance, due to the explosive growth and increasing connectivity of these networks. “Harm” to networks can arise from unauthorized activities that violate security policy on individual hosts (intrusions or intrusion attempts), malicious activities that are technically legal (such as denial-of-service attacks), inadvertent misuse of systems, and hardware or software problems. In this paper, we introduce a method of distributed network intrusion detection which scales with the number of computers on a network, is tunable (the probability of detection can be traded off against overhead), has high detection rates for common routes of intrusion, and low false-positive rates under normal behavior. The method can easily be extended to accommodate dynamically changing definitions of normal behavior (say, when hosts are added or removed from the network) and to remember  Submitted to 1999 IEEE Symposium on Security and Privacy

known patterns of intrusion. Our method builds on two earlier computer security projects—the Network Security Monitor [6, 8] and the immune-inspired negative-detection algorithm [4]. The Intrusion Detection System (IDS) presented here monitors network traffic over a broadcast local area network (LAN). Earlier, researchers at the University of California, Davis (UCDavis) showed that intrusions can be successfully detected by monitoring traffic patterns in broadcast LANs [6, 8]. Their system, known as the Network Security Monitor (NSM), has several important advantages: Monitoring network traffic allows the IDS to be independent of the operating system (it is dependent only on network protocols, which are highly standardized); monitoring broadcast traffic 1 means that every machine in the LAN has real-time access to all traffic (so every machine can monitor every other machine); and there are minimal tracing costs. Although successful, NSM has serious limitations. It is computationally expensive, requiring its own dedicated machine. Further, its architecture does not scale: the computational complexity increases as the square of the number of machines communicating. Finally, NSM is a single point of failure in the system because it runs on a single machine. These limitations can be overcome by distributing the IDS over all machines in the network. Distribution will make the IDS robust by eliminating the single point of failure and will make it more flexible and efficient; computation can vary from machine to machine, fully utilizing idle cycles. However, the architecture of NSM is not easily distributable. Distributing NSM would require either excessive resource consumption on every machine upon which it was run, or communication between machines. Negative detection provides a natural way to distribute a detection task across multiple locations [3, 2, 4]. With negative detection, the system retains a set of detectors that match occurrences of abnormal or unusual patterns (in this case, the patterns are representations of network packets). The detectors are “negative” in the sense that they (implicitly) define normal network traffic in terms of non-normal packets. In the original papers, negative detection was used to monitor for changes by computer viruses to program files. Because in that domain the system being monitored is static and centralized, it does not utilize all that negative detection has to offer. In this paper, we bring together the themes of negative detection and intrusion detection in a new context, that of network intrusion detection. Distributed negative detection has numerous advantages: It is localized (there is no communication required between machines), it is scalable (errors decrease exponentially with the number of machines that independently implement the detection system), it is tunable (there is a trade-off between computational resources used and the number of errors), and it is flexible (different machines can have different numbers of detectors, to take advantage of idle cycles or to free up heavily-used machines). One limitation of the negative-selection algorithm as originally implemented is that it can result in undetectable abnormal patterns called holes, which limit detection rates [3, 2]. In this paper, we address the problem of holes by introducing permutation masks to remap the representation of network packets seen by different detectors. We use two mechanisms to reduce false-alarm rates: Activation thresholds (which require a detector to find multiple abnormal patterns before an alarm is raised) and adaptive thresholds (which 1

As switched networks become increasingly popular, this assumption may become problematic. Possible extensions to switched networks are discussed in section5.

2

change the sensitivity of the system). Activation thresholds allow a detector to integrate abnormal patterns over time (as in temporally separated abnormal activity from one location), and adaptive thresholds allow the system to integrate abnormal activity from multiple locations (as in distributed coordinated attacks). Initial results of using negative detection to distribute the IDS over a LAN are promising. With as little as 1 Kbyte of information per machine on a network of 50 machines, the IDS clearly detects 100% of the 8 intrusive incidents against which the system was tested. Even with such successful detection rates, we have very low false positive rates, with under two false positives per day over a week of testing against normal traffic. Furthermore, the IDS is easily tunable, for example, we can improve the chance of detecting stealth attacks if we are willing to tolerate increased false positives. Fortunately, these false positive rates do not grow indefinitely: in the tests we carried out, there were at most 20 new packets per day (averaged over a week), which means a false positive rate of at the most 20 per day. In the remainder of the paper, we review NSM (Section 2), show how the negative-selection algorithm can be used to distribute an NSM-like IDS (Section 3), describe a set of experiments to test the system (Section 4), discuss the significance of our results and outline areas of future investigation (Section 5), and draw some conclusions (Section 6).

2 Background The IDS presented here monitors network traffic over a broadcast local area network (LAN). The normal traffic patterns in a LAN can be defined in terms of datapaths, where a datapath is a triple (source host, destination host, service), representing a possible TCP connection between a source and destination host using a particular network service (see figure 1). A profile of connections can be built up by monitoring the network during normal usage, and this profile can be used to train an anomaly detection system. Once the system is trained, it can be used to monitor the network for deviations from the normal patterns. A similar approach was used in the Network Security Monitor (NSM) intrusion detection system developed at UCDavis [8, 6]. Results reported in Mukherjee et al. [8] indicate that over a two month period, NSM identified 300 out of 11000 connections as anomalous. Subsequent analysis revealed that most of these 300 connections were indicative of attacks or illegitimate behavior. This approach to anomaly detection was successful because most machines in the LAN typically communicated with few (3 to 5) other machines, over very limited services. If the network is viewed as a graph, with a node for each host, and each connection between hosts represented by an edge between corresponding nodes, then the graph of normal traffic was very sparsely connected. This sparseness is encouraging because it means that the normal profile is fairly compact, and that attackers have a high probability of generating unusual connections. The success of NSM suggests that this is a good approach to network intrusion detection. However, there are limitations to the way in which this intrusion detection was implemented in NSM. NSM runs on a single machine, and retains a complete profile of all normal patterns in a matrix of connections. This centralized approach required significant processing power and storage space. Furthermore, this kind of system is not scalable: as more machines are added, so the complexity of storage increases quadratically. And finally, the

3

external host internal host

ip: 10.10.10.5 port: 21

datapath triple (10.10.10.2, 20.20.20.5, ftp)

ip: 20.20.20.5 port: 1700

broadcast LAN

Figure 1: Patterns of network traffic on a broadcast LAN. Each machine can be thought of as a node in the graph, with a connection between machines forming an edge between the corresponding nodes. Note that a collection of normal datapaths will include paths between hosts on the LAN, and connections between hosts on the LAN and external hosts. machine upon which NSM is running constitutes a single point of failure: if that machine is compromised, the IDS is effectively disabled. We demonstrate that these limitations can be overcome by distributing the IDS. In the following section, we describe exactly how this kind of detection scheme can be distributed.

3 A scheme for distributing network intrusion detection The detection scheme presented here is based upon an abstract model of detection, for which much theoretical analysis has been done [4, 3, 2]. In this paper, we concentrate on the practical application of the model and so discuss the abstract model only briefly, within the context of network intrusion detection. Consider a universe of binary strings (although we could use any alphabet, we use binary strings for reasons which are explained in section 3.3), where each string represents an event in the system of interest. We assume that the universe of strings is closed, and can be subdivided into two disjoint sets, which we label self and nonself 2 . The self set consists of all strings that represent legitimate events, and the nonself set consists of all strings that represent illegitimate events. We have assumed that these sets are disjoint, but in reality they could overlap; in these cases we would hope that any illegitimate incident generates several strings, at least some of which are exclusively in nonself (in experiments in the network traffic domain, we find that this is precisely the case, as shall be shown in section 4.2). 2

This terminology is borrowed from immunology. In general, we shall avoid talking about the immune analogy, except in the discussion (section 5).

4

universe nonself

detection system

self

false negatives false positives

Figure 2: The universe of strings. Each string can belong to one of two sets: self and nonself. Self strings represent acceptable or legitimate events, and nonself strings represent unacceptable, or illegitimate events. A detection system attempts to encode the boundary between the two sets. Where it fails to classify self strings as normal, false positives are generated; nonself strings that are not classified as anomalous generate false negatives. An anomaly detection system must somehow encode the boundary between self and nonself, so that, given any string, it can classify that string as either self or nonself. Typically, such a system would be trained on some representative collection of self strings, and then be used to classify new strings according to the model of self retained by the detection system. In effect, the system classifies new strings as normal or anomalous, and can make two kinds of errors in classification: a false positive error occurs when a self string is classified as anomalous, and a false negative error occurs when a nonself string is classified as normal (see figure 2).

3.1 Representation of network traffic In our application of the model, each event is represented by a TCP/IP SYN packet header [7]. We consider only SYN packets because all following packets on the same connection represent the same datapath triple as the original SYN packets. Furthermore, the volume of data is greatly reduced because there are typically more than 1000 normal packets to every SYN packet. Monitoring only packet headers and not packet contents requires less processing and storage, and avoids privacy infringement. Packet headers are typically 4-tuples of the form (source IP address, source port, destination IP address, destination port). These 4-tuples are mapped to triples of the form (source IP, destination IP, service) by mapping the two port numbers into a service as follows. Generally, one machine is the server and the other machine is the client, and the port on the server usually identifies the kind of connection. For example, if 5

bits: 1

8 9

internal host or server internal host

40 42

external host or client internal host

server flag

49

service

Figure 3: Binary string representation of a TCP network packet header. we have the 4-tuple (10.10.10.5, 21, 20.20.20.5, 1700), then the source host port is 21, which identifies the connection as ftp. Some ports are assigned to commonly known services, such as ftp, telnet, login, etc., whereas other ports are non-assigned (for a sample list of assigned ports, see Appendix G in Garfinkel and Spafford [5]). Furthermore, ports can be privileged (those ports below 1024) or non-privileged (1024 and up). When converting a source and destination port pair to a service, if one of the ports is an assigned port, the service is identified with that assignment (in the above example, the service would be ftp). If neither of the ports is assigned, but one is privileged, then the service is classified as a non-assigned privileged service; if neither are assigned, then it is classified as a non-assigned, non-privileged service, that is, all non-assigned port pairs are grouped into one of two categories: privileged or non-privileged. Using this mapping from port numbers to services, the total number of TCP service categories is about 130. The derived triple (source host, destination host, service) is now mapped to a binary string of length l = 49. The break down of the binary representation is as follows (see figure 3). Because at least one host must be on the LAN, the first 8 bits represent a host on the LAN, and are taken from the least significant byte of its IP address (the other bytes are irrelevant). The following 32 bits represent the other host involved in the communication, which will require 32 bits for the IP address if that host is external. If the other host is also on the LAN, then only 8 bits are needed, but 32 bits (the full IP address) are still used to maintain a fixed length representation. If an external host is involved, it is always represented in the second 32 bits, so an extra bit is used to indicate whether or not the first host is the server; if the first host is the server, the bit is set to one, otherwise it is set to zero. The final 8 bits represent the type of service. The service type is mapped from its category to a number from 0 to 255. All non-assigned privileged ports are represented by a single number, and all non-assigned non-privileged ports are represented by a different, single number.

3.2 Distributing the detection system NSM functions by encoding as complete a description of the self set as possible during the training phase, and then during the test phase, if a new string is not found in the normal profile, it is declared to be anoma-

6

host 1

host 2

host 3

Figure 4: A solution to distributed detection. A detector is an approximate encoding of the boundary of the self set; this minimizes the information requirements of the detector by ignoring the specific convolutions of the boundaries. Each machine has a different approximate detector, and the combination of these detectors refines the distinction between self and nonself. lous. We could distribute this system by simply replicating it across many hosts, but such a solution is neither scalable nor efficient. There is a much better solution. Instead of trying to encode precisely the boundaries of self, we can encode some crude or approximate superset3 of self. Clearly, the more general the detector, the cheaper it will be to implement. This concept is illustrated in figure 4. If each host had a different approximate encoding, local detection would not be very accurate, but over all locations, the detection would be refined by the combination of all the different encodings. How can we generate these crude “superset-detectors”? The immune system has a very elegant solution to offer. The first step is to define detection in terms of the complement of the self set, that is, we define negative detectors, each of which encodes, or covers a subset of nonself rather than a subset of self. Then a set of such negative detectors encodes a crude “superset-detector”, but only if each negative detector in the set is actually valid (i.e. does not cover any self strings). Sets of valid detectors can be constructed by the negative selection algorithm (see figure 5) [4]: randomly generated detectors are compared to the representative set of self strings (the training set), and censored (deleted) if they match any of those self strings (i.e. if the detector covers any part of the self set). We repeatedly generate and censor detectors until we have sufficient to encode the “superset-detector” at the desired level of approximation. The detection 3

It is essential that the encoding does not exclude any part of self in order to avoid false positives.

7

negative detector Randomly generate detector string

If detector matches self, regenerate otherwise accept

ACCEPT

REGENERATE

Figure 5: The negative selection algorithm. Candidate negative detectors are generated randomly, and if they cover any part of the self set, they are eliminated and regenerated. This process is repeated until we have a set of valid negative detectors. system is distributed by placing different sets of detectors on different hosts. To the extent that the training set is a complete representation of self, the negative selection algorithm guarantees no false positives. We shall return to the problem of collecting a complete self set in section 3.5.

3.3 Implementing negative detectors There are many ways of implementing the detectors, for example, a detector could be a production rule, or a neural network, or an agent [1]. We chose to implement detection as string matching, where each detector is a string d, and detection of a string s occurs when there is a match between s and d, according to a matching rule. We use string matching because it is simple and efficient to implement, and easy to analyze and understand. Obvious matching rules include Hamming distance or edit distance, but in this paper we use the r -contiguous bits [9] rule, because of available theoretical analysis [4, 3, 2], and because it is localized. We discuss the advantages of localization in section 3.4. Two strings, d and s match under the r -contiguous bits rule if d and s have the same symbols in at least r-contiguous locations (see figure 6). The value r is a threshold, and can be set differently for different detectors or detector sets. The value of r determines the specificity of the detector, which is an indication 8

r=4 triple string

0110100101

0100110100

detector string

1110111101

1110111101

Match

No match

Figure 6: Matching under the r -contiguous bits match rule. In this example, the detector matches for r = 3, but not for r = 4. of the number of strings covered by a single detector. For example, if r = l, the matching is completely specific, that is, a detector will detect only a single string (itself). A consequence of approximate matching is that there is a trade-off between the number of detectors used, and their specificity: as the specificity of the detectors increases, so the number of detectors required to achieve a certain level of coverage also increases. This trade-off has been theoretically analyzed in Forrest et al. [4] and D’haeseleer et al. [3]. We shall not discuss the analysis here, but simply note that obtaining general results that are meaningful is difficult, because the analysis depends on the structure of the self set (i.e. the regularities in the self set). We can obtain solutions for problem instances, but deriving nontrivial solutions for classes of problems is hard. The correct balance in the trade-off depends on the specific problem instance, which is why we have used a binary alphabet for the strings. Binary alphabets give the most flexibility in the choice of r , and hence the most flexibility in the trade-off.

3.4 Consequences of a negative detection scheme: the problem of holes Although increasing the specificity of detectors increases the number of detectors required, it also increases the accuracy with which the detection system can distinguish between self and nonself. This is another important aspect of the trade-off, because at any given level of nontrivial specificity (i.e. for which r < l), the detectors may not be precise enough to encode the self set accurately, that is, there may be strings in the nonself set that cannot be covered by valid detectors. Such strings are called holes because they are “holes” in the detection system’s coverage of nonself (see figure 7). It has been proved that for a given level of nontrivial specificity, holes can exist regardless of the match rule used [3]. Clearly the number of holes depends on both the regularities in the self set and the specificity of the detectors. Generally, the less specific the detectors, the greater the number of holes. Holes place a fundamental lower bound on the number of false negatives (i.e. an upper bound on the best detection or true positive rates). In general, these limitations can be overcome in two ways: by varying the specificity of the detectors, or by changing the “shape” of the detectors. With a set of detectors of varied specificities, the idea is that large areas of the nonself set can be covered 9

holes

holes

Figure 7: The existence of holes. There are parts of the nonself set that cannot be covered by valid negative detectors of a given specificity. by very general detectors, and the intricate convolutions of the self set can be tracked by highly specific detectors, thus eliminating holes. This approach is problematic with randomly generated detectors because we have no guarantee that the highly specific detectors will fall anywhere near the self/nonself boundary. In general, the probability that the highly specific detectors will track the intricacies of the self/nonself boundary is extremely low if the self set is very small compared to the size of the universe (the expected case). The “shape” of the detectors can be changed by using a different matching rule for each detector. If each set on each machine has a different matching rule, it will be vulnerable to a different set of (possibly overlapping) holes. The combined set of false negatives over all machines is then the intersection of the different hole sets (see figure 8). A simple way of generating a wide variety of different matching rules is to use a locality-based match rule such as r -contiguous bits, together with randomized permutation masks. Each detector set has its own randomly generated permutation mask, by which it reorders all input strings before checking for matches with the r -contiguous bits rule. For efficiency, in our current implementation, binary strings are packed into bytes, and the bytes are permuted. A permutation mask is then a list of 6 numbers, representing the remapping of the 6 bytes in the binary string (the final 49-th bit is not remapped). Because this gives only 720 possible combinations, we use a randomly-generated hash table to remap byte values. Each location in the hash table contains a randomly-generated number which is used to replace byte values that hash to that table location. The random hash-table increases the number of possible combinations to 184320, with little additional cost in computation. Figure 9 shows an example of this byte-and-hash permutation4 . 4

Experimental results suggest that the specific permutation used is not particularly important, provided it generates sufficient diversity of detector “shapes”.

10

host 1

host 2

host 3

Figure 8: The problem of holes can be ameliorated by having a different matching rule for each detector set on each machine.

3.5 Overcoming the problem of incomplete self sets: activation thresholds We have described how the negative detection algorithm can be implemented to avoid false positives. However, if we train the system on an incomplete description of self, new but legitimate patterns will be false positives. We would like the system to be tolerant of such minor, legitimate new patterns, but still detect intrusive activity. We have implemented two methods designed to overcome this problem. The first method is an activation threshold: a detector is not activated on every match; instead it records the number of times it matches, and when the number of matches exceeds the activation threshold, the detector raises an alarm. Once a detector has raised an alarm, it returns its match count to zero. Thus, only repeated occurrences of structurally similar strings will trigger the detection system alarms. There is no time horizon to this mechanism: even if the new strings are encountered only at large time intervals, the detector will still be activated when sufficient matches have been accumulated. Thus, repeated activity from closely related sources will raise alarms. However, some attacks may be launched from many different machines, in which case the first method is unlikely to be successful. To detect such distributed coordinated attacks, we introduce a second method, called adaptive activation. Whenever the match count of a detector goes from 0 to 1, the local activation threshold (i.e. specific to a single machine) is reduced by one. Hence, each different detector that matches for the first time “sensitizes” the detection system, so that all detectors on that machine are more easily activated in future. This mechanism does have a time horizon; over time, the activation threshold gradually returns to its default value. Thus, this method will detect diverse activity from many different sources,

11

1

2

3

4

5

6

112

31

83

201

172

18

byte index

1

byte values extra (49th) bit

1-3-4-2-6-5

Permutation mask

112

Hash table

27

83

201

31

18

172

1

112

55

1

82

83

84

85

141

183

22

7

183

237

3

Figure 9: Byte-and-hash permutation. The binary string of length l = 49 is represented as 6-byte values plus an additional bit. This byte representation is permutated via the permutation mask, 1 3 4 2 6 5, which is randomly generated, and then the binary string is remapped via a randomly generated hash table. All detectors on a single host use the same randomly generated mask, and the same randomly generated hash table.

12

provided that activity happens within a certain period of time. Clearly, these mechanisms suggest ways in which the detection system could still be evaded. However, these mechanisms are tunable: if we are concerned about stealth attacks and can afford to tolerate high falsepositive rates, we can set low thresholds. Alternatively, if we are concerned only with preventing less skilled attackers from penetrating, we can set high thresholds. It transpires that these mechanisms are useful in reducing false positives and do not affect the ability of the system to detect common intrusions (see section 4.2).

4 Experiments The detection system was tested out at the (name omitted for anonymous review), on a subnet in the Computer Science department. The subnet consisted of 50 machines on a switched segment. All analysis was conducted off-line, although we are certain that this detection system could run online; as we shall see in section 4.2, the computational requirements are negligible.

4.1 Input data Two data sets were collected: a self set consisting of normal traffic, and a nonself set consisting of traffic generated during intrusive activity. The self set was collected over 50 days, during which a total of 2.3 million TCP SYN packets were logged. These 2.3 million packets were filtered down to 1.5 million packets. The filtering removed several classes of noisy traffic sources, such as web servers and ftp servers. There is little consistency about web-searching: new connections are generated all the time; inclusion of such traffic would lead to high false positive rates. Consequently, we ignored all all web/ftp traffic from local clients to external servers (outside the LAN), and all web/ftp traffic from anywhere to local servers. However, we monitored traffic that was destined for the web or ftp ports of local machines that did not function as servers. There were a few other classes of noisy traffic, for example, one machine functioned as a telnet server to the outside world, continually receiving connections from many different locations, many of them using dynamic IP numbers. Such sources of “ambient” traffic were also excluded. The 1.5 million packets were mapped to binary strings, using the mapping described in section 3.1. It was found that there were only 3900 unique strings, that is, the graph of normal connections is indeed sparsely connected, confirming the results reported in Mukherjee et al. [8]. The growth of this set of unique strings is shown in figure 10. As time passes (indicated by the x-axis), the number of unique strings tends to level off, although it is unlikely that this curve will ever completely level out. So we should always expect a few new triples, which will result in false positives unless we use a threshold scheme as described in section 3.5. The self set was divided into two parts: a training set (the first 43 days), and a test set (the last 7 days). The boundary between the training and test set is shown by the dotted vertical line in figure 10. During the test period, 137 new triples were logged, out of a total of 182000 packets. Without threshold activation, this

13

Number of Unique Strings

4000

3000

2000

1000

0 0

500000

1000000

1500000

Total Number of Strings

Figure 10: The growth of unique triples versus total number of packets. The vertical dotted line indicates the division between the training and test sets.

14

would be approximately 20 false positives per day. In section 4.2 we show that we can greatly reduce this number, while still detecting nonself. The nonself set was comprised of eight different intrusive incidents. Seven of these are faithful logs of real incidents that occurred on the network being studied, and one incident was synthetically generated to simulate an attack from many different locations. This simulated intrusion consisted of 200 random connections between internal hosts (the supposition was that the attackers had already penetrated at least one machine on the LAN). Most of the real attacks consisted of probing of one sort or another. There was one incident (cartan) of massive portscanning (i.e. of all ports), but most incidents involved much more limited scanning, for example one incident (phear) involved probing of only port 1, which is a way of determining if a machine is an SGI. Interestingly, most incidents (dt03ln93, xtream, sauron, pc35nl) involved probing of ports 53 and 143, which are DNS and IMAP respectively. Recently, vulnerabilities in both of these services have been widely publicized. Clearly, the attackers were looking for the most recent security flaws. At least one incident involved compromise of an internal machine. In one case (cougar), after breaking in to an internal machine, the attackers attempted to connect from there to the telnet ports on many different machines. The traffic tested for each incident consisted of all packets from the first nonself packet (the start of the incident), to the last nonself packet. Thus, each incident reproduces the timing of the attack, as well as including all normal traffic that was interspersed throughout the attack.

4.2 Results The first experiment investigated the issue of false positives. Detectors were generated against the training set and were then used to scan the test set. The parameter settings used were r = 12, on 50 machines, with 100 detectors per machine, and different, randomly-generated permutation masks for each machine. The results reported in table 1 are the average false-positive rates over 30 runs (each run used a different random seed). The standard deviations are not reported because they are very low. The results for activation thresholds of 1 and 10 are reported. Although the total number of false positives is reported, the most important measure is the number of unique false positives, because a fielded system is likely to update its definition of normal every time a new false positive is encountered. Under this assumption, with an activation threshold of 10, there are effectively less than 2 false positives per day. Even with an activation threshold of 1, the false positive rates are still not excessively high, about 20 per day. The important points here are that the false positive rates can be very low and the system is tunable, so that it can be set to produce an acceptable level of false positives. In the second experiment, we wanted to determine how effective the system is at detecting incidents of attacks. The parameter settings used were the same as for the false positives, that is, r = 12, and 50 machines with 100 detectors per machine. To detect an attack, the detection system must identify at least some of the strings in the incident as nonself. Obviously no detection is possible if all the strings in the incident are self strings. As a baseline, we compared the set of self strings to each incident and counted up the number of nonself strings in that incident (see column 2 of table 2). In all cases, there was a substantial fraction of nonself strings in the incident. These fractions indicate the best possible anomaly signal, which is 15

Measure False positive rate Total number of false positives Number of unique false positives False positive rate per day

Threshold = 1 0.0003 55 10 1.4

Threshold = 10 0.0030 546 137 19.6

Table 1: False positive results for the detection system against the test set when detectors are generated against the training set. determined by the number of detector alarms divided by the total number of strings in the incident. Columns 3, 4 and 5 of table 2 report anomaly signals for all eight incidents, averaged over 30 runs (standard deviations are omitted because they are very small). The strength of the anomaly signal reflects our confidence in detection of the anomaly. With permutation masks, these signal strengths are high for all incidents, which means that the system clearly detects eight out of eight incidents, in other words, the true positive rate is 100%5 . Increasing the activation threshold reduces the true positive rate, but even with a threshold of 10, the incidents are clearly detected (with permutation masks). The synthetic incident is the most sensitive to the activation thresholds, because it contains nonself strings that are close to self (recall that the synthetic consists of connections between internal hosts). Because of this, we can regard the synthetic as the most stringent or difficult test of the detection system. The difficult of detecting the synthetic incident is further illustrated by computing the holes in the detection system assuming that r = 12 and there are no permutation masks. Under these assumptions, 21% of the nonself strings in the synthetic incident are holes, whereas in all other incidents, none of the nonself strings are holes. Without permutation, the anomaly signal for the synthetic incident is limited by the holes to a maximum of 0.79. When permutation masks are used with an activation threshold of 1, the true positive rate is 0.94, which clearly shows that permutation masks overcome the hole limit. The effects of permutation masks can also be seen by a comparison between detection with and without permutation masks for an activation threshold of 10 (columns 4 and 5 in table 2). In six out of eight incidents, the anomaly signal is still strong without permutation masks, but in the other two incidents (synthetic and phear) it is very weak. In particular, it is almost zero in the case of the synthetic intrusion. Because permutation seems particularly effective for the synthetic incident, we ran the system against the synthetic incident (the most difficult incident), with the same parameter settings as before (with an activation threshold of 10), but varying the number of detectors. The results are plotted in figure 11. From this plot it can be seen that as the number of detectors increases, so does the anomaly signal. With permutation masks, 5

It is worth noting that these detection rates are achieved very cheaply: we used only 100 detectors per host, which is equivalent to 100 bit strings of length 49, a negligible amount of information.

16

Incident phear cartan dt03ln93 xtream cougar sauron pc35nl synthetic

Fraction nonself 1.00 0.44 0.17 0.62 0.54 0.10 1.00 1.00

Anomaly Signal Threshold = 1 Threshold = 10 permutation no permutation 1.00 0.50 0.09 0.44 0.43 0.34 0.17 0.16 0.16 0.62 0.59 0.61 0.58 0.53 0.49 0.10 0.09 0.09 1.00 0.84 0.43 0.94 0.33 0.01

Table 2: Detection results against 8 incidents. Values in the columns are anomaly signals. we rapidly exceed the upper bound caused by holes (indicated by the dashed line), but without permutation masks we cannot achieve better detection than the hole limit. Not only do the permutation masks help us overcome the hole limit, but they also improve detection even before that limit is reached, for example, with 100 detectors per host, permutation masks increase the anomaly signal from 0.01 to 0.33.

5 Discussion The results presented in the previous section are quite promising. However, we have made several important simplifying assumptions. The assumption of a broadcast network is likely to be problematic because of the increasing popularity of switched networks. We believe that our system can be extended to a network in which each node sees only its own traffic, by either placing the detection system at the routers, or by allowing SYN packets to “leak” through to all machines on the subnet. We could possibly do this by modifying the switches so that they broadcast all SYN packets. With low computational cost to the network (SYN packets are infrequent and contain little data) we could then use our current system without modification. A related issue is that of dynamically allocated IP addresses. Although not explicitly studied in the current system, we believe that some of the permutation masks will generalize across these varying addresses. This is because the dynamic IP address typically vary in only a few bits. A third important decision was to filter out traffic to web servers. Using an anomaly detection system based only on network connections to look for anomalies in web traffic seems unpromising. It is likely that other methods will need to be used to protect this important class of traffic. In the introduction, we mentioned several principles that we thought important for distributed detection. In the scheme presented here, these are achieved as follows:

17

1.00

Anomaly Signal

0.80

0.60 permutation no permutation hole limit 0.40

0.20

0.00 0

100

200

300

400

Number of Detectors per Host

Figure 11: How anomaly signals scale with number of detectors, both with and without permutation masks. The dashed line indicates the upper limit caused by holes. The error bars indicate a 90% confidence interval in the mean.

18

 

 



Localized: A single match on a single host is sufficient to conclude a string is nonself. No communication between hosts is needed to reach a decision. Scalable: As the number of computers increases on a network, so does the number of connections, and hence the number of self strings. Thus, the size of the nonself set in absolute numbers decreases, and we only need more detectors if the self set becomes more complicated. In such cases, we can put those extra detectors on the new machines, that is, our resource demands may increase, but so will resource availability. This is a subject of future research. Tunable: We can tune performance against overhead by varying the number of detectors on each computer. We can also vary parameters to trade-off false positives against false negatives, thus adjusting the system according to the desired domain of application (for example high versus low security environments). Robust: Each computer on the protected network has its own detector set. This means that compromise of any single machine causes incremental reduction in detection effectiveness, but not catastrophic failure. This robustness arises because we incorporated negative detection into our system. When the system experiences some dramatic change, due to the introduction of new software, new hardware, new users, etc., negative detection can continue to provide partial, reduced coverage. When the detection system is updated, only those detectors that raise false alarms need to be deleted; existing detectors that do not raise false alarms can be retained, to provide continuous, partial coverage. Diverse: Through permutation masks, each protected computer has a slightly different form of protection. Our experimental results quantify the extent to which this form of diversity increases the security (interpreted as detection ability) of the overall system.

There are compelling similarities between the problem of computer security and that faced by the human immune system. Both must protect highly complex, dynamically changing systems against intrusions from a wide variety of sources. Both must ensure the continued functioning of the system, and must ensure that the protective mechanisms do not seriously damage the system. The system described in this paper incorporates several important immune-like properties: detection of novel foreign patterns (anomaly detection), distributed detection via the negative-selection algorithm, and diversity across individuals (computers) in a population (the protected network) using permutation masks. In the future, we plan to use this system as the basis for a more ambitious exploration of the immunology metaphor. This includes introducing a life-cycle for detectors (analogous to the life-cycle of immune cells in the body). Each detector would undergo the following events during its “lifetime”: random generation (cell creation), a negative-selection training phase (as described earlier), a testing phase during which detectors that survive the negative-selection phase are used to monitor network traffic (as described earlier), and finally, a learning phase during which some detectors are deleted (those that have not found any foreign patterns) and others are rewarded, either with longer lifetimes or through replication with mutation. In addition to the detector life-cycle, other features that we believe are important include: a scheme for distributed 19

generation of detectors (related to ideas about peripheral tolerance in immunology), memory of previous intrusions (known as a secondary response in the immune system), competition for foreign packets (creating a pressure for more specific memories), autonomy, and placing the permutation masks under evolutionary control.

6 Conclusions We have presented a new method of network intrusion detection. Our system combines earlier work on NSM and the immunologically inspired negative-detection algorithm. In addition, our system has the following new features: permutation masks to increase the effectiveness of negative detection (adding diversity), activation thresholds that allow the system to aggregate foreign activity over time, and adaptive thresholds that allow it to integrate foreign patterns from multiple locations. We presented experimental results that suggest the system is highly effective at detecting common forms of intrusion and has surprisingly low rates of false positives. The current system is implemented on a broadcast LAN, and an important area of future investigation is the extension to switched networks. Finally, the system is promising as an avenue to incorporate many interesting and potentially important aspects of natural immune systems in a concretely defined system that solves a practical problem.

References [1] M. Crosbie and G. Spafford. Defending a computer system using autonomous agents. In Proceedings of the 18th National Information Security Systems Conference, 1995. [2] P. D’haeseleer. An immunological approach to change detection: Theoretical results. In Proceedings of the 9th IEEE Computer Security Foundations Workshop, Los Alamitos, CA, 1996. IEEE Computer Society Press. [3] P. D’haeseleer, S. Forrest, and P. Helman. An immunological approach to change detection: Algorithms, analysis and implications. In Proceedings of the 1996 IEEE Symposium on Research in Security and Privacy, Los Alamitos, CA, 1996. IEEE Computer Society Press. [4] S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri. Self-nonself discrimination in a computer. In Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamos, CA, 1994. IEEE Computer Society Press. [5] S. Garfinkel and G Spafford. Practical Unix and Internet Security, 2cnd Edition. O’Reilly and Associates, Inc., 1996. [6] L. T. Heberlein, G. V. Dias, K. N. Levitt, B. Mukherjee, J. Wood, and D. Wolber. A network security monitor. In Proceedings of the IEEE Symposium on Security and Privacy. IEEE Press, 1990. 20

[7] C. Hunt. TCP/IP Network Administration. O’Reilly and Associates, Sebastopol, CA, 1992. [8] B. Mukherjee, L. T. Heberlein, and K. N. Levitt. Network intrusion detection. IEEE Network, pages 26–41, May/June 1994. [9] J. K. Percus, O. E. Percus, and A. S. Perelson. Probability of self-nonself discrimination. Theoetical and Experimental Insights into Immunology, 1992.

Acknowledgements (Omitted for the purposes of anonymous review)

21

Suggest Documents