Experiments with the V-Detector algorithm

3 downloads 0 Views 249KB Size Report
Classification time is very similar for both the algorithms. On the other ... mune systems (AIS) is Negative Selection Algorithm, proposed by Forrest et al.,. [7].
Experiments with the V-Detector algorithm Andrzej Chmielewski1 , SÃlawomir T. Wierzcho´ n2 1

2

Faculty of Computer Science , Technical University of Bialystok, Poland [email protected] Institute of Computer Science , Polish Academy of Sciences, Warsaw, Poland [email protected]

Abstract. V-Detector is real-valued negative selection algorithm designed to detect anomalies in datasets containing real-valued data. Many of previous experiments were focused on analysis of usability of this algorithm to detect intruders in computer network. Intrusion Detection Systems (IDS) should be efficient and reliable due to large number of network connection and their diversity. Additionally, every connection is described as a record containing tens of numerical and symbolic attributes. We show that choosing appropriate representation of ”typical” connections and smart decomposition of the learning data it is possible to obtain quite efficient and cheap algorithm detecting non-typical connections.

1

Introduction

Following [12] by Intrusion Detection (ID for short) we understand ”the art of detecting inappropriate, incorrect, or anomalous activity. IDSs (i.e. ID systems) that operate on a host to detect malicious activity on that host are called host-based IDSs, and ID systems that operate on network data flows are called network-based IDSs”. A perfect IDS should recognize not only all intruders, but also should utilize as little as possible of computer resources. The approach we advocate for is based on biologically inspired mechanism of negative selection, [7]. It is a main mechanism used by the immune system to censor so-called T-lymphocytes. Namely, young T-lymphocytes produced in the thymus are tested against self proteins: only those T-cells that do not recognize any self cells can survive with the hope that they will be able to detect any dangerous (or non-self) substances. An extensive bibliography reporting approaches to ID can be find e.g. in [13]. The results presented in [2] and [3] show that V-Detector algorithm briefly described in Section 3, after some modifications, is quite robust and efficient in comparison to e.g. a Support Vector Machine (SVM) classifier known to be a very effective classifier [16]. Classification time is very similar for both the algorithms. On the other hand, in the learning stage, time increases linearly with the size of samples of normal connections for V-Detector while for SVM the learning time increases logarithmically.

In this paper we present our analysis of the content of the empirical dataset, namely KDD Cup 1999 [9] and various improvements to the basic V-detector algorithm.

2

Negative selection

One of the major algorithms developed within emerging field of artificial immune systems (AIS) is Negative Selection Algorithm, proposed by Forrest et al., [7]. It is based on the principles of self/nonself discrimination in the immune system. To formalize this algorithm denote U the problem space, e.g. the set of all possible bit strings of fixed length, and assume that S stands for the set of strings representing typical behavior of a system under considerations. Then the set of strings characterizing anomalous behavior, N can be viewed as the set-theoretical complement of S: N = U \S

(1)

The elements of S are called self, and those of N are termed as non-self. To apply negative selection algorithm it is necessary to generate a set D ⊂ N of detectors, such that each d ∈ D recognizes at least one element n ∈ N , and does not recognize any self element. Thus, we must designate a rule, match(d, u), specifying when d recognizes an element u, consult [17] for details. This approach, although intuitive and simple, admits at least two serious drawbacks. First, it is hard to specify the full set S in advace; typically we observe only a subset S 0 ⊂ S. Second, majority of detection rules induce so-called holes, i.e. regions of N which are not covered by any detector. Surely, instead of the binary representation of the space U we can use realvalued representation, originally proposed in [8]. This paper is focused only on real-valued detectors, described in Section 3.

3

V-Detector algorithm

V-Detector algorithm was formally proposed by Ji and Dasgupta [10]. It operates on (normalized) vectors of real-values attributes being points in the mdimensional unit hypercube, U = [0, 1]m . Each self sample, si ∈ S, is represented as a hypersphere with the center at ci ∈ U and fixed radius rs , i.e. si = (ci , rs ), i = 1, . . . , |S|, where |S| is the number of self samples. Every point u ∈ U belonging to any hypersphere is considered as a self element. Also, detectors dj are represented as hyperspheres: dj = (cj , rj ), j = 1, . . . , |D| where |D| is a number of detectors (not known in advance). In contrast to self elements, the radius rj is not fixed but is computed as the Euclidean distance from a randomly chosen center cj to the nearest self element (this distance must be greater than rs , otherwise detector is nor created). Formally, we define rj as rj = max{0, min dist(cj , ci ) − rs } 1≤i≤|S|

(2)

The algorithm terminates if predefined number Tmax of detectors is generated, or the degree of coverage of the space U \S by these detectors exceeds a given threshold co (consult [10] for details). The overall pessimistic computational complexity of creating a detector is of order O(m · (|S| + |D|)). Similarly, during the classification process, each tested sample is compared with all detectors; in pessimistic case O(m · |D|). When |S|, |D|, and m increase, this process becomes time consuming. Particularly pessimistic analysis of the algorithm is given in [14] and [15], where the main author states that (cf. [14], p.114) ”(...) hyperspheres have undesirable properties in high dimensions – the volume tends to zero by keeping the radius fixed, and nearly all uniformly randomly distributed points are close to the hypersphere surface”. And he adds later ([14], p. 112/113) ”A hypersphere, for example with radius r = 1 has a high volume in relation to its radius length, up to dimension 15 (...) In higher dimensions (m > 15), for r = 1 the volume is nearly 0. This means that the recognition space – or in the context of real-valued negative selection the covered space – is nearly 0. In contrast, a radius that is too large (r > 2) in high dimensional spaces (m > 10) imples an exponential volume. This exponential volume behavior, in combination with an unprecise volume estimation of overlapping hyperspheres, is the reason for the poor classification results” [reported in [15]]. 3.1

Modified V-Detector algorithm

In most cases, the results of the V-detector algorithm were presented for the hypercube spanned over all self and nonself elements, and probably it was the main reason why poor results for multidimensional datasets – reported e.g. in [15] – were obtained. In our approach, described in [2], only self samples were used to generate receptors (for another, intriguing approach to the problem consult [5]). In consequence, all samples which lie outside the hypercube are classified as nonself, see Fig. 1(b). Additionally, if the values of self samples for specific attribute are constant and none of nonself samples is equal to this value, then this attribute can be removed from the dataset and simultaneously new detector is created. This detector will recognize all samples using the value specific only for self samples. This corresponds to the idea of magic bit introduced for binary detectors in [1]. Let us look at the following example: Self Set 0.10 0.34 0.22 0.27 0.10 0.77 0.38 0.87 0.10 0.42 0.76 0.65

Nonself Set 0.56 0.53 0.89 0.29 0.87 0.19 0.99 0.69 0.29 0.09 0.32 0.43

(a)

(b)

Fig. 1. Example of performance V-Detectors for 2-dimentional problem. Grey circles denotes self samples, dashed circles denotes V-detectors, dashed area denotes detector which recognize all samples laying outside the space spanned over all self samples and white areas denotes holes.(a) original algorithm, (b) modified.

In this case, it is possible to create a detector which will recognize all samples for which value of first attribute is not equal 0.10±∆T (where ∆T is a tolerance) and reduce our 4-dimensional problem to a 3-dimensional one. Systems for identification intruders in computer networks (or roughly speaking, anomaly detection in some datasets) should be very robust, of course. But another significant feature is also the time spend for learning the system and for classifying new samples. It can be reduced by applying tree-based structures (for example based on k-d trees family) to store self samples and detectors. Details were presented in [4]. Another direction in the improvement of the algorithm is concerned with the way, the radius of the detector is determined, see Eqn. (2). Namely, it is possible that the value rj only slightly exceeds the radius rs but other neighbors of the center cj are quite distant. Then, to get a better coverage of the space N , it is reasonable to adjust the location cj , e.g. towards the center of gravity of a few nearest neighbors. This strategy will be explored in our further works.

4

The dataset

KDD Cup 1999 [9] is very popular dataset used to evaluate ID algorithms as well as other classification algorithms. It contains almost 5 millions records. Each record describes a network connection as a vector of 41 attributes (38 numeric and 3 symbolic) with additional end-label (normal value means self connection). Examples of self and nonself connections are given below: 91, udp, domain u, SF, 87, 45, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 147, 140, 0.95, 0.01, 0.18, 0.00, 0.00, 0.00, 0.00, 0.00, normal.

0, icmp, eco i, SF, 18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 1, 153, 1.00, 0.00, 1.00, 1.00, 0.00, 0.00, 0.00, 0.00, ipsweep. In our experiments all redundant connections were eliminated first. As a result we obtained reduced set Sunique containing 1074994 (almost quintuple reduction) unique connections (812813 normal and 262181 anomalous). 4.1

Hierarchical division of testing dataset

Although the KDD Cup 1999 contains over 1 million connections, we can expect that this value can be much larger in the real world problems, and then both the time needed to generate detectors, as well as the time needed to cover the subspace N by these detectors will grow very fast. So, it seems quite obvious to find a more adequate method for splitting the datasets into disjoint subsets, and generate specific detectors for each subset separately. Besides, what is probably most significant, in some cases it will be possible to reduce the dimension of the problem by eliminating these attributes for which all samples (self and nonself) have the same constant value (usually, these attributes are not used for particular protocols, services, etc.). In [2] a hierarchical approach to this problem was suggested. It relies upon splitting the data according to a priori selected attributes. For example, we can choose the following 3 discriminative attributes: protocol, service and f lag (2-nd, 3-rd and 4-th attribute from the KDD Cup dataset, respectively). More precisely, full dataset (K) is divided according to the attribute protocol giving as a result three subsets containing connections specific for particular protocol: KICM P , KU DP and KT CP , K = KICM P ∪ KU DP ∪ KT CP . It is worth to mention that, especially for KICM P and KU DP , we can observe significant reduction of useless attributes: to 20 attributes for KICM P , to 22 for KU DP and to 40 for KT CP . If obtained subsets still are too big or still contain too much attributes, then we can divide them according to the next attribute, namely service. For ICMP and UDP protocols we have: KICM P = KICM P eco i ∪ KICM P ecr i ∪ KICM P red i ∪ KICM P tim i ∪ ∪ KICM P urh i ∪ KICM P urp i KU DP = KU DP domain u ∪ KU DP ntp u ∪ KU DP other ∪ KU DP private ∪ KU DP tf tp

u

The biggest set (KT CP ) was divided into 61 subsets. In the next step, we can perform similar action with the last attribute f lag. There are several advantages of such a division: 1. Each subset contains less samples (connections). It results in reduction of learning process duration.

2. Some subsets can have attributes, with values identical for all the samples. Hence the dimensionality of problem space reduces. As already stated, KICM P can be described by 20 attributes and KU DP by 22 attributes. 3. Detectors are generated for special type of connection, service and flag. It causes that the connection classification is much easiest. Usually, detectors are generated for special types of attack. The problem is that we do not know in advance what type of attack we are expecting, when nonself connection is established, so we do not know, which set of detectors we should use. However, we have information about protocol, service, flag, etc. 4. There is no need to perform normalization on discriminative attributes. It is especially important for non-numerical values (such as protocol, service, f lag). 5. There is no need to generate receptors for subsets containing only either self or nonself elements; In this case, the triple: protocol, service, f lag is special type of detector which can be used in the process of positive (when it contains only nonself elements) or negative (when it contains only self elements) selection. In the case of ICMP protocol, subsets KICM P red i and KICM P red i constain only self samples. So, all connections for services: red i and urh i using ICMP protocols will be automatically classified as normal. Similarly, subsets KU DP ntp u and KU DP tf tp u for UDP protocol have no nonself connections. In the case of TCP protocol, about 3% samples belong to the subsets of such a type.

5

Experiments

In this section we present some experiments performed on original and modified versions of V-Detector algorithms (described in Section 3) with hierarchical division of KDD Cup 1999 dataset (see Section 4). Next, these results are compared with Support Vector Machine (described in Section 5.2), the strong classification tool. All experiments were repeated 20 times. We choose only 2 discriminative attributes (protocol and service), and we do not use the attribute f lag. To evaluate the performance of the V-Detector algorithm two indices are used: Detection Rate (DR) and False Alarm Rate (FAR), computed as follow: DR =

TP TP + FN

(3)

FP (4) FP + TN where T P (true positive) is the number of correctly classified anomalous (nonself) samples, F N (false negative) is the number of self samples recognized as nonself, F P (false positive) is the number of nonself samples recognized as self and T N (true negative) is the number of correctly classified self samples. It is worth to notice that for V-Detector algorithm F AR is always equal 0 when the same dataset is used at learning and classification process. F AR =

Dataset

Subset

Self Nonself Detectors Detection Std. Learning Classification count count count Rate [%] dev. time [s] time[s] ICMP FULL 4427 7483 4.8 20.56 0.193 1 43 ICMP eco i 1081 4221 4.0 22.24 0.294